i f c f i a (ifcfia)...department of informatics / chair in information system information systems...

DEPARTMENT OF INFORMATICS / CHAIR IN INFORMATION SYSTEM INFORMATION SYSTEMS RESEARCH GROUP

University of Fribourg, Switzerland Boulevard de Pérolles 90

CH-1700 Fribourg

INTUITIONISTIC FUZZY COMPONENT FAILURE IMPACT ANALYSIS

(IFCFIA)

A GRADUAL METHOD FOR SLA DEPENDENCY MAPPING

AND BI-POLAR IMPACT ASSESSMENT

Technical Paper

by

Roland Schuetze

e-mail: [email protected]

http://diuf.unifr.ch/is

ii

A. ABSTRACT

Combining well-grounded academic research with practice oriented requirements and business scenarios are a useful and common practice within IT Information Systems. This paper comprises the elaboration of theoretical foundations of IT Service Management reliability engineering and impact assessment practices ex-panded by fuzzy mathematical models and methodologies.

This on-going research will provide a bridge from IT-centric Service Levels, writ-ten in IT technical terms, to business-oriented service achievement. The research will help for Service Level Agreements (SLAs) to relate metrics for business appli-cations into measurable parameters for technical services that can be defined and reported against an SLA and monitored under Service Level Management.

IT landscapes are inherently integrated and the fulfilment of any higher-level ob-jective requires proper enforcements on multiple resources at several levels. Based on the proposed IFCFIA framework we assess the complex dependency and impact relationships of backend components to the quality of the frontend service. This work will describe dependency couplings in a practical and feasible manner in order to satisfy aspects of the distributed nature of SLAs in a multi-tier-architectural environment and offer transparency into complex multi-level impact assessments and fault analysis.

The IFCFIA framework allow service provider to concentrate on the quality rather than the performance of a service. IFCFIA can help for implementing a flexible SLA model where the organization benefits by optimized investments in the IT infra-structure and capacity levels. Based on IFCFIA we can define rules that capture business insights into how service accounts as a whole can provide and improve quality. It supports service administrators to pro-actively track measures of indi-vidual components to gather the overall SLA quality status of the impacted busi-ness services performing an intelligent multi-level impact- or fault tree assess-ment by application of intuitionistic fuzzy mathematical models and methods.

iv

B. CONTENT

1. PREFACE ......................................................................................................................... 1

2. BACKGROUND AND PROBLEM STATEMENT .................................................................... 3

2.1 SLAS IN MULTI-LAYERED SERVICE DELIVERY MODELS ............................................................... 4

2.1.1 SLA Dependency Mapping ....................................................................................... 4

2.1.2 The concept of Key Quality and Performance Indicators ........................................ 6

2.2 COMPLEXITY OF MULTI-LAYERED SLA TRANSLATIONS ............................................................... 9

2.2.1 Types of SLA Translations ........................................................................................ 9

2.2.2 Multi-layered SLA optimizations ........................................................................... 11

2.3 IT IMPACT ANALYSIS APPLIED IN SERVICE MANAGEMENT ....................................................... 12

2.4 RECOMMENDATIONS AND PROPOSED EXTENSIONS ................................................................ 16

2.5 RESEARCH OBJECTIVES .................................................................................................... 18

3. INTERDEPENDENCIES AND COUPLINGS ......................................................................... 21

3.1 DEPENDENCE COUPLING AS MEASUREMENT ......................................................................... 22

3.2 DETERMINING A LEVEL OF TIGHTLY COUPLING ....................................................................... 23

3.2.1 Overview................................................................................................................ 23

3.2.2 Static Coupling Calculations .................................................................................. 23

3.2.3 Dynamic Coupling Calculations ............................................................................. 26

3.2.4 Aggregated Measurements of Service Quality ...................................................... 27

3.3 DETERMINING THE LEVEL OF LOOSELY COUPLING ................................................................... 28

3.3.1 Overview................................................................................................................ 28

3.3.2 Calculating a degree of loosely coupling ............................................................... 31

3.3.3 CFIA grid for Loosely Coupling Assessments .......................................................... 37

3.4 BI-POLAR COUPLING ASPECTS ............................................................................................ 39

3.5 HYPOTHESIS .................................................................................................................. 41

4. INTUITIONISTIC FUZZY SETS .......................................................................................... 43

4.1 MOTIVATION ON INTUITIONISTIC FUZZY SETS ....................................................................... 44

4.2 IFS DEFINITION AND BASIC OPERATIONS ............................................................................. 45

4.3 APPLYING IFS TO COUPLING- AND IMPACT ASSESSMENTS ...................................................... 46

4.4 SEMANTICS OF INTUITIONISTIC FUZZY DEPENDENCIES ............................................................ 48

4.5 BRIEF ASPECTS ON INTUITIONISTIC FUZZY REASONING ............................................................ 48

5. IFCFIA SOLUTION APPROACH ........................................................................................ 49

5.1 IFCFIA – OVERVIEW OF THE METHOD ................................................................................ 50

5.1.1 From CFIA to IFCFIA ............................................................................................... 50

5.1.2 IFCFIA Seven Step Approach .................................................................................. 50

5.2 DESCRIPTION STEP 1-3: CREATING THE CFIA GRID................................................................ 53

v

5.2.1 Business Application Maps based on auto-discovery ............................................ 53

5.2.2 Creating the CFIA grid with couplings ................................................................... 53

5.3 DESCRIPTION STEP 4: CREATING THE DIRECT COUPLING INDEX ................................................. 55

5.3.1 Pulling together the Level of loosely and tightly coupling ..................................... 55

5.3.2 Defining the Vagueness ......................................................................................... 57

5.3.3 IFCFIA formal Definition ........................................................................................ 62

5.4 DESCRIPTION STEP 5: CALCULATION OF THE INDIRECT COUPLINGS ............................................ 63

5.4.1 Overview................................................................................................................ 63

5.4.2 Indirect Coupling Calculations ............................................................................... 63

5.4.3 Types of indirect impact operations ...................................................................... 65

5.4.4 Example of indirect coupling calculations ............................................................. 67

5.4.5 Updating the CFIA Grid with the indirect coupling index ..................................... 69

5.4.6 Impact- and Root Cause Analysis .......................................................................... 70

5.5 STEP 6: (OPTIONAL) EXTENDING THE BUSINESS VIEW ............................................................ 72

5.5.1 IT (human) enabled Services .................................................................................. 72

5.5.2 Adding the Costs of Failure to the IFCFIA .............................................................. 74

5.6 STEP 7: (OPTIONAL) INTUITIONISTIC FUZZY REASONING ......................................................... 75

5.6.1 Fuzzification of Performance Measures ................................................................ 75

5.6.2 Applying IFCFIA for Fuzzy Intuitionistic Reasoning ................................................ 77

5.6.3 Example IFCFIA based Rules .................................................................................. 78

5.7 ADAPTED IMPACT CALCULATION FOR GRADUAL FAILURES ........................................................ 81

5.7.1 From bi-modal to gradual failure situations ......................................................... 81

5.7.2 Applying derived mathematical models from FCM ............................................... 82

5.7.3 Extending IFCFIA Step 5: indirect coupling calculations ....................................... 83

6. IFCFIA USE CASES .......................................................................................................... 85

6.1 SCENARIO : INCIDENT IN LOGISTICS MANAGEMENT ............................................................... 86

6.1.1 Overview of Scenario ............................................................................................. 86

6.1.2 Service Components Auto-Discovery “Logistics Management”............................ 87

6.1.3 Creating The Fault Tree for Logistics Management Application ........................... 90

6.1.4 TADDM Server Affinity Report ............................................................................... 97

6.1.5 Component Topology Billing Application .............................................................. 98

6.1.6 Creating the Logistics Management IFCFIA Grid ................................................... 99

6.2 USE CASE: BUSINESS IMPACT ANALYSIS ............................................................................ 106

6.2.1 Building the IFCFIA Dependency Graph ............................................................... 106

6.2.2 Calculating the indirect impact ........................................................................... 108

6.2.3 Impact Assessment: Incident in Logistics Management ...................................... 109

6.3 USE CASE: ROOT CAUSE ANALYSIS ................................................................................... 111

6.4 USE CASE: ADVANCED SERVICE LEVEL MONITORING ........................................................... 113

6.4.1 SLA Monitoring and early quality analysis .......................................................... 113

6.4.2 Automated Fuzzy Reasoning based on Backend Monitoring .............................. 115

6.4.3 Fuzzy clustering for analysis of monitoring data ................................................. 116

6.5 USE CASE: CAPACITY IN CONSUMPTION BASED MODELS ...................................................... 118

vi

6.6 USE CASE: 'COST VERSUS BENEFIT' FOR IT INVESTMENTS ..................................................... 119

6.7 EXAMPLE: IFCFIA VERSUS OTHER IMPACT ANALYSIS ............................................................ 120

7. IMPACT MODELS APPLIED IN ITIL V3 BEST PRACTICES - ABILITIES AND LIMITS ........... 123

7.1 IMPACT ANALYSIS WITHIN ITIL V3 BEST PRACTICES ............................................................. 124

7.1.1 ITIL v3 Service Lifecycle Modules ......................................................................... 124

7.1.2 ITIL v3 Impact Analysis Activities and Tools ........................................................ 124

7.2 ITIL V3 DEPENDENCY ANALYSIS - ABILITIES AND LIMITS ........................................................ 126

7.2.1 Configuration auto-discovery .............................................................................. 126

7.2.2 Fault Tree Analysis .............................................................................................. 131

7.2.3 Component Failure Impact Analysis (CFIA) ......................................................... 134

7.2.4 Business Impact Analysis (BIA) ............................................................................ 138

7.2.5 Summary IT Impact Analysis in ITIL v3 ................................................................ 141

7.3 RECOMMENDATIONS AND PROPOSED EXTENSIONS .............................................................. 142

7.4 APPLYING IFCFIA TO EXTEND ITIL QUALITY METHODS ......................................................... 145

8. LIMITATIONS AND CONCLUSION ................................................................................ 147

8.1 LIMITATIONS OF IFCFIA................................................................................................. 148

8.1.1 Multiple incoming arcs ........................................................................................ 148

8.1.2 Loopbacks in the directed dependency graph ..................................................... 148

8.2 CONCLUSION ............................................................................................................... 149

A. REFERENCES ............................................................................................................... 152

B. TERMS AND DEFINITIONS ........................................................................................... 156

viii

C. LIST OF FIGURES

Figure 1: The Challenge of Virtualized multi-layered SLA translations ........................ 2

Figure 2: SLAs Defining the delivered Service Quality ......................................................... 4

Figure 3: KQI , PI, and SLA Relationship [Open Group 04] ................................................. 6

Figure 4 : PI/KQI Indicator Hierarchy ......................................................................................... 7

Figure 5: Delta application response time as function of delta DB query time .......... 8

Figure 6: Threshold of PI/KQI and KQI derived from underlying services PI ............ 8

Figure 7: Web-app using three-tier servers with SLA translations ............................... 10

Figure 8: Overview: Incident and Restoration Process ...................................................... 30

Figure 9: Example MTTR / MTBF Allocation .......................................................................... 35

Figure 10: Point System for Loosely coupling Assessments ............................................ 36

Figure 11: Extended CFIA Grid with Direct Couplings ....................................................... 37

Figure 12: Zadeh Fuzzy Complement ........................................................................................ 56

Figure 13: Sugeno and Yager Complement ............................................................................. 59

Figure 14: Certainty Mappings for Sugeno and Yager ........................................................ 60

Figure 15: Sugeno complement for lambda = 2 ..................................................................... 61

Figure 16: direct IFS relationships “Coupling” in a directed graph ............................... 67

Figure 17: One-Level Dependency Map after performing FCC or RCC ........................ 68

Figure 18: extended dependency graph with IT enabled Services ................................ 73

Figure 19: Fuzzification of “response time” metric into the fuzzy variables ............. 76

Figure 20: Mapping of Thresholds and Linguistic Variables ............................................ 77

Figure 21: Indirect Couplings with KQI Activation Levels ................................................ 83

Figure 22: J2EE 4 tier client-server architecture .................................................................. 86

Figure 23: Topology Logistics Management J2EE Application........................................ 87

Figure 24: TADDM Grouping Composer for manual Assignments ................................ 89

Figure 25: Bill Payment Business Service ................................................................................ 89

Figure 26: Grouping of Frontend Components (WebServer) .......................................... 90

Figure 27: Logistics Management Software Components ................................................. 90

Figure 28: Frontend Software Components of Logistics Management........................ 91

Figure 29: XML Export of Logistics Management Dependencies ................................... 91

Figure 30: L2 Dependency for Web Server hpux1.lab.collation.net:3880.................. 91

Figure 31: L3 Dependency for Web Logic Server histronix.lab.collation.net ............ 92

Figure 32: L4 Dependency for Database Server .................................................................... 92

Figure 33: L5 Dependency DNS/NIS service majestix.eng.collation.net:53 .............. 93

Figure 34: L6 Dependency Computer System majestix.eng.collation.net................... 93

Figure 35: L2 Dependency for Web Server cleopatra.lab.collation.net:4580 ........... 94

ix

Figure 36: L3 Dependency for Web Logic Server ................................................................. 94

Figure 37:Logistics Management Software Topology Level 2-4..................................... 95

Figure 38: Computer System L2 Dependency for hpux1.lab.collation.net ................. 95

Figure 39: Dependencies on the hpux1.lab.collation.net computer system .............. 96

Figure 40: Software Components depending on Computer Systems ........................... 96

Figure 41: Logistics Management Application Physical Topology ................................ 97

Figure 42: Fault Tree Software Component Levels representing the J2EE Layer ... 97

Figure 43: TADDM Server Affinity Report ............................................................................... 98

Figure 44: Bill Payment Business Service ................................................................................ 99

Figure 45: Billing Application Physical Topology ................................................................. 99

Figure 46: Dependency Graph for Logistics Management Application .................... 105

Figure 47: IFCFIA Dependency Directed Graph for Logistics Management ........... 107

Figure 48: Simplified indirect Intuitionistic Dependency Map .................................... 109

Figure 49: Fuzzification of conformance measurements ............................................... 114

Figure 50: Fuzzification of application “response time” metric .................................. 115

Figure 51: Natural granulation of component performance measurements ......... 117

Figure 52: booked vs. burst capacity ...................................................................................... 118

Figure 53: 3 tier web architecture providing static and dynamic web pages ........ 120

Figure 54: ITIL Service Lifecycle Modules Source: krpm.wordpress.com .............. 124

Figure 55: TADDM Application Mapping showing the dependencies ....................... 128

Figure 56: Frontend availability calculation based on component availabilities . 131

Figure 57: FTA for Fault Tree for Web-based system ...................................................... 132

Figure 58: Example basic CFIA Matrix ................................................................................... 135

Figure 59: CFIA Worksheet with Failure Modes [Bailey et al 2008] ......................... 137

x

D. LIST OF TABLES

Table 1: Fenton and Melton Coupling Levels [Alghamdi 07] ........................................... 24

Table 2: Example Business Types with regard to RPO, RTO and Impact .................... 32

Table 3: Combined Classical and Probabilistic logical IFS operations ......................... 46

Table 4: IFCFIA Grid with indirect couplings to the business service .......................... 69

Table 5: Extended CFIA with Cost of Failure .......................................................................... 75

Table 6: Business Impact Versus Cost and Risk .................................................................... 80

Table 7: Component Relationship Matrix ............................................................................. 100

Table 8: Determining the Loosely Coupling Index ............................................................ 101

Table 9: Determining the Tightly Coupling Index ............................................................. 102

Table 10: IFCFIA Grid with indirect coupling calculations and cost of failure ...... 103

Table 11: Final IFCFIA Matrix for the Bill Payment Business Service ....................... 104

Table 12: Attitude based impact calculations ..................................................................... 108

Table 13: IFCFIA Grid with RCC Couplings used for Root Cause Analysis .............. 112

Table 14: IFCFIA with monitored Failure Modes ............................................................... 114

xii

E. LIST OF ABBREVIATIONS

ADDM Application Dependency Discovery Management

BIA Business Impact Analysis

BCM Booked Capacity Models

CBS Coupling between Services

CDM Common Data Model

CFO Chief Information Officer

CI Configuration Item (CMDB)

CMDB Configuration Management Database

CFIA Component Failure Impact Analysis

DMZ De-Militarized Zone

FCC Forward Coupling Calculation

FCM Fuzzy Cognitive Maps

FTA Fault Tree Analysis

FMECA Failure Mode, Effect and Criticality Analysis

IaaS Infrastructure-as-a-Service

IFS Intuitionistic Fuzzy Set

IP Internet Protocol

ITIL IT Infrastructure Library

ITeS IT Enabled Services

ITSCM IT Service Continuity Management

ITSM IT Service Management

J2EE Java 2 Enterprise Environment

KQI Key Quality Indicator

LMA Logistics Management Application

MTTRS Mean Time to Restore Service

MTTR Mean Time to Recover

MTBF Mean Time Between Failure

MTBSI Mean-Time-Between-System-Incidents

MOM Message oriented Middleware

MOO Multi-Objective Optimization

PaaS Platform as a Service

PI (technical) Performance Indicator

PVA Pain Value Analysis

OLA Operational Level Agreement

QoS Quality of a Service

RBD Reliability Block Diagram

RCA Root Cause Analysis

xiii

RCC Reverse Coupling Calculation

RED Redundancy Level based on Point System

RPO Recovery Point Objective

RTO Recovery Time Objective

SaaS Software-as-a-Service

SFA Service Failure Analysis

SLA Service Level Agreement

SLACS SLA Compliance of the Service

SLM Service Level Management

SLO Service Level Objectives

SOA Service Oriented Architecture

SQL Structured Query Language

TADDM Tivoli Application Dependency Discovery Manager

Translation Types of Quality Parameters

o M2C Metric to Configuration

o C2C Configuration to Configuration

o M2M Metric to Metric

o C2M Configuration to Metric

UML Unified Modeling Language

UBB Usage Based Billing

WAS Web Application Server

WLS Clustered Surfer for Workload Distribution

1

1

1. PREFACE

In an increasingly service-oriented world, “best effort” service delivery is not

good enough. But how does the business know whether it is getting an ade-

quate service?

Service level requirements are set to ensure that the business goals underly-

ing IT services are met. The Service Level Agreements (SLAs) incorporate

the expectations and the obligations about the properties of a service. The

most significant part of a SLA is the range of the duties of a service that con-

tains a description of the offered service, the constraints, the steps required

for a delivery of a service and the agreed objectives between a service pro-

vider and a service requestor. Those objectives are mostly the concerns that

are associated with the Quality of a Service (QoS). The SLAs can be used as

an instrument to set, monitor and enforce performance thresholds on the

operations of a service.

The early SLAs were IT-centric, written in IT technical terms, and predomi-

nantly provided the IT user with service levels that had more to do with in-

ternal IT performance measurements than with business-oriented service

achievement. Frequently metrics were inappropriate, measurements impre-

cise and monitoring weak. The SLA reports simply did not reflect the experi-

ence of the customer when using the service. SLAs are now becoming in-

creasingly business-focused and measured in real-time. SLAs are more seen

as a strategic tool to align IT support services directly to business mission

achievement. Now, the more mature organization writes business-centric

SLAs and has sophisticated performance measurement tools that accurately

reflect the customer's or service user's actual experience.

Service Level Agreements (SLAs) related to customer satisfaction or other

front end measures (response time, wait time, correctness, etc.) of the com-

posed service are often used to manage delivery contracts and have revenue

impacts for service providers. To guarantee business-focused SLAs results in

optimization problem solving across multiple domains (e.g. networking,

computer systems, and software engineering). The landscape of today's IT

service providers is inherently integrated. It consists of all kinds of elements,

2

namely networks, servers, storage, and software stacks. The fulfilment of

any higher-level objective requires proper enforcements on multiple re-

sources at several levels.

FIGURE 1: THE CHALLENGE OF VIRTUALIZED MULTI-LAYERED SLA TRANSLATIONS

This work will help to bridge from IT-centric Service Levels, written in IT

technical terms, to business-oriented service achievement by relating busi-

ness metrics for Service Level Agreements (SLAs) into measurable parame-

ters for technical services that can be defined and reported against an SLA

and monitored under Service Level Management. Based on the proposed

IFCFIA framework we assess the complex dependency and impact relation-

ships of backend components to the quality of the frontend service.

Today’s IT management requires flexible SLA models where the organiza-

tion benefits by optimized investments in the IT infrastructure and capacity

levels. Instead of tightening SLAs across the board, which is a costly ap-

proach, individual Service Levels should be directly driven by business

needs where the organization benefits by projecting and paying for only

what is required.

Based on the here proposed IFCFIA framework we can define rules that cap-

ture such business insights into how service accounts as a whole can im-

prove quality, optimize service levels and allow service provider to concen-

trate on the quality rather than on the performance of a service It supports

service administrators to pro-actively track measures of individual compo-

nents to gather the overall SLA quality status of the impacted business ser-

vices performing an intelligent multi-level impact- or fault tree assessment

by application of intuitionistic fuzzy mathematical models and methods.

3

2

2. BACKGROUND AND PROBLEM STATEMENT

4

2.1 SLAS IN MULTI-LAYERED SERVICE DELIVERY MODELS

2.1.1 SLA DEPENDENCY MAPPING The Service Level Agreements (SLAs) incorporate the expectations and the obligations about the properties of a service. Service Level Agreements are documents which define the relationships between two parties: the provider and the recipient [Service Level Agreement Zone, 2007]. This Agreement works as a contract between the two. The most significant part of a SLA is the range of the duties of a service that contains a description of the offered service, the constraints, the steps required for a delivery of a service and the agreed objectives between a service provider and a service requestor. Those objectives are mostly the concerns that are associated with the quality of a service (QoS).

FIGURE 2: SLAS DEFINING THE DELIVERED SERVICE QUALITY

The early SLAs were often IT-centric, written in IT technical terms and ser-vice levels that had more to do with internal IT performance measurements than with business-oriented service achievement. Frequently metrics were inappropriate, measurements imprecise and monitoring weak.

In the last year’s SLAs are becoming increasingly business-focused and measured in real-time. So now the SLA process is used as an instrument to set, monitor and enforce performances thresholds on the operations of a business service. SLAs are more seen as a strategic tool to align IT support services directly to business mission achievement. The business-centric SLAs have sophisticated performance measurement tools that accurately reflect the customer's or service user's actual experience. Also those Service Level Agreements related to customer satisfaction or other front end measures (response time, wait time, correctness, etc.) of the composed ser-vice are often used to manage delivery contracts and have revenue impacts for service providers.

In parallel the delivery of Information Technology (IT) services is moving away from a single provider model, and is increasingly based on the com-position of multiple services and assets. Often, the composed service con-sists of component services that come from specialized providers.

5

For a composite service, the business-centric SLA related to front end measures depends on the proper execution of the underlying services like hardware, software, personnel (IT enabled services) or even licenses.

The challenge with such enterprise SLAs is translating metrics for business applications into measurable parameters for technical services that can be defined and reported against an SLA and monitored under Service Level Management (SLM). Service compositions, translation and mappings lies there-fore in the core of SLA management, in that it correlates metrics and parameters within and across layers. To guarantee business-focused SLAs results in optimization problem solving across multiple domains (e.g. net-working, computer systems, and software engineering). The fulfilment of any higher-level objective requires proper enforcements on multiple re-sources at several levels. For example, in order to guarantee certain bounds on the response times for ERP-type, it involves the ERP software, the appli-cation and database servers, the network configuration, and more.

It would be easier to monitor, understand, and manage Quality of Service (QoS) metrics related to individual services and the resources they use (such as storage, network, processing power, etc.). However, the virtualized ser-vice delivery model requires the composition of services to deliver the over-all service to the client. The interactions between the individual services, many of which may come from different sources, makes it harder to control the performance of the overall service or provide quality measures for it in terms of the quality and performance of the underlying services. [Joshi et al. 2011]

When setting up a service monitoring we need therefore to translate back the metrics related to individual components of the service, like accuracy, responsiveness, uptime, etc. (which are in a sense backstage metrics) to the front stage experienced by the client or business. Service monitoring on the backstage metrics implies a bottom-up approach and begins by monitoring on backend applications and resources. When knowing the relation and de-pendency of this backend service to the end-user service (or composite ser-vice), service administrators can then pro-actively track and verify these dependencies by periodically polling the measures of individual services and gathering the overall quality status of the end-user service. This will allow administrators responsible for the functioning of a service to monitor its quality based on the measurements typically already done for the infrastruc-ture components.

The further explained process of SLA dependency mapping can create visi-bility between applications and infrastructure dependencies. It can capture, connect and unveil relationships including the way in which applications behave and relate to the technology architecture on which they rely.

6

2.1.2 THE CONCEPT OF KEY QUALITY AND PERFORMANCE INDICATORS Open Group defined a concept of key quality and performance indicators (KQI/PI) developed by TM Forums Wireless Services Measurement Hand-book (GB 923). “Open Group SLA Management Handbook. Volume 4: Enter-prise Perspective” [The Open Group 04]. The importance of this KQI/PI con-cept is that it allows the provider of the service to concentrate on the quality rather than the performance of a service.

FIGURE 3: KQI , PI, AND SLA RELATIONSHIP [OPEN GROUP 04]

Service Level Specification parameters can be one of two types: Key Quality Indicators (KQIs) and (most technical) Service Performance Indicators (PIs). At the highest level, a KQI or group of KQIs are required to monitor the qual-ity of the business service offered to the end-user. These KQIs will often form part of the contractual SLA between the provider and the customer. A KQI provides a measurement of a specific aspect of the performance of a Product or a Service. The KQI is derived from a number of sources, including performance metrics of the service or under-lying support services as PI. As a service or application is supported by a number of service elements, a number of different PI may need to be determined to calculate a particular KQI. The mapping between the PI and KQI may be simple or complex, empir-ical or formal.

The automated process of translating and correlating high-level require-ments and policies of all kinds down to infrastructure level creates a set of related PIs, which we term now a KQI/PI Hierarchy. The KQI/PI Association Hierarchy Graph, or KQI/PI Hierarchy for short, is a directed graph repre-senting the association relationships between sets of KQI/PIs within (or across) tiers in a multi-tier architecture as well as across multi-stakeholder domains.

7

FIGURE 4 : PI/KQI INDICATOR HIERARCHY

The following notations are used for expressing the association relationship between two sets of KQIs/PIs A and B:

A is coupled to B or non-textual: A → B, read as “A is coupled to B”.

The associative coupling relationship is transitive (in mathematical terms). It is valid to state that if A → B and B → C follows A → C.

While the association relationship only relates adjacent sets of KQIs/PIs, the Hierarchy establishes KQIs/PIs associations across the whole stack in a dis-tributed multi-tier architecture. This enhances the scope of the KQIs/PIs and the reasoning about root causes of a KQIs/PIs violation. Indirect dependen-cies can be derived instead of entered/maintained by the system engineer or operator. The indirect dependencies between KQI/PIs can be determined considering the relationships and couplings for the contained direct de-pendencies.

Having determined multiple PI parameters Pn, a formula as f (P1; P2; … ;Pn) = F(Qn) may in theory be determined to calculate KQI parameters Qn [The Open Group 04].

KQIs and PIs can be set on different levels, where the highest level is the business or end-user KQIs which are being measured. The PIs can be com-bined by some empirical or theoretical function to lead to a measure of KQI. The exact form of the function linking PI to KQI is an important concept for

8

SLA definitions. As in very most cases the KQI/PI relation-ship cannot be mathematically described, measurements can, in real or laboratory envi-ronments, determine the relationship for a specific KQI to the impacting PIs as shown as an example of the delta in application response time with re-gard to the delta in database execution time. For in-stance, when extending the response time of as DB query by 1 second, this may lead to an additional delay of half-a-second in the business service response time to the end-user.

FIGURE 5: DELTA APPLICATION RESPONSE TIME AS FUNCTION OF DELTA DB QUERY TIME

Dependency couplings can been constructed in a practical and feasible man-ner in order to satisfy aspects of the distributed nature of SLAs in a multi-tier-architectural environment. KQIs/PIs within a Hierarchy are related to each other, but the nature of the relationship is not rigorously defined. Even if it is certain that these KQIs/PIs are related, the impact that they can have on each other it is not immediately obvious. So how can soft dependencies be described via a KQI/PI dependency hierarchy? This will be discussed in the further work.

FIGURE 6: THRESHOLD OF PI/KQI AND KQI DERIVED FROM UNDERLYING SERVICES PI

The KQI is derived from a number of information sources, including met-rics for calculating the performance of the service or derived from metric of un-

9

derlying services as PI. In general way a KQI is defined from a set of PIs and each PI or KQI will have upper thresholds and lower thresholds of warning ("Lower Warning/Upper Warning") and error ("Lower Error/Upper Error").

For instance, a set of PIs values indicating warnings can degrade a service until it provokes the interruption, then, it would have to be considered as an error indicating a KQI violation.

2.2 COMPLEXITY OF MULTI-LAYERED SLA TRANSLATIONS

2.2.1 TYPES OF SLA TRANSLATIONS In [Hui Li 09] SAP Research, Dec 09 - Challenges in SLA Translation are SLA definitions investigated in close correspondence with multi-layered service architecture. In the described model each layer or sub layer can have SLAs defined, sometimes referred as “sub-SLAs” or “OLAs (Operational Level Agreements)”. For example, the service layer might have “web service re-sponse time” as metric and “authentication method” as a parameter. The resource layer might have “number of cores” as metric and “network laten-cy” as parameter. It is evident that these metrics and parameters at one lay-er or different layers are somehow correlated, but fully characterizing their relationships can be very complex and remains as a challenging task. Such problems he characterized as “SLA translations” problems, means as any form of transformation of metrics and parameters, within one layer or from one (sub)-layer to another in a multi-layered Service environment.

The virtualized service delivery model requires the composition of services to deliver the overall service to the client. The interactions between the indi-vidual services and components, many of which may come from different sources and infrastructure components, makes it harder to control the per-formance of the overall service or provide quality measures for it in terms of the quality and performance of the underlying services. The fulfilment of any higher-level objective requires proper enforcements not on a single re-source, but on multiple resources at several levels.

The following translation types are described distinguishing between ob-servables (metrics) and configurable (parameters) in service layers, with different types of translations.

C2C (Configuration to Configuration): this type of translation mostly relates to the dependencies)

M2C (Metric to Configuration): this type of translation mostly trans-lates higher-level objectives to lower-level system parameters

C2M (Configuration to Metric): this type of translation predicts high-er-level objectives from lower-level system parameters

M2M (Metric to Metric): this type of translation correlates a high-level metric with lower-level metrics

10

The system we examine here supports a web-based application and is built on the rather common three tier server configuration (four tiers counting also the thin client tier). The first server tier, which consists of nodes 1 and 2, involves Hypertext Transfer Protocol (HTTP) servers, in the role of load balancers. The second tier, which consists of nodes 3 through 6, involves Web application servers (WASs). The third tier involves nodes 7 and 8 as database (DB) servers.

Explanation:

KQI

OR PI

TRANS-LATION-TYPE

APPL

NODE

TECH

COM-

PONENT

LOG.TIER

FIGURE 7: WEB-APP USING THREE-TIER SERVERS WITH SLA TRANSLATIONS

In the example topology of a web application using three-tier server con-figuration, all types of translations are used. Examples shown within this scenario are:

1. M2C (Metric to Configuration) translates in the example the end-user objective “Response Time” to the underlying application server topology (“Deploy Option”), which is needed to ensure enough ca-pacity to handle the expected number of requests in time. This layout

11

and capacity planning is normally based on the peak hours to guar-antee the expected response time and Service Level also during heavy workload, e.g. doing month-end closings.

2. C2C (Configuration to Configuration) is used to translate here the “Deploy option” of an application server to the supporting “Database Configuration”. A clustered application server for high availability topologies needs a corresponding database configuration to support the clustered processing which is needed for e.g. for processing of Java 2 Entity Beans.

3. M2M (Metric to Metric) correlates the high-level metric with lower-level metrics, here for example the service objective for “Application Response Time” to the required average database “Query Execution Time”. For instance, a sub second end-user application response time requires an average DB query execution time of max half-a-second.

4. Finally as example for C2M (Configuration to Metric) a meaningful example is the translation of the defined “Database Configuration” and DB cluster setup to the lower-level system parameters of the Storage Area Network (SAN) infrastructure with the required “Bandwidth” capacity metric.

Normally the service level measurements are changing when doing SLA translations over several layers. In some cases the measurement can remain e.g. for availability measures. In this situation, the end-user service availabil-ity will be always lower than the availability of the weakest component or tier of the solution. In other words, tier availability targets are always higher than the end-to-end availability target. As a result, the elimination of tiers with corresponding relationships can lead to improved availability.

Correlating a higher-level objective such as an end-user service response time with low-level operational parameters may involve sophisticated ana-lytic models. SLA translation means any form of transformation of metrics and parameters, within one layer or from one (sub)-layer to another in a multi-layered environment.

2.2.2 MULTI-LAYERED SLA OPTIMIZATIONS Multi-layered SLA optimizations results in a combinatorial optimization problem which ensures the optimal mapping between each service and re-lated components and infrastructures. This kind of problem takes a signifi-cant amount of time and costs to find optimal solutions (optimal combina-tions of concrete services) from a huge number of possible solutions and several heuristics have been proposed to find semi-optimal solutions in a reasonably short time. When a problem has a number of possibly conflicting objectives (goals) to be optimized simultaneously, there is mostly no single optimal solution but rather a whole set of alternative solutions of equivalent

12

quality, which is called Pareto solutions. For example, in the SLA-aware ser-vice composition problem minimizing cost and maximizing overall response time are clearly conflicting and, therefore, there may no single optimum to be found. Multi-objective scenarios can yield a whole set of Pareto solutions, which are all optimal in some sense, and give the option to assess the trade-offs between different solutions.

Also an important aspect is, that to deliver the SLA guarantees, the service provider viewpoint imperatively needs to care about reducing operational costs on its side. This is essentially a problem with multiple objectives (e.g. performance and cost), which are conflicting with each other. In such cases Multi-Objective Optimization (MOO) methods prove to be more applicable than single-objective based methods.

The need for reducing operational costs on service provider side leads to a basic assumption for the following conceptual work of this paper: Instead of tightening SLAs over all layers and maximizing across the board, which is a costly approach, they are optimized by functional needs in their respective context and relationships with the objective to deliver defined performance parameters while minimizing cost. These cost optimized parameters for system components are defined during the system design and development phase by system- and software engineers using several types of methodolo-gies (e.g. statistics, software performance engineering, QoS mapping, queu-ing theory, optimization theory etc.)

Another important aspect is that service objectives can be positively or neg-atively related. As example, systems with more than one redundant node are often better solutions from the availability point of view, but they naturally run at a lower utilization and higher total cost of ownership. A positive link-age of service objectives is the following example: A consequence of the rela-tionship of capacity and availability is that workload growth reduces availa-bility by consuming redundant capacity. This is a major advantage as here several service level targets are directly related, so an increase in the availa-bility service level is coupled to a better service level to the application re-sponse time in peak hours.

2.3 IT IMPACT ANALYSIS APPLIED IN SERVICE MANAGEMENT Service Management Standards are influenced by the range and quality of methods and techniques and benefits of established best practices. ITIL (IT Infrastructure Library) provides a best practice based framework, devel-oped since the late 1980th by the UK Office of Government Commerce. It is the most widely used and accepted approach to IT Service Management (ITSM) around the world. ITIL includes several valuable management ideas and well-tried procedures. (ITIL V3: A Management Guide [Van Haren 08])

13

Within chapter 8 an overview is shown and pros and cons are discussed of those impact and dependency techniques in IT services management explic-itly applied within ITIL v3 best practices and management guidance. There are several areas in the ITIL v3 Service Lifecycle Modules where dependency and impact analysis or similar reliability engineering techniques are used. Those techniques are mainly applied in ITIL v3 within Service Design, Ser-vice Operation and Continual Service Improvement. The four most im-portant in ITIL v3 referred methods and tools, Configuration Management Database (CMDB), Fault Tree Analysis (FTA), Component Failure Impact Analysis (CFIA) and Business Impact Analysis (BIA) are discussed in detail, as these will be the basis for the proposed framework in this project work

The following section is a summary of the discussion of traditional IT im-

pact analysis within chapter 7:

Application Dependency Discovery Management (ADDM) Application discovery is the process of automatically analysing artefacts of a software application and physical elements that constitute a network (e.g., servers, firewalls, etc.). Dependency mapping creates visibility between dis-covered applications and infrastructure dependencies. Automated applica-tion discovery and subsequent dependency mapping, can capture, connect and unveil relationships including the way in which applications behave and relate to the technology architecture on which they rely. Application De-pendency Discovery Management has its roots from an application man-agement perspective and originally aimed to streamline the infrastructure management processes. ADDM introduces a level of trust that discovered information is no longer hypothetical, but real. By automatically discovering interdependencies between and among applications and underlying sys-tems, ADDM products deliver a point-in-time view of the “truth.” This can be a powerful enabler that, over time, can minimize IT organizations expend on the in-formation assimilation function and can also provide a basis for ever-higher levels of automated problem resolution [EMA Radar Dec.10]

But as discussed in chapter 8, there are several limitations using ADDM tools. On the one side it reduces dependency on the human factor but on the other side this can provide only a basic view on impact assessments for business services, as logical dependencies cannot be discovered and thus must be complemented again by human interactions. The automated discov-ery finds dependencies by looking for instance at the TCP connections or by evaluating the configuration of programs which does not provide insights on consequences on impacted higher level services and SLA. So the ADDM pic-ture needs to be extended with additional logical dependencies. This goes much beyond the scope of ADDM tools as for functioning of an information system we need to know also about dependencies to e.g. IT users, IT staff and business units and supporting processes and functions e.g. helpdesk.

14

ADDM keeps the assessment of the components relations as a simple result (connected/not connected). This can be hardly interpreted for impact as-sessments and dependency couplings but gives a fundamental view on relat-ed and interfacing infrastructure components.

Fault Tree Analysis (FTA)

Fault tree analysis is a top down, deductive failure analysis in which a

state of a system is analysed using Boolean logic to combine a series of

lower-level events. Events in a fault tree are associated with statistical

probabilities. The fault tree output probabilities related to the set opera-

tions of Boolean logic.

The Fault Tree Analysis adds a logical representation of all the different rela-tionships that are necessary to result in the top event. In constructing this fault tree, a thorough understanding is required of the logic and basic causes leading to the top event. The FTA analysis can be incorporated within the CFIA matrix to assess the dependencies of a business service. The major lim-itation here is that classical FTA is binary (fail–success) and may therefore fail (as most deductive dependency models) to address soft dependency problems as needed for PQI/PI relationships. In praxis we have more soft dependency relationship that allows also the more complex consideration of a degraded mode of operation.

One of the key benefits resulting from the application of the FTA techniques is that they force the analyst to follow a systematic procedure of analysis of the system. In most cases, the mere construction of the model leads to a bet-ter understanding of the system design, including aspects such as compo-nent interdependencies and reliability weaknesses. Because an FTA does not produce a unique answer, the value of an FTA still depends on the skill and experience of the analyst.

Component Failure Impact Analysis (CFIA) The purpose of a Component Failure Impact Analysis (CFIA) is to assist management predict and evaluate the impact of component failures on IT systems. Component failures include hardware and software but should also cover the processes, tools and people that support the systems. When con-ducting a CFIA analysis, a matrix is created with IT services on one axis and Configuration Items (CIs) on the other. This enables the identification of critical CIs (that could cause the failure of multiple IT services) and fragile IT services (that have multiple single points of failure).

A basic CFIA will target a specific section of the infrastructure; just look-

ing at simple binary choices (e.g. if we lose component x, will a service

stop working? More advanced CFIAs can be expanded to include a num-

ber of variables, such as likelihood of failure, repair and recovery time,

detailed recovery procedures, organizational assignments and integration

15

into wider service management processes and also can also consider and

evaluate for different component failure modes.

Component Failure Impact Analysis method significantly helps in providing a systematic approach to assist management predict and evaluate the impact of component failures on IT systems. It extends the pure system view (hardware and software) on component failures to include also the process-es, tools and people that support the systems. This provides a starting point to consider different management approaches and techniques to mitigate or avoid the impact of failures. With CFIA it is not a pure technical solution, it becomes a methodological solution. It provides a relevant assessment to the physical components of the service, but also to examine the systems man-agement framework, the supporting tools and the skills within the delivery organization. Limitations of CFIA can be caused by a static system analysis that does not consider the impact of multiple component failures, latent de-fects that impact timing and sequencing. When defining the CFIA grid inter-dependencies between and among applications and underlying systems, needs to be constructed in theoretical and feasible (mostly manual) way. CFIA can answer the question “Which are the indirect dependent business services of a particular component x” but cannot comment on the type of dependency and to which level they are logical coupled and impacted.

Business Impact Analysis (BIA) Business Impact Analysis identifies vital business functions and their de-pendencies. These dependencies may include suppliers, business processes, IT Services etc. BIA defines as an output the requirements which include recovery time objectives and minimum Service Level Targets for each IT Service.

The BIA can best be conducted based on a CFIA analysis. Having created the CFIA matrix including the dependencies, the grid can be expanded to include fields related to the Business Value and the Cost of Failure of a Service. These fields can simply show the hourly failure cost to the business or can map the number of users supported by each business service. So the compo-nent coupling to the higher level services indicates also the cost and affected users by a degraded operation of an infrastructure node.

The same BIA estimate used during operation to assess the business impact in case of an incident, can also be used to justify IT Infrastructure improve-ments by quantifying the total cost to the organisation of an IT Service fail-ure(s). These costs can then be used to support a business case for addition-al IT Infrastructure investment and provide an objective 'cost versus benefit' assessment.

In praxis business impact is hard to measure, as it could have several conse-quences, from financial impact to fuzzy aspects like feeling of dissatisfaction if IT service problems occur. Measurements on business impact of a failure

16

are hard to quantify in monetary value, like “user productivity loss”, “IT productivity loss”, “lost business cost” etc. Also BIA provides a static view that does not consider the impact of multiple component failures, latent de-fects that impact timing and sequencing.

2.4 RECOMMENDATIONS AND PROPOSED EXTENSIONS Based on the traditional methodology discussion (chapter 7), the following section proposes several recommendations, best practices and lists useful and required extensions of traditional impact analysis:

1. The notion that a single method can support every use case should be replaced by a more complete view that may include several combined and integrated steps to provide the needed results. So we recommend that all described methods: ADDM, FTA, CFIA and BIA should be leveraged to provide the overall dependency picture and showing the different aspects for an impact assessment.

2. As the overall frame for incorporating all data and methods the CFIA is best suited. CFIA can be freely extended with different kind of variables showing failure modes, several reliability parameters, operational capabilities and techniques and extends the pure sys-tem view (hardware and software) on component failures to in-clude also the processes, tools and people that support the systems. This is necessary as for functioning of an information system we need to know also about dependencies to e.g. IT users, IT staff, business units and supporting processes like backups and func-tions like helpdesks.

3. The initial CFIA grid should best be setup by usage of auto-discovery tools (ADDM) which provides trust that the discovered information is real and up-to-date. By automatically discovering in-terdependencies between and among applications and underlying systems, ADDM products deliver a powerful enabler that minimize IT organizations expend on the information assimilation function and can also provide a basis for further higher level, logical de-pendency assessments.

4. We recommend that an Fault Tree Analysis (FTA) is incorporated in the CFIA matrix creation process, to assess the dependencies of components to a business service. The use of FTA enables the iden-tification of dependent components that could cause the failure of the IT business services, where an incident occurs. The basic step of the CFIA, to create a grid with components on one axis and the IT Services which have a dependency on the components on the other can be built using the results of the FTA analysis. So we recommend an export from the FTA tools to automate the definition of the grid of lower level components for each business service.

17

5. As classical FTA is bi-modal (fail–success) and cannot address soft dependency problems as required for the described PQI/PI rela-tionships, we recommend extending the traditional concepts with a limited or partial dependency model. In praxis there are mostly weak or soft dependency relationship that allows also the more complex consideration of a degraded mode of operation or the con-cept of a probability of a dependency or impact. This can in the next approach be modelled via fuzzy extensions of the classical FTA.

6. Impact assessments are fuzzy in nature, so the ability to consider the level of vagueness would provide more accurate results. Any assessment in praxis is related to things like vagueness, uncertain-ty, limited or imprecise knowledge, non-proofed information or simple hesitancy to do a statement. This is not supported by the classical approaches and tools.

7. Impact assessments on complex systems needs to consider contra-ry aspects. On the one side the risk resulting from interdependen-cies from interacting and related components on the other side each component has a set of mitigation, restoration and resilience capabilities. The discussed traditional methods already cover both aspects. Fault Tree Analysis (FTA), like the word fault tree indi-cates, work in the "failure space" and looks at system failure com-binations. So the FTA method covers the aspect of negative risk of interdependencies and negative impacts on failure. The basic CFIA itself is primarily focused on the mitigation, restoration and resili-ence capabilities, which represents the positive aspect of inde-pendence. Our proposal recommends the basic idea to consider the real-world impact of an incident pulling both aspects simultaneous-ly into one integrated result set.

8. Finally the viewpoint is also an important concept. It is basically a specification that describes a particular view of the service which is an important parameter for performing an impact assessment. A viewpoint is linked to a particular stakeholder or set of stakehold-ers in mind and allows different stakeholders to focus on their own concerns. The impact of a specific incident depends upon how close it is related to a stakeholder’ concerns and requirements. Various stakeholders may have their individual concerns which lead to a different subjective impact assessment. Therefore we propose to support some kind of an “attitude” based impact assessment model allowing performing a parameterized impact assessment.

These recommendations will now be the basis for our future proposed soft or granular dependency framework (further referred as IFCFIA), which will implement the above described requirements and recommendations.

In general, the proposed concepts should be aligned with actual industry standard quality techniques, IT Service Management (ITIL v3) methods and IT architectural dependency modelling standards and best practices.

18

2.5 RESEARCH OBJECTIVES Combining well-grounded academic research with practice oriented re-quirements and business scenarios is a useful and common practice within information system and IT service management. This paper comprises the elaboration of theoretical foundations of IT service management impact as-sessment methods and tools. This theoretical foundation comes into opera-tion as integrated proposed framework and process workflow to an end2end approach and methodology the so called IFCFIA framework. This is applied to a real-world data-centre infrastructure and IT application land-scapes. Within several use-case scenarios the unifying IFCFIA framework is demonstrated and proofed, also business, technical and operational benefits are shown as well as limitations are discussed.

In this Research Project, the following objectives are pursued and questions to be analysed:

The first, mostly theoretical objective is to review the Service Man-agement standards for impact and dependency assessments, which are influenced by several proofed methods and techniques and bene-fits of established best practices, especially related to ITIL v3 and the Open Group concepts of key quality and performance indicators and their coupling described as a measure of the strength of intercon-

nection between services and quality parameters. This is augmented by elaboration also of a range of quality methods in systems and software engineering as an attempt to capture various facets of sys-tems quality within Service Level Agreements.

The second objective is to provide a bridge from IT-centric Service Level Agreements, written in IT technical terms, to business-oriented service achievement. As SLAs are now becoming increasing-ly business-focused and seen as a strategic tool to align IT support services directly to business mission achievement the objective of this project is to provide an integrated approach for aligning and mapping business-focused SLAs to multiple domains (e.g. network-ing, computer systems, and software engineering). The thesis will re-search, evaluate and analyse ways for enterprise SLAs to relate met-rics for business applications into measurable parameters for tech-nical services that can be defined and reported against an SLA and monitored under Service Level Management. This bridge would then allow quality assessments by tracking measures of individual com-ponents to gather the business SLA status.

The third objective is the development of a model and methodology to assess the complex dependency and impact relationships of backend components to the quality of the frontend service. This work will therefore construct dependency couplings in a practical and feasible manner in order to satisfy aspects of the distributed na-ture of SLAs in a multi-tier-architectural environment. Even if it is

19

certain that quality indicators are related, the impact that they can have on each other it is not immediately obvious and will be elabo-rated and modelled in the further work. The concepts should as re-sult offer granularity and transparency into complex multi-level im-pact assessments and fault analysis.

The forth objective is to enrich the basic methodology with addition-al capabilities and valuable features which comprises:

o Instead of binary degrees of compliance, the designed model should support a more granular and complex consideration of a degraded mode of SLA operations

o As impact assessments are fuzzy in nature we will have bet-ter results also considering a level of vagueness, uncertainty, limited or imprecise knowledge for impact assessments.

o The model needs naturally approaching envisaging positive and negative strength, because only both, positive and nega-tive, aspects together will define the overall system behav-iour and the probable impact on the business service.

o Taking into consideration various stakeholders having indi-vidual concerns and requirements which lead to a different subjective impact assessment (“attitude based”).

The fifth objective of this thesis is to define a clear business value proposition of the defined IFCFIA methodology (and also its limita-tions) to the business stakeholders with regard to:

o Quantitative evaluation and estimation of risks and impacts o Allow service provider to concentrate on the quality rather

than the performance of a service o Individual Service Levels should be directly driven by busi-

ness needs where the organization benefits by projecting and paying for only what is required.

o Operational enablement to pro-actively track measures of individual components to gather the overall SLA quality sta-tus of the impacted business services.

o Enrichment of impact and dependency techniques applied within IT service management best practices and guidance. The proposed framework in this thesis is designed to incor-porate and naturally extend the existing ITIL quality methods rather than to replace them with an isolated new approach.

The sixth objective of this thesis is to show and prototype practical usage scenarios based on real-world data and data-centre system landscapes. The technical implementation is also leveraging several leading commercial tools in the service management area to support the IFCFIA method.

o Business Impact Analysis o Root Cause Analysis o Advanced Service Level Monitoring,

20

o Capacity optimization in Consumption Based Pricing with re-gard to financial aspects and variable pricing models

o 'Cost versus Benefit' assessments for IT investments o Analysis of Usage Data to recommend on optimal planning

and booking of capacity baselines with regard to evolving market and business needs

The seventh objective is to verify the twofold intuitionistic hypothe-sis, which states that on the one side dependence is only loosely coupled to the independence. This fact allows naturally and inde-pendently approaching envisaging positive and negative instances of impacts and dependencies. On the other side we want to proof that only the simultaneous consideration of both, positive and negative, aspects of dependence together can best define the overall system behavior and the probable impact on the dependent business ser-vice. This is discussed within concrete scenarios.

The eighth objective is to demonstrate how the intuitionistic fuzzy mathematical logic can be applied for practical usage in IT Service Management. The concept of Intuitionistic Fuzzy Sets (IFS) as a gen-eralization of a classical fuzzy set which defines another degree of freedom, the independent judgment of positive and negative aspects. This two-sided (intuitionistic) view including the possibility to rep-resent formally also a third aspect of imperfect knowledge could be used to describe many real-world service management problems in a more adequate way – by specification of both - dependence and in-dependence, pros and cons for each direct or indirect coupling im-pact. Atanassov IFS is giving us a great natural tool for modeling such be-polar relationships within IT systems.

21

3

1

3. INTERDEPENDENCIES AND COUPLINGS

22

3.1 DEPENDENCE COUPLING AS MEASUREMENT Dependence Coupling is a measure that we propose to capture how depend-ent the component or service is on other services or resources for its deliv-ery.

In general the goal is to build components that do not have tight dependen-cies on each other, so that if one component were to die (fail), sleep (not respond) or remain busy (slow to respond) for some reason, the other com-ponents in the system are built so as to continue to work as if no failure is happening. Loose coupling describes an approach where integration inter-faces are developed with minimum assumptions between the send-ing/receiving parties, thus reducing the risk that change in one application or module will effect to other applications or modules. Loose coupling iso-lates the components of an application so that each component interacts asynchronously with the others and treats them as a “black box”. For exam-ple, in the case of web application architecture, the application server can be isolated from the web server and from the database. The application server does not know about your web server and vice versa, this gives decoupling between these layers and there are no dependencies code-wise or functional perspectives.

Service oriented architecture (SOA) is an architectural style that uses open-standards to describe component architecture. A service is a function that is well-defined, self-contained, and does not depend on the context or state of other services. Services in a system need to couple to execute a task. When services are linked together, they exhibit environmental coupling which is caused by calling and being called by other services. Comparing with tradi-tional system, SOA is architected with looser coupling. Loose coupling or a low dependency factor indicates that the service provider does not have to depend on other services or resources to complete delivery of its service. High dependency factor or tight coupling on the other hand indicates that successful delivery of other services or availability of resources is a prereq-uisite for the completion of a service.

In the following chapters we will introduce two new types of a logical rela-tionship which expresses the level of interdependency between compo-nents: - is tightly coupled - and - is loosely coupled-.

The tightly coupled dependency measurement can be roughly seen as an indicator of the risk resulting from interdependencies where the loosely coupled aspect refers more to the mitigation and resilience capabilities of a system. Loose coupling or a low dependency factor indicates that the service does not have to depend on other services or resources to complete delivery of its service. High dependency factor or a tight coupling on the other hand indicates that successful delivery of other services or availability of re-sources is a prerequisite for the completion of a service.

23

3.2 DETERMINING A LEVEL OF TIGHTLY COUPLING

3.2.1 OVERVIEW In an initial assessment the type of relationship between two components should indicate the principle measurement which can best be used to specify the level of coupling. When the dependency is between a service and some resource it uses, coupling will essentially be a function of how often the re-source is used. For instance, the dependence of a service on the network layer might be measured by how often it is making a socket call, or how much data it is transferring. The dependence of a database on compute par-tition will be determined by how much compute resources it needs from that partition. For web-services we can examine an environmental coupling which is caused by calling and being called by other services. The type of relationship between two components does indicate the principle measure-ment which can be applied to specify the level of coupling and can be differ-ent for each type of components.

The simplest way to describe the level of coupling would be to capture a linguistic description of the dependency – define it as high or moderately high or low. The degree of dependency or coupling could be directly defined by the experts who have created the service. This can be refined by setting a dependency degree between 0 and 1 by the judgment of experts. Another option is to mine the monitoring results and historical data to obtain the data which can then be mapped and normalized against the measure of a dependency relationship.

For tightly coupling we will prefer an ordinary interpretable measurement as this can be best used to determine a probable degradation impact on component operation.

For normalization reason the measurement should result in a value between 0 and 1, where 0 means independency and no coupling impacts. 1 implies a full compliance relationship, means in context of our KQI/PI Hierarchy – “KQI A will be violated if related PI B is violated, or in other words if a lower level components’ service target fails, the impacted business service objec-tive fails also.

3.2.2 STATIC COUPLING CALCULATIONS Services in a system need to couple to execute a task. Traditional compo-nents are more tightly and statically integrated, having a fix interface with defined dependencies.

Measurements for those couplings are related mostly to procedural pro-gramming coupling measures which measure the coupling of software com-ponents that are implemented in procedural programming languages.

24

Examples for procedural couplings are [Fenton and Melton, 1990] or [Dha-ma 1995].

Fenton and Melton proposed a metric to measure the coupling between two components x and y, which is defined in the equation

where n is the number of interconnections between x and y and i is the highest (worst) level of coupling type found between x and y using the fol-lowing table.

TABLE 1: FENTON AND MELTON COUPLING LEVELS [ALGHAMDI 07]

Another well-established method is Dhama's Metric. [Dhama 1995] defines how a module is coupled. This definition, like most others in software engi-neering, is a global one. It provides a measure of how tightly coupled the module is with the others. The Dhama metric is an example of an intrinsic coupling metric, which calculates the coupling value of each component in-dividually.

Because Dhama’s metric is mainly made for software modules and also we need for couplings in particular to define pair-wise relations between two components where one is coupled to the other, [Joshi et al. 2009] proposed to adapt Dhama’s metric to define coupling between services x and y using the following formula for Service Coupling C

i = in data parameters – data sent from calling service x to called service y u = out data parameters – data sent from called service y to calling service

x. g = number of global variables used as data r = number of times x calls y.

The lower this measure, the more tightly coupled the two services are. Dhama's metric returns a value closer to zero if two components are tighter

25

coupled. Vice-versa it returns one for an imaginable minimal coupling. For example one call without any data transfer. Due to the division by zero, there is no expression when no coupling exists.

[Alghamdi 07] classified coupling measures in two groups, procedural pro-gramming coupling measures (where also Dhamas’ belongs to) and object-oriented coupling measures and gives an overview of the available meas-urements. Alghamdi also proposed an approach for breaking the calculation of coupling into two basic steps. The first step is to generate a description matrix that captures the factors that affect coupling in a system. The second step is to calculate the coupling between each two components of the system from the description matrix to produce a coupling matrix.

The objective of generating a description matrix is to create a structure that captures all of the characteristics of a software system that relate to cou-pling, which can then be used to calculate coupling information for that sys-tem. The coupling matrix for a software system of m components is a matrix of order m×m, where each row and each column represents a component of the system. The coupling values can be calculated from the description ma-trix in various ways; for example, the degree of coupling between two com-ponents can be the sum of the weights of all members shared by the two components.

This approach would be very useful for our impact concerns as for IT infra-structures we might have total different ways and mechanism of couplings. In an initial assessment the type of relationship between two components should indicate the principle measurement which can be applied to specify the level of coupling. When the dependency is between a service and some resource it uses, coupling will essentially be a function of how often the re-source is used. For instance, the dependence of a service on the network layer might be measured by how often it is making a socket call, or how much data it is transferring. The dependence of a database on compute par-tition will be determined by how much compute resources it needs from that partition. For web-services we can examine an environmental coupling which is caused by calling and being called by other services.

[Hong Yang 2010] has written his doctor thesis about coupling determina-tions as measurement within the software engineering, but as well respect-ing the scientific basis of coupling measurements. He sees a tendency for researchers and practitioners to apply metrics without a full awareness of what they mean. Coupling is the measure of the interdependence between parts of a (software) system is one important property for which many met-rics have been defined. Hong Yang sees especially a problem due to the lack of coverage of all forms of connections that comprise coupling. To illustrate this he identifies indirect forms of coupling that manifests between two seemingly unrelated parts of the system through hidden connections.

26

3.2.3 DYNAMIC COUPLING CALCULATIONS

In recent work these concepts have been further developed and a suite of dynamic coupling metrics for service oriented software and infrastructure has been proposed.

These dynamic measures are needed for Service oriented Architectures (SOA). SOA is an architectural style that uses open-standards to describe software components. A service is a function that is well-defined, self-contained, and does not depend on the context or state of other services. Services in a system need to couple to execute a task. When services are linked together, they exhibit environmental coupling which is caused by calling and being called by other services. Comparing with traditional sys-tem, SOA is architected with looser and more dynamic couplings.

SOA complicates static reliability modelling by providing the ability to select and invoke services at run time. The dynamic nature of SOA will therefore place new demands on modelling techniques used to predict reliability and availability, For example, it may be justified to develop stochastic simulation models for mission-critical SOA systems.

SOA also provides the ability to assemble a system from services that have guaranteed quality of service (QoS) attributes that specify reliability charac-teristics. This provides very loose coupling and could allow the reliability engineer to model system availability using a clearly defined hierarchy of independent models.

[Quynh and Thang, 2009] proposed and compared several metrics to evalu-ate the couplings in dynamic systems. Two of them are the CBS (Coupling between Services) and the DC2S (Direct Coupling between 2 Services).

The CBS metric calculates the number of relationships between service A and other services in a system, where n is the number of services in a sys-tem.

AiBj = 0 if Ai does not connect to Bj and AiBj = 1 if Ai connects to Bj.

The higher the CBS is, the higher is the dependency of A from other services due to its connection to numerous services. As example service A could have been programmed so that it exchanges data with service B, C and D, but does in runtime only communicate with service B, so dynamic measurement brings more exact results by considering the environmental coupling and therefore measuring the connection to service B only.

27

The DC2S identifies the dependency between two services A and B where n is the number of services in the system and N(A;Bi) is the number of connec-tions from service A to service Bi.

Although this metric takes a look at the connections between the services, it does that by calculating the percentage of connections which go from A to B over all of A’s connections. This can lead to false assumptions because DC2S(A,B) = 0.5 does not imply a tighter coupling than DC2S(B,C) = 0.9. Therefore we recommend the better usage of the CBS for dynamics coupling calculations.

To summarize, as SOA provides the ability to create even more complex sys-tems with a high degree of dynamic configurability, static or intuitive ap-proaches cannot be used to predict such system couplings with a sufficient degree of confidence. We conclude that the use of SOA will further increase the need for application of more complex modelling techniques and dynamic coupling metrics.

3.2.4 AGGREGATED MEASUREMENTS OF SERVICE QUALITY Measuring couplings of all distinct quality parameters individually will be-come an extremely large relationship matrix even for a small number of components and services. For real-world system landscape assessments to limit the size and complexity of the model scope, quality parameters are mostly pulled together. General functioning and basic operational level e.g. described as availability performance indicator is often an aggregated per-formance value e.g. maximum number of total PI failures in a period. If it exceeds the defined threshold, which is the maximum allowed failures, the system or component is on failure status (not available), otherwise it is con-sidered as operational and overall compliant to specifications (available). The aggregation or combination of the single performance PIs to an overall component compliance PI leads to an aggregated measurement which can be described as a component compliance PI. This can be for service architec-tures the SLACS – SLA Compliance of the Service, [Rud et al. 07] measured as the fraction of time during which all SLA fulfilment indicators of the service lie in green. This QPI/PI defines the overall degree of compliance for a com-ponent to specifications and can be used as an overall measurement of a probable failure situation for component operations.

28

3.3 DETERMINING THE LEVEL OF LOOSELY COUPLING

3.3.1 OVERVIEW Loosely Coupling is similar to, but not the same as, the coupling measure used in traditional software engineering to describe the independence be-tween two modules. Loose coupling, which means a low dependency factor, indicates that the service does not have to depend on the fully functional related component. or the other way round, the capability to mitigate the impact on the dependent service when an associated component has gone to die (fail), sleep (not respond) or remain busy (slow to respond) for any rea-son.

Loosely Coupling is therefore defined as a measurement on the level the dependent component can complete the delivery of its service even the cou-pled component fails or is degraded in operation.

There are numerous dimensions to the idea of loose coupling. Integration, for instance, between two applications may be coupled loosely in time through the usage of Message oriented Middleware (MoM) – meaning the availability of such a system does not have any effect on the other. On the other hand, integration might be coupled loosely in format through the utili-zation of middleware as a means of performing data transformation – i.e. differences in data models are not able to prevent integration.

In Service Oriented Architectures (SOA), loose coupling is the approach in which integration interfaces are developed with very few assumptions among sending and receiving parties. This reduces the risk that any changes that take place in one particular application will make a change in a relating application necessary. So a measurement for this described kind of loose coupling may be the maximum quantity of changes possible within the data elements that might occur in the sending or receiving systems, which allow that the computers would still be able to communicate in a correct manner

The degree of loosely coupling in this sense is a contrary effect to the degree of tightly coupling. Mostly dependencies are expressed and described in a positive form. There are several measurements and metric for defining a dependency or level of coupling, but there are hardly metrics for defining a level of non-dependency. This means in real world situations we prefer to assemble a level of dependence and implicit assess the independence via the negation of the dependency level.

We propose in our conceptual work to assess explicitly a degree of loosely coupling as a level of resilience and component independency. Loosely cou-pling aspects should here focus on the restoration and mitigation capabili-ties of the affected components compared to the business objectives to re-cover.

29

Within ITIL v3 best practices a “Recovery Point Objective” or “RPO”, is de-fined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident. The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously main-tained- a daily offsite backup on tape will not suffice. The RPO is only a measure of the maximum time period in which data might be lost if in case of an incident affecting an IT Service- not a direct measure of how much data might be lost.

The “Recovery Time Objective” or “RTO”, is the time it takes to recover the service. The events that mark the start and end of the RTO duration must be pre-agreed. Roughly speaking a RTO is often based on the principle of set-ting the time objective to be "the amount of time the business can be without the service. Each service component has an associated availability rate, re-covery objectives, the technology prerequisites, and the cost to deliver the service.

So the RTO and RPO form part of the basic specification for any IT Service and component. If for instance the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously main-tained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.

When getting closer to a RTO of seconds the cost to provide such a solution exponentially increases. The same applies for RPO. But the pressure to main-tain or reduce IT costs means that Chief Information Officers (CIOs) must justify the investment in availability technologies by categorizing IT systems in terms of their criticality and implement the most cost-effective solutions to achieve agreed-upon recovery objectives or service-level agreements (SLAs).

So the impact of an incident mainly depends on whether the specific RTO target can be met or not. The major parameter of the required time restoring a service is the “Mean Time to Restore Service” (MTTRS). So the time differ-ence between the RTO and the MTTRS is a major indicator whether an inci-dent will have a larger business impact, or not.

30

FIGURE 8: OVERVIEW: INCIDENT AND RESTORATION PROCESS

Source:

http://en.wikibooks.org/wiki/ITIL_v3_(Information_Technology_Infrastruct

ure_Library)/Service_Design

Important resilience and recovery related service design parameters

are here:

Mean Time to Recover (MTTR) : The typical time that it takes to re-

cover (includes repair) a component, sub-system or a system. Usual-

ly seconds, minutes, hours, possibly days or even months in the case

of component like a data centre.

Mean Time to Restore Service (MTTRS) : The time between failure

and full restoration of a service. This means we need also assessing

the abilities of system designs to meet RPO criteria.

Mean Time Between Failure (MTBF) : The mean / average time be-

tween successive failures of a given component, sub-system or sys-

tem

Mean-Time-Between-System-Incidents (MTBSI) : elapsed time be-

tween detection of two consecutive incidents.

Based on terminology above, MTBSI = MTBF + MTTRS, and availability A can be calculated by A = MTBF / (MTBF + MTTRS).

MTTRS is different in the way that MTTR would mean time to repair a con-figuration item, and MTTRS would mean time to restore service after repair. E.g. MTTR is the time to change CPU of a node, MTTRS is the time to restore all services provided by that node.

http://en.wikibooks.org/wiki/ITIL_v3_(Information_Technology_Infrastructure_Library)/Service_Design

http://en.wikibooks.org/wiki/ITIL_v3_(Information_Technology_Infrastructure_Library)/Service_Design

31

Component failures and applied restoration methods include hardware and software but should also cover the processes, tools and people that support the systems. It is not technical solution only, so this is a guide to provide a relevant assessment to the physical components of the service, but also to examine the systems management framework, the supporting tools and the skills within the delivery organization.

Well-written RTO/RPO business objectives must measure unplanned and planned downtime. They must take into account timing of the downtime (e.g., end of month, quarterly close, and peak sales periods), and they must measure downtime from the perspective of the user. This means the need to measure the availability to the end IT service, not just the individual infra-structure components such as clients, server, storage and networks.

To summarize, there are several measurements and metrics available for defining a dependency or level of coupling, but there are hardly metrics for defining a level of non-dependency or independence.

Therefore we decided to create own measurements to define a degree of loosely coupling, which we will be proposed in the following chapter. The proposed measurements for loosely coupling will be based on the idea that service design needs to consider the following twofold major impact aspects:

Loosely coupling is a measurement on the level the dependent com-ponent can complete the delivery of its service even the coupled component fails or is degraded in operation. This can be seen as level of independence and is complementary to the tightly coupling as-pect.

Loosely coupling indicates also the capability to mitigate the impact on the dependent service in time, when an associated component is not compliant to specifications. This can be expressed for instance as relation and time difference of the “Mean Time to Restore Service (MTTRS)” to the “Recovery Time Objective (RTO)”.

3.3.2 CALCULATING A DEGREE OF LOOSELY COUPLING Most tightly coupling metrics, like Fenton and Melton metric, are examples of an inter-modular coupling metrics, which calculate the coupling between each pair of components in the system. For loosely coupling, based on resili-ence measurements, more intrinsic coupling metrics would fit. As each com-ponent has individual resilience capabilities and we therefore want to calcu-late the loosely coupling degree of each component individually.

Relating the MTTR to the Business Service RTO/RPO target A simple method is to consider only the four parameters RTO, RPO, MTTR and required Availability A = MTBF / MTBF + MTTR. (In the following work we assume that MTTRS and MTTR are considered the same as this will be

32

aligned to classical concepts which are not differentiating between those two parameters).

Recovery Time Objective (RTO) is defined as period of time within which systems, applications, or functions must be recovered after an outage. The RTO measures how long a failed application can be down before it begins to cost the enterprise significant amounts of money. As example, when the fi-nance group within a firm is unable to print and send out bills to customers for less than 24 hours, then the overall impact on the business would be minimal. If the order entry system were unavailable for even a few hours, then the losses could be substantial. The RTO for the finance billing applica-tion might be set at 24 hours while the RTO for the order entry system may be limited to 30 minutes or less.

Recovery Point Objective (RPO) is the point in time to which systems and data must be recovered after an outage. RPOs determine the amount of data that may need to be recreated after the systems or functions have been re-covered. If for instance a system failure resulted in the order entry database being corrupt, then the RPO would define how much lost data would be too much to tolerate. To turn the recovery targets into availability categories for a business service, in praxis this is mostly done via translation into different business type and data value classes.

Bus.

Type

Required

Availability

Impact RPO (minu-

tes)

RTO

1 90% Not important to

operations

10,000 7 days

2 99% Important for pro-

ductivity

1,440 1 day

3 99.9% Business important 120 2 hours

4 99.99% Business vital 10 15 minu-

tes

5 99.999% Mission critical 1 1.5 minu-

tes

TABLE 2: EXAMPLE BUSINESS TYPES WITH REGARD TO RPO, RTO AND IMPACT

RTO/RPO business objectives are business measurements from the end per-spective of the user. This means in our proposed measurement we will re-late now the resilience capabilities of the individual infrastructure compo-nents such as PCs, server, storage and networks to the required objective of

33

the business service. In case an infrastructure component is shared within several business services, the proposed approach is to define the MTTR tar-get for a specific infrastructure component, by collecting all affected busi-ness applications RTO/RPO targets and taking the minimum of those.

So the simplest method to calculate a degree of Loosely Coupling LC would be to compare the MTTR of an individual infrastructure component Comp with the RTO/RPO objectives of all impacted Business Services BS1-BSn.

LCComp = min(RTOBS1-BSn, RPOBS1-BSn) / MTTRComp

The MTBF is also an important factor, as this indicates the number an inci-dent will occur over time for a component. A high number of incidents, which decrease the overall system availability, will also decrease the degree of loose coupling for a component, as the mitigation capabilities of the sub-component must be applied more often which results in a more tightly cou-pling relationship level.

This can be best shown as component availability level A where RTOBS1-

BSn

AComp = MTBFComp / (MTBFComp + MTTRComp)

So the adjusted degree of Loosely Coupling LC for an individual component can be calculated as follows, also considering the component availability level.

LCComp = min(RTOBS1-BSn, RPOBS1-BSn) / (MTTRComp * AComp )

The higher the LC index is, the better are the component mitigation capabili-ties in relation to the required business services RTO/RPO targets and the more resilient is the overall business system against business impacts of infrastructure incidents. The result can afterwards be normalized by map-ping to a LC degree between [0,1] like described in the next chapter.

In praxis, an issue is that usually RTO targets are defined at the front-end business service, not at the level of individual infrastructure components such as server, storage and networks. But determining the criticality of business applications or IT systems and writing meaningful, achievable RTO/RPO objectives with business owners are often far more challenging than the implementation of the technology itself.

For this reason, a second approach is proposed by using individual compo-nent resilience assessments without comparing to the business services RPO/RTO.

34

Individual Components’ Resilience Assessments [J.Eifert 2012] proposed a method for defining individual components inde-pendence degrees (which can be used as index for loosely coupling) without relating to business recovery objectives, by simply assessing 3 major com-ponent resilience parameters, the MTBF, MTTR and the redundancy level RED which will be set by a here proposed evaluation point method.

Redundancy aspects consider redundant installations which can be classi-fied e.g. as hot, warm and cold redundancy. Hot sites can take over at a mo-ment's notice, they are kept functional at all times. The fail-over function is usually undertaken by using a "heartbeat" cable. As soon as the second com-ponent stops receiving the "pulse", it takes over the work of the first one. To make the system more reliable, the redundant component(s) would have to be located far away from the primary installation. This requires a wide area network with a high bandwidth and comes along with larger hardware ex-penses and maintenance costs. Cold site requires higher efforts to be brought online, but comes along with considerably less expenses. A warm site, of course, resides somewhere in the middle. In a warm site the compo-nents are as redundant as in the hot one, which means they are identical twins, including their dedicated function. But unlike the hot site they are not running all the time, which reduces the expenses significantly but does not allow a seamless fail-over. This process may take a few minutes to a few hours depending on the system. Due to the high costs, companies often pre-fer the cold redundancy. This means the components are not prepared for a take-over.

A complete recovery process may be assessed including recovery timings - to enable IT to provide the business with accurate estimations of when ser-vice can be restored, available alternatives - to identify what alternative re-covery options are available in the event of a failure and the corresponding recovery procedures - confidence that valid recovery procedures exist for each component. Component recovery times have to be mostly estimated. There are rare cases in which measures do exist, but for the same compo-nent, several problem could occur or another which may influence the re-covery process. Thus, the planned recovery time (MTTR) can only be seen as more or less precise estimation.

[J.Eifert 2012] proposed to access the 3 resilience parameters MTBF, MTTR, RED and then combining those within a formula. To allocate a degree (best between zero and one) to the MTTR and MTBF, as standard a linear function is proposed, with the MTBF or MTTR is on the x-axis and the allocated value on the y- axis.

35

FIGURE 9: EXAMPLE MTTR / MTBF ALLOCATION

A lower MTTR results in a higher loosely coupling effect, because it was showing us a fast repair resulting in a higher resilience. To the MTBF the contrary applies. An exponential function like the red line could also be used for a more pessimistic assignment. The blue line represents an optimistic assignment. In the MTBF example 300 days are used as a maximum value for the MTBF. Of course this value needs to be adapted according to the techno-logical standard and the system expectations. Any MTBF higher than the maximum number of days is assigned to the maximal level of one. For MTTR we can use as example a duration of >=24 hours resulting in the lowest loosely coupling level zero, but also the scales for MTTR must be defined individually with regard to business needs.

The redundancy degree RED should again provide a value between zero and one, which can be inserted into the formula later on. The best possible case is a hot redundant component item for which procedures are worked out and tested. This is of course rewarded with a value of one. The zero would be no redundancy at all.

Johannes proposed a system which gives 0.7 "points" for a hot redundancy, 0.5 for a warm one and 0.2 for a cold one. Assuming that there are described test procedures, we add to the points given for the redundancy 0.2 points

36

and another 0.1 if they are also tested. If multiple redundancies exist, the highest redundancy is evaluated. The point system for the redundancy level RED allows evaluating many different factors while being executable and simple. It is important to put forward that a warm redundancy with tested procedures gets a slightly higher "ranking" than a hot one without them.

The following picture displays the point system for a better understanding. A sample shows the summing up of the points.

FIGURE 10: POINT SYSTEM FOR LOOSELY COUPLING ASSESSMENTS

So it is important to include in the redundancy variable also the fact, wheth-er procedures are planned and / or tested when determining the loosely coupling degree.

Having worked out procedures for the recovery of a component item will certainly shorten and securing the recovery process. Although well known, rarely do enterprises invest in resources and time to develop such proce-dures. But even for those enterprises that have established such procedures, even fewer businesses take their time to test the procedures they've drawn out.

Out of the MTBF, the MTTR, the redundancy, procedures and testing of a component now a formula can be created. The calculate a degree of Loosely Coupling for a specific component LCComp, we apply the already assessed level of MTTR LCMTTR, MTBF LCMTBF and Redundancy LCRed.

LCComp = a * LCMTTR + b * LCMTBF + c * LCRed

This formula also includes weighing factors which normalize the formula, but allows also defining specific weights for the different aspects. Therefore, a+b+c = 1 must always be true. Each of the concerns associated with the mit-igation strategies can be annotated with a weight symbolizing for instance a

37

priority of the restoration capability. The initial weight values will be given by the experts after the assessment on the foreseen average impact of the affected business applications. Then, in a case of the weight value under a specified minimum, the decision could be not to apply a probably unsuitable mitigation strategy. In a case of strategy conflicts, the weight may be used in order to decide which method to prefer for a specific infrastructure instance.

Apart from the above role of the weights differentiation of the importance of the varying components resilience capabilities, we may assign the different weights to particular instances of the same component, representing the preferences needed of the respective impacted business application which may result in different weights for the same underlying infrastructure com-ponents.

A brief example for a single components’ resilience assessment

With regard to the described method, for the MTBF 300 days are set as a maximum value for the Loosely Coupling Level equal 1 and for the MTTR we can use a duration of >=24 hours resulting in the lowest loosely coupling level zero

The investigated component has a MTTR of 1 hour, a MTBF of 100 days and a hot fail-over method for which procedures have been developed, but have not been tested yet. This means that if we create values between [0,1] out of the variables, we get 1 – (1/24) = 0.9583 for the MTTR, (100/300) = 0.33 for the MTBF and 0.7 + 0.2 = 0.9 for the redundancy level.

Inserting the values into the formula by using the weighing factors a = 0.3, b=0.2 and c = 0.5 we get the following result of the level of Loosely Coupling Index LCComp :

LCComp = a * LCMTTR + b * LCMTBF + c * LCRed

= 0.3 * 0.9583 + 0.2 * 0.3333 + 0.5 * 0.9 = 0.804

3.3.3 CFIA GRID FOR LOOSELY COUPLING ASSESSMENTS Referring to the already described impact assessment method, the overall frame for incorporating all data and methods will be the Component Failure Impact Analysis (CFIA). The CFIA method can be freely extended with differ-ent kind of variables showing failure modes, several reliability parameters, operational capabilities and techniques and extends the pure system view (hardware and software) on component failures to include also the process-es, tools and people that support the systems.

FIGURE 11: EXTENDED CFIA GRID WITH DIRECT COUPLINGS

38

Having built the CFIA grid, components that have a large number of Xs are critical to many services and can result in high impact should the component fail. Equally, IT Services that are used in many business applications and have therefore a high count of Xs are important and are vulnerable to failure. The impact analysis using the CFIA can answer the question “which are the dependent business services of a particular component x” and “which com-ponents have an impact for a specific business service.

In the grid we can show all data relevant for the Loosely Coupling assess-ment as columns to be filled by the relevant system admins, so that the Loosely Coupling Indicator can be calculated using e.g. the described formu-la, or simply by an expert assessment. The here called Resilience Index (or Loosely Coupling Indicator) can also be directly written in the grid as addi-tional column. The Tightly Coupling Index can be calculated with regard to an appropriate formula as described or alternatively assessed by the ex-perts. The “failure mode and effects” column allows distinguishing the at-tributes with regard to the different mode of failures, like outage or slow response.

RTO/RPO targets are related to the business services and may therefore be an extended attribute of the business application in the grid, and can also be shown there on top of the business services.

As discussed, impact assessments are more precise when considering the coupling degree together with the level of vagueness. When judging the re-silience it is not only important to check whether to have a failover or recov-ery, but also the risk whether these methods will succeed completely or if this results only in a part- or limited restoration. Also often recov-ery/failover is not completely tested in all real-life cases, as this is very cost-ly and means a disturbance to business. We will therefore allow the experts to set a level of certainty next to the Loosely and Tightly Coupling Index. Not only the technical failover capability should be considered, but also the risk if these methods will succeed completely or if this results only in a part- or limited restoration. The degree of certainty could express anything like un-defined risk, vagueness, limited or imprecise knowledge, non-proofed in-formation or simple hesitancy to do a statement.

The rationale behind adding a certainty factor to the CFIA coupling degrees is the following. Even Fuzzy Logic and the later proposed fuzzy methodolo-gies can handle fuzzy data like vagueness, limited or imprecise knowledge, non-proofed information; the fuzzy mathematical model is exact and fuzzy mathematics provides precise calculations. So the advantage is, that fuzzy logic can handle imprecise data, with the prerequisite that fuzzy logic needs to know about this fact. So also the fuzziness of the assessment needs to be described.

39

For instance if someone defines a degree of Loosely Coupling for a compo-nent of 0.75 and this is simply a best guess, the derived impact assessment will also be a best guess. Therefore we allow the expert in our approach to add also the certainty level of this 0.75 assessment, so the expert can assign a low certainty factor as this was only best guess based on limited knowledge. Now our method can apply the 0.75 component resilience esti-mate in an adapted way to derive the business impact, also considering the fact, this input was based on a simple estimate without closer investigations. For impact assessments this meta-information about a high vagueness should decrease the level of resilience capabilities assumed for a component.

Let as example the recovery/failover be implemented but not completely tested, as this is very costly and means a non-acceptable disturbance to business. Now we can assign a low certainty and the impact assessment method can further adjust (decrease) the resilience factor, expressed as loosely coupling index.

3.4 BI-POLAR COUPLING ASPECTS A key principle of the following proposed impact assessment method is the idea of naturally envisaging positive and negative instances of the depend-ency relation and simultaneous consideration by pulling both strengths to-gether.

Let’s discuss this idea with a simple example. A complex system can on some degree be compared with a human organism. For instance, the possibility or likelihood that a person becomes ill when influenza-infected, results from the combination and intersection of the viral infect attack and the personal immune system stability and recovery ability of the infected individuals. Both of these threads (viral attack and immune system) have opposite ef-fects. When defining something like a "personal influenza disease risk" it does not make sense to consider only one aspect, mostly the infection prob-ability, in isolation. A high risk of infection but with strong immune abilities will not have an impact on the health of the individual person. (in IT terms a business impact). With a strong immune condition "high personal risk of influenza disease" is not necessarily given, even getting the virus infect.

For a complex IT system this is quiet similar, the risk of infection, are the dependencies through interactions (which can result in an incident), the controversy immune ability are the built-in system resilience capabilities. Both in combination result in the risk of getting a business impact in case an incident somewhere in the system occurs. The degree of independence is not always identical with the logical negation of dependency and the measure-ments for independence may be totally different than those for dependen-cies. So even they are complementary aspects, dependence and resilience

40

are only loosely complementary coupled, similar than the immune abilities and the infect-risk in the influenza example.

The intelligence in any system model with regard to the described bi-polar aspect will be the modelling of the indirect dependencies and interactions. This is because we may have several ways that the incident can interfere indirectly with other components which is mainly resulting out of the bi-polar aspect and combination of contrary forces. To show this, we take the described example of the influenza. There are two types of virus infection scenarios possible which mainly influence the impact and determine wheth-er an epidemic virus attack occurs. It could be either possible that only the virus can be submitted to third-parties, when the person himself gets sick, means the virus is stronger than the immune system, which leads to a small-er epidemic risk. Or in cases even healthy, but infected persons can transmit the virus to others, the epidemic risk increases significantly. So a high / low influenza contagion risk leads to a pessimistic (worst case) / optimistic (best case) indirect impact assessment.

IT systems try to implement the first scenario, that the resilience capabilities of each component should pro-actively limit the inference and impact of the incident to related components or the business services. However, the type of relation between IT components together with the needed MTTR timeframes may lead to different results in the way components are impact-ing others and hard to predict in an exact way.

Within the proposed IFCFIA approach, after the definition of the direct im-pacts, the indirect coupling between components or services can be calcu-lated considering the degrees for direct coupling. So we propose therefore that impact calculations should support different types of interdependen-cies, which may involve the usage of different probabilistic variants of the logical operations in calculation of indirect impacts. This allows modelling the way the incident impact is transferred throughout the larger and com-plex system (similar like a virus epidemic). Depending on which operations will be applied by combining the dependency levels, the indirect impacts may be greater or smaller. Therefore three basic types of impact analysis are later introduced: worst case (pessimistic), best case (optimistic) and moder-ate impact analyses.

Impacts are complex which constitutes uncertainty. They involve a multi-tude of effects that cannot be easily assessed and may involve complex cau-salities, non-linear relationships as well as interactions between effects. This may render it difficult to determine exactly what may happen. Therefore impact assessments are fuzzy in nature and we need to consider also a level of vagueness, uncertainty and limited or imprecise knowledge. As we are bi-polar approaching envisaging positive and negative instances independent-ly, each aspect may have a different and independent level of vagueness.

41

Later in this concept a real-world datacentre use-case will proof the idea that the loosely and tightly coupling concepts can be implemented via inde-pendent methods and best suited approaches for determining the degree of tightly coupling (interdependency) and the degree of loosely coupling (re-silience).

So the simultaneous and free play of the two contrary forces, dependence and independence, positive and negative aspects together will define the overall system behaviour and the probable impact on the coupled business service. Considering and judging positive and negative impact aspects iso-lated will not provide reliable assessments to the business. Therefore our proposal further recommends merging both aspects and contrary relations by combinatorial operations into one integrated result set to allow an inte-grated two-sided inference model and intuitionistic reasoning.

This leads to the question whether the discussed traditional impact analysis methods can be applied using a bi-polar model. In general the discussed traditional methods already cover both aspects. Fault Tree Analysis (FTA), like the word fault tree indicates, work in the "failure space" and looks at system failure combinations. So the FTA method covers the aspect of nega-tive risk of interdependencies and negative impacts on failure. On the other side, the described Component Failure Impact Analysis (CFIA) approach is primarily focused on the mitigation, restoration and resilience capabilities, which represents the positive aspect of independence.

3.5 HYPOTHESIS The discussions of the bi-polar coupling aspects lead to the following two-

fold hypotheses put forward in this paper:

Independence is not always identical with the logical negation of de-pendency and the measurements for independence may be totally different than those for dependencies. On the one side we have the interdependencies resulting from interactions between related sys-tem components on the other side each component has a set of resil-ience capabilities ensuring independence, by allowing a component to function even another component fails. Best results for impact as-sessments can be achieved by naturally and independently ap-proaching envisaging positive and negative instances of the depend-ency relationship using best suited measurements and methods to assess the strength of tightly and loosely coupling relations.

Even tightly and loosely coupling relations are independent, we be-lieve that only the simultaneous consideration of both, positive and negative, aspects together can define the overall system behavior and the probable impact on the dependent business service. Consid-

42

ering and judging positive and negative impact aspects isolated will not lead to the real world results and reliable judgments to the busi-ness. The application of two-sided (intuitionistic) reasoning by com-bining both aspects into inference rules and logics allows a much more granular impact prediction than other traditional impact anal-ysis can provide.

Impact relationships are complex in nature which constitutes uncer-tainty, so adjusting the result by considering also the vagueness is a key element for more exact impact assessments.

The next chapter will describe a fuzzy-mathematical model for implement-ing the proposed intuitionistic (bi-polar) concept.

43

4

1

4. INTUITIONISTIC FUZZY SETS

44

4.1 MOTIVATION ON INTUITIONISTIC FUZZY SETS

Intuitionistic Fuzzy Set (IFS) was proposed by [Atanassov 86,99] character-ized by a membership function and a non-membership function, which is a generalization of Zadeh’s fuzzy set [Zadeh, 65,94], whose basic component is only a membership function. Over the last decades, IFS has been applied to many different fields, such as decision making, logic programming, medical diagnosis, pattern recognition, machine learning and market prediction, e.g. [Szmidt and Kacprzyk 02,04] considered the use of the Atanassov IFSs for building soft decision making models using membership degree and non-member ship degree to express a decision-maker’s hesitation.

A bibliography of Intuitionistic Fuzzy Sets and their applications can be found at

http://www.clbme.bas.bg/projects/gnifs/ifs/publ.html

As example, deciding for buying a specific product can be modeled using the Atanassov IFS. The membership function expresses the degree of a given good being preferred by the customer while the non-membership function indicates the degree of the given product which is not preferred. Sometimes it seems to be more natural to describe imprecise and uncertain opinions not only by membership functions due to the fact that in some situations it is easier to describe our negative feeling than positive attitude. Also customers often need comparing preferences expressed by means of orderings which admit uncertainty due to imprecision, vagueness and hesitance. In this case the Atanassov IFS can give us a natural tool for modeling such orderings

Another application of IFS is the following situation: A human being who expresses the degree of membership of a given element in a Fuzzy Set (FS) very often does not express the corresponding degree of non-membership as the complement to 1. This reflects a well-known psychological fact that the linguistic negation is not always identified with the logical negation.

A similar approach can be chosen and adapted within our impact assess-ments for complex systems implying the fact that in praxis independence is not always identical with the logical negation of dependency and the meas-urements for independence may be totally different than those for depend-encies. On the one side we have the interdependencies resulting from inter-actions between related system components on the other side each compo-nent has a set of resilience capabilities ensuring independence, by allowing a component to function even another component fails. Both, positive and negative, aspects together will define the overall system behavior and the probable impact to the business.


45

4.2 IFS DEFINITION AND BASIC OPERATIONS

Let us have a fixed universe E. Let A be a subset of E. Let us construct the set where

Atanassov called the set A* Intuitionistic Fuzzy Set (IFS). Every element has

a degree of membership μA(x): E → [0,1] and a degree of non-membership

νA(x): E → [0,1].

Intuitionistic fuzzy sets, with independent memberships and non-

memberships, are generalization of fuzzy sets. Unlike in classical fuzzy sets,

the values of μA(x) and νA(x) are independent on each other (omitting μA(x) +

νA(x) ≤ 1).

For each IFS A in X, we will call π(x) = 1 - μA(x) - νA(x) the intuitionistic index

of x in A. It is a hesitancy degree of x to A corresponding to the degree of un-

certainty, indeterminacy, limited knowledge etc.)

Let a and b be intuitionistic fuzzy logical statements with estimations re-

spectively < μa, νa > and < μb, νb >, where < μa > is the degree of truth and < μb > is the degree of falsity of statement a.

A variety of operations over IFS have been defined. Below are shown several basic IFS operations for the IFS A,B. [Atanassov 08]

The logical operations conjunction (∧) and disjunction (∨) can be defined in two variants (classical and probabilistic). The possibility of both, a classical or probabilistic interpretation of the logical operations conjunction (∧) and disjunction (∨) will be a key concept in the next proposed indirect depend-ency calculations.

where

46

The following IFS operations are further proposed: worst case (pessimistic), best case (optimistic), moderate (medium) and classical fuzzy analyses. [Kolev and Ivanov, 2009]: These include operations expressed by means of intuitionistic fuzzy values carrying probabilistic information.

Worst case

V (p∧ q)=<min (μ(p),μ(q)), max(ν(p),ν(q)>

V (a∨ b)=<μ(a)+μ(b)-μ(a)*μ(b),ν(a*ν(b)>

Best case

V(p∧q)=<μ(p)*μ(q),ν(p)+ν(q)-ν(p)*ν(q)>

V(a∨ b)=<max(μ(a),μ(b)),min(ν(a),ν(b))>

Moderate

V(p∧ q)=<μ(p)*μ(q),ν(p)+ν(q)-ν(p)*ν(q)>

V(a∨ b)=<μ(a)+μ(b)-μ(a)*μ(b),ν(a)*ν(b)>

Classical

V(p∧ q)=<min(μ(p),μ(q)),max(ν(p),ν(q))>


TABLE 3: COMBINED CLASSICAL AND PROBABILISTIC LOGICAL IFS OPERATIONS

Depending on which operations are applied (classical and/or probabilistic) by combining the membership levels, the results will be greater or smaller.

To summarize, the membership degree and non-membership degree of an IFS themselves are crisp values – μ is the exact lower boundary of all esti-mates for the belonging of an element x to the IFS A and ν is the exact upper boundary of all estimates that the element x does not belong to the IFS A. These membership functions are loosely coupled, with the only constraint that the sum of the two degrees does not exceed one.

In other words, an intuitionistic fuzzy set is a generalization of a fuzzy set which defines another degree of freedom into the set description, the inde-pendent judgment of positive and negative aspects. This two-sided (intui-tionistic) view including the possibility to represent formally also a third aspect of imperfect knowledge could be used to describe many real-world problems in a more adequate way – by specification of both - advantages and disadvantages, pros and cons for each variable in the model together with the vagueness of these statements.

4.3 APPLYING IFS TO COUPLING- AND IMPACT ASSESSMENTS The following lists the advantages and key capabilities applying IFS to cou-pling and impact assessments:

Loosely and tightly coupling concepts can be naturally approached by separately envisaging positive and negative instances. Independ-ent methods for determining the degree of tightly and loosely cou-pling can be applied by best suited approaches.

47

IFS are describing besides the degree of truth and the degree of falsi-ty also the uncertainty of a statement. As impact assessments are fuzzy in nature we need to consider a level of vagueness, uncertainty and limited or imprecise knowledge.

IFS allow integrating two loosely related, complementary aspects in one single IFS as combined result set and pull together the comple-mentary dependency aspects. .

A variety of operations over IFS have been defined and well evaluat-ed. After the definition of the direct impacts, the indirect coupling be-tween components or services can be calculated considering the de-grees for direct coupling using the appropriate and best suited IFS operations.

Different types of interdependencies may involve the usage of differ-ent probabilistic variants of the logical operations in calculation of indirect impacts. IFS can implement the logical operations conjunc-tion (∧) and disjunction (∨) in two variants, classical and probabil-istic.

Applying different IFS operations can be leveraged as method to ex-press attitudes in impact assessments. Therefore three basic types of impact analysis are later introduced: worst case (pessimistic), best case (optimistic) and moderate impact analyses. This allows stake-holders to consider their individual concerns and requirements which lead to subjective impact assessments. (viewpoint concept)

IFS as extensions of classical fuzzy sets are fully enabling the ad-vantages of fuzzy mathematical models and the existing related work. This is leveraged in our project to establish soft-dependency models for SLAs. A simple binary and sharp assessment whether a component “is coupled” or is “not-coupled” to a business service will never be precise enough for granular impact assessments on busi-ness service levels.

IFS allow the application for two-sided (intuitionistic) fuzzy reason-ing. Using two-sided fuzzy propositions, complex system behaviors can be closely simulated by considering the perception of both (somewhat opposite) sides of the impact subject matter simultane-ously. IFS based rules are later developed for several scenarios.

Finally different semantics of intuitionistic fuzzy dependencies can be modeled. This requirement is needed for soft dependency rela-tionship that allows also the more complex and granular considera-tion of a degraded mode of operation or the concept of possibility of a failure.

Therefore Atanassov IFS give us a great natural tool for modeling coupling and impact assessments which can also be leveraged and integrated by IFS extension of classical reliability engineering methods.

48

4.4 SEMANTICS OF INTUITIONISTIC FUZZY DEPENDENCIES The intuitionistic fuzzy dependencies between components may have differ-ent kinds of semantics depending on the type of information they represent.

A probabilistic coupling dependency between KPI A and KPI B means “the probability that B is not available in case A is not available”.

An ordinary fuzzy coupling dependency between B and A means that “if B is not available, then A is partially not available”.

So a component KQI as an aggregated KPI value (e.g. SLA Compliance of the Service measured as the fraction of time during which all SLA fulfillment indicators of the service lie in green SLACS) can be used for interpretations within two semantically meanings :

as a measurement of a probable degradation level for component operation

as a probability that a component is compliant to specifications or not

We think that in praxis impacts can be best expressed having both semantics considered which is also aligned with the way human dependency assess-ments works. If we verbally express that a system or service is partly availa-ble, means operational degraded for instance with a degree of 10% but to be considered available with 80% probability will give a good measurement of the expected usability and operational status. This allows a notion of having a service still usable with some sort of degradation (functional and probabil-istic), where the first indicator refers to the level of degradation, and the second indicates the probability it occurs.

4.5 BRIEF ASPECTS ON INTUITIONISTIC FUZZY REASONING The membership and non-membership functions of property variables can be designed individually for both coupling aspects (membership/tightly) and (non-membership/loosely). Subsequently, the inference rules of the system can be constructed with algorithms for reasoning and defuzzification for both aspects separately or in combination.

IFS allow the application for two-sided (intuitionistic) fuzzy reasoning by combining both aspects including the vagueness of the statement into infer-ence rules and logics. Using two-sided fuzzy logic, human decision-making or even complex system behavior can be closely simulated by considering his perception of both (somewhat opposite) sides of the subject matter sim-ultaneously. Two-sided fuzzy if-then rules can be constructed using different interpretations of fuzzy implications, modeled by applying the different IFR operations and bi-polar interpretation of the result sets.

49

5

3

1

5. IFCFIA SOLUTION APPROACH

50

5.1 IFCFIA – OVERVIEW OF THE METHOD

5.1.1 FROM CFIA TO IFCFIA Component Failure Impact Analysis (CFIA) helps in providing a systematic approach to assist management to predict and evaluate the impact of com-ponent failures on IT systems. The discovered physical dependencies (using ADDM tools and extracted in a structured data format) can in the CFIA be extended with additional logical dependencies. This extends the pure system view (hardware and software) on component failures to include also the processes, tools and people that support the systems. It provides a relevant assessment to the physical components of the service, but also to examine the systems management framework, the supporting tools and the skills within the delivery organization. This provides now the starting point to consider our intuitionistic fuzzy approach and techniques to analyse and describe more granularly the impact of failures.

As CFIA can be freely extended with different kind of variables, we propose in our approach the extension of the classical CFIA to the IFCFIA (Intuition-istic Fuzzy Component Failure Analysis) by adding the direct and indirect fuzzy dependencies to the grid. The initial CFIA grid has been setup by usage of auto-discovery tools (ADDM) which provide the basis for the further higher level, logical indirect dependency assessments.

The indirect impacts can be determined by a fuzzy Fault Tree Analysis (FTA), which will be described in detail in this chapter. The fuzzy FTA, adds a degree of dependency (coupling) to the impacted business service (ex-pressed as Intuitionistic Fuzzy Set IFS) and will be incorporated within the standard CFIA matrix creation process.

Whilst classical FTA is binary (fail–success) it cannot address soft depend-ency problems, the FTA using Intuitionistic Fuzzy Sets (IFS) can describe the impact more granular (degraded modes or probability), considering the two aspects, the dependence and the resilience. The coupling IFS allows the ap-plication of two-sided (intuitionistic) fuzzy reasoning by combining positive and negative relationship and the level of vagueness to assess the impact of a lower level incident to a business service. The use of a fuzzy FTA enables also the more exact identification of dependent components that are likely to cause the failure of the IT business services, in case an incident is reported.

5.1.2 IFCFIA SEVEN STEP APPROACH As overview, the following seven steps have to be processed creating an IFCFIA grid, where we have already discussed the process until IFCFIA Step 3, creating the CFIA grid.

51

Step 1: Auto-discovery by ADDM tools

All infrastructure component items and technical dependencies of a defined scope will be auto-discovered using ADDM tools. This provides trust that the discovered information is real by automatically discovering interdependen-cies among applications and underlying systems and minimize IT organiza-tions expend on the information assimilation. The discovered components with corresponding relations can be extracted by ADDM tools in a structured data format e.g. xml for further automated processing.

Step 2: Defining the Business Service

The in-scope discovered component items are grouped to form the business applications, as the top level in the component hierarchy is the business ser-vice. A business service is the way to group the different kinds of IT re-sources into a logical group, and this logical group acts together as one unit to provide the service. Business services can contain any number of the low-er-level resources. The result of step 2 is a grouped list by business service of all directly and indirectly related components.

Step 3: Creating the CFIA Grid

A CFIA grid is created as described in chapter 3.3 showing the auto-discovered components on one axis and on the other axis the IT business services which have a dependency on the components. In the matrix we can list all data relevant for the loosely coupling assessment including the busi-ness RTO/RPO targets.

The grid is complemented with the calculated or assessed coupling degrees for loosely and tightly coupling. The tightly coupling index is defined as in-ter-modular coupling metric, which calculate the coupling between each pair of directly related components. For loosely coupling an intrinsic coupling metric is chosen as this refers to the individual components’ resilience capa-bilities. The CFIA will also show the assessed level of certainty next to the loosely and tightly coupling index.

Step 4: Defining the direct impact as IFS

As next step for the two independent loosely- and tightly coupling indexes a combined representation into an integrated Intuitionistic Fuzzy Set (IFS) is created. This requires the two coupling indexes A and B to be normalized and combined by IFS operations (we choose the fuzzy operation A@¬B). The result of step 4 is a combinatorial IFS describing the coupling dependency between two components (inter-modular). This will be called the intuition-istic fuzzy probabilistic direct impact between two components. The deter-mined direct coupling index can be added to the CFIA grid as additional col-umn.

52

Step 5: Calculating the indirect couplings as IFS

After defining the direct couplings as inter-modular IFS, the indirect cou-pling between components or services can be calculated considering the degrees for direct coupling. Here we can involve different probabilistic vari-ants of the logical operations in calculation of the indirect impacts. This al-lows modelling the way the incident impact is transferred throughout the complex system. Depending on the operation logic that will be applied on the IFS, the indirect impacts may be greater or smaller. Therefore several basic types of impact analysis are introduced: worst case (pessimistic), best case (optimistic) and moderate impact analyses. The result of step 5 is the coupling index of each component to the front-end business service repre-sented as indirect coupling IFS.

Step 6 (optional): Extending the Business View

The IFCFIA may be optional extended with additional logical dependencies and business impact information. For operation of IT systems we need to know also about dependencies to e.g. IT users and roles, organizational ele-ments, supporting processes or maintenance services. This can be expressed with a coupling relationship like – is coupled to: a procedure, a Service Level Agreement (SLA) or even technical- or user documentation. Also business impact information can be added to the business service like hourly cost of failure or impacted users. Thus when a component is unavailable, the num-ber of users impacted is understood and an impact calculation based on the cost of unavailability can be performed.

Step 7 (optional): Applying Intuitionistic Fuzzy Reasoning

As last step the IFCFIA allows the application for two-sided (intuitionistic) fuzzy reasoning by combining both aspects including the vagueness of the fact into inference rules and logics. Using two-sided fuzzy logic, the complex system behaviour can be closely analysed by considering both contrary cou-pling aspects simultaneously. Two-sided fuzzy if-then rules can consider different interpretations of fuzzy implications, by applying the best suited IFR bi-polar operations and interpretations.

Result:

The IFCFIA is now ready for fuzzy Business Impact Analysis (BIA) and for evaluation of incidents and events on infrastructure functions as Root Cause Analysis (RCA). It allows extracting the most likely root cause or other rea-soning by combining membership, non-membership and the vagueness 1- μA(x) - νA(x) into inference rules. IFCFIA provides an outstanding view on a complex system relations and dependencies by listing it's components with direct and indirect couplings to the impacted business functions and can now be leveraged for several business-critical scenarios.

53

5.2 DESCRIPTION STEP 1-3: CREATING THE CFIA GRID The following chapters describe the IFCFIA approach in detail. As we have already discussed the process until the CFIA creation we summarize the first 3 steps in one section.

5.2.1 BUSINESS APPLICATION MAPS BASED ON AUTO-DISCOVERY During IFCFIA Step 1 all infrastructure component items and technical de-pendencies of a defined scope will be auto-discovered using ADDM tools. This provides trust that the discovered information is real by automatically discovering interdependencies among applications and underlying systems and minimize IT organizations expend on the information assimilation. The discovered components with corresponding relations are extracted by ADDM tools in a structured data format like xml. Using IBM TADDM as auto-discovery tool, the provided data is structured based on the CDM data mod-el.

The auto-discovery tools can provide also the indirectly linked components which have a relation to the directly dependent items. In our use case we leverage for this the TADDM component affinity report. This report extracts all related components which have a dependency (IP dependency, transac-tional dependency or configuration dependency) on those components which are directly related to the in-scope business services.

The in-scope discovered component items are grouped to form the business applications, as the top level in the component hierarchy is the business ser-vice. A business service is the way to group the different kinds of IT re-sources into a logical group, and this logical group acts together as one unit to provide the service. Business services can contain any number of the low-er-level resources. The result of step two is a grouped list by business ser-vice of all directly related components. This will be shown in detail in the following use case for the Logistics Management application.

This grouping step creates implicitly the fault tree to the business service by chaining all directly and indirectly linked components. In case an incident occurs for a business application, a list of possible components which may be the root cause of the incident can now be identified. This allows perform-ing already after the basic CFIA creation a fault tree analysis (FTA).

5.2.2 CREATING THE CFIA GRID WITH COUPLINGS Having auto-discovered as part of the CMDB discovery the in-scope infra-structure components, there relationships and the configurations to be as-sessed, the next step is to create a grid with components on one axis and the IT Services which have a dependency on the component. This matrix is called CFIA grid (as described in chapter 3.3)

54

In the grid we can show all data relevant for the loosely coupling assessment including the business RTO/RPO targets. Also the Tightly Coupling Index can be determined with regard to an appropriate formula (e.g. Dhama's metric) or alternatively assessed by the experts. The tightly coupling index is de-fined as inter-modular coupling metric, which calculate the coupling be-tween each pair of directly related components. For loosely coupling an in-trinsic coupling metric is chosen as this refers to the individual components’ resilience capabilities. The CFIA will also show the level of certainty next to the coupling Index. The certainty could express anything like undefined risk, vagueness, limited or imprecise knowledge, non-proofed information or simple hesitancy to do a statement. Therefore not only the technical capabil-ities will be considered, but also the risk whether these capabilities will suc-ceed completely or lead to a limited restoration only.

After creation of the CFIA grid, components that have a large number of Xs are critical to many services and can result in high impact should the com-ponent fail. Equally, IT Services that are used in many business applications and have therefore a high count of Xs are important and are vulnerable to failure. The impact analysis using the CFIA can answer the question “which are the dependent business services of a particular component x” and “which components have an impact for a specific business service.

Even all methods rely on a number of simplifying assumptions that may not always hold true for every system and situation, our further proposed ap-proach based in the CFIA data, needs to get for input as precise data and information as possible, including the meta-information how certain can an assessed coupling degree be considered. This requirement is obvious for considering soft dependency relationship that allows also the more complex and granular consideration of a degraded mode of operation or the concept of probability (global view)/ possibility (local view) of a failure. A simple assessment whether a component “is related” or is “not-related” to a busi-ness service will never be precise enough for granular impact assessments.

Our approach starts performing an auto-discovery to ensure a principal lev-el of trust that discovered information is not hypothetical, but real. This is mapped and extended as data in the filled CFIA grid. Now also the following work and extensions of the discovered relations with logical and derived dependencies should continue that principle of real, trusted and well-evaluated information.

55

5.3 DESCRIPTION STEP 4: CREATING THE DIRECT COUPLING INDEX

5.3.1 PULLING TOGETHER THE LEVEL OF LOOSELY AND TIGHTLY COU-

PLING The CFIA is built until step 4 and we show the loosely and tightly coupling index for each component with the associated level of certainty. The goal is to create single IFS where the degree of membership is defined as the degree of tightly coupling and the degree of non-membership as the degree of loose-ly coupling. Tightly coupling refers to the degree of interdependency and loosely coupling is about the resilience of a component, which is a form of non-dependency. Although they imply contrary aspects and are adversative-ly related, they are not the logical negation of each other. Having therefore evaluated the two couplings individually, we want to combine them into a single intuitionistic fuzzy set (IFS). As IFS are defined with independent memberships and non-memberships function, coupling relations can be expressed as IFS where each element fulfils the tightly coupling criteria to some extent μ(x) and he loosely coupling criteria to some extent ν(x) , omit-ting μA(x) + νA(x) ≤ 1.

When performing mathematical operations, to ensure that the result set always fulfils the IFS criteria, we can operate on the tightly coupling index only with regard to the loosely coupling index and vice versa. Just ensuring μA(x) + νA(x) ≤ 1 by doing an isolated normalization on membership and non-membership does not provide meaningful results. E.g. we have a tightly coupling degree of μA(x) = 0.8 and a loosely coupling degree of μB(x) = 0.4. Any isolated transformation, like reducing both by multiplying with a com-mon factor < 1 will provide useless results. The logical reason behind is, that there is an inverse relationship between both couplings (tightly and loosely), as a high loosely coupling factor (resilience capability) will implicitly reduce the degree of tightly coupling and the other way round, a strong coupling relation will automatically decrease the efficiency of the components resili-ence capabilities.

Therefore we decided to consider the fuzzy complement of tightly coupling together with the loosely coupling degree and vice versa the fuzzy comple-ment of loosely coupling with the tightly coupling degree. Using this ap-proach the real impact can be closely simulated by considering the percep-tion of both sides of the coupling subject matter simultaneously. This makes especially sense as loosely coupling measurements needs to consider the two major aspects, first a measurement on the level the dependent compo-nent can complete the delivery of its service even the coupled component fails or is degraded in operation and second, the capability to mitigate the impact on the dependent service in time, when an associated component is not compliant to specifications. The first aspect is expressed as the comple-ment of the tightly coupling index and the second is the loosely coupling

56

index itself. Combining both allows both aspects to define the IFS represen-tation of the coupling.

As first classical approach we use the strict Zadeh fuzzy negation C(μ(x)) = 1−μ(x) = μ(x) to get the non-membership degree derived from the member-ship value.

FIGURE 12: ZADEH FUZZY COMPLEMENT

The combined IFS C* can then be created using the following fuzzy operation A@¬B:

where A is the fuzzy set for tightly coupling with membership μA(x) and non-

membership νA(x) and B the fuzzy set for loosely coupling. We will call now

to define as measurement of the direct dependency μcombined(x)= μdircpl(x)

and νcombined(x)= νdircpl(x) as degree of the direct coupling relationship.

In our example, using a tightly coupling degree of μA(x) = 0.8 and a loosely

coupling degree of μB(x) = 0.4 we can derive a fuzzy Zadeh complement of

νA(x) = 0.2 and νB(x) = 0.6.

Then with above formula the combined coupling IFS D* (μdircpl(x), νdircpl(x)) can be created with the degrees (0.7, 0.3).

We may also use weighing factors which normalize the formula, as this would also allow defining specific weights for the loosely and tightly cou-pling aspects, in case one is seen more important than the other. Therefore a + b = 1 must always be true.

57

μdircpl(x)= a * μA(x) + b * νB(x) and νdircpl(x) = a * νA(x) + b * μB(x)

Alternatively other fuzzy operations may be chosen which satisfy the neces-

sary requirement for the combined direct coupling IFS D* that μdircpl(x),

νdircpl(x) ≤ 1.

This can be for instance the (max(μA(x),νB(x)),min(μB(x),νA(x))) or

(min(μA(x),νB(x)),max(νB(x),νA(x))) operators.

For (max(tightly),min(loosely)) this would lead to prefer the highest (most dominant) coupling assessment for either the membership of tightly and the non-membership of loosely coupling and the lowest degree of the resilience assessment and independency.

D*(MaxMin)=(max(μA(x),νB(x)),min(μB(x),νA(x)))=(max(0.8,0.6),min(0.4,0.2)) = (0.8,0.2)

D*(MinMax)=(min(μA(x),νB(x)),max(νB(x),νA(x)))=(min(0.8,0.6),max(0.4,0.2)) = (0.6,0.4)

As we have used the Zadeh complement function νA(x) = 1 - μA(x) and νB(x) =

1 - μB(x), we have created by the proposed operations also a combined direct coupling IFS D* result as classical (Zadeh) fuzzy set with fulfills μA(x) + νA(x) = 1.

What we have not considered yet, is the certainty measurement of the cou-plings. This can be included now by an optional extension within the meth-od, described in the following chapter.

D*(MaxMin)=(max(μA(x),νB(x)),min(μB(x),νA(x)))= (max(0.8,0.6),min(0.4,0.2)) = (0.8,0.2)

D*(MinMax)=(min(μA(x),νB(x)),max(νB(x),νA(x)))= = (min(0.8,0.6),max(0.4,0.2)) = (0.6,0.4)

5.3.2 DEFINING THE VAGUENESS Motivation Fuzzy Logic can handle fuzzy data like vagueness, limited or imprecise knowledge, non-proofed information, but the fuzzy mathematical model is exact and fuzzy mathematics does need precise input to perform the impact calculations. Therefore we allow the expert in our approach to add also the certainty level to the coupling assessments, so the expert can assign a low certainty factor in case this was only best guess based on limited knowledge. Now our method can apply and operate the coupling degree in an adapted

58

way to derive the business impact, also considering the fact, this input was based without doing deeper investigations on a simple best guess.

Even all methods rely on a number of simplifying assumptions that may not always hold true, our IFCFIA approach needs to get as precise data and in-formation as possible, including how certain can an assessed coupling de-gree be considered. This requirement is obvious for granular dependency relationship that allows also the more complex and granular consideration of impacts like a degraded mode of operation or the possibility of a failure on a business service. Atanassov IFS give us a great natural tool for model-ling vagueness or imperfect knowledge. For each IFS A in X, we call π(x) = 1 - μA(x) - νA(x) the intuitionistic index of x in A which can be considered as the degree of uncertainty, indeterminacy, limited knowledge etc. IFS are a gen-eralization of fuzzy sets. Unlike in classical fuzzy sets, the values of μA(x) and νA(x) are independent on each other (omitting μA(x) + νA(x) ≤ 1) and are var-ying by the degree of vagueness which is defined as intuitionistic index π.

We will proceed with the used approach to combine the fuzzy complement of tightly coupling together with the loosely coupling degree and vice versa the fuzzy complement of loosely coupling with the tightly coupling degree to simulate the impact by considering the perception of both sides of the cou-plings simultaneously. The difference now is that we allow to use weaker forms of complements than the strict Zadeh complement where the intui-tionistic degree (level of vagueness) π(x) = 1 - μA(x) - νA(x) is per definition zero.

Fuzzy Complements

In general a fuzzy complement c maps membership degrees c: [0,1] → [0,1]

where the function c assigns a value to each membership value. The value c(x) is interpreted not only as the degree to which x belongs to the fuzzy set cA (membership complement set), but also as the degree to which x does not belong to the fuzzy set A (non-membership). This idea can be leveraged to complement a meaningful non-membership function for our coupling IFS in addition to the Zadeh complement.

For the fuzzy complement c mappings several axiomatic requirements are defined:

c(0) = 1 and c(1) = 0 - boundary condition for all a; b [0; 1], if a <= b, then c(a) >= c(b) - monotonicity c is a continuous function. c is involutive, i.e., c(c(a)) = a, for each a [0; 1]

Classes of complements which fulfil these axiomatic requirements are for instance the complement functions defined by Sugeno and Yager which will now be explained in our method. Involutive fuzzy complements are the Sugeno or Yager complements which are shown in the graph below varying

59

by (Sugeno) or w (Yager). For the Sugeno complement function, the clas-

sical Zadeh complement is created for = 0 and for the Yager complement function, the Zadeh complement is mapped for w = 1.

FIGURE 13: SUGENO AND YAGER COMPLEMENT

In our approach either of both complements functions by Sugeno or Yager can be chosen. We define as requirement that for the complement function using Sugeno >= 0 and using Yager that w <= 1. This is necessary to omit μA(x) + νA(x) <= 1 as requirement for the IFS.

For Sugeno, the larger is defined, the lower is the complement value νA(x)

and the higher is the considered degree of vagueness π(x). For = 0 we have no vagueness and therefore a standard Zadeh fuzzy complement is generat-ed. For Yager, the lower w is defined, the lower is the complement value νA(x) and the higher is the considered degree of vagueness π(x). For w =1 we have no vagueness and therefore a standard Zadeh fuzzy complement is generated.

This means the parameter (Sugeno) and w (Yager) determines not only the complement value, but also the level of vagueness which is defined as π(x) = 1 - μA(x) - νA(x).

To set these parameters a mapping from the expert’s vagueness assessment of the tightly and loosely coupling degree to the chosen parameter and w is needed, which can be set up like the following:

For the linguistic variable: Certainty with the linguistics terms (fuzzy sets): {certain, confident, uncertain} the mapping can be defined e.g.:

“certain” to = 0.5 or w= 0.9,

“confident” to = 2 or w= 0.75

“uncertain” to = 5 or w= 0.5.

60

This proposed method allows the experts to judge only the certainty of the membership aspects of the tightly and loosely coupling relationship. This is exactly as real world situations are described, as dependencies are mostly expressed in a positive form (membership) only. This means in real world situations we prefer to specify a level of dependency and implicit assess the independence via the negation of the level of dependence.

FIGURE 14: CERTAINTY MAPPINGS FOR SUGENO AND YAGER

We can express the vagueness indicator as linguistic terms in the CFIA grid set by the experts and can afterwards defuzzify this back into a crisp number for calculation based on the complement type (Sugeno, Yager).

Defining the direct coupling IFS D* For instance we have now independently determined the degree of tightly coupling of 0.7 (seen as an indicator of risk resulting from interdependen-cies) and a degree of 0.5 for loosely coupling (degree of resilience capabili-ties).

Now we choose to use the Sugeno complement and we define for tightly coupling a lambda parameter of = 2 (confident level). This value is mapped in response to the expert assessment of a “confident” level of the tightly coupling impact statement.

Here we get a Sugeno complement for tightly coupling c(μA(x) = 0.7) =

(1-0.7) / (1 + 2 * 0.7) = 0.3 / 2.4 = 0.125 = νA(x).

We can create then a first IFS A* for Tightly Coupling with the tuple (0.7, 0.125). The degree of vagueness for tightly coupling is defined as 1 - μA(x) -νA(x) = 0.175.

The following graph shows the Sugeno complement for = 2 :

61

FIGURE 15: SUGENO COMPLEMENT FOR LAMBDA = 2

For Loosely Coupling we have a membership degree of 0.5 where we do the same approach. For the Sugeno complement we define a vagueness indicator of = 0.5 which correlates to a “certain” statement.

Here we get a Sugeno complement for loosely coupling c(μB(x) = 0.5) = (1-0.5) / (1 + 0.5 * 0.5) = 0.5 / 1.25 = 0.4 = νB(x).

Now we create a second IFS B* for Loosely Coupling with the tuple (0.5, 0.4). The degree of vagueness for loosely coupling is 1 - μB(x) -νB(x) = 0.1.

We have now two independent IFS, A* for loosely and B* for tightly coupling which we want again to combine in a single direct coupling IFS D*. Using this approach the real impact can be closely simulated by considering both con-trary sides of the coupling aspect simultaneously.

The IFS D* can be defined with the adapted IFS operation [Atanassov 08] A@¬B

where A* is the IFS set for tightly coupling with membership μA(x) and non-membership νA(x) and B* is the equivalent IFS for loosely coupling. We will call now μcombined(x) = μdircpl(x) and νcombined(x) = νdircpl(x) as degrees for the direct coupling IFS D*.

With a IFS A* = (0.7, 0.125) for tightly coupling and B* = (0.5, 0.4) for loosely coupling, the combined IFS D* = (A@¬B) is (0.55, 0.3125) with a vagueness πD(x) = of 0.2375.

We may also use weighing factors which normalize the formula, this allows also defining specific weights for the loosely and tightly coupling aspects, in

62

case one is seen more important than the other. Therefore a + b = 1 must always be true.

μdircpl(x)= a * μA(x) + b * νB(x) and νdircpl(x) = a * νA(x) + b * μB(x)

Alternatively other fuzzy operations may be chosen which satisfy the neces-

sary requirement for the combined direct coupling IFS D* that μdircpl(x),

νdircpl(x) ≤ 1. This can be for instance the (max(μA(x),νB(x)),min(μB(x),νA(x)))

or (min(μA(x),νB(x)),max(νB(x),νA(x))) operators.

For (max(tightly),min(loosely)) this would lead to prefer the highest (most dominant) coupling assessment for either the membership of tightly and the non-membership of loosely coupling and the lowest degree of the resilience assessment and independency.

D*(MaxMin)=(max(μA(x),νB(x)),min(μB(x),νA(x)))=(max(0.7,0.4),min(0.5,0.125)) = (0.7,0.125) with vagueness πD(x)= 0.175.

D*(MinMax)=(min(μA(x),νB(x)),max(νB(x),νA(x)))=(min(0.7,0.4),max(0.5,0.125)) = (0.4,0.5) with vagueness πD(x)= 0.1.

5.3.3 IFCFIA FORMAL DEFINITION As we have defined the couplings as Intuitionistic Fuzzy Sets (IFS) we want to do now the following formal definition for the Intuitionistic Fuzzy Com-ponent Failure Impact Analysis IFCFIA.

Dependence Coupling C is a measure that we propose to capture how de-pendent the component or service is on other services or resources for its delivery. IFCFIA can be defined then as the following tuple of components and coupling relationships:

IFCFIA = (E, C), where E is a set of components and C is the intuitionistic fuzzy set of coupling relationships between the components:

C = {<a, b, μC(a, b), νC(a, b)> / a∈E, b∈E },

where the functions μC: E×E → [0, 1] define the degree of tightly coupling

(:=dependency between the components a and b) and νC: E×E → [0, 1] de-fine the probabilistic degrees of loosely coupling (:= Independency between the components a and b).

The intuitionistic fuzzy set of coupling relationships C between the compo-nents comprises all direct and indirect relationships. So after definition of the direct coupling, the indirect coupling can be calculated instead of deter-mined individually (IFCFIA approach step 5). This will be done via IFS oper-ations which implement composition functions on the direct intuitionistic relations.

63

5.4 DESCRIPTION STEP 5: CALCULATION OF THE INDIRECT COUPLINGS

5.4.1 OVERVIEW With IFCFIA Step 4 we have as result determined the intuitionistic fuzzy probabilistic direct impacts between two components. After definition of the IFS for the direct impact, the indirect coupling can be calculated instead of determined individually. The indirect coupling between components or ser-vices can be calculated considering the degrees for direct coupling. Here we can involve different probabilistic variants of the logical operations in calcu-lation of the indirect impacts. This allows modelling the way the incident impact is transferred throughout the complex system.

Different types of impact analysis involve the usage of classical or probabil-istic variants of the logical operations conjunction and disjunction in calcula-tion of indirect impacts. Depending on which combination of operations will be used, the indirect impacts may be greater or smaller. Several basic types of impact analysis are introduced which may express attitudes leading to a pessimistic (worst case), optimistic (best case) or moderate (mediate case) assessment of the impact caused by an incident situation.

This variation of the indirect impact calculation can be leveraged to imple-ment a viewpoint or attitude based concept. A viewpoint is basically a speci-fication that describes a particular view of the service which is an important parameter for performing an impact assessment. A viewpoint is linked to a particular stakeholder or set of stakeholders in mind and allows different stakeholders to focus on their own concerns. The impact of a specific inci-dent depends on the relation to a stakeholder’s concerns and requirements. Various stakeholders may have their individual concerns which lead to a different subjective impact assessment. This might result in a problem for finding a common agreement of the strength of the expected impact.

Therefore we see this concept also of a kind of “attitude” based impact as-sessment model allowing performing a parameterized impact assessment with regard to the stakeholders’ individual attitude and concerns.

5.4.2 INDIRECT COUPLING CALCULATIONS The direct coupling from component x to component y is defined as follow-ing in the directed dependency map:

where V is the evaluating function of the intuitionistic fuzzy statement.

64

The methodology for calculating the indirect coupling follows the forward dependency direction. Following it we can answer the question “Which are the indirect dependants of a particular component x?” starting from the node x in the dependency graph and traversing through its direct or indirect dependants.

Therefore when we perform an impact analysis, we use a Forward Coupling Calculation (FCC), as adapted variant of the approach introduced by [Kolev and Ivanov, 2009]. The direct coupling from component x to component y is defined using FCC as following :

where i is the component directly coupled to y on the path from x to y.

Applying the Forward Coupling Calculation (FCC), we will more exactly call the forward looking relation “couples to” and the backward looking relation “is coupled to”.

For our KQI/KPI Hierarchy a forward looking coupling calculation means a bottom-up calculation in case e.g. an infrastructure component fails, what is the coupling to a higher level component or finally the business service B.

Vice versa a root cause analysis is a top down approach and requires the reverse task to be solved, i.e. “To which components is the business applica-tion B coupled to (depends on)” The second method implies the definition of methodology for calculating indirect impacts starting from the dependant and traversing through its impact arcs in the reverse direction. We refer to this method now as Reverse Coupling Calculation (RCC).

RCC uses the following formula to calculate indirect impact from component y to component x:

where i is the component directly coupled to x on the path from y backwards to x.

The decision which method should be applied depends on the task to be solved – impact analysis or root cause analysis. FCC and RCC are valid meth-ods and can be used for indirect coupling calculations.

65

FCC and RCC results for indirect dependencies may differ. The reason is that two indirect dependencies are equal only if the conjunction is distributive over the disjunction. But this is not the case as the probabilistic logical oper-ation types don’t have distributive character.

For the further work we will apply the FCC method, as we will bottom-up assess the impact from infrastructure or backstage to the business applica-tions or front stage services. As we discover by the automated discovery tools (here used Tivoli Application Dependency Discovery Manager) the topology of the infrastructure (physical and software) we want to conclude to the impact on the business applications. The fulfilment of any higher-level objective requires proper enforcements not on a single resource, but on multiple resources at several levels. Service monitoring on the backstage metrics implies a bottom-up approach and begins by monitoring on backend applications and resources.

With the described method we can relate the metrics of individual compo-nents, like accuracy, responsiveness, uptime, etc. (which are in a sense back-stage metrics) to the front stage experienced by the client or business. The indirect couplings can indicate and be interpreted whether the frontend service will meet the SLAs or may be impacted or operational degraded in some way.

5.4.3 TYPES OF INDIRECT IMPACT OPERATIONS

Classical and probabilistic interpretation of logical operations Depending on which combination of IFS operations will be used, the indirect impacts may be greater or smaller. Four types of impact analysis are intro-duced [Kolev and Ivanov, 2009]: worst case (pessimistic), best case (opti-mistic), moderate (medium) and classical fuzzy analyses.

The possibility of both, a classical, probabilistic interpretation of the logical operations conjunction (∧) and disjunction (∨) is a key concept in the pro-posed indirect impact calculations. The partial impact between the compo-nent PI and business KPI is now expressed by means of intuitionistic fuzzy values carrying probabilistic information. The combination of classical and probabilistic interpretation of the logical operations can as result be inter-preted either as a probabilistic indirect dependency between component PI and the business KQI (means the probability that a KQI breaches the SLA in case the component PI fails) or an ordinary indirect fuzzy dependency (means that the KQI is partially out of specification or degraded in case the component PI fails).

The classical and/or probabilistic interpretation of the logical operations can be either performed for all the calculations when traversing through the dependency graph and its direct or indirect dependants. This approach can be seen more as a “general attitude” based interpretation of the indirect im-pacts.

66

Alternatively the decision about classical and/or probabilistic interpretation can be assigned as a characteristic to the relation itself. In this case the spe-cific type of relation determines the best suited operation. E.g. simple ordi-nary relation measurements (which are often described probabilistic like for instance component availability) may prefer probabilistic interpretations. In praxis this assignment of an additional attribute to each individual relation will be difficult to handle, as this needs to be defined by an expert’s assess-ment and the number of individual relation can get large even within a lim-ited evaluation scope.

The following classical and probabilistic calculations are proposed:

Worst case impact analysis The worst case impact analysis involves the usage of classical conjunction and probabilistic disjunction in calculation of indirect impacts. Thus a great-er value for the degree of truth of indirect impacts is achieved. V as the eval-uating function of an intuitionistic fuzzy statement can be defined for max-imizing the impact (worst case):

V (p ∧ q) = < min (μ(p), μ(q)), max (ν(p), ν(q)) >

V (a ∨ b) = < μ(a) + μ(b) - μ(a) * μ(b), ν(a)* ν(b) >

Best case impact analysis The best case impact analysis involves the usage of probabilistic conjunction and classical disjunction in calculation of indirect impacts. Thus a smaller value for the degree of truth of indirect impacts is achieved:

V (p ∧ q) = < μ(p) * μ(q), ν(p) + ν(q) - ν(p) * ν(q) >

V (a ∨ b) = < max (μ(a), μ(b)), min (ν(a), ν(b)) >

Moderate impact analysis The moderate impact analysis involves the usage of either probabilistic or classical logical operations in calculation of indirect impacts. Probabilistic operations are more applicable for IFCFIs with probabilistic kind of compo-nent dependencies:

V (p ∧ q) = < μ(p) * μ(q), ν(p) + ν(q) - ν(p) * ν(q) >

V (a ∨ b) = < μ(a) + μ(b) - μ(a) * μ(b), ν(a) * ν(b) >

Classical fuzzy impact analysis

Classical intuitionistic fuzzy operations are more applicable for IFCFIAs with ordinary fuzzy kind of component dependencies:

V (p ∧ q) = < min (μ(p), μ(q)), max (ν(p), ν(q)) >

V (a ∨ b) = < max (μ(a), μ(b)), min (ν(a), ν(b)) >

67

The applied method of impact analysis can not only express the risk attitude but may also be used to adapt to different types of KQI/PI relations, as each individual KQI/PI relation may be more or less sensitive to coupling and has therefore a different derived indirect impact.

The result of IFCFIA Step 5 is the indirect coupling index of each component to the front-end business service represented as IFS. This can now be added to the CFIA grid, which we call afterwards the IFCFIA grid. This grid with the indirect fuzzy relationships builds the core and basis of the Intuitionistic Fuzzy Component Failure Impact Analysis.

5.4.4 EXAMPLE OF INDIRECT COUPLING CALCULATIONS The direct couplings with corresponding intuitionistic fuzzy relationship can be drawn in a directed graph; indirect couplings can then be calculated from the direct coupling degrees.

FIGURE 16: DIRECT IFS RELATIONSHIPS “COUPLING” IN A DIRECTED GRAPH

As an example of the calculation with the Forward Coupling Calculation (FCC) method (used for Impact Analysis) of indcpl(C2,B0) depicted in the graph above shows the indirect coupling dependency of the Business Appli-cation B0 on the Component C2 :

indcpl(C2,B0) = (dircpl(C2,C3) ∨ (dircpl(C2,C4) ∧ dircpl(C4,C3))) ∧ dircpl(C3,B0)

Applying a classical indirect coupling operation operation indcplclassic(C2,B0) = = (0.60,0.30) is the result set. When using a moderate impact assessment indcplmoderate(C2,B0)=(0.43,0.43), with a worst case impact assessment indcpl-

68

worst(C2,B0)=(0.60,0.30) and for a best case impact assessment indcplbest(C2,B0) = (0.36,0.51).

After calculation of the indirect coupling IFS, we can see directly which com-ponents create the biggest risk to our business application. A component with a coupling membership value close to one will have a major impact because the business application is dependent on this component and rarely resilience capabilities for this item exist. A component having a coupling membership value close to zero does either have a great resilience, or is de-coupled and functional independent which indicates a small coupling level.

The directed graph can also be leveraged for a root cause analysis (RCA). This is a top down approach and requires the reverse task to be solved, i.e. “To which components is the business application B coupled to (depend on)” The difference is here that RCA is calculating indirect impacts starting from the dependant and traversing through its impact arcs in the reverse direc-tion. We refer to this method as Reverse Coupling Calculation (RCC) as the top-down direction in contrary to a bottom-up Forward Coupling Calcula-tion (FCC).

Having all indirect couplings calculated a simple one-level intuitionistic de-pendency map can be drawn showing all dependencies of a business service.

FIGURE 17: ONE-LEVEL DEPENDENCY MAP AFTER PERFORMING FCC OR RCC

FCC and RCC results for indirect dependencies may differ. An example here is the coupling calculation of indcpl(C2,B0) from the graph above applying:

69

FCC method: indcpl(C2,B0) = (dircpl(C2,C3) ∨ (dircpl(C2,C4) ∧ dircpl(C4,C3))) ∧ dircpl(C3,B0)

RCC method: indcpl(C2,B0) = (dircpl(C3,B0) ∧ dircpl(C4,C3) ∧ (dircpl(C2,C4)) ∨ (dircpl(C2,C3) ∧ dircpl(C3,B0))

The decision which method should be applied depends on the task to be solved – impact analysis or root cause analysis. Both, FCC and RCC are valid methods and can be used for indirect coupling impact calculations.

5.4.5 UPDATING THE CFIA GRID WITH THE INDIRECT COUPLING INDEX Having built the CFIA grid as described in Step 3, components that have a large number of Xs are critical to many business services and can result in high impact or the other way round components that are used in many busi-ness applications are vulnerable to failure.

TABLE 4: IFCFIA GRID WITH INDIRECT COUPLINGS TO THE BUSINESS SERVICE

After definition of the IFS for the direct impact, the indirect couplings were calculated instead of determined individually. Now we can simply replace the Xs with IFS results for the indirect coupling calculations in the column for the Business Services. So with IFCFIA Step 4 we have as result deter-mined the intuitionistic fuzzy probabilistic direct impacts between two

70

components which is the direct coupling IFS and added the indirect coupling IFS as additional column to the CFIA grid.

The impact analysis using the IFCFIA can answer the question “to which degree is the impacted business services dependent on a particular compo-nent x and to which degree is it independent” and “which components have an impact with which degree for a specific business service and to which degree they do not impacting the business function.

We extended now the classical CFIA to the IFCFIA (Intuitionistic Fuzzy Com-ponent Failure Analysis) grid by adding the indirect fuzzy dependencies as IFS into the table. The shown IFS numbers are only examples to provide the look & feel for the IFCFIA grid and are not real derived by IFS operations. The coupling of the highest level infrastructure component (e.g. Switch, Http Server) to the business application is also defined by the experts and can be written in the same way as coupling parameter next to those highest level infrastructure components which are directly related to the business ser-vice.

As FCC and RCC results for indirect dependencies may differ we need two columns in the CFIA for the two types of calculations for each business ser-vice. For each IFS, implicitly, π(x) = 1 - μA(x) - νA(x) the intuitionistic index pro-vides the degree of vagueness, uncertainty, limited knowledge etc. for the indirect coupling IFS.

The CFIA Matrix can now be leveraged for impact analysis, which requires a Forward Coupling Calculation (FCC) method for indirect couplings or for performing a Root Cause Analysis (RCA) which requires the Reverse Cou-pling Calculation (RCC) method as described in the next chapter.

5.4.6 IMPACT- AND ROOT CAUSE ANALYSIS The IFCFIA can be used in two principal ways, bottom-up as impact assess-ment or top-down as fault tree analysis.

IFCFIA Impact Analysis CFIA can be used to help predict and evaluate the impact on business ser-vices arising from component failures within the IT Infrastructure design. Having built the IFCFIA grid, the impact analysis using the IFCFIA can an-swer the question “Which are the indirect dependant business services of a particular component x and to which level are they tightly or loosely cou-pled?” starting from the low-level infrastructure component in the depend-ency hierarchy and traversing through its direct or indirect dependants to the business application services.

71

Therefore a forward coupling calculation must be applied and thus the IFCFIA column “FCC coupling to Business Service” must be taken for impact assessments.

The IFCFIA can be now used as a proactive method to determine the poten-tial impact on service delivery in the event that a particular component (or configuration item) should fail. We can leverage now all three measure-ments (tightly and loosely coupling as well the level of vagueness) for rea-soning about possible impacts. The tightly coupled dependency degree can be seen as an indicator of the risk resulting from interdependencies where the loosely coupled aspect refers to the mitigation and resilience capabilities of a system.

IFCFIA can be a very useful tool as it creates a visual tabular view of services and their required component items and shows granularly the way that the infrastructure is arranged, organized and depends on each other. A basic IFCFIA will target a specific section of the infrastructures; just looking at scenarios e.g. if we lose component x, will a business service be degraded or is likely to stop working?

By adding the indirect coupling IFS we have much more granular infor-mation. We can now even reason over several semantics of intuitionistic fuzzy dependencies between components. So we may have different kinds of semantics depending on the type of information they represent.

A probabilistic coupling dependency between KPI a and KPI b means “the probability that b is not available in case a is not available”.

An ordinary fuzzy coupling dependency between b and a means that “if a b is not available, then a is partially not available”.

The IFCFIA can express dependencies having both semantics considered which is also aligned with the way human dependency assessments works. If we verbally express that a system or service is partly available, means oper-ational degraded for instance with a degree of 10% but to be considered available with an 80% probability will give a good measurement of the ex-pected usability and operational status.

IFCFIA allows therefore a notion of having a service still usable with some sort of degradation (functional and probabilistic), where the first indicator refers to the level of degradation, and the second indicates the probability it occurs.

IFCFIA Fault Tree Analysis The purpose of a fault tree analysis is to determine the root cause of a fail-ure, considering the fact that a particular item is out of order. A root cause analysis is a top down approach and requires the reverse task then the im-

72

pact analysis to be solved, i.e. “To which components is the business applica-tion B coupled to (depends on)”

Therefore a reverse coupling calculation must be applied and thus the IFCFIA column “RCC coupling from Business Service” must be taken for the root cause analysis.

Assuming that the component y has failed, the intuitionistic fuzzy set R, rep-resenting the root cause possibilistic distribution, is defined as follows:

R (y) = { < x, μR ( x,y ), νR ( x,y ) > | x C },

where μR ( x,y ) = μ(indcpl(x,y)) and νR ( x,y ) = ν(indcpl(x,y))

and C is the set of all components.

The IFCFIA analysis procedure takes into account direct and indirect im-pacts of other components over the failed components. The result of the analysis is a sorted intuitionistic fuzzy distribution of components giving an ordered set of possible root causes.

Having the IFCFIA grid created, we simply need to sort for the highest level of IFS coupling (we propose to sort primary for tightly coupling and second-ary for loosely coupling) to get an order for the probability of possible root causes. The infrastructure component with the highest coupling is most like-ly and should therefore first being considered for causing the impact on a higher level business service.

To summarize, the indirect couplings between components and the business services can be calculating using either FCC or RCC method, but the choice of the best suited calculation method depends on the task to be solved – impact analysis or root cause analysis.

5.5 STEP 6: (OPTIONAL) EXTENDING THE BUSINESS VIEW

5.5.1 IT (HUMAN) ENABLED SERVICES The described impact model can be extended including IT Enabled Services (ITeS) which typically include a large human element. For operation of IT systems we need to know also about dependencies to e.g. IT users and roles, IT staff, IT organizational elements, business units, supporting processes and functions like helpdesk and maintenance services.

This can be expressed also with a logical association relationship like – is coupled to: a procedure, an organizational unit, a Service Level Agreement (SLA), a manual, a user documentation or a support function like help desk. We can easily leverage the IFCFIA method also to show all those relation-

73

ships where coupling can be interpreted as any form of interdependency because of an interaction. This interaction can be manual triggered or auto-mated with service management tools.

FIGURE 18: EXTENDED DEPENDENCY GRAPH WITH IT ENABLED SERVICES

For example an infrastructure component may depend for correct operation on an available and complete technical documentation. In case the docu-mentation is not available, we have a degraded operation of the component and in event of a failure a longer Mean Time To Recover (MTTR) or limited restoration capabilities will result.

For components not belonging to the system itself, for instance related sup-port functions or relations on user experiences, we would propose to cap-ture simply the dependency via a linguistic description of the dependency – define it e.g. as high, medium high, medium, low or very low. The degree of dependency or coupling could be directly determined by the experts who have created the service. Then we can map the linguistic terms to specific IFS coupling values e.g. high (with some uncertainty) to (0.8,0.1) or medium dependency (without uncertainty) to (0.5,0.5).

With this approach we are able to predict with the same reverse bottom-up calculation the indirect impact of a failure of a single infrastructure compo-nent, up to high level business indicators, like the perceived User Experience or Customer Satisfaction.

74

5.5.2 ADDING THE COSTS OF FAILURE TO THE IFCFIA The IFCFIA matrix developed during the activities described in the previous chapters can be expanded to include fields that can map the number of users supported by each business service, so the component coupling to the higher level services indicates also the affected users by a degraded operation of an infrastructure node. Thus when a component is unavailable, the number of users impacted is understood. This can enable cost calculations based on the number of users impacted and/or amount of lost user processing time or even total cost of unavailability.

However, the number of user workstations does not necessarily equate to the number of users at one point in time. So other measurements of costs of failure should complement these numbers, like SLA penalties when service providers fail to deliver the pre-agreed quality, estimation of the financial impact of IT failure against the transaction volumes (related to the vital business functions) normally processed during the period of failure.

For organisations unable to justify the failure costs, a 'user assessment' of a monetary hourly value is a simple technique that provides a business and user view of the business service cost of non-availability. Also for certain businesses a consequence of IT failure may be even external claims for fi-nancial compensation by impacted customers or business partners.

An approach to obtain an indicative cost per hour unavailability is to take the annual cost to the business of taking the service and simply divide by the number of service hours contracted in the SLA for a year.

An example for the calculation of hourly cost of failure is shown in the chap-ter Business Impact Analysis.

Since we hold now the measurements for our business applications, we can compute them directly for our CIs using the following formula:

Where: n is the number of business applications i, CCI denotes the hourly

cost of a of the component item Ci, μA(x)i is the degree of membership of

tightly coupling of the component up to the business application i and Ci

denotes the hourly cost of a failure of the business application i.

For instance the hourly cost of failure for the node HS-01 Http Server as it

is coupled with μ=0.6 to Business Service 1 and with μ=0.5 to Business

Service 2:

CHttpServer = 0.6 * 10.000 + 0.5 * 3,000 = 7500

75

The calculated total cost of failure per component can then be added as addi-tional column to the IFCIA grid which allows assessing at one glance the monetary impact of each lower level component failure.

TABLE 5: EXTENDED CFIA WITH COST OF FAILURE

IFCFIA provides the possibility to size and plan the capacity and perfor-mance parameters of each individual component based on the risk of failure of the coupled front-end business services. A high tightly coupling index indicates a higher risk to the affected business service, which means this infrastructure component, is vital to business and needs therefore a highly committed resilience and ensured performance. A high loosely coupling in-dex for a component indicates a stronger resilience capability which allows a small buffer overhead in the individual component’s capacity planning and sizing.

5.6 STEP 7: (OPTIONAL) INTUITIONISTIC FUZZY REASONING

5.6.1 FUZZIFICATION OF PERFORMANCE MEASURES [Zadeh 94] “Two concepts within fuzzy logic play a central role in its applica-tions. The first is a linguistic variable; that is, a variable whose values are words or sentences in a natural or synthetic language. The other is a fuzzy if-then rule in which the antecedent and consequents are propositions contain-ing linguistic variables. The essential function of linguistic variables is that of granulation of variables and their dependencies. In effect, the use of linguis-tic variables and fuzzy if-then rules results - through granulation - in lossy data compression. In this respect, fuzzy logic mimics the remarkable ability of the human mind to summarize data and focus on decision-relevant infor-

76

mation. … The concept of a linguistic variable goes to the heart of the way in which humans perceive, reason, and communicate …”

For the now proposed intuitionistic fuzzy reasoning we will apply Zadeh’s key concept for fuzzy rules in granulation (fuzzification) of the observed data using linguistic parameters.

In our impact analysis scenario we want to reason using measurements of the backend components implying the intuitionistic fuzzy coupling relation-ships to determine the impact on frontend services. To measure the frontend quality of our service we apply fuzzy rules to the observed backend performance metrics which enable us to generate performance rules for the expected frontend behaviour. E.g. we fuzzify the “response time” perfor-mance metric of a database system for a standard query into the fuzzy vari-ables HIGH, LOW and MEDIUM.

FIGURE 19: FUZZIFICATION OF “RESPONSE TIME” METRIC INTO THE FUZZY VARIA-

BLES

Similar fuzzification rules can be applied to the other performance measures. Regarding the described KQI/PI Hierarchy [Open Group 04] there are natural boundaries for the granulation of the measurements for the per-formance parameters. As each Performance Indicator (PI) or Key Quality Indicator (KQI) will have a lower and upper warning threshold and lower and upper error threshold we can use those thresholds as best suited limits for linguistic performance variables. This makes especially sense as a set of PIs values indicating warnings may degrade a service until it provokes the interruption.

When using these thresholds as limits for linguistic performance expres-sions, fuzzy rules using these verbal expressions can then best describe the situation when the component performance are likely resulting in an error indicating a service violation at a higher level.

77

FIGURE 20: MAPPING OF THRESHOLDS AND LINGUISTIC VARIABLES

For a SLA-aware service composition the economical goal is minimizing cost and still having the performance attributes, e.g. response time, in the green area (>=medium).

So the same linguistic parameters can be leveraged also for reasoning about optimizing cost aspects, as these require taking action not only in case of low performance, but also when a “very high” performance situation occurs. This is because it is not only about delivering the SLA guarantees, it is also to care about reducing operational costs, so there are two objectives, performance and cost, which are contrary to each other.

5.6.2 APPLYING IFCFIA FOR FUZZY INTUITIONISTIC REASONING Once we have determined the fuzzy rules to define the performance measures, we can create linguistic rules for the service that will help to de-termine the Front-Stage Service Quality.

Within the indirect coupling IFS to the Business Service, we have a degree of positive (tightly) and negative (loosely) instance of the coupling aspect de-fined where the parameter π(x) = 1 - μA(x) - νA(x) shows the level of vagueness. Therefore three coupling aspects (tightly, loosely and vagueness) can be considered for reasoning. The membership and non-membership functions of property variables can be designed individually for both coupling aspects (membership/tightly) and (non-membership/loosely). Subsequently, the inference rules of the system can be constructed with algorithms for reason-ing and defuzzification for all aspects separately or in combination.

IFS allow the application for two-sided (intuitionistic) fuzzy reasoning by combining both aspects including the vagueness of the statement into infer-ence rules and logics. Using two-sided fuzzy logic, complex system behav-iour can be closely simulated by considering his perception of both opposite sides of the coupling subject matter simultaneously. Two-sided fuzzy if-then rules can be constructed using different interpretations of fuzzy implica-

78

tions, modelled by applying the different IFR operations and two-sided in-terpretation of the result sets.

The IFCFIA Grid shows the fuzzy coupling relation for each low-level com-ponent to the related business applications and services. The tightly and loosely coupled IFS values are an aggregation level over all indirect cou-plings and dependencies. With fuzzy reasoning based on the IFS coupling level we can now translate back the metrics related to individual compo-nents of the service infrastructure, like accuracy, responsiveness, uptime, etc. (which are in a sense backstage metrics) to the front stage experienced by the client or business. This will indicate if the frontend service will meet the SLAs in case of a lower level incident or whether the service levels to the business may be degraded.

Now general coupling rules can be formulated. The principal approach used for static couplings is described in [Joshi et al. 2009]. We extend these de-scribed concepts to split for performance Quality of Services (QoS) sensitive to the tightly or loosely coupling or combining both relationships for two-sided intuitionistic reasoning.

5.6.3 EXAMPLE IFCFIA BASED RULES Most ordinary measurements, like response time, are mainly influenced and most sensitive to tightly coupling and therefore we can define for those the following example rules:

If {“Component Service” is (tightly coupled > 0.7) to “Business Service” and

“Component Service” performance is LOW} then “Business Service” perfor-

mance is LOW.

If {“Component Service” is (tightly coupled > 0.3 and < 0.7) to “Business Ser-

vice” and “Component Service” performance is LOW} then “Business Service”

performance is MEDIUM

If {“Component Service” is (tightly coupled > 0.7) to “Business Service” and

“Component Service” performance is MEDIUM} then “Business Service” per-

formance is MEDIUM.

If {“Business Service” is tightly coupled to “User Experience” and “Business

Service” performance is MEDIUM or LOW} then “User Experience” is LOW

If {(“Business Service” is tightly coupled to “User Experience” and “Business

Service” is tightly coupled to “Helpdesk Service”) and (“Business Service” per-

formance is LOW” or “Helpdesk Service” performance is LOW)} then Customer

Satisfaction is LOW

79

For QoS more sensitive to the loosely coupled aspect (most probability

measurements, like reliability) we can define rules in the same way:

If {“Component Service” is (loosely coupled < 0.5) to “Business Service” and

“Component Service” reliability is LOW} then “Business Service” reliability is

LOW.

If {“Component Service” is (loosely coupled > 0.5) to “Business Service” and

“Component Service” reliability is LOW} then “Business Service” reliability is

MEDIUM

More sophisticated rules can now use both aspects for tightly and loosely coupling to define rules to perform two-sided intuitionistic reasoning:

If {“Component Service” is (tightly coupled > 0.5) and (loosely coupled < 0.4)

to “Business Service” and (“Component Service” performance is LOW or

“Component Service” reliability is LOW)} then “Business Service” performance

is LOW

[Joshi et al. 2009] showed several examples for static coupling scenarios with individual rules. This includes Software as a Service (SaaS) example of collaboration tool services, a helpdesk service with large human elements and an example for Infrastructure as a Service (IaaS). These rules involve three types of service elements i.e. human agents, actual software that en-codes the service and other infrastructure resources.

When defining couplings we need to consider also dynamic coupling con-cepts as metrics of service-oriented architecture and virtualizes infrastruc-tures that have become much more dynamic. So a “dynamic” coupling meas-ure can be built by interaction between services in a system at runtime. When a developer builds service a developer thinks that service A can inter-act with service B, C, D; but in runtime it only communicates with service B. It means that calculating dynamic coupling between services will bring a more exact result than based on design specification. As example the pro-posed metric for tightly coupling for service interactions is measuring the level of coupling between two services in a system by calculating the per-centage of the number of calls (connections) from A to B compared to the number of calls (connections) from A to other services in system (environ-mental coupling).

This means that the coupling level itself should then be fuzzified and mapped to linguistic parameters. Thus, the coupling is no longer static, but

80

just like the other component performance parameters to be monitored and evaluated during operation under service management.

The linguistic defined impact R can in this case also be mapped as a fuzzy set with a two-dimensional membership function of the tightly A(x) and loosely

coupling index νA(x).

R(x, y) = f(A(x), νA(x)) = f(a,b), with a = A(x), b = νA(x)

where f is called the fuzzy implication function providing the membership value of the predicted impact. One of the key operations in fuzzy logic and approximate reasoning is the fuzzy implication, which is usually performed by an operator, called an implication function or, simply, an implication. Many fuzzy rule based systems do their inference processes through these operators which are useful also in fields like composition of fuzzy relations or fuzzy relational equations.

For impact assessments the difference δ between the truth degrees of tightly

and loosely coupling δ = A(x) - νA(x) or ο = A(x) / νA(x) are meaningful indi-cators for reasoning about probable impact effects and can be used as key parameter in the implication function.

δ = A(x) - ν A(x) (absolute)

ο = A(x) / ν A(x) (relative)

Business Impact

Risk Cost

≅ 0 ≅ 1 moderate moderate moderate

[>0;0.5] [>1;2] moderate high

moderate to high

moderate low

[>0.5;1] > 2 high high low to very low

[<0;-0.5] [0.5;1] moderate low

moderate to low

high moderate

[<-0.5;-1]] < 0.5 Low low high to very high

TABLE 6: BUSINESS IMPACT VERSUS COST AND RISK

An important aspect is when defining the required SLA guarantees are to care about reducing operational costs. This results essentially in a problem with multiple objectives (here performance and cost), which are conflicting with each other.

If δ = A(x)-νA(x)≅ 0 or ο = A(x) / νA(x) ≅ 1 these parameters are balanced and can be considered as risk/cost efficient. If the business can accept such proposed service levels the system components can be defined during the system design and development phase by system- and software engineers in a risk/cost efficient way.

81

5.7 ADAPTED IMPACT CALCULATION FOR GRADUAL FAILURES

5.7.1 FROM BI-MODAL TO GRADUAL FAILURE SITUATIONS To reduce the complexity of operational monitoring, compliance for tech-nical performance parameters will in praxis mostly measured bi-modal (ei-ther they operate correctly or they fail), means with regard to SLAs compli-ant to specifications or not. This can be refined by distinguishing the attrib-utes with regard to the different mode of failures, like outage or slow re-sponse whereas each failure mode again is monitored as binary condition. This makes sense as different failure modes of each system component can result in different system performance problems and would need individual appropriate safeguards against such problems to put in place. Also failure modes may be monitored differently, for example, a crash can be detected by the system monitoring software, whereas a hang is more difficult to detect and needs a manual check. Such binary measured and monitored failure modes may change over time means e.g. the failure mode slow response may evolve to an outage.

Even for reasons of simplicity and manageability binary failure modes (e.g. slow response Y/N, outage Y/N) are mostly applied when monitoring for impact assessment the model can be extended allowing to specify a granular failure or service degradation level which causes the impact. With this ap-proach it is possible to predict impacts based on the degree a specific SLA with corresponding performance measurement is fulfilled. This allows e.g. to forecast the impact of e.g. 80% SLA achievement or 60% compliance of a performance parameter. Here a challenge is the measurement of the granu-lar operational service level (means to specify to which degree a SLA is met) and correspondent level of degradation of performance values.

As shown, the direct coupling dependencies can be visualized within a di-rected graph representing the direct impacts. The map consists of nodes and arcs between nodes. Each node represents a quality characteristic of the system. In the IT landscape model these characteristics could indicate the level of compliance to the SLA quality targets. Each service level specifica-tion parameter described as Key Quality Indicator (KQI) represents a node. Each KQI is characterized by a number Ai that represents its value and it results from the transformation of the SLA compliance level for which this node stands, in the interval [0,1]. The tightly coupling model describes the causal relationships between two nodes. A decrease in the value of a quality parameter (QoS) or SLA compliance level would yield a corresponding de-crease at the nodes connected to it via tightly coupling relationships. As most tightly coupling metrics, like Fenton and Melton metric, are examples of an inter-modular coupling measurements, which calculate the coupling between each pair of components in the system, soft effects of partial func-tioning or degraded SLA compliance between IT components can be directly modeled using the this approach.

82

5.7.2 APPLYING DERIVED MATHEMATICAL MODELS FROM FCM This concept is briefly derived from the mathematical model of cognitive maps. In 1986 Bart Kosko [Kosko 86] introduced the notion of fuzziness to cognitive maps and created the theory of Fuzzy Cognitive Maps (FCMs). A Fuzzy cognitive map is a cognitive map within which the relations between the elements (e.g. components, IT resources) can be used to compute the "strength of impact" of these elements.

FCMs are used in a much wider range of applications [Stylios, Georgopoulos

1997] which all have to deal with creating and using models of impacts in

complex processes and systems. In the IT landscape scenario FCMs can be

used to describe mutual dependencies between infrastructure and higher

level IT components

The activation level of a quality parameter indicates in this extended model

the level of SLA compliance e.g. what is the impact if a technical component

is 80% compliant to specifications.

We can use now the standard mathematical model of the FCM approach to

compute the value of each quality parameter that influenced by the values of

the coupled quality indicator with the appropriate weights and by its previ-

ous value. So the value Ai for each quality indicator KQIi can be calculated by

the following rule:

where Ai is the activation level of quality parameter KQIi at time t+1, Aj is the

activation level of quality parameter KQIj at time t, Ai old is the activation lev-

el of quality parameter KQIi at time t, and Wji is the weight of the dependence

coupling between KQIj and KQIi , and f is a threshold function.

So the new state vector ANew is computed by multiplying the previous state

vector AOld by the weight matrix W.

The weights of the dependencies between the KQIi and KQIj could be pos-

itive ( Wji>0 ) which means that an increase in the value of KQIi leads to

the increase of the value of KQIj, and a decrease in the value of KQIi leads

to the decrease of the value of KQIj. Or there is negative causality ( Wji<0 )

which means that an increase in the value of KQIi leads to the decrease of

the value of KQIj and vice versa.

83

5.7.3 EXTENDING IFCFIA STEP 5: INDIRECT COUPLING CALCULATIONS The direct couplings with corresponding intuitionistic fuzzy relationship can be drawn in a directed graph; indirect couplings can then be calculated from the direct coupling degrees. Now we add also the activation levels of the KQIs. Each KQI is characterized by a number Ai that represents its value and it results from the transformation of the SLA compliance level for which this KPI stands, in the interval [0,1].

FIGURE 21: INDIRECT COUPLINGS WITH KQI ACTIVATION LEVELS

As example: Using the Forward Coupling Calculation (FCC) method (used for Impact Analysis) of indcpl(C2,B0) depicted in the example graph shows the indirect coupling dependency of the Business Application B0 on the Compo-nent C2.

indcplclassic(C2,B0) = (0.60,0.30) indcplmoderate(C2,B0)=(0.43,0.43) indcplworst(C2,B0)=(0.60,0.30) indcplbest(C2,B0) = (0.36,0.51).

Now the calculation of the KQI Activation Level for B0 at time t+1 can be done as follows using an activation level of KQIT B0 = 0.8 at point in time t .

KQIT+1 B0 classic = (0.8 – 0.3 * 0.6 ) = 0.62 KQIT+1 B0 moderate = (0.8 – 0.3 * 0.43 ) = 0.671 KQIT+1 B0 worst = (0.8 – 0.3 * 0.6 ) = 0.62 KQIT+1 B0 best = (0.8 – 0.3 * 0.36 ) = 0.692

84

So in case the performance indicator C2 decreases for 0.3 we can estimate an impact between a decrease 0.108 and 0.18 to the quality indicator KQI B0.

This simple approach can be useful especially for planning purposes where it is required to consider how several smaller improvements at different infrastructure components (e.g. improvements in performance or through-put) in total will impact a business service performance parameter KQI. All impacts will be pulled together so all single impacts can be aggregated to the total effect on the business service KQI.

85

6

1

6. IFCFIA USE CASES

86

6.1 SCENARIO : INCIDENT IN LOGISTICS MANAGEMENT

6.1.1 OVERVIEW OF SCENARIO

Incident discovered by system monitoring In our example IT Service Management gets a failure event at an infrastruc-ture component, e.g. in our scenario below a database server failed to start a required database service. This incident is discovered via monitoring the correct start and availability of the relevant database services using stand-ard monitoring tools like IBM Tivoli Monitoring and an error event is creat-ed which automatically results in a problem ticket for the corresponding service desk. As today’s enterprise business service management should not only be concerned about a failed component, they must be more concerned with the impact of that component on the business. Unfortunately this rela-tion and the dependencies are not obvious and the impact of this failure cannot be assessed at all by the service desk maintaining the database infra-structure.

It is assumed by the IT department that this may have an impact on the Lo-gistics Management application but there are no clear dependency or usage maps available, so this is more an assumption as there are hundreds of data-base servers operated over the datacentre. The Business Management is concerned about this unclear situation and therefore asked to evaluate sev-eral methods of impact analysis. Finally it was decided to create for the Lo-gistics Management Business application an IFCFIA framework unveiling the dependencies on the underlying infrastructure components used.

Background on Logistics Management Application The Logistics Management Application has a 4 tier client-server architecture

which has the essential components Browser/PCs, Web Server, Application

Server and Database Server.

FIGURE 22: J2EE 4 TIER CLIENT-SERVER ARCHITECTURE

As the Logistics Management Application is based on Java 2 Platform, it is a distributed application and can be classified into a set of layers. The typical separation of layers in J2EE applications is Presentation Layer, Controller Layer, Business Layer and Data Layer.

87

The separation of software systems into front and back end layers simplifies development and separates maintenance. This is mapped to the correspond-ing infrastructure topology. A four-tier topology provides an efficient physi-cal and logical layout to support scaling out or scaling up, and allows distri-bution of services across the member servers of the data-center. In J2EE application architecture, the application server can be isolated from the web server and from the database. The application server does not know about the web server and vice versa, this gives decoupling between these layers and there are no dependencies code-wise or functional perspectives. There are several communication protocols used, http requests are exchanged be-tween browser and web server, the business layer, consists of one or nu-merous application servers where the Logistics Management Application (LMA) get invoked by using Remote Method Invocation (RMI). They get per-sistent data from the fourth layer, called the data access layer. It consists of one or multiple databases. Usually Structured Query Language (SQL) state-ments are passed between the third and fourth layer.

FIGURE 23: TOPOLOGY LOGISTICS MANAGEMENT J2EE APPLICATION

6.1.2 SERVICE COMPONENTS AUTO-DISCOVERY “LOGISTICS MANAGE-

MENT” As first step an automated discovery is launched over the datacentre infra-structure scope where the Logistics Management Application is deployed. In our use case we take IBM’s Tivoli Application Dependency Discovery Man-ager (TADDM) as auto-discovery solution that provides automated applica-tion dependency mapping and configuration auditing. The TADDM discovery

88

includes down to network devices, storage devices, cross-tier dependencies, and run-time configurations. TADDM employs agent-free discovery, togeth-er with a Data Centre Reference Model, to produce cross-tier dependency maps and topological views. As part of the discovery process, the TADDM discovery feature examines the configuration of each device and discovers the ports that are assigned to the applications. The discovery feature uses this information to determine relationships and dependencies between ap-plications and other discovered components. A dependent component relies on data or configurations from another component, and a provider compo-nent provides information to a dependent component. The basic automated discovery finds dependencies by looking either at the TCP connections or by evaluating the configuration of programs (e.g. JDBC resources).

In principle three types of dependencies can be automatically discovered, transactional dependencies, service dependencies and IP dependencies:

Transactional dependencies occur between application components, such as Web servers, application servers, and databases. The de-pendent component issues requests to the provider component in order to perform certain functions, such as JDBC calls from a J2EE server to a database. In this case, the provider is often referred to as a server and the dependent as a console.

Service dependencies occur between application components and in-frastructure services, such as DNS, LDAP, and NFS. The provider is the infrastructure service, and the dependent component requests system services from the provider, such as a request to map a DNS name to an IP address.

IP dependencies occur between two computer systems or between an application server and a computer system. TADDM creates this type of relationship when it discovers a relationship between two computer systems but cannot discover exactly what kind of relation is involved.

After the application discovery, dependency mapping as second step creates visibility into discovered applications (in our example the Logistics Man-agement Application) and infrastructure dependencies. Automated applica-tion discovery and subsequent dependency mapping unveils the relation-ships which are needed to define the basic structure of the IFCFIA Grid.

Building the Logistics Management business application definition is the next step. Business applications in TADDM can contain any number and kind of lower-level resources. Due to the already discovered relationships, TADDM can automatically add the closest connected components like for example servers on which the applications run on, or in another example switches and routers between the servers. The purpose of this grouping is to bring together various lower-level objects and their relationships and treat them as units in order to perform reporting and analysis.

89

Besides automatic grouping, we can create business applications or services manually in TADDM. A business application is the way to group the different kinds of IT resources into a logical group, and this logical group acts together as one unit to provide some kind of service. Business Applications in TADDM can be defined via [O'Brien, 2008]

Application descriptors (development or deployment time) by tag-ging of the application components

Template-based definition (operations time) by signature-based ap-plication grouping

Manual definitions (post-discovery) manual drag-and-drop grouping of application components

In the Logistics Management Scenario application descriptors are applied and also a manual grouping has been defined. In the TADDM Grouping Com-poser we can navigate down into the Inventory Summary to the Computer Systems and applications which are part of the Logistics Management busi-ness application and add the related software systems.

FIGURE 24: TADDM GROUPING COMPOSER FOR MANUAL ASSIGNMENTS

The highest grouping level in TADDM is the business service which is in our scenario the “Bill Payment Service” which comprises two applications, the Billing- and the Logistics Management Application which are included with-in the Bill Payment Service process flow.

FIGURE 25: BILL PAYMENT BUSINESS SERVICE

90

In our scenario, only for the Logistics Management Application we will cre-ate the full Fault Tree for the supporting infrastructure components. To do so, we need to assign at minimum the highest infrastructure component level to the Business Application. For the Use Case we assign in the TADDM grouping composer the two frontend software systems (Web Servers) to the Logistics Management Application.

FIGURE 26: GROUPING OF FRONTEND COMPONENTS (WEBSERVER)

FIGURE 27: LOGISTICS MANAGEMENT SOFTWARE COMPONENTS

Once the business application has been grouped, TADDM shows a structured

diagram of the Logistics Management business application with the related

software components.

6.1.3 CREATING THE FAULT TREE FOR LOGISTICS MANAGEMENT AP-

PLICATION Until now we have assigned in the grouping the highest level of software components of the Logistics Management Business Application which are the Apache Web Servers.

91

The highest Level (L1) of the Fault Tree Hierarchy is always the Business Service, in our example the Logistics Management Application

FIGURE 28: FRONTEND SOFTWARE COMPONENTS OF LOGISTICS MANAGEMENT

The Logistics Management Application has two frontend software compo-nents:

cleopatra.lab.collation.net:4580 - Web Server hpux1.lab.collation.net:3880 - Web Server

The dependencies can be exported as XML Schema to allow an automated processing of the discovered dependencies.

FIGURE 29: XML EXPORT OF LOGISTICS MANAGEMENT DEPENDENCIES

We navigate now to the first Web Server, the hpux1.lab.collation.net:3880. In the TADDM details view, we choose the dependency tab and extract a list of the dependent software components for the Web Server components which represents the Level 2 (L2) in our Fault Tree which is the Web Server Layer. Alternatively a custom report can be developed if more automatism is required for large environments.

FIGURE 30: L2 DEPENDENCY FOR WEB SERVER HPUX1.LAB.COLLATION.NET:3880

92

The L2 Web Server hpux1.lab.collation.net:3880 is dependent on two soft-ware components:

Transactional (Web Logic Server) histronix.lab.collation.net:7021 Transactional (Java Server) caesar.lab.collation.net:2809

Now we go down to examine the Level 3 (Application Server Layer)

FIGURE 31: L3 DEPENDENCY FOR WEB LOGIC SERVER HISTRO-

NIX.LAB.COLLATION.NET

The Web Logic Server histronix.lab.collation.net:7021 is dependent on two software components:

Transactional (Sybase Enterprise Database Server) what-zit.lab.collation.net:4002

Service Dependency 192.168.1.21:53 (DNS/NIS service)

We are going further down to Level 4 which represents the Data Layer.

FIGURE 32: L4 DEPENDENCY FOR DATABASE SERVER

Here we have no transactional dependency to a software system, only the dependency to the DNS/NIS service.

93

This results in a Level 5 which represents a technical DNS/NIS service. This service is widely used for most software components.

FIGURE 33: L5 DEPENDENCY DNS/NIS SERVICE MAJES-

TIX.ENG.COLLATION.NET:53

Also the DNS/NIS service needs a computer system to be deployed which results in Level 6. In our complete Directed Graph, we have also assigned computer system dependencies to all other software components in the dif-ferent layers.

FIGURE 34: L6 DEPENDENCY COMPUTER SYSTEM MAJESTIX.ENG.COLLATION.NET

We can now go on to the lowest level in our example, which are devices of the computer system like network adapters. Here we see as level 7 (L7) de-pendencies the Intel® PRO/1000 network adapter.

We may even go further down like assigning as L8 a technical documenta-tion of the network adapter or any support function. In the example we will stop with the network adapter as lowest examined level.

The now created seven level Directed Graph gives us a relationship from an incident occurred at a network adapter (here at majestix.eng.collation.net) up to the impacted Business Application “Logistics Management”.

We will have a brief look at the second frontend component, the Level 2 Web Server cleopatra.lab.collation.net:4580.

94

FIGURE 35: L2 DEPENDENCY FOR WEB SERVER CLEOPAT-

RA.LAB.COLLATION.NET:4580

The second Web Server cleopatra.lab.collation.net:4580 is dependent on three software components:

Transactional (Web Logic Server) histronix.lab.collation.net:7021 Transactional (Web Logic Server) homeopatix.lab.collation.net:7021 Service Dependency 192.168.1.21:53 (DNS/NIS service)

Here again we can go down the dependent software components on L3 which represent the Application Servers (WebLogic Server).

FIGURE 36: L3 DEPENDENCY FOR WEB LOGIC SERVER

The Web Logic Server homeopatix.lab.collation.net:7021 is dependent on three software components:

Transactional (Sybase Server) brutus.lab.collation.net:2638 Transactional (Microsoft SQL Server) hades.lab.collation.net:1433 Service Dependency 192.168.1.21:53 (DNS/NIS service)

The Sybase Database Server brutus.lab.collation.net:2638 has only the de-pendency on the underlying DNS service as L4, for the Microsoft SQL Server hades.lab.collation.net:1433 no further dependency is discovered, besides the underlying computer system (L4).

95

The second L3 Web Logic Server histronix.lab.collation.net:7021 is already examined as this software component is also related to the first Web Server hpux1.lab.collation.net:3880.

An overview of the relationships between the software components are shown in the software topology for Logistics Management

FIGURE 37:LOGISTICS MANAGEMENT SOFTWARE TOPOLOGY LEVEL 2-4

Now we can provide accurate mappings to the machines and devices sup-porting the software components which are part of the Logistics Manage-ment business application.

Finally we need to assign in our complete Fault Tree the supporting comput-er system dependencies to all software components in the different layers. For instance the Web Server hpux1.lab.collation.net:3880 is deployed on the HP computer system hpux1.lab.collation.net.

FIGURE 38: COMPUTER SYSTEM L2 DEPENDENCY FOR HPUX1.LAB.COLLATION.NET

Remark: Added to the complexity of size and number of components, there is a trend towards server consolidation, using multiprocessor machines to reduce the overall number of systems to manage and to harvest otherwise wasted CPU utilization for servers that might not have much load under a 1-server-1-application architecture. This makes it more difficult to provide accurate mappings of functions to machine or application, because several instances of databases or Web servers coexist on the same host machine, each capable of performing separate critical tasks in the environment. To know for instance what else is deployed in addition e.g. on the same server

96

then the Web Server hpux1.lab.collation.net:3880 we simply can have a look at the dependency tab in the details view of the server hpux1.lab.collation.net.

FIGURE 39: DEPENDENCIES ON THE HPUX1.LAB.COLLATION.NET COMPUTER SYS-

TEM

The same computer system dependency assignment needs to be performed through the created fault tree over all software component layers.

FIGURE 40: SOFTWARE COMPONENTS DEPENDING ON COMPUTER SYSTEMS

The result of the discovered relations is shown in TADDM in the physical system connection topology graph of the Logistics Management including the computer systems as depicted in the following figure:

97

FIGURE 41: LOGISTICS MANAGEMENT APPLICATION PHYSICAL TOPOLOGY

To summarize, the constructed Fault Tree represents for software compo-nents the separation of layers in J2EE applications, Presentation Layer (Browser/PCs), Controller Layer (Web Servers), Business Layer (WebLogic Server) and Data Layer (MS SQL, Sybase)

FIGURE 42: FAULT TREE SOFTWARE COMPONENT LEVELS REPRESENTING THE

J2EE LAYER

6.1.4 TADDM SERVER AFFINITY REPORT This report provides the ability of a more automated approach to extract the transactional and/or service dependencies which may have an indirect im-pact on the Logistics Management business application using the TADDM Server Affinity Report.

The relationships can be retrieved with the TADDM Server Affinity Report.

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Application%20Dependency%20Discovery%20Manager/page/Shipped%20reports




98

This report displays relationships between servers, arranged according to the source and target of each relationship. The first table displays all servers within the specified scope that are sources of relationships, and the connec-tions from those servers to other servers. The second table displays all serv-ers within the specified scope that are targets of relationships, and the con-nections to those servers from other servers. [IBM TADDM, 2012].

FIGURE 43: TADDM SERVER AFFINITY REPORT

The affinity report can be exported and filtered for the relevant hosts with the corresponding transactional and/or service dependencies. We may also filter for routers, switches, network cards and similar components. The TADDM server affinity report provides several export formats e.g. a spread-sheet or xml representation.

6.1.5 COMPONENT TOPOLOGY BILLING APPLICATION As already discussed, the highest grouping level in TADDM is the business service “Bill Payment Service” which comprises two applications, the Billing-

99

and the Logistics Management Application which are included within the Bill Payment Service process flow.

FIGURE 44: BILL PAYMENT BUSINESS SERVICE

In our scenario, for the Billing Application we will not create also the de-tailed Fault Tree, as this has been done in detail for the Logistics Manage-ment Application. We show here the physical topology to provide an over-view of this application also.

FIGURE 45: BILLING APPLICATION PHYSICAL TOPOLOGY

6.1.6 CREATING THE LOGISTICS MANAGEMENT IFCFIA GRID Now we gathered the necessary topology information in order to build an IFCFIA grid representing a dependency tree where the business application is the top level. All the data has been extracted from the TADDM database, which contains all necessary information to provide a topological view of the components (nodes) with the type of dependency. We mapped the compo-nents to one (or more) business applications and discovered all their direct and indirect relations. This is assembled now in a structured data format to allow further processing and includes hardware, software, network compo-nents, servers, services and all other IT components with their relations. With this data we can create the grid with components on one axis and com-ponents (or IT Services) which have a dependency on the component on the other axis. The table lists components and their direct dependants vertically.

100

The highest level 1 is the business application which is added in the follow-ing step horizontally. So the vertically listed components start with level 2.

The following table shows the structural Dependency Tree created for the Logistics Management application with all discovered components listed on the vertical axis and the relation to the direct related component. The hier-archy level indicates the level in the Dependency Tree. It also shows the component and dependency types.

TABLE 7: COMPONENT RELATIONSHIP MATRIX

We can now optional extend the tree with additional logical dependencies which can express dependencies to e.g. IT users, IT staff and business units and supporting processes and functions. After building the dependency tree from the auto-discovered relationships in the next step we assess the level of

L2 Web Server hpux1.lab.collation.net:3880 composition Logistics Management Business Application

L3 HP UX Computer System hpux1.lab.collation.net service hpux1.lab.collation.net:3880 Software Component

L3 HP UX Computer System hpux1.lab.collation.net service hpux1.lab.collation.net:3880 Software Component

L4 Cisco moralelastic.lab.collation.net service hpux1.lab.collation.net Computer System

L4 ethernet-scmacd ethernet-scmacd (6) service hpux1.lab.collation.net Computer System

L3 Java Server caesar.lab.collation.net:2809 transactional hpux1.lab.collation.net:3880 Apache AppServer

L4 Computer System panacea.lab.collation.net service caesar.lab.collation.net:2809 Software Component

L3 WebLogic Server histronix.lab.collation.net:7021 transactionalhpux1.lab.collation.net:3880

cleopatra.lab.collation.net:4580Apache AppServer

L4 Sun Sparc Computer System histronix.lab.collation.net service histronix.lab.collation.net:7021 Software Component

L5 Network Device dmfe1 service histrionix.lab.collation.net Computer System

L4 Sybase Server Enterprise whatzit.lab.collation.net:4002 transactional histrionix.lab.collation.net:7021 WebLogic Server

L5 Sun Sparc Computer System whatzit.lab.collation.net service whatzit.lab.collation.net:4002 Software Component

L2 Web Server cleopatra.lab.collation.net:4580 composition Logistics Management Business Application

L4 Sun Sparc Computer System cleopatra.lab.collation.net service cleopatra.lab.collation.net:4580 Software Component

L3 WebLogic Server homeopathix.lab.collation.net:7021 transactional cleopatra.lab.collation.net:4580 Apache AppServer

L4 Sun Sparc Computer System homeopathix.lab.collation.net service homeopathix.lab.collation.net:7021 Software Component

L5 Network Device dmfe2 service homeopathix.lab.collation.net Computer System

L3 WebLogic Server histrionix.lab.collation.net:7021 transactionalhpux1.lab.collation.net:3880

cleopatra.lab.collation.net:4580Apache AppServer

L4 Sun Sparc Computer System histronix.lab.collation.net service histronix.lab.collation.net:7021 Software Component

L4 Microsoft SQL Server hades.lab.collation.net:1433 transactional homeopathix.lab.collation.net:7021 WebLogic Server

L5 Windows Computer System hades.lab.collation.net service hades.lab.collation.net:1433 Software Component

L6 Intel PRO/1000 MT Intel PRO/1000 MT Network 2 service hades.lab.collation.net Computer System

L4 Sybase Adaptive Server IQ brutus.lab.collation.net:2638 transactional homeopathix.lab.collation.net:7021 WebLogic Server

L5 Sun Sparc Computer System brutus.lab.collation.net service brutus.lab.collation.net:2638 Software Component

L4

L5DNS/NIS service majestix.eng.collation.net:53 service

histrionix.lab.collation.net:7021

brutus.lab.collation.net:2638

homeopathix.lab.collation.net:7021

whatzit.lab.collation.net:4002

Web Logic Server

Sybase Server

L5,6 Computer System majestix.eng.collation.net service majestix.eng.collation.net:53 Software Component

L3

L5Cisco Router aniline.lab.collation.net service

cleopatra.lab.collation.net

whatzit.lab.collation.netComputer System

L4

L5Cisco Router orinjade.lab.collation.net service

homeopathix.lab.collation.net

histrionix.lab.collation.net

hades.lab.collation.net

Computer System

L3 Cisco Router moralelastix.lab.collation.net servicehpux1.lab.collation.net

whatzit.lab.collation.netComputer System

L4 Switch infarctus.lab.collation.net servicehomeopathix.lab.collation.net

histrionix.lab.collation.netComputer System

Pare

nt

Com

ponent

Id's

Hie

rarc

hy L

eve

l

Dependency T

ype

Pare

nt

Com

ponent

Type

Dis

covere

d C

om

ponent

Id

Dis

covere

d C

om

ponent

Configura

tion I

tem

(C

I)

101

resilience. For loosely coupling, based on resilience measurements, an in-trinsic coupling metric is chosen, as each component has individual designed resilience capabilities.

In the grid we can show all data relevant for the Loosely Coupling assess-ment as columns to be filled by the relevant system admins, so that the Loosely Coupling Indicator can be calculated using e.g. the described formu-la, or simply by an expert assessment. The here called Loosely Coupling In-dicator (or Resilience Index) can be directly written in the grid as additional column. We will require the experts to set a level of certainty next to the Loosely Coupling Index. Not only the technical failover capability should be considered, but also the risk if these methods will succeed completely or if this results only in a part- or limited restoration or even this was never be tested.

In case there were several failure modes considered (like outage, slow re-sponse, limited function), each failure mode is represented now in a sepa-rate line. So the following grid is assembled extending the relationship ma-trix with additional data. The green columns show the intrinsic Loosely Cou-pling index with the corresponding certainty assessment. We have shown here only the dependent components of the WebServer hpux1.lab.collation.net:3880:

TABLE 8: DETERMINING THE LOOSELY COUPLING INDEX

The Tightly Coupling Index can be calculated with regard to an appropriate formula as described or alternatively directly assessed by the experts:

Failo

ver

Pro

cedure

s

Teste

d (

Y/N

)

Recovery

Meth

od

Pro

cedure

s

(Y/N

)

Teste

d (

Y/N

)

Inte

gra

ted

Part

ly

Inte

gra

ted

Web Server hpux1.lab.collation.net:3880 Outage 99.99 24 10.000 N N N Y Y N N Y Y 0.75 VH

HP UX Computer System hpux1.lab.collation.net Outage 99.50 24 10.000 Y Y N N N N Y N N 0.88 H

HP UX Computer System hpux1.lab.collation.net Slow Response 99.50 24 10.000 Y Y N N N N Y N N 0.88 H

Cisco moralelastic.lab.collation.net Limited Function 99.20 12 2.000 Y Y N N N N Y N N 0.82 M

ethernet-scmacd ethernet-scmacd (6) Limited Function 99.30 48 5.000 N N N Y Y Y N N Y 0.65 M

Java Server caesar.lab.collation.net:2809 Outage 99.85 24 50.000 N N N N N N N N Y 0.69 H

Computer System panacea.lab.collation.net Slow Response 99.20 12 2.000 Y Y N N N N Y N N 0.82 M

WebLogic Server histronix.lab.collation.net:7021 Outage 99.85 24 10.000 Y Y N N N N Y N N 0.75 VH

Sun Sparc Computer System histronix.lab.collation.net Outage 99.20 12 2.000 Y Y N N N N Y N N 0.82 M

Network Device dmfe1 Limited Function 99.50 12 10.000 Y Y N N N N Y N N 0.57 L

Sybase Server Enterprise whatzit.lab.collation.net:4002 Outage 99.85 24 10.000 Y Y N N N N Y N N 0.88 VH

Mean T

ime B

etw

een

Failu

res (

MT

BF

) in

hours

Hot/Warm

Failover

% A

vaili

bili

ty Q

PI

Mean T

ime to R

epair

in h

ours

Cold

Recovery

Ce

rta

inty

LC

: Very

Hig

h,

Hig

h, M

ediu

m ,Low

,VL

Lo

ose

ly C

ou

plin

g

Ind

ex (

resilie

nce

)

Sin

gle

Poin

t of F

ailu

reRecovery

Proc.

Dis

covere

d

Com

ponent

Configura

tion I

tem

(CI)

Dis

covere

d

Com

ponent

Id

Failu

re M

ode a

nd

Eff

ect

102

TABLE 9: DETERMINING THE TIGHTLY COUPLING INDEX

Also for the tightly coupling Index the “failure mode and effects” column allows distinguishing the attributes with regard to the different mode of failures, like outage or slow response. We consider again the tightly coupling degree together with the level of certainty set by the expert. We have also already extended the grid with the fuzzy intuitionistic direct coupling index based on a combination of both loosely and tightly coupling degrees, thus adding the direct Coupling IFS as last green column in the previous table.

After defining the direct couplings as inter-modular IFS, the indirect cou-pling between components or services can be calculated considering the degrees for direct coupling. Here we can involve as described different probabilistic variants of the logical operations in calculation of the indirect impacts. For Impact Analysis a Forward Coupling Calculation (FCC) is best suited, means a bottom-up calculation in case e.g. an infrastructure compo-nent fails, what is the coupling to a higher level component or finally the business service. Vice versa a root cause analysis is a top down approach and requires the reverse task to be solved, i.e. “To which components does the business application depend on” The second method implies the calcu-lating of indirect impacts starting from the top and traversing through its impact arcs in the reverse direction. We refer this method as Reverse Cou-pling Calculation (RCC). FCC and RCC results for indirect dependencies may differ, so both are calculated in the grid.

Web Server hpux1.lab.collation.net:3880 composition Logistics Management Outage 0.30 H (0.4,0.4)

HP UX Computer System hpux1.lab.collation.net service hpux1.lab.collation.net:3880 Outage 0.88 H (0.3,0.6)

HP UX Computer System hpux1.lab.collation.net service hpux1.lab.collation.net:3880 Slow Response 0.88 H (0.3,0.6)

Cisco moralelastic.lab.collation.net service hpux1.lab.collation.net Limited Function 0.45 M (0.8,0.1)

ethernet-scmacd ethernet-scmacd (6) service hpux1.lab.collation.net Limited Function 0.20 M (0.5,0.4)

Java Server caesar.lab.collation.net:2809 transactional hpux1.lab.collation.net:3880 Outage 0.89 H (0.4,0.5)

Computer System panacea.lab.collation.net service caesar.lab.collation.net:2809 Slow Response 0.45 M (0.8,0.1)

WebLogic Server histronix.lab.collation.net:7021 transactionalhpux1.lab.collation.net:3880

cleopatra.lab.collation.net:4580Outage 0.75 H (0.2,0.6)

Sun Sparc Computer System histronix.lab.collation.net service histronix.lab.collation.net:7021 Outage 0.45 M (0.8,0.1)

Network Device dmfe1 service histrionix.lab.collation.net Limited Function 0.67 L (0.8,0.1)

Sybase Server Enterprise whatzit.lab.collation.net:4002 transactional histrionix.lab.collation.net:7021 Outage 0.90 H (0.5,0.4)

Sun Sparc Computer System whatzit.lab.collation.net service whatzit.lab.collation.net:4002 Slow Response 0.67 H (0.5,0.4)

Tig

htly C

ou

plin

g

Ind

ex (

dir

ect p

are

nt)

Ce

rta

inty

TC

: Very

Hig

h,

Hig

h, M

ediu

m ,Low

,VL

Dis

covere

d

Com

ponent

Configura

tion I

tem

(CI)

Dis

covere

d

Com

ponent

Id

Dependency T

ype

Pare

nt

Com

ponent

Id's

Failu

re M

ode a

nd

Eff

ect

Dir

ect Im

pa

ct

(IF

S)

on

pa

ren

t

103

TABLE 10: IFCFIA GRID WITH INDIRECT COUPLING CALCULATIONS AND COST OF

FAILURE

Depending on which combination of IFS operations will be applied, when calculating the indirect impacts, the resulting IFS values may be greater or smaller. Four types of impact analysis are used worst case (pessimistic), best case (optimistic), moderate risk and classical fuzzy analyses. The IFCFIA grid adds also business impact information. Thus when a component fails, the number of users impacted is understood and an impact calculation can be performed, based on the total hourly cost of failure of the component an incident occurs. The following table shows the final IFCFIA grid for the “Bill Payment Business Service”.

Level 0

Level 1

FC

C c

ouplin

g

Modera

te R

isk

FC

C c

ouplin

g

Be

st C

ase

FC

C c

ouplin

g

Wors

t C

ase

FC

C c

ouplin

g

Cla

ssic

al

RC

C

couplin

g t

o

Logis

tics M

anagem

ent

FC

C c

ouplin

g

Modera

te R

isk

RC

C c

ouplin

g t

o

Bill

ing A

pplic

ation

L2 Web Server hpux1.lab.collation.net:3880(0.8,0.1) (0.7,0.1) (0.5,0.1) (0.8,0.1) (0.8,0.1) (0.8,0.1) 700 4000

L3 HP UX Computer hpux1.lab.collation.net (0.3,0.6) (0.3,0.5) (0.2,0.6) (0.4,0.5) (0.3,0.5) (0.2,0.6) 700 3000

L4 Cisco moralelastic.lab.collation (0.8,0.1) (0.7,0.3) (0.5,0.4) (0.7,0.2) (0.7,0.3) (0.8,0.1) 700 7000

L4 ethernet-scmacd ethernet-scmacd (6) (0.5,0.4) (0.6,0.3) (0.6,0.3) (0.6,0.3) (0.6,0.3) (0.7,0.3) 700 6000

L3 Java Server caesar.lab.collation:2809 (0.4,0.5) (0.7,0.3) (0.8,0.1) 300 2100

L4 Computer System panacea.lab.collation.net (0.8,0.1) (0.7,0.3) (0.5,0.3) (0.8,0.3) (0.7,0.3) (0.8,0.1) 700 7000

L3 WebLogic Server histronix.lab.collation:7021 (0.2,0.6) (0.3,0.5) (0.2,0.6) (0.4,0.5) (0.3,0.5) (0.2,0.6) 700 3000

L4 Sun Sparc Computer histronix.lab.collation.net (0.8,0.1) (0.7,0.1) (0.5,0.1) (0.8,0.1) (0.8,0.1) (0.8,0.1) 700 7000

L5 Network Device dmfe1 (0.8,0.1) (0.7,0.3) (0.7,0.3) (0.7,0.3) (0.7,0.3) (0.8,0.1) (0.8,0.1) (0.7,0.3) 1000 9400

L4 Sybase Server whatzit.lab.collation:4002 (0.5,0.4) (0.3,0.5) (0.2,0.6) (0.6,0.3) (0.5,0.4) (0.4,0.5) 700 3000

L5 Sun Sparc Computer whatzit.lab.collation.net (0.5,0.4) (0.7,0.3) (0.7,0.3) (0.7,0.3) (0.7,0.3) (0.8,0.1) 700 7000

Bill Payment

Business Service

cost of failure

10.000 per hour

Logistics Management Application

# Users 700

RTO 2 hours

RPO 4 hours

Billing

Application

cost of failure

3.000 per hour

# Users 300

RTO 12 hours

RPO 12 hours

tota

l co

st

of

fail

ure

per

ho

ur

tota

l End U

sers

impacte

d

Direct Im

pact

(IF

S)

on p

are

nt

Hie

rarc

hy L

eve

l

Extended IFCFIA Grid with

indirect couplings and cost of failure

Dis

covere

d C

om

ponent

Id

Dis

covere

d C

om

ponent

Configura

tion I

tem

(C

I)

104

TABLE 11: FINAL IFCFIA MATRIX FOR THE BILL PAYMENT BUSINESS SERVICE

105

FIGURE 46: DEPENDENCY GRAPH FOR LOGISTICS MANAGEMENT APPLICATION

The coupling of the highest level infrastructure component (the web serv-ers) to the Logistics Management Application is also set by the experts and written as parameter as couplings to the Level 2 components. The numbers depicted in the IFCFIA grid above are only examples, thus a real impact cal-

106

culation will be done next in our use case Business Impact Analysis for Lo-gistics Management.

The shown IFCFIA dependency graph is the corresponding graphical repre-sentation of the IFCFIA grid. Depending on the discovered detail level, the complexity increases significantly. In praxis the scope of the evaluated de-pendencies will be limited on a reasonable and manageable level and num-ber of components.

6.2 USE CASE: BUSINESS IMPACT ANALYSIS

6.2.1 BUILDING THE IFCFIA DEPENDENCY GRAPH For each component a membership values for Tightly Coupling (e.g. Dhama's metric or Fenton & Melton) and Loosely Coupling (Resilience Capabilities) is calculated with usage of the metric described in chapters 3 and 4. The expert has the possibility to adapt the resulting value of the metric according to his knowledge. Also the expert adds a certainty in words (or, if preferred, in numbers). VH stands for Very High, H for High, M for Medium, L for Low, VL for Very Low. These linguistic expressions are translated into numeric val-ues, giving the parameter for the Sugeno complement. Using the Sugeno complement, we can compute the non-membership function for Tightly and Loosely Coupling and create then two separate IFS for both contrary cou-pling aspects.

As example, for the Sybase Adaptive Server Enterprise what-zit.lab.collation.net:4002 we have determined the degree of tightly coupling of 0.7 to the WebLogic Server histrionix.lab.collation.net:7021 (seen as an indicator of risk resulting from inter-dependencies) and a degree of 0.5 for loosely coupling (degree of resilience capabilities). Now we choose to use the Sugeno complement and we define for tightly coupling a vagueness lambda parameter of = 2 which means the statement is considered on a confident level. We get a Sugeno complement for tightly coupling c(μA(x) = 0.125 = νA(x). We can create then a first IFS A* for Tightly Coupling with the tuple (0.7, 0.125). The degree of vagueness for tightly coupling is defined as 1- μA(x)-νA(x) = 0.175.

For Loosely Coupling we have a membership degree of 0.5 where we do the same approach. For the Sugeno complement and we define a vagueness in-dicator of = 0.5 which correlates to a “certain” statement. Here we get a Sugeno complement for loosely coupling c(μB(x) = 0.4 = νB(x). Now we can create a second IFS B* for Loosely Coupling with the tuple (0.5, 0.4). The de-gree of vagueness for loosely coupling is defined as 1-μB(x)-νB(x) = 0.1.

We have now two independent IFS, A* for loosely and B* for tightly coupling which we will combine in a single direct coupling IFS D*. Using this approach

107

the real impact can be closely simulated by considering both contrary sides of the coupling aspect simultaneously.

The IFS D* will be defined here with the IFS operation A@¬B

where A* is the IFS set for tightly coupling with membership μA(x) and non-

membership νA(x) and B* is the equivalent IFS for loosely coupling with

membership μB(x) and non-membership νB(x).

Setting a IFS A* = (0.7, 0.125) for tightly coupling and B* = (0.5, 0.4) for loose-ly coupling, the combined IFS D* = (A@¬B) is (0.55, 0.3125) with a vagueness πC(x) = of 0.2375. So the fuzzy intuitionistic direct degree of coupling be-tween whatzit.lab.collation.net:4002 (Sybase Adaptive Server Enterprise) and histrionix.lab.collation.net:7021 (WebLogic Server) is calculated as (0.55,0.3125).

FIGURE 47: IFCFIA DEPENDENCY DIRECTED GRAPH FOR LOGISTICS MANAGE-

MENT

The same procedure will be done for the other components in scope of the Logistics Management application to determine the fuzzy intuitionistic di-rect impacts. Depending on the discovered detail level, the complexity in-creases significantly, so in praxis the scope of the evaluated dependencies will be limited on a reasonable and manageable level.

108

6.2.2 CALCULATING THE INDIRECT IMPACT Depending on which combination of IFS operations will be used, the indirect impacts may be greater or smaller. Four types of impact analysis are intro-duced [Kolev and Ivanov, 2009]: worst case (pessimistic), best case (optimis-tic), moderate and classical fuzzy analyses.

Worst case impact analysis

V (p∧ q)=<min (μ(p),μ(q)), max(ν(p),ν(q)>

V (a∨ b)=<μ(a)+μ(b)-μ(a)*μ(b),ν(a*ν(b)>

Best case impact analysis

V(p∧q)=<μ(p)*μ(q),ν(p)+ν(q)-ν(p)*ν(q)>


Moderate impact analysis

V(p∧ q)=<μ(p)*μ(q),ν(p)+ν(q)-ν(p)*ν(q)>

V(a∨ b)=<μ(a)+μ(b)-μ(a)*μ(b),ν(a)*ν(b)>

Classical fuzzy impact analysis

V(p∧ q)=<min(μ(p),μ(q)),max(ν(p),ν(q))>


TABLE 12: ATTITUDE BASED IMPACT CALCULATIONS

indcpl(Cwhatzit:4002,BLogisticsManagement) =

indcpl(Cwhatzit:4002, Chpux1:3880) ∧ dircpl(Chpux1:3880,BLogisticsManagement) ∨

indcpl(Cwhatzit:4002, Ccleopatra:4580) ∧ dircpl(Ccleopatra:4580,BLogisticsManagement)

= dircpl(Cwhatzit:4002,Chistronix:7021) ∧ (( dircpl(Chistronix:7021,Chpux1:3880) ∧ dircpl(Chpux1:3880,BLogisticsManagement) ) ∨

( dircpl(Chistronix:7021,Ccleopatra:4580) ∧ (Ccleopatra:4580,BLogisticsManagement) ))

As an example of a dependency assessment using the Forward Coupling Cal-culation (FCC) method which is used for Impact Analysis, we can calculate the indirect dependency of the Sybase Server whatzit.lab.collation.net:4002 to the business application Logistics Management.

Moderate impact assessment: indcplmoderate(Cwhatzit:4002,BLogisticsManagement) = (0.3438,0.4675),

Worst case impact assessment: indcplworst(Cwhatzit:4002,BLogisticsManagement) = (0.5500,0.3125)

Best case impact assessment: indcplbest(Cwhatzit:4002,BLogisticsManagement) = (0.2200,0.6288)

Classical impact assessment: indcplclassic(Cwhatzit:4002,BLogisticsManagement) = (0.5000,0.4000)

After calculation of all indirect impacts a simple one-level dependency map can be drawn where the indirect impact of each component on the business service can be easily depicted.

109

FIGURE 48: SIMPLIFIED INDIRECT INTUITIONISTIC DEPENDENCY MAP

Various stakeholders may have their individual concerns and requirements which lead to different subjective impact assessment. This approach can support now particular views of the same incident impact to support some kind of an “attitude” based impact assessment with regard to the stakehold-ers concerns and attitude.

Another interpretation to involve the usage of different probabilistic vari-ants of the logical operations in calculation of indirect impacts is the model-ling of the way the incident impact is transferred throughout the complex system. This is because we may have several ways how the incident can in-terfere indirectly with other components which is mainly resulting out of the bi-polar aspect and combination of contrary forces of dependencies and re-silience capabilities.

6.2.3 IMPACT ASSESSMENT: INCIDENT IN LOGISTICS MANAGEMENT In our example IT Management finds a failure in an infrastructure compo-nent, e.g. in our scenario below the Sybase Server “whatzit.lab.collation.net” failed to start a required database service. This incident is discovered via

110

monitoring the correct start and availability of the relevant database ser-vices using standard monitoring tools like IBM Tivoli Monitoring and an error event is created which results in a ticket for the corresponding service desk. As today’s enterprise business service management should not only be concerned about a failed component, they must be more concerned with the impact of that component on the business. Unfortunately this relation and the dependencies are not obvious and the impact of this failure is difficult to be assessed and not in the responsibility of the service desk maintaining the database infrastructure.

Having calculated the indirect coupling IFS, we can see directly which com-ponents create the biggest risk to our business application. A component with a coupling membership value close to one will have a major impact because the business application is absolutely dependent on this component and rarely resilience capabilities for this component exist. A component hav-ing a coupling membership value close to zero does either have a great resil-ience, or its resilience is not of much importance because of its very small coupling level.

The intuitionistic fuzzy dependencies between components can be inter-preted with different kinds of semantics depending on the type of infor-mation they represent.

A probabilistic coupling dependency between the business service and the component where the incident occurs means that the calcu-lated indirect IFS represent “the probability that the higher level service Logistics Management (LM) is not compliant to specifications in case the component Sybase Server “whatzit.lab.collation.net” (SS) failed in operation.

An ordinary fuzzy coupling dependency between LM and SS means that “if SS is non-compliant to specifications, then LM is partially de-graded in functionalities, so the indirect coupling IFS provides a measurement of a probable degradation level for a service operation in the Logistics Management application.

We believe that in praxis impacts can be best expressed having both seman-tics considered which is also aligned with the way human expressions and dependency assessments work.

Within the use case, as immediate response to the discovered incident, IT Management can now predict to the concerned business department, that the Logistics Management Application, using a moderate risk and impact assessment, is likely to be operational degraded with a degree of 34%, but can be considered as operational working with a 47% probability. This ap-proach provides an excellent measurement of the expected usability, com-pliance and operational status. This allows a notion of having a service still usable with some sort of degradation (functional and probabilistic), where

111

the first indicator refers to the level of degradation, and the second indicates the probability it occurs.

Also service management can verbally express the vagueness of the assess-

ment, for instance assigning and rating the intuitionistic index π(x) = 1 -

μA(x) - νA(x) as following:

(0% ; <15%) - certain

(15% ; <30%) - medium

(>=30%) uncertain Applied in our use case, 19% indicates a medium secure statement, so we can express to the business a medium certainty level together with the pre-dicted impact.

Evaluating the extended IFCFIA grid we see that 700 users in total are im-pacted by the incident of an database service failure for the Sybase Server “whatzit.lab.collation.net” and also we know that the hourly cost of failure for the impacted Logistics Management Application within the “Bill Pay-ment” Business Service is 10.000$ per hour. As we predict an operational degradation of 34% for Logistics Management we can provide a cost esti-mate of 3400$ per hour of downtime of the database service at “what-zit.lab.collation.net”.

Because the IFCFIA grid indicates also the failover and recovery procedure, in our use case the server “whatzit.lab.collation.net” has a cold recovery only with a corresponding MTTR of 4 hours, we can calculate finally the total es-timated financial impact of the incident “database service not available” of 13.600$.

6.3 USE CASE: ROOT CAUSE ANALYSIS The purpose of a fault tree analysis is to determine the root cause of a fail-ure, considering the fact that a particular item is out of order. A root cause analysis (RCA) is a top down approach and requires the reverse task then the impact analysis to be solved, i.e. “To which components is the business application B coupled to (depends on)”. The IFCFIA analysis procedure takes into account direct and indirect impacts of other components over the failed components. The result of the analysis is an intuitionistic fuzzy distribution of components giving an ordered set of possible root causes.

RCA implies the calculating of indirect impacts starting from the top and traversing through its impact arcs in the reverse direction. We refer this method as Reverse Coupling Calculation (RCC). Forward Coupling Calcula-tions (FCC) and RCC results for indirect dependencies may differ, so both are

112

calculated in the IFCFIA grid, but for the RCA analysis we only need to con-sider the RCC couplings.

Having the IFCFIA grid created, we simply can sort for the highest level of IFS coupling (we propose to sort primary for tightly coupling and secondary for loosely coupling) to get an order for the probability of possible root causes. The infrastructure component with the highest coupling is most like-ly and should therefore first being considered for causing the impact on a higher level business service.

TABLE 13: IFCFIA GRID WITH RCC COUPLINGS USED FOR ROOT CAUSE ANALYSIS

Applied to our use case the users of the Logistics Management application find the system providing very slow responses or even timeout of user re-quest occur. IT Management is quickly analysing the IFCFIA grid and finds two components, the Sybase Server Enterprise “what-zit.lab.collation.net:4002” and the WebLogic Server “histro-nix.lab.collation.net:7021” with the highest RCA coupling of (0.8,01) and therefore to be most likely to be the root cause of the degraded operation of the Logistics Management application. After checking the status of both sys-tems and finding the WebLogic Server up and running, the root cause is quickly identified: The Sybase server “whatzit.lab.collation.net” has failed to start a required database service.

As we compare here the degrees of direct and indirect couplings which can cause the observed impact, indirect impacts are more likely to be the root cause then direct impacts when their indirect coupling is higher than the direct ones.

FC

C

mo

de

rate

risk

RC

C

L2 Web Server hpux1.lab.collation.net:3880 Logistics Management Outage (0.6,0.3) (0.6,0.3)

L3 HP UX Computer System hpux1.lab.collation.net hpux1.lab.collation.net:3880 Outage (0.3,0.5) (0.2,0.6)

L3 HP UX Computer System hpux1.lab.collation.net hpux1.lab.collation.net:3880 Slow Response (0.3,0.5) (0.2,0.6)

L4 Cisco moralelastic.lab.collation.net hpux1.lab.collation.net Limited Function (0.7,0.3) (0.7,0.2)

L4 ethernet-scmacd ethernet-scmacd (6) hpux1.lab.collation.net Limited Function (0.6,0.3) (0.6,0.3)

L3 Java Server caesar.lab.collation.net:2809 hpux1.lab.collation.net:3880 Outage

L4 Computer System panacea.lab.collation.net caesar.lab.collation.net:2809 Slow Response (0.7,0.3) (0.5,0.4)

L3 WebLogic Server histronix.lab.collation.net:7021hpux1.lab.collation.net:3880

cleopatra.lab.collation.net:4580Outage (0.3,0.5) (0.8,0.1)

L4 Sun Sparc Computer System histronix.lab.collation.net histronix.lab.collation.net:7021 Outage (0.7,0.3) (0.7,0.2)

L5 Network Device dmfe1 histrionix.lab.collation.net Limited Function (0.7,0.3) (0.5,0.3)

L4 Sybase Server Enterprise whatzit.lab.collation.net:4002 histrionix.lab.collation.net:7021 Outage (0.3,0.5) (0.8,0.1)

L5 Sun Sparc Computer System whatzit.lab.collation.net whatzit.lab.collation.net:4002 Slow Response (0.7,0.3) (0.4,0.1)

Level 1Logistics

Management

cost of failure

10.000 per hour

# Users 700

RTO 2 hours

RPO 4 hours

Failu

re M

ode a

nd

Eff

ect



Hie

rarc

hy L

eve

l

Dis

covere

d

Com

ponent

Configura

tion I

tem

(CI)

Dis

covere

d

Com

ponent

Id

Pare

nt

Com

ponent

Id's

113

Using two-sided fuzzy logic, possible system failure root causes can be or-dered by considering both opposite sides of the subject matter simultane-ously, the dependency and the resilience. The degree of dependency in the IFS is already adjusted for the included recovery capabilities which will pro-vide more realistic results. The usage of IFCFIA for root cause analysis pro-vides therefore a granular and ordered view to unveil the dependencies be-tween business applications to their supporting components and infrastruc-tures.

6.4 USE CASE: ADVANCED SERVICE LEVEL MONITORING

6.4.1 SLA MONITORING AND EARLY QUALITY ANALYSIS The virtualized service delivery model requires the composition of services to deliver the overall service to the client. The fulfilment of any higher-level objective requires proper enforcements on multiple resources at several levels. IFCFIA provides the capability to assess the dependency and relation-ships of backend components to the SLAs of the end-user service (or compo-site service) impacted by these backend services.

Service monitoring on the backstage metrics implies a bottom-up approach and begins by monitoring on backend applications and resources. Analysing the monitored backend status and operational performance levels, we need to translate back the metrics related to individual components of the service, like accuracy, responsiveness, uptime, etc. (which are in a sense backstage metrics) to the front stage experienced by the client or business. This is re-quired to assess if the business service will meet the SLAs, decide on what resources to allocate to it, and perhaps choose between providers when se-lecting component services to be composed.

Using IFCFIA, service administrators can pro-actively track and verify by periodically polling the measures of individual components and gathering the overall quality status of the impacted business services. This allows ad-ministrators responsible for the functioning of a service to monitor its quali-ty based on the measurements typically already done for the component services.

In this use case scenario we want to reason on frontend services perfor-mance, like “Logistics Management Application” end-user response time, using monitoring results of the backend components applying the intuition-istic fuzzy coupling relationships to determine the impact and predicted behaviour. For instance we monitor the database performance of the Sybase Enterprise Server “whatzit.lab.collation.net:4002” by periodically executing a standard test query and measuring the response time.

114

A central concept of the quality of services is an adaptive penalization of individual requests according to the current degree of SLA conformance C. The conformance C is monitored per service, in the use case the database response time performance.

We define conformance C = Number of timely transaction invocations / Total number of invocations of the transaction

In practice, so-called step-wise SLAs are commonly used to specify the QoS requirements of a service class. The SLAs consist of one or more percentile constraints and an optional deadline constraint. Percentile constraints re-quire e.g. in our use case that n% of all database service requests to be pro-cessed within x seconds.

We apply then fuzzification rules to define a linguistic degree of the con-formance measure:

FIGURE 49: FUZZIFICATION OF CONFORMANCE MEASUREMENTS

All performance measures which are non-compliant can then be mapped to a failure mode.

TABLE 14: IFCFIA WITH MONITORED FAILURE MODES

Level 1

FC

C c

ouplin

g

Modera

te R

isk

FC

C c

ouplin

g

Be

st C

ase

FC

C c

ouplin

g

Wors

t C

ase

FC

C c

ouplin

g

Cla

ssic

al

L2 Web Server hpux1.lab.collation.net:3880 Outage (0.8,0.1) (0.7,0.1) (0.5,0.1) (0.8,0.1) (0.8,0.1) 700 4000

L3 WebLogic Server histronix.lab.collation.net:7021 Outage (0.2,0.6) (0.3,0.5) (0.2,0.6) (0.4,0.5) (0.3,0.5) 700 3000

L4 Sybase Server whatzit.lab.collation.net:4002Medium

Conformance(0.3,0.6) (0.3,0.5) (0.2,0.6) (0.6,0.3) (0.5,0.4) 700 3000

L4 Sybase Server whatzit.lab.collation.net:4002Low

Conformance(0.5,0.4) (0.4,0.4) (0.3,0.5) (0.6,0.3) (0.5,0.4) 700 4000

L4 Sybase Server whatzit.lab.collation.net:4002Outage (Non-)

Conformance(0.6,0.3) (0.5,0.4) (0.4,0.5) (0.7,0.2) (0.6,0.3) 700 5000

L5 Sun Sparc Computer whatzit.lab.collation.net Slow Response (0.5,0.4) (0.7,0.3) (0.7,0.3) (0.7,0.3) (0.7,0.3) 700 7000

Hie

rarc

hy L

eve

l

Failu

re M

ode a

nd E

ffect



Dis

covere

d C

om

ponent

Id

Dis

covere

d C

om

ponent

Configura

tion I

tem

(C

I)

tota

l co

st

of

fail

ure

per

ho

ur

tota

l End U

sers

impacte

d

Direct Im

pact

(IF

S)

on p

are

nt

cost of failure

10.000 per hour

Logistics Management Application

# Users 700

RTO 2 hours

RPO 4 hours

115

This allows the engineer to consider how the failure modes of each system component can result in system performance problems and define the ap-propriate safeguards against such problems. The IFCFIA considers the dif-ferent failure modes of each lower level system component and can assess the dependency effects to the business service for each failure mode indi-vidually

In our use case we discover using a standard monitoring tool a “Low Con-formance” situation at the Sybase database “whatzit.lab.collation.net:4002”. Evaluating the IFCFIA matrix with business services on one axis and compo-nents with the different failure modes on the other, we can judge an impact of (0.4,0.4) to Logistics Management using a moderate impact assessment method and a corresponding business impact of 4000$ per hour during the time the incident occurs.

6.4.2 AUTOMATED FUZZY REASONING BASED ON BACKEND MONITOR-

ING In the Logistics Management scenario the reasoning should be based on monitored measurements of the backend components implying the intui-tionistic fuzzy coupling relationships to determine the impact on frontend services. We have already fuzzified the backend conformance measure-ments, now we need to fuzzify also the frontend performance measurements in the same way, for instance the end-user response time of the Logistics Management application.

FIGURE 50: FUZZIFICATION OF APPLICATION “RESPONSE TIME” METRIC

For automated reasoning about the frontend quality of our service we apply fuzzy rules to the observed backend performance metrics which enable us to generate performance rules for the expected frontend behaviour. We can define individual rules for a single backend component or establish more generic rules for several component services which is more likely in praxis. For the component service “whatzit.lab.collation.net:4002” as instance the following rules apply using the IFCFIA couplings.

If {“Component Service” is (tightly coupled > 0.7) to “Logistics Management Application” and “Component Service” conformance is LOW} then “Logistics Management Application” response time performance is LOW.

116

If {“Component Service” is (tightly coupled > 0.4) to “Logistics Management Application” and “Component Service” conformance is MEDIUM} then “Logis-tics Management Application” response time performance is MEDIUM or LOW.

Similar fuzzification rules can be applied to any other performance measures. Such rules are described from experts when developing and pub-lishing a service.

The application of IFS allow the application for two-sided (intuitionistic) fuzzy reasoning by combining tightly and loosely coupling aspects of the statement into inference rules and logics. Using two-sided fuzzy logic, com-plex system behaviour can be closely simulated by considering his percep-tion of both opposite sides of the coupling subject matter simultaneously.

The IFCFIA Grid shows the fuzzy coupling relation for each low-level com-ponent to the related business applications and services. The tightly and loosely coupled IFS values are an aggregation level over all indirect cou-plings and dependencies. With fuzzy reasoning based on the IFS coupling level we can now translate back the monitored metrics related to individual components of the service infrastructure, like accuracy, responsiveness, uptime, etc. (which are in a sense backstage metrics) to the front stage expe-rienced by the client or business. This will indicate if the frontend service will meet the SLAs or the service levels may be degraded.

6.4.3 FUZZY CLUSTERING FOR ANALYSIS OF MONITORING DATA Standard analysis methods and tools available do classify monitoring results sharply. In case of a SLA constraint monitoring it means often that the moni-tor provides simply two result classes whether a conformance level re-quirement is reached or not, e.g. in our use case the number of timely data-base test-queries / total number of database test-queries issued by monitor-ing tools to the Sybase Enterprise Server will reach the minimum required conformance level.

Monitoring results with very similar characteristics may be classified in dif-ferent result classes. A sharp classification approach cannot treat monitor-ing data according to their actual impact since sharp classes do not reflect the actual situation of the monitoring data within the possible result set. Having fuzzy classes opens new perspectives for positioning the SLA moni-toring results inside the classification space. In fuzzy clustering monitoring data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that monitoring result and a particular cluster which represent the different quality level. Fuzzy clustering is a process of assign-ing these membership levels, and then using them to assign monitoring data

117

elements to one or more quality clusters based on their membership de-grees. This important information offers new possibilities for segmenting, targeting and controlling monitoring data and allows proactively reacting on quality problems in advance before the acceptance level has been reached. In contrast to sharp results where the SLA performance shifting can only be detected when it is too late and they fall under the limit, the fuzzy classes enable the monitoring of the performance evolution.

FIGURE 51: NATURAL GRANULATION OF COMPONENT PERFORMANCE MEASURE-

MENTS

There are natural boundaries for the granulation of the measurements for the performance parameters. As each performance measurement may have a lower and upper warning threshold and lower and upper error threshold we can use those thresholds as best suited limits for linguistic performance variables. This can be leveraged to monitor the shift in case a set of meas-urements values indicating warnings may degrade a service until it provokes the interruption. Based on the monitoring position information, a fuzzy clas-sification allows the service management to monitor the evolution of single QoS over time. By comparing either the attributes values or the classes' membership degrees of a QoS performance over time, it is possible to detect if this performance criteria is increasing, maintaining or decreasing its value with regard to the natural thresholds. Service Management can therefore proactively analyse these observations and derive appropriate reactions.

SLO's vs. SLA's. The size of the gaps between monitored and projected ser-vice levels could be used to design "service level objectives" ("SLO's") and timing for corresponding achievement of "service level agreements" ("SLA's”). As aspirational targets, SLO's do not involve potential risk of "pen-alties" for non-compliance in contrast to binding contractual commitments like SLA's .

118

6.5 USE CASE: CAPACITY IN CONSUMPTION BASED MODELS Virtualization technologies require new cost models. IT spending that is based on fix SLA (capacity based) does mostly not get aligned to actual de-mand and corresponding service levels. IT organizations prefer to pay on-demand rather than for capacity. As such, a lot of fixed cost is being mod-elled into variable cost and aligned to the business demand cycle. Many in-novative models are stemming from this trend like Cloud, Software-as-a-Service (SaaS), Infrastructure-as-a-Service (IaaS) or Usage Based Billing (UBB) -based models. Most IaaS providers use virtualization technologies to encapsulate applications and components. Statically partitioning the physi-cal resource into virtual machines (VMs) according to the applications’ peak demands will lead to poor resource utilization. Overbooking is used to im-prove the overall resource utilization, and resource capping is applied to achieve performance isolation among co-located applications by guarantee-ing that no application can consume more resources than those at maximum allowed to be allocated to it.

Examples are virtualized hosting environments where the trend tends from Booked Capacity Models (BCM) to Consumption Based Models (CBM). In a consumption based model, the customer books for instance an entitlement of CPU capacity. The Logical Partition (LPAR) is allowed to burst over that entitlement, but burst capacities are not guaranteed. The burst capacity is also measured and charged, but burst capacity is cheaper than the booked CPU capacity.

FIGURE 52: BOOKED VS. BURST CAPACITY

The decision what the recommended entitlement of a higher-priced guaran-teed CPU capacity level for a component versus cheaper, but non-guaranteed, burst capacity (booked vs. burst) can also be indicated by the IFCFIA grid. A component with a tight coupling to business services with large number of users and high hourly cost of failures would tend to have a higher sizing in booked and guaranteed capacity and lower level of over-booking.

119

When having dynamic coupling metrics, measurement tools can even evalu-ate during operations the degree of couplings and then automated capacity adjustments on infrastructure components can be triggered using rules based on the IFCFIA.

The described framework is general enough to be applied to any type of IT service, but especially cloud services can benefit from the IFCFIA. The cloud delivery model determines what performance indicators must be covered in an SLA and will have to address the financial implications of cloud usage. Naturally, cloud capacity is simultaneously used by multiple consumers. All consumers will be asked to pay for their own usage and service perfor-mance. As a result the cloud has to be equipped with SLA based costing ca-pabilities.

IFCFIA can help to determine the best suited SLAs for the provisioned cloud services. For cloud delivery models like Platform as a Service (PaaS) or In-frastructure as a Service (IaaS), IFCFIA based rules can derive capacity re-quirements which ensure the required higher level QoS like response times, throughput, availability, reliability and scalability of transactions are met.. Thus, the proposed conceptual IFCFIA framework can help for implementing a flexible SLA model where the organization benefits by projecting the ca-pacity and paying for only what is required. Instead of tightening SLAs across the board, which is a costly approach, SLAs can now be driven by business needs.

6.6 USE CASE: 'COST VERSUS BENEFIT' FOR IT INVESTMENTS Another aspect is to cost justify improvements to the backend IT Infrastruc-ture that improve the Quality of Services (QoS) on the front-stage or end-user level, it is necessary to demonstrate how the proposed improvements deliver tangible business benefits to the business services. Where the pro-posed improvements require a significant re-investment in the IT Infrastruc-ture the benefits often need to be expressed in financial terms, i.e. the busi-ness case.

A frequently used technique to justify IT Infrastructure improvements is to quantify the total cost to the organisation of an IT Service failure(s). These costs can then be used to support a business case for additional IT Infra-structure investment and provide an objective 'cost versus benefit' assess-ment. As discussed, this cost of failure information can be included in our extended IFCFIA grid.

Often the fuzzy element in such business cases is the impact and interde-pendency of a backend infrastructure component to the supported business services. The IFCFIA grid indicates with the degree of tightly and loosely coupling the direct and indirect dependency of business services on infra-

120

structure elements. Therefore from the IFCFIA we can also derive measure-ments based on the IFS couplings to be used within business cases of in-vestments in the IT infrastructure and recommended component capacity levels.

6.7 EXAMPLE: IFCFIA VERSUS OTHER IMPACT ANALYSIS In this chapter is a brief example comparing the traditional impact analysis with the proposed IFCFIA approach.

FIGURE 53: 3 TIER WEB ARCHITECTURE PROVIDING STATIC AND DYNAMIC WEB

PAGES

The system supports a web-based application and is built on the rather common three tier server configuration (four tiers counting also the thin client tier). The first server tier, which consists of nodes 1 and 2, involves Hypertext Transfer Protocol (HTTP) servers, in the role of load balancers. The second tier is consisting of the application server nodes 3 creating the dynamic web representation and a HTTP server node 4 for static web pages. The third tier involves node 5 as database server coupled to the application server.

In the scenario the WebSphere Application Server and the Microsoft SQL Database Server are considered by IT Management as business critical and have therefore a defined warm recovery strategy, whereas the File Server

121

which is used for static web pages only is not considered as business critical and has therefore only a weekly tape backup.

In traditional impact analysis when either the database server or the appli-cation server fails, the business system is considered as non-operational. As this application is mission critical, immediately the senior management gets informed and becomes severely concerned about this out-of-line situation.

But in the scenario the application server has been restored within several minutes and the database gets recovered in less than one hour. Most end-users do not recognize the problem as they are working most time using static web pages only. The users concerned by this issue are taking a coffee and trying it sometime later again, as they know already this temporary kind of disturbance.

The business impact was largely oversized by traditional dependency analy-sis methods because:

Traditional methods can not consider the fact that only 40% of the transactions use dynamic web pages and 60% are using static or al-ready buffered web pages

There are only active JDBC calls from the J2EE server to the database in the scenario when querying product or customer information which are not already in memory. But even this function is only oc-casionally used by specific user roles.

The application server is provided via a cloud pooling architecture and a new active application server instance can be deployed out of the cloud pool in about 10 minutes.

As the database server is classified as mission critical, it has a failo-ver with a MTTR of 45 minutes.

The “recovery point objective” RPO as well as the “recovery time ob-jective” RTO are defined by the business for one hour. So the maxi-mum tolerable period in which data might be lost due to a major in-cident is one hour as well as the maximum time allowed to recover the service. Therefore this incident of 45 minutes can be considered as not having a significant business impact at all.

IFCFIA can take into considerations all of those aspects to define a granular impact measurement. IFCFIA will provide a realistic very low dependency factor, and senior management does not need to be concerned about the described incident.

The other way round, let the File Server fail for providing static web pages. As this server is not considered as a business critical system component, traditional impact analysis are not primarily concerned about this incident and the management gets not informed at all about the incident. Because of a resulting late notification to IT service management, after two days a

122

backup of the previous week for the File Server is restored. In reality the non-operational static web pages are blocking most of the required end-user dialogues and functions. The users gets frustrated and stop further working with the application (and will also not use the properly working transaction anymore), a major business impact results with a high associated cost of failure.

IFCFIA can also in the second described situation granularly predict the im-pact to the business and therefore the potential risk resulting of the failure of the static file server is judged correctly what allows then to select the best suited mitigation strategy.

Traditional impact analyses will in very most cases over- or undersize the estimated impact as there are no granular dependency assessments possi-ble. Also the failover capabilities or fast restoration capabilities using e.g. cloud pools will minimize the disturbance to higher level business functions, which needs to be considered besides the negative incident traversing through the system by interdependencies. Therefore a pure fault tree analy-sis will never give a realistic impact using todays advanced resilience tech-niques. A good failover or restoration capability on a lower level infrastruc-ture component can mitigate the impact on the dependent business func-tions completely. The real impact depends on whether the system resilience can prevent transferring the incident to indirect coupled components or not. Rather than judging positive and negative impact aspects isolated we need simultaneously consider both to get realistic results and reliable judgments to the business. And finally each business can survive a short time with lim-ited or degraded operational system support, so realistic impact analysis are also measured against the RPO and RTO targets of the business service to define the experienced disturbance to the business.

123

7

1

7. IMPACT MODELS APPLIED IN ITIL V3 BEST

PRACTICES - ABILITIES AND LIMITS

124

7.1 IMPACT ANALYSIS WITHIN ITIL V3 BEST PRACTICES

7.1.1 ITIL V3 SERVICE LIFECYCLE MODULES Service Management Standards are influenced by the range and quality of methods and techniques and benefits of established best practices. ITIL, (IT Infrastructure Library) provide a best practice based framework, developed since the late 1980th by the UK Office of Government Commerce. It is the most widely used and accepted approach to IT Service Management (ITSM) around the world. ITIL includes several valuable management ideas and well-tried procedures. [ITIL V3: A Management Guide (Van Haren Publishing 08)]

Application of IFCFIA to ITIL v3 Service Lifecycle Modules

The following figure shows the five ITIL v3 Service Lifecycle Modules:

FIGURE 54: ITIL SERVICE LIFECYCLE MODULES SOURCE: KRPM.WORDPRESS.COM

7.1.2 ITIL V3 IMPACT ANALYSIS ACTIVITIES AND TOOLS The following lists the explicitly in ITIL v3 mentioned activities, methods and tools related to impact and dependency analysis: (source http://www.itilexam.net/Glossary.html )

ITIL Service Failure Analysis (SFA) (Service Design) : An Activity that identifies underlying causes of one or more IT Service interruptions.

http://www.itilexam.net/Glossary.html

125

SFA identifies opportunities to improve the IT Service Provider's Processes and tools, and not just the IT Infrastructure.

ITIL Root Cause Analysis (RCA) (Service Operation) : An activity that identifies the root cause of an incident or problem. RCA typically concentrates on IT infrastructure failures.

ITIL Pain Value Analysis (Service Operation) : An activity used to help identify the Business Impact of one or more Problems. A formu-la may be defined to calculate Pain Value based on the number of Us-ers affected, the duration of the Downtime, the Impact on each User, and the cost to the Business.

ITIL Business Impact Analysis (BIA) (Service Strategy) : BIA is the Activity in Business Continuity Management that identifies Vital Business Functions and their dependencies. These dependencies may include Suppliers, people, other Business Processes, IT Services etc. BIA defines requirements which include Recovery Time Objec-tives, Recovery Point Objectives and minimum Service Level Targets for each IT Service.

ITIL Configuration Management Database (CMDB) (Service Transi-tion) : A database used to store Configuration Records throughout their Lifecycle. The Configuration Management System maintains one or more CMDBs, which is populated by auto-discovery tools and each CMDB stores Attributes of Configuration Items (Cis), and Rela-tionships with other CIs.

ITIL Fault Tree Analysis (FTA) (Service Design) (Continual Service Improvement) : A technique that can be used to determine the chain of Events that leads to a Problem. Fault Tree Analysis represents a chain of Events using Boolean notation in a diagram.

ITIL Component Failure Impact Analysis (CFIA) : (Service Design) A technique that helps to identify the impact of CI failure on IT Ser-vices. A matrix is created with IT Services on one edge and CIs on the other. This enables the identification of critical CIs (that could cause the failure of multiple IT Services) and of fragile IT Services (that have multiple Single Points of Failure).

ITIL Failure Modes and Effects Analysis (FMEA) : This proactive troubleshooting method can be part of a CFIA and allows the engi-neer to consider how the failure modes of each system component can result in system performance problems and to ensure that ap-propriate safeguards against such problems are put in place.

126

7.2 ITIL V3 DEPENDENCY ANALYSIS - ABILITIES AND LIMITS We will focus in this paper on those impact and dependency techniques in IT services management explicitly applied within ITIL v3 best practices and management guidance. There are several areas in the ITIL v3 Service Lifecy-cle Modules where dependency and impact analysis or similar reliability engineering techniques are used. Those techniques are mainly applied in ITIL v3 within Service Design, Service Operation and Continual Service Im-provement. In the following chapters the four in ITIL v3 referred methods and tools, Configuration Management Database (CMDB), Fault Tree Analysis (FTA), Component Failure Impact Analysis (CFIA) and Business Impact Analysis (BIA) will be discussed in detail, as these will be the basis for the proposed framework in this project work.

7.2.1 CONFIGURATION AUTO-DISCOVERY

CMDB Discovery Capabilities A key success factor in implementing a Configuration Management Data-base (CMDB) is the ability to automatically discover information about the Configuration Items CIs (auto-discovery) and track changes as they happen. So CMDB needs to have discovery capabilities to allow to explore the details of the managed resources and the other way round a core component of application dependency discovery solutions is the domain configuration database (CMDB).

Application discovery is the process of automatically analysing artefacts of a software application and physical elements that constitute a network (e.g., servers, firewalls, etc.). Dependency mapping creates visibility be-tween discovered applications and infrastructure dependencies. Automated appli-cation discovery and subsequent dependency mapping, can capture, con-nect and unveil relationships including the way in which applications be-have and relate to the technology architecture on which they rely. This pro-cess can result to an automatic discovery of complex business applications and dependencies which can be done e.g. through application templates cre-ated by operations after deployment or through application descriptors that are created at development and deployment time.

The automated discovery engine of a CMDB to retrieve attributes of Con-figuration Items (CIs), and relationships with other CIs is handled by tools called Application Dependency Discovery Manager.

Application Dependency Discovery Manager An Application Dependency Discovery Manager (ADDM) is, as the name im-plies, an application that discovers dependencies. Usually it is used for large applications with many components that have obtained a high complexity. The summary based on [Craig, EMA ADDM Radar Dec 10] lists the following six points as minimum criteria for an ADDM.

127

Automated discovery and modelling of applications, transactions, and/or services

Graphical service model consisting of top-down application / trans-action / service ow analysis paired with supporting bottom-up infra-structure / software / metadata relationships

GUI-based reporting Capabilities supporting discovery and mapping of both custom and

packaged applications Linkages of applications to discovered artefacts via identifying tech-

nology Support for a variety of IT and/or business roles

ADDM tools can create for specialists/technicians who must work on the account servers and infrastructures a visual map depicting the managed systems and their interrelations. The visibility of the environment is in-creased. ADDM shows detailed maps of business applications and their rela-tionships to one another. In this way, it provides the critical element not to represent the discovered data in peer-to-peer and hierarchical representa-tion.

Enterprise Management Associates (EMA) is a leading industry analyst firm which created a ADDM Radar Report which provides an overview about the strength and capabilities of several leading ADDM tool provider. EMA evalu-ates vendor ADDM solutions based on five key areas represented by the five sides of a hexagram within the Radar. They include Deployment & Admin-istration, Cost Advantage, Architecture & Integration, Functionality, and Vendor Strength. [EMA ADDM Radar Dec 10]

After the first deployment, the architecture can be maintained and most of the dependencies will be automatically refreshed. This is done mostly via an agentless discovery of interdependencies between applications, middleware, servers and network components.

ADDM solutions automate the process of mapping transactions and applica-tions to underlying infrastructure and application interdependencies. They leverage a wide variety of discovery and analysis techniques to create ser-vice models in a more or less automated fashion (depending on the vendor and product).

Three of the suite vendors (ASG, HP, and IBM) achieved similar scores on this measurement. Our use case in the later chapter is built with the ADDM suite from IBM, the IBM Tivoli Application Dependency Discov-ery Manager (TADDM). Therefore the product functionality of TADDM is explained more in detail in the next chapter.

The ADDM tools create for specialists/technicians who must work on the account servers and infrastructures a visual map depicting the managed

128

systems and their interrelations. The visibility of the environment is in-creased. ADDM shows detailed maps of business applications and their rela-tionships to one another.

IBM Tivoli Application Dependency Discovery Manager The IBM Tivoli Application Dependency Discovery Manager (TADDM) is IBM’s auto-discovery solution that provides automated application de-pendency mapping and configuration auditing. TADDM provides visibility into how the infrastructure actually delivers the applications and services. TADDM can discover the interdependences between business applications, software applications, and physical components, such as hosts and network devices. [Jacob et al. 2009] describes TADDM Capabilities and Best Practices.

FIGURE 55: TADDM APPLICATION MAPPING SHOWING THE DEPENDENCIES

The TADDM discovery includes down to layer-2 network devices, storage devices, cross-tier dependencies, and run-time configurations. TADDM em-ploys agent-free discovery, together with a Data Centre Reference Model, to produce complete cross-tier dependency maps and topological views.

The following four steps provide a simplified description of the operation of the TADDM discovery:

1. The agent-free discovery engine instructs the discovery sensors to determine and collect the identity and settings of each application, system, and network component.

129

2. The discovered data is fed to the Data Canter Reference Model, which creates the specific runtime cross-tier application topologies.

3. The topologies, along with their configuration data, interdependen-cies, and change history are stored in the Configuration Management Database (CMDB).

4. The product console provides analytics and topological views of the CMDB.

As part of the discovery process, the TADDM discovery feature examines the configuration of each device and discovers the ports that are assigned to the applications. The discovery feature uses this information to deter-mine rela-tionships and dependencies between applications and other dis-covered components. A dependent component relies on data or configurations from another component, and a provider component provides information to a dependent component. The basic automated discovery finds dependencies by looking either at the TCP connections or by evaluating the configuration of programs (e.g. JDBC resources returned by JMX).

In principle two types of dependencies can be automatically discovered, transactional dependencies and service dependencies. Transactional de-pendencies occur between application components, such as Web servers, application servers, and databases. The dependent component issues re-quests to the provider component in order to perform certain functions, such as JDBC calls from a J2EE server to a database. In this case, the provider is often referred to as a server and the dependent as a console. Service de-pendencies occur between application components and infrastructure ser-vices, such as DNS, LDAP, and NFS. The provider is the infra-structure ser-vice, and the dependent component requests system services from the pro-vider, such as a request to map a DNS name to an IP address. More details about TADDM discovery can be found at IBM developer works site.

The relationships discovered with TADDM are defined as part of the Com-mon Data Model (CDM). This is an information model that provides con-sistent definitions for managed resources, business systems and pro-cesses and the relationships between those elements. CDM is based on the unified modelling language (UML). The CDM assigns a specific semantic meaning to the relationship. The CDM is based on the DMTF’s CIM object model. More information about the CDM data model can be found at http://www.dmtf.org and [Tai et al. 2008] explains the IBM Tivoli Common Data Model: Guide to Best Practices.

After TADDM discovers or imports all of the IT resources from different re-sources, we can create business views or services manually or automatically in TADDM. A business application is the way to group the different kinds of IT resources into a logical group, and this logical group acts together as one unit to provide some kind of service. The top level in the component hierar-chy of TADDM is the business service. Business services can contain any

130

number of the lower-level resources, from business applications to ear modules in a J2EE server or specific configuration files on systems. The pur-pose of the business service is to consolidate multiple lower-level objects and their relationships in order to perform reporting and analysis, consider-ing all related resources.

Limitations in automated application dependency discovery

The notion that a single ADDM tool can support every use case is be-ing replaced by a more pragmatic view that asset-centric require-ments, change and configuration automation, and performance man-agement will require different details, and differing types of applica-tion/ infrastructure interdependencies [EMA ADDM Radar Dec. 10]

By discovering interdependencies between and among applications and underlying systems, ADDM products deliver only a point-in-time view of the “truth.”

The automated discovery finds dependencies by looking either at the TCP connections or by evaluating the configuration of programs (e.g. JDBC resources returned by JMX). These automatic discovery capa-bilities are limited to active connections and do not provide a com-plete picture of all dependencies.

Only physical dependencies can be discovered by ADDM. This needs to be extended with additional logical dependencies. For functioning of an information system we need to know also about dependencies to e.g. IT users, IT staff and business units and supporting processes and functions e.g. helpdesk. This can be ex-pressed with a logical re-lationship like – is coupled to: a procedure, a SLA or even a manual or – coupled in the sense of supported by: for support function like help desk or maintenance organizations etc.

131

7.2.2 FAULT TREE ANALYSIS

Overview Root Cause Analysis (RCA) is an ITIL v3 activity in service operation that identifies the root cause of an incident or problem. This section summarily presents the most commonly used root cause analysis method, which is also referred in ITIL v3 best practices, the Fault Tree Analysis (FTA). Fault tree analysis is a top down, deductive failure analysis in which a state of a system is analysed using Boolean logic to combine a series of lower-level events. Events in a fault tree are associated with statistical probabilities. The fault tree output probabilities related to the set operations of Boolean logic. This allows the combination of probabilities P of several independent states A, B with the logic operators AND, OR, XOR.

Below is an example for a KQI (End-User Application Availability) de-termination based on several System PIs (Component Availability) using a basic FTA method.

.

FIGURE 56: FRONTEND AVAILABILITY CALCULATION BASED ON COMPONENT

AVAILABILITIES

In the example, on the left side the availability of the individual, independent components is shown. The hosts backend are mirrored which means that the probability having at least one host up (OR) is (0.98+0.98) – (0.98*0.98) = 99.96%. The total end-user availability can be calculated by the product (AND) of each individual component 0.9996 * 0.98 * 0.975 * 0.98 = 91.96. In this scenario the model be used to calculate the impact of a concrete sub layer performance e.g. impact of backend server availability on the end-user KQI availability.

This example by using the FTA method, implements the simplest variant of above described KQI/PI Hierarchy model. FTA can fully map the scenario when we limit the set to exact one KQI/PI (here availability) having a binary two state scenario, operational working or failure.

FTA involves backward reasoning through successive refinements from general to specific. As a deductive methodology it examines preceding events leading to failure in a relational sequencing. The resulting fault tree is a graphical representation of the potential combinations of failures that gen-

132

erated the incident. The tree starts with a ‘top event’ representing the ana-lysed incident and decomposes it into contributory events and their rela-tionships until the root causes are identified.

There are also several concepts available for extending the classical FTA with fuzzy logic and fuzzy elements. [Tyagi et al. 2010] in fuzzy set theoretic approach to fault tree analysis, describes several fuzzy operators on FTAs on triangular and trapezoidal fuzzy numbers. Traditionally, fault trees have been used to access fixed probabilities (i.e. each event that comprises the tree has a fixed probability of occurring) Fuzzy concepts can substitute static probabilities by employing the possibility of failure as a fuzzy set defined in the probability space.

The following graph shows a classical fault tree for a web based 3-tier (Web server, Application Server and Database) application system. This basic ex-ample will be later extended with fuzzy and more granular relationships.

FIGURE 57: FTA FOR FAULT TREE FOR WEB-BASED SYSTEM

133

Limitations of Fault Tree Analysis Even classical FTA provides a large number of usage capabilities; we have seen several limitations in applying a FTA to the KQI/PI hierarchy model.

The fault tree must explicitly show all the different relationships that are necessary to result in the top event. In constructing this fault tree, a thorough understanding is required of the logic and basic causes leading to the top event. This is practically not possible in vir-tualized complex environments.

FTA is not (fully) suitable for modelling dynamic scenarios. Classical FTA is binary (fail–success) and may therefore fail (as most

deductive dependency models) to address soft dependency prob-lems as needed for PQI/PI relationships.

The FTA model is requires always a compliance relationship, means if “KQI/PI A will be violated if related KQI/PI B is violated”. In our concept we need a softer form of compliance relationship the de-pendency relationship, which states that “KQI/PI A could be violated if a related KQI/PI B is violated.

FTA keeps the assessment of a components state as a binary result (normal/failed). In praxis we have more soft dependency relation-ship that allows also the more complex consideration of a degraded mode of operation or the concept of probability (global view)/ pos-sibility (local view) of a failure.

FTA supports a single event as top event; to analyse other types of failures, additional fault trees must be developed. The level of de-tail, types of events included, and the organization of the tree can vary significantly from analyst to analyst. Because an FTA does not pro-duce a unique answer, the value of an FTA depends on the skill and experience of the analyst. The accuracy of FTA results depends on data that is often difficult to obtain.

Because an FTA does not produce a unique answer, the value of an FTA depends on the skill and experience of the analyst. The accuracy of FTA results depends on data that is often difficult to obtain.

FTA, like the word fault tree indicates, work in the "failure space" and looks at system failure combinations. So the FTA method covers mainly the aspect of negative risk of interdependencies and negative impacts on failure. It does not tell about the positive aspect of inde-pendence via mitigation, restoration and resilience capabilities which are built in the system.

Finally the FTA relies on a number of simplifying assumptions that may not always hold true for every system. In praxis we have often n:m relationships which means the tree structure can be nested and an incident may have several parallel impacts which results in corre-lated error situations, which may in combination lead to a business service failure. This complexity can be mathematically hardly de-scribed with traditional FTA methods.

134

Other Root Cause Analysis Methods for IT Incidents There are also other Root Cause Analysis (RCA) Methods used in IT Service Management for Incidents Investigations. They should only shortly be listed here, as we focus on the ITIL v3 referred methods, however the principal fuzzy methodology proposed in the following concepts can very similar ap-plied to most of them.

Reliability Block Diagram (RBD), RBDs provide a graphical means for representing availability-related system dependencies The most fundamental difference between Fault Tree Analysis (FTA) and RBDs is that in an RBD one is working in the "success space", and thus looks at system successes combinations, while in a fault tree one works in the "failure space" and looks at system failure combina-tions. Traditionally, fault trees have been used to access fixed proba-bilities (i.e. each event that comprises the tree has a fixed probability of occurring) while RBDs may have included time-varying distribu-tions for the success (reliability equation) and other properties, such as repair/restoration distributions.

Cause-Effect Analysis: The cause-effect analysis uses fishbone (Ishi-kawa) diagrams to illustrate how various causes can be linked to an identified effect. There may be a series of causes that can be identi-fied, one leading to another. This series should be pursued until the fundamental, correctable cause has been identified

7.2.3 COMPONENT FAILURE IMPACT ANALYSIS (CFIA)

Overview The purpose of a Component Failure Impact Analysis (CFIA) is to provide a systematic approach to assist management predict and evaluate the impact of component failures on IT systems. Component failures include hardware and software but should also cover the processes, tools and people that sup-port the systems. This provides a starting point to consider ‘availability management’ approaches and techniques to mitigate or avoid the impact of failures. The CFIA study is a development of a technique first used by IBM in the early 1970s to analyse the impact of hardware and software failures on applications. The study has evolved into a systematic IT infrastructure as-sessment that is relevant to the challenges of a wide range of current system and service delivery organisations. To provide a relevant assessment the threat assessment addresses both to the physical components of the service, but also examines the systems management framework, the supporting tools and the skills within the delivery organisation. The objective of con-ducting a CFIA study is to manage the risk of system failure through predic-tive analysis of the current state of the ‘in scope’ solution. This ranges from

135

identification of ‘Single Points of Failure’ (SPoF) to identification of methods to reduce the recovery time of a system.

The CFIA or threat assessment approach sits within the ITIL V2 ‘Availability Management’ process (v2) or ITIL V3 Continual Service Improvement (CSI), as its purpose is to provide a systematic approach to service improvement and to achieving agreed service levels.

[ITIL Glossary 2011] CFIA: A matrix is created with IT services on one axis and CIs on the other. (Thus an application is considered as higher-level IT service to the business). This enables the identification of critical CIs (that could cause the failure of multiple IT services) and fragile IT services (that have multiple single points of failure).

FIGURE 58: EXAMPLE BASIC CFIA MATRIX

A basic CFIA will target a specific section of the infrastructure; just looking at simple binary choices (e.g. if we lose component x, will a service stop work-ing? More advanced CFIAs can be expanded to include a number of variables, such as likelihood of failure, recovery time and cost. The grid, (using a spread sheet or graph paper) list components in one column and the IT Ser-vice(s) across the top row. Then an indicator is set whether a component failure causes an outage, if there is an immediate backup (“hot-start” or “warm-start”). With this grid several questions can be examined like ”is a component a Single Point of Failure (SPOF)”, “what is the business/customer impact of this component failing”, “how many users would be impacted”, “what would be the cost to the business?”, “what is the probability of fail-ure?” “what can be done to prevent this impact?” “Is there a need for addi-tional redundancy or some form of resiliency?” “What would redundancy cost?” etc.

136

So some examples of the additional fields that can be included in the grid are as follows:

Probability of Failure - this can be based on the Mean Time Be-tween Failure (MTBF) information if available or on the current Availability trends. This can be expressed as a low/medium/high indicator or as a numeric representation.

Recovery Time - this is the estimated recovery time to recover the component. This can be based on recent recovery timings, recovery information from disaster recovery testing or a scheduled test re-covery.

Recovery procedures - this is to verify that up to date recovery pro-cedures are available for the component

Device Independence - where components have duplex physical files to provide resilience.

A complete recovery process may be assessed including recovery timings to enable IT to provide the business with accurate estimations of when service can be restored, available alternative recovery options in the event of a fail-ure providing confidence that valid recovery procedures exist for each com-ponent.

CFIA including Failure Modes and Effect Analysis Adding additional attributes, the CFIA (Component Failure Impact Analysis) can be conducted as extension of the proactive troubleshooting method, called Failure Modes and Effect Analysis (FMEA) as also defined in the new Service Strategy of ITILv3. Sometimes the name Failure Modes and Effect, and Criticality Analysis (FMECA) is also used for the same approach. A CFIA based on the FMEA approach considers also the different failure modes of each lower level system component. This can be shown in the normal CFIA matrix with IT services on one axis and components with the different fail-ure modes on the other. So here each component line will be duplicated by the investigated failure modes.

[Bailey et al. 2008] provides in the IBM Systems Journal VOL 47, NO 4, 2008 an example for a CFIA with failure modes. This technique allows the engi-neer to consider how the failure modes of each system component can result in system performance problems and to ensure that appropriate safeguards against such problems are put in place. FMEA provides input to the test strategy and test case design, allowing early identification of negative test scenarios.

The example below shows that there are two ways in which the function can fail: crash and hang. For each of these failure modes, the CFIA (FMEA) work-sheet captures details about the type of failure. For example, a crash can be detected by the system monitoring software, whereas a hang is more diffi-cult to detect and needs a manual check.

137

CFIA can be a very useful tool as it creates a visual tabular view of services and their required component items and shows the way that the infrastruc-ture is arranged, organized and depends on each other. A basic CFIA will target a specific section of the infrastructures; just looking at simple scenar-ios (e.g. if we lose component x, will a business service be impacted?)

More advanced CFIAs can be expanded to include a number of variables, such as likelihood of failure, repair and recovery time, detailed recovery procedures, organizational assignments and integration into wider service management processes and also can also consider and evaluate for different component failure modes (FMEA).

The following figure is an excerpt of a CFIA worksheet with failure modes shown by [BAILEY ET AL. IBM SYSTEMS JOURNAL VOL 47, NO 4, 2008]

FIGURE 59: CFIA WORKSHEET WITH FAILURE MODES [BAILEY ET AL 2008]

Limitations of Component Failure Impact Analysis

CFIA provides a static system analysis that does not consider the im-pact of multiple component failures, latent defects that impact timing and sequencing

The notion that a single grid can support every use case is being re-placed by a more pragmatic view that the CFIA grid should fo-cus on a specific set of requirements and usage.

The interdependencies between and among applications and under-lying systems, needs to be constructed in theoretical and feasible (mostly manual) way. ADDM tools can only support for a set of basic technical relations and needs to be extended with all kind of logical dependencies.

CFIA can answer the question “Which are the indirect dependent business services of a particular component x” but cannot comment on the type of dependency and to which level they are logical cou-pled and impacted.

138

CFIAs have no ability to consider also a level of vagueness. But any assessment in praxis is related to anything like vagueness, un-certainty, limited or imprecise knowledge, non-proofed information or simple hesitancy to do a statement.

In praxis there can often soft dependency relationships be required, that allows also the more complex consideration of a degraded mode of operation or the concept of probability of a failure. This is not supported by the classical CFIA approaches and tools.

7.2.4 BUSINESS IMPACT ANALYSIS (BIA)

Overview In case of an incident, the Root Cause Analysis (RCA) as part of the ITIL problem management process and concentrate on the identification prob-lem of an incident that interfere with operating an IT service. RCA focuses only on technology components and a problem. RCA is more of tactical activ-ities of daily operations conducted by a team of application / infra-structure support in dealing with IT service problems.

There is a second ITIL v3 activity, the Service Failure Analysis (SFA) which is more strategic nature with the aim to improve the quality of IT services as a whole. SFA is an input to the IT management to take strategic steps in in-creasing the value of the IT organization. So the SFA approach is not only the technology component of a service, but also the internal processes of an IT organization that provides these services and can be considered as part of a quality improvement activities of IT services. Both, RCA and SFA lead to the question to identify the business impact of the possible or detected prob-lems. A formula may be defined to calculate Pain Value based on the number of Users affected, the duration of the downtime, the impact on each user, and the cost to the business.

For Service Operation the impact analysis is called Pain Value Analysis (PVA), as activity of Service Design or Business Continuity Management Business this is referred as Business Impact Analysis (BIA). BIA identifies vital business functions and their dependencies. These dependencies may include suppliers, business processes, IT Services etc. BIA defines as an out-put the requirements which include recovery time objectives and minimum Service Level Targets for each IT Service.

The Component Failure Impact Analysis (CFIA) matrix developed during the activities described in the previous chapters can be expanded to include fields that can map the number of users supported by each business service, so the component coupling to the higher level services indicates also the affected users by a degraded operation of an infrastructure node. Thus when a component is unavailable, the number of users impacted is understood. This can enable cost calculations to be based on the number of users impact-

139

ed and/or amount of lost user processing time or even total cost of unavail-ability.

However, the number of user workstations does not necessarily equate to the number of users at one point in time. So other measurements of costs of failure should complement these numbers, like:

Each SLA normally contains a set of penalty clauses when service providers fail to deliver the pre-agreed quality. This can be also a reward when over-achieving the SLO target.

Estimation of the financial impact of IT failure against the transac-tion volumes (related to the vital business functions) normally pro-cessed during the period of failure.

For organisations unable to justify the failure costs via more ad-vanced measurement techniques, a 'user assessment' of a monetary hourly value is a simple technique that provides a business and user view of the business service cost of non-availability.

For certain businesses a consequence of IT failure may be claims for financial compensation by impacted customers. An example being for the loss of interest due to delayed payments.

An approach to obtain an indicative cost of one hour unavailability is to take the annual cost to the business of taking the service and simply divide by the number of service hours contracted in the SLA for a year.

As example for the calculation of hourly cost of failure :

Total Cost of Failure per hour =

User productivity loss (hourly costs of total user affected) + IT productivity loss ((hourly costs of affected IT staff) + Lost revenue (lost business cost) + other losses (overtime, materials, penalties and fines) + monetary value for loss of trust and reputation.

Remark: While traditional IT measures may show the '%' SLA target met, this does little to change the feeling of dissatisfaction if IT service problems have impacted the business operation. Therefore reputation based metrics for measuring trust is also a well-established and relevant technique.

BIA can also be used to justify IT Infrastructure improvements by quantify-ing the total cost to the organisation of an IT Service failure(s). These costs can then be used to support a business case for additional IT Infra-structure investment and provide an objective 'cost versus benefit' assessment.

140

Limitations in BIA

Business Impact is hard to measure, as it could have several conse-quences, from financial impact to fuzzy aspects like feeling of dis-satisfaction if IT service problems occur.

Measurements on business impact of a failure are hard to quantify in monetary value, like “user productivity loss”, “IT productivity loss”, “lost business cost”, losses like overtime (maybe not paid extra), ma-terials, possible resulting future penalties/fines, and especially the value for loss of trust and reputation.

Often the fuzzy element in BIA is the impact and interdependency of the lower level services, and backend infrastructure components to the supported business services. The interdependencies between and among applications and underlying systems, needs to be con-structed in theoretical and feasible (mostly manual) way.

For BIA it is not enough to answer “which are the indirect dependent business services of a particular component x” but need also to as-sess the type of dependency and to which level they are coupled and impacted.

Components may have different failure modes with differing busi-ness impact. As example there are two ways in which the function can fail: crash and hang. For each of these failure modes a different business impact needs to be assessed.

IT organization use mostly indicative cost of failure for a single or range of time periods, e.g.: the agreed service hours. So we need to calculate the indicative hourly cost of a failure by taking the annual cost to the business of taking the service and simply divide by the number of service hours contracted in the SLA for a year. This gives the IT expenditure cost to the business by hour.

BIA provides a static view that does not consider the impact of mul-tiple component failures, latent defects that impact timing and se-quencing

As BIA assessments are fuzzy in nature we need to have the ability to consider also a level of vagueness. Assessment in praxis is related to anything like vagueness, uncertainty, limited or imprecise knowledge, non-proofed information or simple hesitancy to do a statement. This is not supported by the classical BIA approaches and tools.

In praxis we have soft dependency relationship that allows also the more complex consideration of a degraded business impact or the concept of probability (global view)/ possibility (local view) of a business impact. This is not supported by the classical BIA ap-proaches and tools.

The business impact of a specific incident depends upon how close it is related to a stakeholder’ concerns. Various stakeholders may have their individual concerns and requirements which lead to a different

141

subjective impact assessment. Classical BIA assessments does not support particular views of the same incident impact to support some kind of an “attitude” based impact assessment with regard to the stakeholders concerns and attitude.

7.2.5 SUMMARY IT IMPACT ANALYSIS IN ITIL V3 ADDM has its roots from an application management perspective and origi-nally aimed to streamline the infrastructure management processes. ADDM introduces a level of trust that discovered information is no longer hypothet-ical, but real. By automatically discovering interdependencies between and among applications and underlying systems, ADDM products deliver a point-in-time view of the “truth.” This can be a powerful enabler that, over time, can minimize IT organizations expend on the in-formation assimilation func-tion and can also provide a basis for ever-higher levels of automated prob-lem resolution [EMA Radar Dec.10]

But as discussed there are several limitations using CMDB/ADDM tools. On the one side it reduces dependency on the human factor but on the other side this can provide only a basic view on impact assessments for business services, as logical dependencies cannot be discovered and thus must be complemented again by human interactions. The automated discovery finds dependencies by looking either at the TCP connections or by evaluating the configuration of programs which does not provide insights on consequences on impacted higher level services and SLA. So the ADDM picture needs to be extended with additional logical dependencies. This goes much beyond the scope of ADDM tools as for functioning of an information system we need to know also about dependencies to e.g. IT users, IT staff and business units and supporting processes and functions e.g. helpdesk. ADDM keeps the as-sessment of the components relations as a simple result (connected/not connected). This can be hardly interpreted for impact assessments and de-pendency couplings but gives a fundamental view on related and interfacing infrastructure components.

Component Failure Impact Analysis (CFIA) method really helps in providing a systematic approach to assist management predict and evaluate the impact of component failures on IT systems. This extends the pure system view (hardware and software) on component failures to include also the process-es, tools and people that support the systems. This provides a starting point to consider different management approaches and techniques to mitigate or avoid the impact of failures. With CFIA it is not a pure technical solution, it becomes a methodological solution. It provides a relevant assessment to the physical components of the service, but also to examine the systems man-agement framework, the supporting tools and the skills within the delivery organization.

142

The Fault Tree Analysis adds a logical representation of all the different rela-tionships that are necessary to result in the top event. In constructing this fault tree, a thorough understanding is required of the logic and basic causes leading to the top event. The FTA analysis can be incorporated within the CFIA matrix to assess the dependencies of a business service. The major lim-itation here is that classical FTA is binary (fail–success) and may therefore fail (as most deductive dependency models) to address soft dependency problems as needed for PQI/PI relationships. In praxis we have more soft dependency relationship that allows also the more complex consideration of a degraded mode of operation or the concept of probability (global view)/ possibility (local view) of a failure.

One of the key benefits resulting from the application of the FTA techniques is that they force the analyst to follow a systematic procedure of analysis of the system. In most cases, the mere construction of the model leads to a bet-ter understanding of the system design, including aspects such as compo-nent interdependencies and reliability weaknesses. Because an FTA does not produce a unique answer, the value of an FTA still de-pends on the skill and experience of the analyst.

Having created the CFIA including the dependencies we can expand the grid to include fields related to the Business Value (BIA) and the Cost of Failure of a Service. These fields can simply show the hourly failure cost to the busi-ness or can map the number of users supported by each business service. So the component coupling to the higher level services indicates also the cost and affected users by a degraded operation of an infrastructure node.

The same BIA estimate used during operation to assess the business impact in case of an incident, can also be used to justify IT Infrastructure improve-ments by quantifying the total cost to the organisation of an IT Service fail-ure(s). These costs can then be used to support a business case for addition-al IT Infrastructure investment and provide an objective 'cost versus benefit' assessment.

7.3 RECOMMENDATIONS AND PROPOSED EXTENSIONS Based on the previous discussion, the following chapter proposes recom-mendations, best practices and lists useful and required extensions of tradi-tional impact assessments and dependency analysis:

1. The notion that a single method can support every use case should be replaced by a more complete view that may include several combined and integrated methods to provide the needed results. So we recommend that all described major methods, CMDB/ADDM, FTA, CFIA and BIA should be leveraged to provide the overall de-

143

pendency picture and showing the different aspects for an impact assessment.

2. As the overall frame for incorporating all data and methods the CFIA is best suited. CFIA can be freely extended with different kind of variables showing failure modes, several reliability parameters, operational capabilities and techniques and extends the pure sys-tem view (hardware and software) on component failures to in-clude also the processes, tools and people that support the systems. This is necessary as for functioning of an information system we need to know also about dependencies to e.g. IT users, IT staff, business units and supporting processes like backups and func-tions like helpdesks.

3. The initial CFIA grid should best be setup by usage of auto-discovery tools (ADDM) which provides trust that the discovered information is real and up-to-date. By automatically discovering in-terdependencies between and among applications and underlying systems, ADDM products deliver a powerful enabler that minimize IT organizations expend on the information assimilation function and can also provide a basis for further higher level, logical de-pendency assessments.

4. We recommend that an Fault Tree Analysis (FTA) is incorporated in the CFIA matrix creation process, to assess the dependencies of components to a business service. The use of FTA enables the iden-tification of dependent components that could cause the failure of the IT business services, where an incident occurs. The basic step of the CFIA, to create a grid with components on one axis and the IT Services which have a dependency on the components on the other can be built using the results of the FTA analysis. So we recommend an export from the FTA tools to automate the definition of the grid of lower level components for each business service.

5. As classical FTA is bi-modal (fail–success) and cannot address soft dependency problems as required for the described PQI/PI rela-tionships, we recommend to extend the traditional concepts with a limited or partial dependency model. In praxis we have more weak or soft dependency relationship that allows also the more complex consideration of a degraded mode of operation or the concept of probability (global view) / possibility (local view) of a dependency or impact. This can in our next approach be modelled via fuzzy ex-tensions of the classical FTA.

6. Impact assessments are fuzzy in nature, so we recommend also the ability to consider the level of vagueness. Any assessment in praxis is related to things like vagueness, uncertainty, limited or imprecise knowledge, non-proofed information or simple hesitancy to do a statement. This is not supported by the classical approaches and tools.

144

7. Impact assessments on complex systems needs to consider contra-ry aspects. On the one side we have the risk resulting from interde-pendencies from interacting and related components on the other side each component has a set of mitigation, restoration and resili-ence capabilities. This means we need naturally approaching envis-aging positive and negative instances of the same dependency and impact relationship. Only both, positive and negative, aspects to-gether will define the overall system behaviour and the probable impact on the dependent business service. Considering and judging positive and negative dependency aspects isolated will not lead to the real world results. This thesis will be explained more closely in this paper. The discussed traditional methods already cover both aspects. Fault Tree Analysis (FTA), like the word fault tree indi-cates, work in the "failure space" and looks at system failure com-binations. So the FTA method covers the aspect of negative risk of interdependencies and negative impacts on failure. The basic CFIA itself is primarily focused on the mitigation, restoration and resili-ence capabilities, which represents the positive aspect of inde-pendence. Our proposal recommends the basic idea to consider the real-world impact of an incident merging both aspects simultane-ously into one integrated result set.

8. Finally the viewpoint is also an important concept. It is basically a specification that describes a particular view of the service which is an important parameter for performing an impact assessment. A viewpoint is linked to a particular stakeholder or set of stakehold-ers in mind and allows different stakeholders to focus on their own concerns. The impact of a specific incident depends upon how close it is related to a stakeholder’ concerns and requirements. Various stakeholders may have their individual concerns which lead to a different subjective impact assessment. Therefore we propose to support some kind of an “attitude” based impact assessment model allowing performing a parameterized impact assessment with re-gard to the stakeholders’ attitude and concerns.

These recommendations are the basis for our proposed soft or gradual de-pendency framework (referred as IFCFIA), which will help to implement the above described requirements and recommendations.

145

7.4 APPLYING IFCFIA TO EXTEND ITIL QUALITY METHODS

There are reliability engineering techniques and impact analysis methods

explicitly applied in ITIL v3 best practices and for those, different ITIL v3

processes and activities are linked to them. The need of translating and cor-

relating high-level requirements, service qualities and policies of all kinds

down to infrastructure level is a key issue in ITIL Processes part of several

Service Lifecycle Modules.

The proposed IFCFIA framework in this thesis is designed to incorporate

and naturally extend the existing ITIL quality methods rather than to replace

them with an isolated new approach.

IFCFIA helps to translate enterprise SLAs for business applications into

measurable parameters for technical services, which can be defined and

reported against an SLA and monitored under Service Level Management. So

this is a key challenge for ITIL Service Level Management (SLM) which af-

fects Service Design and Continual Service Improvement. SLM is the Process

responsible for negotiating Service Level Agreements, and ensuring that

these are met over all layers. SLM is responsible for ensuring that all IT Ser-

vice Management Processes, Operational Level Agreements, and underpin-

ning contracts, on all lower level services and components are appropriate

for the agreed Service Level Targets of the composite business application.

ITIL Availability Management and Component Capacity Management can

benefit from IFCFIA in the sense that the fulfilment of any higher-level objec-

tive requires proper capacities on multiple resources at several levels. In

Service Transition, the ITIL Configuration Management process is responsi-

ble for maintaining information about al Configuration Items required deliv-

ering an IT Service (including their relationships) which is stored in the Con-

figuration Management Database (CMDB). IFCFIA can naturally extend the

CMDB approach to provide the relationships for advanced impact and de-

pendency analysis. Change Management can leverage the data for impact

assessment in order to approve or reject a change request.

In Service Operations as part of Incident/Problem Management a Root

Cause Analysis (RCA) is performed over all layers to identify the root cause

of an incident or problem. RCA typically concentrates on IT Infrastructure

failures. A Pain Value Analysis during service operation is used to help iden-

tify the business impact of a problem which needs the relation of the lower

level infrastructure to the end-users to estimate the number of users and the

146

cost to the business. The Business Impact Analysis (BIA) is part of Service

Strategy in Business Continuity Management that identifies vital business

functions and their dependencies on all coupled components and resources.

BIA defines requirements to the datacentre environment which include re-

covery time objectives and minimum Service Level Targets for each deliver-

ing IT Service and component on all levels.

ITIL IT Service Continuity Management (ITSCM) is responsible for managing

risks that could seriously impact IT Services (Business Continuity Manage-

ment). ITSCM can leverage IFCFIA for reducing risks to an acceptable level

and planning for the recovery of IT Services and datacentre infrastructures.

147

8

7

1

8. LIMITATIONS AND CONCLUSION

148

8.1 LIMITATIONS OF IFCFIA

8.1.1 MULTIPLE INCOMING ARCS The relationships between components may have different kinds of seman-tics for components, which depend on multiple other components. In the created dependency map as directed graph, the two incoming arcs for a component C may mean different things:

“C depends on the proper working of both related lower-level com-ponents”.

“C is properly working if either one of both sub-components is properly working”.

Within the IFCFIA approach the both cases will lead to a different loosely coupling factor of the components. For instance in case of a mirrored fileserver each of the fileserver has a very high loosely coupling degree be-cause of having the other fileserver available in case of a hot failover. This will be modelled that each of the fileserver gets a loosely coupling degree representing the hot failover as probability that at least one of the two fileservers is up and running. With this setup of the coupling parameters the standard IFS operations are applicable for such calculations of indirect im-pact for components with multiple incoming arcs.

However for high availability scenarios where we have a service operational working without disturbance for e.g. > 99,9% the modelling of the hot failo-ver as loosely coupling factor may be not exact enough and in addition when combining with the tightly coupling degree, the coupling result may be too high, compared to the real impact. For these cases the MaxMin operator for the combination of the loosely and tightly coupling IFS will provide the bet-ter result as the loosely coupling is the determinant in such relations.

Further research is needed to define the optimal operations (which will be different from disjunction) to be used for the calculation of indirect impact for components with multiple incoming arcs in high availability and work-load management architectures.

8.1.2 LOOPBACKS IN THE DIRECTED DEPENDENCY GRAPH

As most reliability engineering techniques, like Fault Tree Analysis or Relia-bility Block Diagrams, also for the IFCFIA it is recommended that loopbacks in the directed graph should be avoided.

In case there are loopbacks the impact calculation will also cycle in loops. Therefore the program routine calculating the indirect impact should stop in case:

149

a fixed equilibrium is reached for the Business Service impact level a limited cycle is reached chaotic behaviour is exhibited

In praxis, to reduce the complexity of operational monitoring, compliance for technical performance parameters will mostly be measured bi-modal (either they operate correctly or they fail. This will avoid or limit effects of loopbacks as the bi-modal state will not change again in case of a loopback impact effect. If it is refined by distinguishing with regard to the different mode of failures, like outage or slow response whereas each failure mode again is monitored as bi-modal condition. Such binary measured and moni-tored failure modes may change because of loopbacks means e.g. the failure mode slow response may evolve to an outage.

8.2 CONCLUSION In this paper we presented a fuzzy methodical framework, which can be used to granularly relate performance metrics of the backstage in a service orchestration to the metrics used within Service Level Agreements at the front-stage. Service Level Agreements (SLA’s) related to customer satisfac-tion or other front end measures (response time, wait time, correctness, etc.) are critical since the front end is what the customer or the consumer experi-ences, and on which SLA’s and terms of the contract for the composed ser-vice are typically defined. The business SLA’s between the client and the service provider are mostly used to manage delivery contracts and may have related revenue impacts for service providers when failing.

Component Failure Impact Analysis (CFIA) is an industry standard quality technique and part of ITIL v3 with the objective to manage the risk of system failure through predictive analysis of the current state of the solution. We extended the CFIA with intuitionistic fuzzy means to the Intuitionistic Fuzzy Component Failure Impact Analysis (IFCFIA) approach, which allows rea-soning from single components to business services or even higher to the end user experiences. Including the fuzzy extensions of the coupling rela-tionships, the IFCFIA method can even more granularly unveil the interrela-tionships and dependencies between business applications to their support-ing components and infrastructures. IFCFIA has the basic advantages com-pared to other impact analysis methods that the model is more granular and therefore much closer to reality. Traditional impact analyses will in very most cases over- or undersize the estimated impacts. By considering a prob-able degradation level for component operation and probability of a compo-nent failure we can assess service usability granularly with a level of degra-dation, failure risk and vagueness.

This new IFCFIA model about a set of fuzzy-related components to a busi-ness service with corresponding performance parameters can be utilized to

150

support Service Management to predict on impacts of monitored back-end component failures to business services. Further, it can be a guide in the process of discovering the root cause of SLA violations and may help to pro-vide more accurate analyses that are needed to make appropriate adjust-ment decisions at runtime.

The basic IFCFIA grid has the advantage that it can be primarily built by au-to-discovery tools which can also automate and support the process of dis-covering and mapping applications to underlying infrastructure interde-pendencies. This makes the approach applicable also to systems without having full and extensive knowledge about the single components and rela-tions. When performing an IFCFIA assessment the IT departments need to define only the “direct impact” relationships and individual components resilience capabilities. All the “indirect impact” relations are calculated au-tomatically including possible impact adaptions to stakeholders’ attitudes and component dependency types.

The consideration of the uncertainty within the IFCFIA approach allows rea-soning without having a full known model of relationships and resilience capabilities. These advantages significantly reduce the effort needed to per-form impact analysis for larger systems and produce results that are close to reality. Based on the IFCFIA relationships we can further define two-sided rules for fuzzy intuitionistic reasoning that capture business insights into how service accounts as a whole (including IT enabled services with larger human elements), can predict and improve on quality.

The IFCFIA can help for implementing a flexible SLA model where the organ-ization benefits by projecting the capacity and paying for only what is re-quired. This allows optimizing investments in the IT infrastructure and booked component capacity baselines. Instead of tightening SLAs across the board, which is a costly approach, component capacity levels can be driven by the strength of the individual impact to the business.

With the actual trend towards server consolidation, it becomes even more difficult to provide accurate mappings of functions to machine or applica-tion, because several instances of applications coexist on the same host ma-chine, each capable of performing separate critical tasks in the environment. Using Virtual Machines is also a popular method of recovering CPU and memory that would otherwise be wasted, but it can complicate systems management. IFCFIA can help to determine the effect of taking a virtual ma-chine or a host machine and associated applications down for servicing. Without knowing such impacts, this will lead to paralysis in the IT data-centre and vulnerability as patches are delayed, applications are not updat-ed, and hardware upgrades are not performed to meet increased capacity requirements.

151

Impacts are complex which constitutes uncertainty. They involve a multi-tude of effects that cannot be easily assessed and may involve complex cau-salities, non-linear relationships as well as interactions between effects. This may render it difficult, if not impossible, to determine exactly what may happen. However by considering the risk resulting from interdependencies simultaneously together with the mitigation and resilience capabilities pro-vides a holistic and entire view of complex system behaviour. While doing so we are naturally approaching separately envisaging positive and negative instances of the coupling and interdependency aspect. So loosely and tightly coupling measurements can be implemented via independent methods using best suited approaches and tools which allows also integrating ordinary and probabilistic measurements within a single approach and result set. Apply-ing different fuzzy probabilistic operations to calculate indirect impacts adapts the general approach to individual concerns, attitudes or types of relationships.

The IFCFIA approach can be established as integral part of a broader Service Impact Management which is important to SLA’s as it increases the respon-siveness of IT organizations to the Business by definition of critical relation-ships between business services and IT assets, and detect dependencies in dynamic IT environments. IFCFIA provides the ability to directly connect IT operations to business services, means transformation of availability and performance data into knowledge about the real-time status of business services that allows understanding and communicating the true impact of incidents (such as IT component failure) on the business and vice versa. It supports service administrators to pro-actively track measures of individual components to gather the overall SLA quality status of the impacted busi-ness services applying fuzzy mathematical models and methods.

Several ITIL v3 processes can leverage the IFCFIA framework and the meth-ods proposed. IT Service Level Management can benefit from correlating high-level service qualities down to infrastructure level and relating moni-tored backstage performance to frontend quality parameters. Availability Management and Component Capacity Management can benefit from IFCFIA in the sense that the fulfilment of any higher-level objective requires proper capacities on multiple resources at several levels. IFCFIA can naturally ex-tend Configuration Management to provide the relationships for advanced impact and dependency analysis. Change Management can leverage the ap-proach for impact assessment in order to approve or reject a change request. For service operations Incident/Problem Management can naturally use IFCFIA for performing Root Cause Analysis to identify the probable root cause of an incident. Business Continuity Management can use IFCFIA to identify vital business functions and their dependencies on all coupled infra-structure components and resources.

152

A

A. REFERENCES

[Alghamdi, 2007] Alghamdi, J. S. (2007). Measuring software coupling. In

Proceedings of the 6th WSEAS International Conference on Software Engi-

neering, Parallel and Distributed Systems, SEPADS'07, pages 6-12, Stevens

Point, Wisconsin, USA. World Scientific and Engineering Academy and Socie-

ty (WSEAS).

[Atanassov 86;99] Atanassov K. On Intuitionistic Fuzzy Sets Theory (Stud-

ies in Fuzziness and Soft Computing) Springer Berlin Heidelberg; 1st ed.

1999 Edition: 2010 http://www.amazon.de/Intuitionistic-Fuzzy-Sets-

Applications-Fuzziness/dp/3790824631/ref=sr_1_2?s=books-intl-

de&ie=UTF8&qid=1335370781&sr=1-2

[Atanassov 08] Atanassov K. 25 years of intuitionistic fuzzy sets, or: The

most important results and mistakes of mine, 7th International workshop on

intuitionistic fuzzy sets and generalized nets. 17 Oct. 2008. Warsaw, Poland

[Bailey et al. 08] Three reliability engineering techniques and their applica-

tion to evaluating the availability of IT systems, IBM Systems Journal, Vol 47,

No 4, 2008

[Bianco et al. 08] P. Bianco, G. Lewis, P. Merson Service Level Agreements in

Service-Oriented Architecture Environments Software Architecture Tech-

nology Initiative CMU/SEI-2008-TN-021 Sept. 2008

[Craig, EMA ADDM Radar Dec 10] Craig, J. (2010). EMA Radar for applica-

tion discovery and dependency mapping (ADDM): Q4 2010 summary and

vendor profile.

http://www.enterprisemanagement.com/research/asset-free.php/1906/

http://www.amazon.de/Intuitionistic-Fuzzy-Sets-Applications-Fuzziness/dp/3790824631/ref=sr_1_2?s=books-intl-de&ie=UTF8&qid=1335370781&sr=1-2



http://www.enterprisemanagement.com/research/asset-free.php/1906/

153

[Dhama 95], H. Dhama, Quantitative models of cohesion and coupling in

software, Journal of Systems and Software, Volume 29, Issue 1, April 1995,

pp. 65-74

[Eifert 12] J. Eifert 2012, Application and Infrastructure Dependency Map-

ping using Intuitionistic Fuzzy Sets, Bachelor Thesis University of Fribourg,

Suisse, Department of Informatics

[Fenton and Melton, 1990] Fenton, N. and Melton, A. (1990). Deriving

structurally based software measures. Journal of Systems and Software,

12:177187.

[Hong Yang 2010] Hong Yul Yang 2010 Measuring Indirect Coupling. Doc-

tor Thesis at Department of Computer Science, University of Auckland, New

Zealand

[Hui LI 2009] Hui Li, Challenges in SLA Translation – SLA@SOI European

Commission Seventh Framework Programme (2007-2013) SAP Research,

Dec. 2009 http://sla-at-soi.eu/2009/12/challenges-in-sla-translation/

[IBM TADDM, 2012] IBM TADDM Infocenter : Dependencies between re-

sources

http://pic.dhe.ibm.com/infocenter/tivihelp/v46r1/index.jsp?topic=/com.ib

m.taddm.doc_721/SDKDevGuide/c_cmdbsdk_modelobject_dependencies.ht

ml

[Intuitionistic Fuzzy Sets] IFS Wiki

http://www.ifigenia.org/wiki/Intuitionistic_fuzzy_sets bibliography of IFS

and its applications http://www.clbme.bas.bg/projects/gnifs/ifs/publ.html

[ITIL Glossary] www.itil-

officialsite.com/nmsruntime/saveasdialog.aspx?lID=1182

[ITIL Process Model] http://datalinkcontrol.net/misc/itil-v3-process-

model.pdf

[ITIL Modules] ITIL Service Lifecycle Modules Source:

http://krpm.wordpress.com/

[Jacob et al. 2009] Jacob, Bart; Adhia, Bhavesh; Badr, Karim; Huang, Qing

Chun; Lawrence, Carol S.; Marino, Martin; Unglaub-Lloyd, Petra: IBM Tivoli

Application Dependency Discovery Manager: Capabilities and Best Practices.

http://www.redbooks.ibm.com/ abstracts/sg247519.html

http://sla-at-soi.eu/2009/12/challenges-in-sla-translation/

http://pic.dhe.ibm.com/infocenter/tivihelp/v46r1/index.jsp?topic=/com.ibm.taddm.doc_721/SDKDevGuide/c_cmdbsdk_modelobject_dependencies.html



http://www.ifigenia.org/wiki/Intuitionistic_fuzzy_sets


http://www.itil-officialsite.com/nmsruntime/saveasdialog.aspx?lID=1182


http://datalinkcontrol.net/misc/itil-v3-process-model.pdf

http://datalinkcontrol.net/misc/itil-v3-process-model.pdf

http://krpm.wordpress.com/

http://www.redbooks.ibm.com/

154

[Joshi et al. 2007] Integration of domain-specific IT processes and tools in

IBM Service Management IBM Systems Journal May 2007

[Joshi et al. 2009] Joshi, Karuna; Joshi, Anupam; Yesha, Yelena; Kothari,

Ravi: A Framework for Relating Frontstage and Backstage Quality in Virtual-

ized Services. http://ebiquity.umbc.edu/paper/html/id/462/A-Framework-

for-Relating-Frontstage-and-Backstage-Quality-in-Virtualized-Services-

[Joshi et al. 2011] Karuna P Joshi, Anupam Joshi, Yelena Yesha, Managing

the Quality of Virtualized Services - Proceedings of the SRII Service Research

Global Conference, 2011

http://ebiquity.umbc.edu/get/a/publication/541.pdf

[Kolev/Ivanov 2009] Kolev, Boyan; Ivanov Ivaylo: Fault Tree Analysis in an

Intuitionistic Fuzzy Configuration Management Database.

http://www.ifigenia.org/w/images/d/de/NIFS-15-2-10-17.pdf

[Mahapatra 2010] Intuitionistic Fuzzy Fault Tree Analysis Using Intuition-

istic Fuzzy Numbers, International Mathematical Forum, 5, 2010, no. 21,

1015 – 1024

[O'Brien, 2008] O'Brien, D. (2008). Best practices for discovering business

applications.

https://www.ibm.com/developerworks/wikis/display/tivoliaddm/Best+Pr

actices+for+Discovering+Business+Applications

[Quynh, Thang 2009] Pham Thi Quynh, Huynh Quyet Thang Dynamic Cou-

pling Metrics for Service–Oriented Software

www.waset.org/journals/ijeee/v3/v3-5-46.pdf

[Rafiee, Shabgahi 2011] Evaluating the Reliability of Communication Net-

works Using Fuzzy Fault Tree Analysis - A Case Study TJMCS Vol .2 No.2

(2011) 262-270. http://www.TJMCS.com

[Rud et al. 2007] Dmytro Rud, Andreas Schmietendorf Reine. Resource

Metrics for Service-Oriented Infrastructures. www.cs.uni-

magdeburg.de/~rud/papers/Rud-13.pdf

[Service Level Agreement Zone, 2007] Service Level Agreement Zone

(2007). The service level agreement. www.sla-zone.co.uk

http://ebiquity.umbc.edu/paper/html/id/462/A-Framework-for-Relating-Frontstage-and-Backstage-Quality-in-Virtualized-Services-

http://ebiquity.umbc.edu/paper/html/id/462/A-Framework-for-Relating-Frontstage-and-Backstage-Quality-in-Virtualized-Services-

http://ebiquity.umbc.edu/get/a/publication/541.pdf

https://www.ibm.com/developerworks/wikis/display/tivoliaddm/Best+Practices+for+Discovering+Business+Applications

https://www.ibm.com/developerworks/wikis/display/tivoliaddm/Best+Practices+for+Discovering+Business+Applications

http://www.cs.uni-magdeburg.de/~rud/papers/Rud-13.pdf

http://www.cs.uni-magdeburg.de/~rud/papers/Rud-13.pdf

155

[Szmidt and Kacprzyk 02,04] Analysis of Consensus under Intuitionistic

Fuzzy Preferences, Systems Research Institute Polish Academy of Sciences,

Warsaw, Poland

[Taixi Xu et al. 2006] T. Xu, K. Qian, and X. He, Service Oriented Dynamic

Decoupling Metrics. Proceedings of the 2006 International Conference on

Semantic Web & Web Services 170-176.

[Tai et al. 2008] Tai, Ling; Baker, Ron; Edmiston, Elizabeth; Jeffcoat, Ben:

IBM Tivoli Common Data Model: Guide to Best Practices.

http://www.redbooks.ibm.com/ ab-

stracts/redp4389.html?Open&pdfbookmark,

[Tyagi et al. 2010] S. K. Tyagi, D. Pandey and R. Tyagi, “Fuzzy Set Theoretic

Approach to Fault Tree Analysis,” International Journal of Engineering, Sci-

ence and Technology, Vol. 2, No. 5, 2010, pp. 276-283. www.ijest-ng.com

[The Open Group 2004] TeleManagement Forum: SLA Management Hand-

book – Volume 4: Enterprise Perspective ISBN: 1-931624-51-8

www.afutt.org/Qostic/qostic1/SLA-DI-USG-TMF-060091-SLA_TMForum.pdf

[Van Haren 08] Continual Service Improvement Based On ITIL V3: A Man-

agement Guide (Best Practice (Van Haren Publishing))

http://www.openisbn.com/isbn/9087531281/

[Zadeh 65,94] Lofti A. Zadeh, Soft Computing and Fuzzy Logic, University of

Berkeley, California USA

[Kosko 86] B. Kosko, ‘Fuzzy Cognitive Maps’ International Journal of Man-

Machine Studies, Vol.24, pp. 65 - 75, 1986.

[Stylios, Georgopoulos 1997] [C. Stylios, V. Georgopoulos, P. Groumpos,

The use of Fuz zy Cognitive Maps in Modeling Systems, Department of Electri-

cal and Computer Engineering, University of Patras, 1997

http://www.redbooks.ibm.com/

http://www.afutt.org/Qostic/qostic1/SLA-DI-USG-TMF-060091-SLA_TMForum.pdf

http://www.openisbn.com/isbn/9087531281/

156

B

B. TERMS AND DEFINITIONS

Availability Ability of a component or service to perform its required function at a stated instant or over a stated period of time. It is usually expressed as the availability ratio, i.e. the propor-tion of time that the service is actually available for use by the customers within the agreed service hours. (Agreed-Unavailable) / Agreed * 100%)

Business Continuity

Business Continuity means the continued operation of busi-ness processes to, at least, a predetermined acceptable level in the event of a major business disruption.

Business Re-covery Objec-tive

The desired time within which business processes should be recovered, along with the minimum staff, assets and ser-vices required within this time.

Business Re-silience

The ability of the business to rapidly adapt and respond to opportunities, regulations and risks, in order to maintain secure, continuous, business operations, be a more trusted partner, and enable growth.

Business Resilience spans business strategy, organizational structure, business and IT processes, IT infrastructure, ap-plications and data, and facilities. It arises from the imple-mentation and management of a plan that ensures high availability through monitoring and automatic adjustment of redundant or virtualized (distributed) infrastructure.

Cold stand-by this is applicable to organizations that can function for a period of up to 72 hours, or longer, without a re-establishment of full IT facilities.

Continuous Availability (CA)

Attribute of a system to deliver non disruptive service to the end user 7 days a week, 24 hours a day (there are no planned or unplanned outages).

Continuous Operations

Attribute of a system to continuously operate and mask planned outages from end-users. It employs Non-disruptive

157

(CO) hardware and software changes, non-disruptive configura-tion, software coexistence

Disaster Re-covery

Given a major disruption caused by a disaster, fully recover essential data and operations. DR encompasses the ability of the total design to transfer data and workload offsite and to restart the workload at a new site. The switch often involves a system and service outage but in the extreme the switch can be made completely seamless to important users.

Fault Toler-ance

Fault Tolerance is that property of a component, sub-system or system that means that normal service continues even though a fault has occurred within the system.

Fault tree analysis

Fault tree analysis (FTA) is a deductive, top-down method of analyzing the causes of system failure. It involves specifying a top event, such as a system crash, followed by identifying all the probable causes for that top event. The entire sys-tem, including human interactions is analyzed when per-forming a fault tree analysis. Fault trees provide a conven-ient symbolic representation of the combination of events or failures resulting in the top-level event.

Failure Mode, Effect and Criticality Analysis

Failure Mode, Effect and Criticality Analysis (FMEA or FMECA) is a bottom-up procedure to analyze each potential failure mode, to determine its effects on the system and to classify each potential failure mode according to its severity. The purpose is to provide a safer, more reliable initial de-sign. FMECA also helps to identify and eliminate any single points of failure in the design process.

High Availa-bility

The attribute of a system to provide service during defined periods, at acceptable or agreed upon levels and masks un-planned outages from end-users. It employs Fault Tolerance; Automated Failure Detection, Recovery, Bypass Reconfigu-ration, Testing, Problem and Change Management.

Hot stand-by provides for the immediate restoration of services following any irrecoverable incident. Hot stand-by previously referred to availability of services within a short timescale such as 2 or 4 hours..

Impact Measure of the business criticality of an incident, problem or Request for Change. Often equal to the extent of a distor-tion of agreed or expected Service Levels.

Incident An event or series of events that disrupts, or has the poten-

158

tial to disrupt IT production services to the user.

IT Business Resilience

Control of underlying IT Resources so as to support the Business Resilience policies of the organization.

The objective is to ensure that the availability and resilience needs and actions are integrated into the everyday ongoing operations of information technology systems and therefore will lead to predictable, successful outcomes in case of planned or unplanned outage.

ITIL Glossary www.itil-officialsite.com/nmsruntime/saveasdialog.aspx?lID=1182

Mean Time Between Failures (MTBF)

A Metric for measuring and reporting Reliability. MTBF is the average time that a Configuration Item or IT Service can perform its agreed Function without interruption. This is measured from when the CI or IT Service starts working, until it next fails.

Mean time to repair (MTTR)

Mean time to repair (MTTR) is a basic measure of the main-tainability of repairable items. It represents the average time required to repair a failed component or device. Ex-pressed mathematically, it is the total corrective mainte-nance time divided by the total number of corrective

Mean-Time-To-Restore-Service (MTTRS)

This would be different in the way that MTTR would mean time to repair a configuration item, and MTTRS would mean time to restore service after repair. E.g. MTTR=time to change CPU of a node, MTTRS=time to restore all services provided by that node

Recovery Point Objec-tive (RPO)

A “recovery point objective” or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident. The RPO gives systems designers a limit to work to.

Problem A fault or defect that requires further analysis after recov-ery, to determine and eliminate the cause of the incident

Recovery Time Objec-tive (RTO)

The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in busi-ness continuity.



159

Redundancy Redundancy is the use of components / nodes in parallel. It can help improve mission reliability and generally decreas-es logistics reliability and increases support costs.

Resilience The ability of an IT service to maintain its specified availa-bility in spite of the incorrect operation of one or more components.

Reliability The reliability of an IT Service can be qualitatively stated as freedom from operational failure. Reliability is the probabil-ity that an item will perform its intended function for a specified interval under stated conditions. Simply stated, it is how long the system can work.

Reliability Block Dia-gram

A Reliability Block Diagram (RBD) is a tool for analyzing the combinations of complex components and assemblies, often designed with various redundancies in order to improve the overall reliability of the system.

Scheduled Outage

A time period when the system is not ready for usage and the users do not expect it to be. That is: a planned outage. These are ‘Predefined Events’, often-repetitive calendar outages.

Unscheduled Outage

A time period when the system is not ready for usage and the users expect it to be. That is: an unplanned outage. These are ‘Random Events’.

Warm stand-by

typically involves the re-establishment of the critical sys-tems and services within a 24 to 72 hour period and is used by organizations that need to recover IT facilities within a predetermined time to prevent impacts to the business pro-cess

i f c f i a (ifcfia)...department of informatics / chair in information system information systems...

Documents