identifying the root cause of failures in it changes: novel strategies and trade-offs (im 2013)

46
Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs Ricardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville and Luciano P. Gaspary Federal University of Rio Grande do Sul, Brazil Roben C. Lunardi Federal Institute of Rio Grande do Sul, Brazil

Upload: ricardo-luis-dos-santos

Post on 04-Jul-2015

177 views

Category:

Business


2 download

TRANSCRIPT

Page 1: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs Ricardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville and Luciano P. Gaspary

Federal University of Rio Grande do Sul, Brazil

Roben C. Lunardi

Federal Institute of Rio Grande do Sul, Brazil

Page 2: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• Introduction

• Proposed Solution

• Diagnosis Process

• Conceptual Architecture

• Root Cause Analyzer

• Strategies for Selecting Questions

• Case Study

• Final Considerations

• Future Work

Outline

Page 3: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Introduction

• Context

• The complexity of IT infrastructures becomes the IT processes a critical mission

• ITIL (Information Technology Infrastructure Library) became the most widely accepted approach to IT processes management all over the world

• IT Change Management

• Defines how the IT infrastructure must evolve in a consistent and safe way

• Defines how changes should be conducted

3/28

Page 4: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Introduction

• IT Problem Management

• Defines the lifecycle of IT problems

• The primary goals are

• To eliminate recurrent incidents

• To prevent the occurrence of IT problems

• To minimize the impact of problems which cannot be prevented

• To achieve these goals, identifying the root cause of failures and reusing the operator’s knowledge is fundamental

• To simplify the procedures

• To minimize financial losses

• To reduce maintenance costs

4/28

Page 5: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Introduction

• Current Scenario

• Changes and failures have been exploited by several researches

• However, these researches have some limitations, such as

• Often, previous data are not considered

• Do not identify root cause of failures

• Specific solutions for detecting software failures

5/28

Page 6: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Introduction

• Our Goals

• Propose strategies that help in the identification process keeping the interactive approach

• The developed strategies must select a question and explore different criteria

• Compare the diagnostics generated by each strategy

6/28

Page 7: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Interactive Diagnosis

Proposed Solution Diagnosis Process – Our Approach

Problem Report Answered

Question

Root Cause Question

Selection

7/28

PR RC

Help Desk Root Cause

Analyzer

Operator

Page 8: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Config. Mgmt.

Database

Change Management System

Change

Planner

Change

Designer

Proposed Solution Conceptual Architecture

Operator

8/28

Deployment

System

RFC

Page 9: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Config. Mgmt.

Database

Diagnosis System

Diagnosis Log

Recorder

RC

Change Management System

Change

Planner

Change

Designer

Proposed Solution Conceptual Architecture

Operator

8/28

Deployment

System

RFC

Root Cause

Analyzer

Page 10: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Config. Mgmt.

Database

Diagnosis System

Diagnosis Log

Recorder

RC

Change Management System

Change

Planner

Change

Designer

Proposed Solution Conceptual Architecture

Operator

8/28

Deployment

System

Root Cause Analyzer

Question

Selector

Question

Verifier RC

Input

Processor

CI CI

RC RC RC

PR

RFC

Root Cause

Analyzer

Log

Page 11: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• The developed strategies use same inputs and return a single question as result

• 4 different proposed strategies

• Strategy 1 – Only completed diagnostics

• Strategy 2 – All diagnostics

• Strategy 3 – Age of diagnostics

• Strategy 4 – Questions’ popularity

9/28

Page 12: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 1 – Only completed diagnostics

• Only completed diagnostics are considered

• The calculated weights suffer no penalty

• The element weight is computed by sum of completed diagnostics in which RC was correctly identified

Root Causes Questions Answers Completed Diagnostics

RC1 Q1, Q2 A1, A3 20

RC2 Q1, Q3 A2, A5 30

10/28

Page 13: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 1 – Only completed diagnostics

• Only completed diagnostics are considered

• The calculated weights suffer no penalty

• The element weight is computed by sum of completed diagnostics in which RC was correctly identified

Root Causes Questions Answers Completed Diagnostics

RC1 Q1, Q2 A1, A3 20

RC2 Q1, Q3 A2, A5 30

10/28

20 + 30 = 50 30 20

Page 14: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 2 – All diagnostics

• Completed and frustrated diagnostics are considered

• The element weight is calculated by the sum of the completed diagnostics subtracting the sum of frustrated diagnostics

• A diagnostic is frustrated when the system uses at least one question associated with a RC, but at the end of the process another RC is identified

11/28

Page 15: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 2 – All diagnostics

Root Causes Questions Answers Diagnostics

Completed Frustrated

RC1 Q1, Q2 A1, A3 20 10

RC2 Q1, Q3 A2, A5 30 15

12/28

Page 16: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 2 – All diagnostics

Root Causes Questions Answers Diagnostics

Completed Frustrated

RC1 Q1, Q2 A1, A3 20 10

RC2 Q1, Q3 A2, A5 30 15

12/28

(20 + 30) – (10 + 15) = 25 30 – 15 = 15 20 – 10 = 10

Page 17: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 3 – Age of diagnostics

• Considers completed and frustrated diagnostics

• The elements weights suffer penalty by the age of diagnostics

Age Diagnostics Time Penalty

1ª To120 days Not applicable

2ª From 121 days to 150 days 10%

3ª From 151 days to 180 days 20%

4ª From 181 days to 210 days 30%

5ª From 211 days to 240 days 40%

6ª From 241 days to 270 days 50%

7ª From 271 days to 300 days 60%

8ª From 301 days to 330 days 70%

9ª From 331 days to 360 days 80%

10ª From 360 days 90%

13/28

Page 18: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 3 – Age of diagnostics

10

1

)( )(i

iiixghtelementWei

i – age of diagnostics

βi – percentage of weight to be used

αi – the amount of completed diagnostics in an age group

ωi – the amount of frustrated diagnostics in an age group

14/28

Page 19: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 3 – Age of diagnostics

10

1

)( )(i

iiixghtelementWei

15/28

Root Causes Questions Answers

Completed

Diagnostics

Frustrated

Diagnostics

1st age 10th age 1st age 10th age

RC1 Q1, Q2 A1, A3 1 24 4 8

RC2 Q1, Q3 A2, A5 4 15 1 2

Page 20: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 3 – Age of diagnostics

10

1

)( )(i

iiixghtelementWei

15/28

Root Causes Questions Answers

Completed

Diagnostics

Frustrated

Diagnostics

1st age 10th age 1st age 10th age

RC1 Q1, Q2 A1, A3 1 24 4 8

RC2 Q1, Q3 A2, A5 4 15 1 2

4.3 + 1.6 = 5.9

100% (1 - 4) + 10% (24 - 8) = 1.6

100% (4 - 1) + 10% (15 - 2) = 4.3

1.6

Page 21: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 4 – Questions’ popularity

• The RCs and categories’ weight are calculated according the Strategy 2

• The question’s weight consider the weight of associated RCs and question’s popularity

• Question’s popularity is obtained by the ratio between amount of occurrences of the question and amount of diagnostic sets selected

16/28

Page 22: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 4 – Questions’ popularity

αx – amount of occurrences of the question x in the diagnostic sets

n – amount of diagnostic sets

βRCi – probability of identifying an RC

αRCi, x – amount of occurrences of question x in the diagnostic set

of an RC

2

1

,

)(

n

i

xRCiRCix

xn

ightquestionWe

17/28

Page 23: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 4 – Questions’ popularity

2

1

,

)(

n

i

xRCiRCix

xn

ightquestionWe

18/28

Root Causes Questions Answers

Completed

Diagnostics

Frustrated

Diagnostics

1st age 10th age 1st age 10th age

RC1 Q1, Q2 A1, A3 1 24 4 8

RC2 Q1, Q3 A2, A5 4 15 1 2

Page 24: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Strategies for Selecting Questions

• Strategy 4 – Questions’ popularity

18/28

Root Causes Questions Answers

Completed

Diagnostics

Frustrated

Diagnostics

1st age 10th age 1st age 10th age

RC1 Q1, Q2 A1, A3 1 24 4 8

RC2 Q1, Q3 A2, A5 4 15 1 2

(2/2 + ((13/29 * 1) + (16/29 * 1))) /2 = 1

(1/2 + ((13/29 * 1) + (16/29 * 0))) /2 = 0.4741

(1/2 + ((13/29 * 0) + (16/29 * 1))) /2 = 0.5259

Page 25: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• In this case study some constrains were defined

• There is no changes during all executions

• The operator will provide always the same answer

• One company provides some services on the Web

• The infrastructure consists of DB Server and Web Server

• In order to meet growing demand 2 new servers will be installed

• Hosting Server – Will be used to host the clients’ websites

• Mail Server – Will be used to host the email services

19/28

Case Study

Page 26: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• The CP below aims to install 2 new servers and to migrate existing services

20/28

Case Study

Page 27: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• The CP below aims to install 2 new servers and migrate existing services

20/28

Case Study

A failure occurs

Page 28: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• IT infrastructure state in the company

21/28

Case Study

Page 29: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• IT infrastructure state in the company

21/28

Case Study

Page 30: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

• IT infrastructure state in the company

21/28

Case Study

Page 31: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

22/28

Case Study

Categories Level

Calculated Weights

Strat. 1 Strat. 2 Strat. 3 Strat. 4

Service 1 1083 242 157,30 242

Web Page Server 2 558 82 33,20 82

DataBase 2 519 195 127,60 195

Network 1 1058 345 188,10 345

Services 2 512 189 113,40 189

Devices 2 485 136 66,20 136

System 1 603 167 54,30 167

Computer System 2 545 153 52,90 153

Hosting Server 3 319 175 49,90 175

DB Server 3 192 -22 3,00 -22

Software 1 1115 343 126,60 343

Web Server 2 607 138 86,80 138

DB Server 2 443 169 36,20 169

Page 32: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

23/28

Case Study

• Diagnostic workflows generated

Page 33: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

23/28

Case Study

• Diagnostic workflows generated

The PHP configuration does not allow the

use of language in user’s websites

Page 34: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

24/28

Case Study

• Diagnostic workflows generated

Page 35: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

24/28

Case Study

• Diagnostic workflows generated

The PHP configuration does not allow the

use of language in user’s websites

Page 36: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Final Considerations

25/28

• The proposed solution allows to identify the failures’ root cause with the following features

• Reuse the operator’s knowledge

• Interactivity between solution and operator

• Flexibility of the diagnostic generated

• System compatibility with the standards used by companies

• The modular structure of solution allows organizations to adapt the system to their special needs

Page 37: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Final Considerations

26/28

• The proposed strategies generate different diagnostic workflows, considering the same infrastructure and failure

• Analyzing the obtained results, we have the following recommendations for IT operators

• Strategy 1 – histories with a small amount of records

• Strategy 2 – bulky and recent histories

• Strategy 3 – histories that include at least 10 months

• Strategy 4 – data sets with a great amount of popular questions

Page 38: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Future Work

27/28

• Explore new criteria for the selection of questions

• Confidence

• False positive and false negative rates

• Extend the process to identify root causes for other scopes

• Investigate the use of CIM classes (actions e checks) in order to improve the system bootstrapping

• Automate root cause identification of certain kinds of failures

Page 39: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Thank you for your attention!

Questions?

Page 40: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

References

• J. P. Sauvé, R. A. Santos, R. R. Almeida et al., “On the Risk Exposure and Priority Determination of Changes in IT Service Management,” in XVIII IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2007), 2007, pp. 147–158

• ITIL, “ITIL - Information Technology Infrastructure Library. Office of Government Commerce (OGC),” 2009, Available: http://www.itilofficialsite.com/. Accessed: aug. 2010

• G. Machado, F. Daitx, W. Cordeiro et al., “Enabling rollback support in IT change management systems,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 347–354

• W. Cordeiro, G. Machado, F. Andreis et al., “ChangeLedge: Change design and planning in networked systems based on reuse of knowledge and automation,” Computer Networks, vol. 53, no. 16, pp. 2782 – 2799, 2009

• ITIL, “ITIL - Information Technology Infrastructure Library: Service Operation Version 3.0. Office of Government Commerce (OGC),” 2007

• DMTF, “Distributed Management Task Force: Common Information Model. Distributed Management Task Force (DMTF),” 2009, Available: http://www.dmtf.org/standards/cim. Accessed: aug. 2010

Page 41: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

References

• J. Sauvé, R. Santos, R. Reboucas, A. Moura, and C. Bartolini, “Change priority determination in it service management based on risk exposure,” Network and Service Management, IEEE Transactions on, vol. 5, no. 3, pp. 178 –187, september 2008

• A. Brown and A. Keller, “A best practice approach for automating it management processes,” in Network Operations and Management Symposium, 2006. NOMS 2006. 10th IEEE/IFIP, 3-7 2006, pp. 33 –44

• A. Moura, J. Sauve, and C. Bartolini, “Business-driven it management - upping the ante of it : exploring the linkage between it and business to improve both it and business results,” Communications Magazine, IEEE, vol. 46, no. 10, pp. 148 –153, october 2008

• A. Keller, J. Hellerstein, J. Wolf, K.-L. Wu, and V. Krishnan, “The champs system: change management with planning and scheduling,” in Network Operations and Management Symposium, 2004. NOMS 2004. IEEE/IFIP, vol. 1, 23-23 2004, pp. 395 –408 Vol.1

• M. Jantti and A. Eerola, “A Conceptual Model of IT Service Problem Management,” in Service Systems and Service Management, 2006 International Conference on, vol. 1, Oct. 2006, pp. 798–803

• R. Gupta, K. Prasad, and M. Mohania, “Automating itsm incident management process,” in Autonomic Computing, 2008. ICAC ’08. International Conference on, 2-6 2008, pp. 141 –150

Page 42: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

References

• K. Appleby, G. Goldszmidt, and M. Steinder, “Yemanja-a layered event correlation engine for multi-domain server farms,” in Integrated Network Management Proceedings, 2001 IEEE/IFIP International Symposium on, 2001

• M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systems through incremental hypothesis updating,” Computer Networks, vol. 45, no. 4, pp. 537 – 562, 2004

• W. L. C. Cordeiro, G. Machado, D. F.F. et al., “A template-based solution to support knowledge reuse in IT change design,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 355–362

• J. A. Wickboldt, L. A. Bianchin, R. C. Lunardi et al., “Improving it change management processes with automated risk assessment,” in XII IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2009), 2009

• R. C. Lunardi, F. G. Andreis, W. L. d. C. Cordeiro, J. A. Wickboldt, B. L. Dalmazo, R. L. d. Santos, L. A. Bianchin, L. P. Gaspary, L. Z. Granville, and C. Bartolini, “On strategies for planning the assignment of human resources to it change activities,” in Network Operations and Management Symposium, 2010. NOMS 2010. IEEE, apr. 2010, pp. 248–255

Page 43: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Root Cause Analyzer

Proposed Solution Root Cause Analyzer

Question Verifier

Obvious?

Threshold

80% with the

same answer

Input Processor

RC RC RC Identification

based on

categories

Identification

based on PR

Identification

based on RCs

Question Selector

Selects the

Question has

the greatest

weight/level

Selects the

Category that

has the greatest

weight

Calculates the

weights

according to the

strategy

CI CI Log

Page 44: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Case Study

• Identified CIs and categories associated CI Categories

Hosted Sites Service Web Page Server

DataBase Access Service DataBase

Web Page Access Service Web Page Server

PHP Interpreter Service Web Page Server

CMS Service Service Web Page Server

Logical Connection Network Services

Joomla Software Web Server

PHP Software Web Server

Apache Software Web Server

MySQL Software Web Server

DB Server System Computer System DB Server

Hosting Server System Computer System Hosting Server

Switch Network Devices

Page 45: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Information Model

dete

rmin

esP

roble

m

possibleAnswers

determinesOthersQuestions

CategoryParentChild

1 1..*

1 0..1

1..*

*

ServiceProblem

SolutionCategory *

1..*

ManagedElement

ExchangeElement

SolutionElement

*

QuestionC

ate

gory

Category

0..1

Question

RootCause

1..* *

1

0..*

ServiceIncident

Problem

Answer

0..1

1..*

1..*

0..1

1..* SolutionCategory

Page 46: Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs (IM 2013)

Proposed Solution Information Model

dete

rmin

es

Pro

ble

m

possib

les

Answ

ers

dete

rmin

es

Oth

ers

Questions 1..*

0..1

1

Logical Element

EnabledLogical

Element

MessageLog

RecordLog recordedAnswers

recordedQuestions

1

0..1

Question

RootCause

1..*

1 1

1

1

Problem

Answer

0..1

recordedProblem

1

1

1..*

1 *