problem management practitioners forum thursday january 19, 2012 jon dowell jorge a. wong
TRANSCRIPT
Problem Management Practitioners Forum
Thursday January 19, 2012
Jon DowellJorge A. Wong
Agenda
o Housekeeping & Introductionso Define a successful investigationo Makeup of a successful Problem Managero Proactive monitoring of automated alerts for trends/patternso Impact of Change Management on PbMo Feedback & next steps
Housekeeping & Introductions
Fire & WashroomsName, Company, & Experience
Jon Dowell
Jon Dowell• Senior Consultant with KSLD Consulting.• 15 years of experience solving I.T. mysteries.• Facilitation and critical thinking during:
o Major Incidentso Problem investigationso Project quality assessments prior to go-liveo Project warranty periods
• Training and mentoringo Critical thinkingo Root cause analysiso Impact assessmentso Potential risks associated with requests for change
KSLD Consulting specializes in I.T. Problem Management and problem solving for today’s busy world.
Jorge Wong• Over 13 years in IT with Enmax and Accenture
o Senior Systems Analysto Applications Support Team Leado Contact Center Technology Team leado Service Delivery Leado Relationship Managero Problem Manager
• ITIL Background • Focuses on reactive and proactive problem management• Facilitates and conducts problem investigations with cause mapping
analysis method to capture the complete investigation to:o Assess impact and costo Identify root cause(s)o Best solution(s) to prevent recurrence
• Reviews and analyzes data from incident management and pinpoint problems which will give the best results once resolved.
Define a successful investigation
Jorge A. Wong
• Successful Problem Investigations• Must first understand:
• Why do we have problem investigations?• An investigation should be conducted to diagnose the root
cause of the problem.• How long should it take?
• The speed and nature of the investigation will vary depending upon the impact, severity, and urgency of the problem.
• What resources are required?• The appropriate level of resources and expertise should be
applied to finding a resolution corresponding to the priority and service levels targeted.
• Then, use your problem investigation toolkit. • There are many problem solving analysis, diagnosis and
solving techniques available and much research has been done in this area.
• Successful Problem Investigations• Some of the most useful and frequently used techniques
include:• Chronological analysis
• Timeline of events• Pain Value Analysis
• What level of pain has been caused to the organization/business by these problems
• Kepner and Tregoe• Deeper rooted problems
• Cause Mapping• Deeper rooted problems
• 5 Whys• Cause and effect
• Brainstorming• Gather together the relevant people and brainstorm the problem
• Ishikawa Diagrams• Document causes and effects which can be useful in helping identify where
something may be going wrong, or be improved• Pareto Analysis
• Separate important potential causes from more trivial issues
• Use what is appropriate and what you feel comfortable with.
• Successful Problem Investigationso End results
o Expected and desired outcome realizedo Root cause(s) identified and or validatedo Corrective measure(s) identified and or implementedo Effective use of resources throughout the investigation
o Which meanso Increased benefits to the business and the IT organization of:
o Decreased downtimeo Increased business satisfactiono Decreased amount of IT resources spent on incident management
o Other benefitso Influences future cost avoidanceo CMDBo Improved IT service qualityo Incident volume reductiono Permanent solutionso Improved organizational learningo Better first time fix rate at the Service Desko Improves existing processes and procedureso Happy Staff, including Problem Manager!
What makes a successful Problem Manager?
Jon Dowell
1111
Root Cause
Event
1212
Root Cause
Event
Why?
1313
Root Cause
EventTechnical
Failure
Why?
1414
Root Cause
EventTechnical
Failure
Why?Why?
1515
Root Cause
EventPeopleFailure
Technical Failure
Why?Why?
1616
Root Cause
EventPeopleFailure
Technical Failure
Why?Why? Why?
1717
Root Cause
EventPeopleFailure
ProcessFailure
Technical Failure
Why?Why? Why?
1818
Root Cause
EventPeopleFailure
ProcessFailure
Technical Failure
Why?Why? Why? Why?
1919
Root Cause
EventRoot
Cause?PeopleFailure
ProcessFailure
Technical Failure
Why?Why? Why? Why?
Kepner Tregoe has a process called Incident Mapping that performs a similar process.
ThinkReliabilty also has a process called "Cause Mapping"
Brainstorm traits for a good Problem Manager…
Are these good Problem Managers?
What about these individuals…?
What are the traits of a Problem Manager?• Listening
o Ability to listeno Attention to detail… while listening
• Questioningo Open questions… to allow the story to flowo Closed questions… to confirm facts/detailso Ability to ask tough questions and not be side tracked by miss direction.
• Leadership• Ability to lead a teams, resolve conflict, and drive resolution.• Prioritization with a focus on business, not technical, impact.• Strong organization & time management abilities.
• Business writing skills
And…• Understanding of business terminology and concepts.• Understanding of basic technical concepts, architecture, and methodologies.
Helpful educational opportunities?• Dale Carnegie• Kepner Tregoe
• Problem Solving & Decision Making• Incident Mapping
• ThinkReliabilty• Cause Mapping
• FranklinCovey• Focus
• General Business Writing
Proactive monitoring of automated alerts for trends/patterns
Jorge Wong
Alerts and monitoring, why?• Identify future problems.• Prevent problems from happening.• Manage technology infrastructure based on business.• Anticipate and meet the needs of the business.• Effectively manage an increasingly intricate and complex
infrastructure.• Predict and solve problems before they affect business.• Industry analyst reports, IT still discovers about 70% of
problems through the service desk.
• Alerts and monitoring, why?o Reactive to Proactiveo End-user experienceo Application performance and availabilityo Service level commitmentso Outageso Cost avoidanceo Resourceso Productivityo Efficiencyo Capacityo Predictive analyticso MTTRo MTBF
• Alerts and monitoring, what?o Demando Capacityo Availabilityo KPIso Logso Serviceso Networko Serverso User Defined Monitoring and Instant Alerts
Monitor the Windows Event log Alert on hardware and software changes Alert on specific file changes and protection violations Know if disk space is running low on computers Monitor computer online/offline status Know if a server goes down Know when traveling users with notebooks connect Alert message and recipient configuration
• Alerts and monitoring, what?o Pro-active approach
Server's utilization exceeds predefined percentage of total capacity available......raise alert!
Server CPU breaches 90% utilization, or disk becomes 80% full.o Food For Thought
What happens when a server goes down? Alarms, alerts, and notifications are triggered all over the place. The application, database, and operating system may appear to be down. However, this problem behavior may be due to a single point of failure
elsewhere in the network. What is the problem? What is the impact? What is or are the root causes? What is or are the workarounds and resolutions? Or......should we even be worried about it?
Problem Management Categories Re-active Pro-active Predictive Intelligence?
Feedback & next steps
Jorge A. Wong
• Next Stepso Future sessions
Problem Management Practitioner Forums 2012 January 19 (9am - Noon) March 15 (9a - Noon) June 7 (9a - Noon) Followed by casual lunch
Change Management Practitioner Forum 2012 April 12 (9a - Noon) <Tentative>
Business Analyst World Conference 2012 May 7, 8, & 9
Practitioner Forums 2012 Looking for subject ideas Configuration Management Service Level Management Looking for thought leaders and interested participants
15 Minute Break
Thank you!
Appendix
Problem Management: What it is? Is not?
Jorge A. Wong
IT Problem ManagementWhat is a Problem?A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created.What is Problem Management?The objective of Problem Management is to resolve the root cause of Incidents, and to prevent the recurrence of Incidents related to these errors.What does a Problem Manager do?The Problem Manager is responsible for managing the lifecycle of all Problems. He undertakes research for the root-causes of Incidents and thus ensures the enduring elimination of interruptions. His primary objectives are to prevent Incidents from happening, and to minimize the impact of Incidents that cannot be prevented.
What is Root cause Analysis?
• A standard process of:o Identifying a problem
What happened?o Containing and analyzing the problem
What were the root causes of the problem?o Defining the root cause
What internal options are available to deal with the problem?
o Defining and implementing the actions required to eliminate the root cause What is the cost of acting upon the available options?
o Validating that the corrective action prevented recurrence of problem Which decision options will provide the most cost-
effective solution?
Validate
Follow Up Plan
Complete Plan
Action Plan
Root Cause
Immediate Action
Identify Team
Identify Problem
At a high level, problem investigation looks at:• What were we doing? (Before Major Incident, Incident)• What was the problem?• Why did it happen?• What should be done?• What will we be doing now? (After Problem Investigation)