deal with production issues - the itil way

Deal with Production IssuesDeal with Production Issues

Suggestions from ITILSuggestions from ITIL

Problems to solveProblems to solve

Long resolution timeLong resolution time Neglected issuesNeglected issues

Issues we lose track of until our Issues we lose track of until our users remind ususers remind us

Recurring issuesRecurring issues Inconsistency in response timeInconsistency in response time Developers are distracted Developers are distracted

constantly to resolve issuesconstantly to resolve issues

GoalGoal

Manage issues in a consistent Manage issues in a consistent mannermanner

Fast resolutionFast resolution Reduce client impactReduce client impact Proactively resolve issues Proactively resolve issues

before they impact clientsbefore they impact clients

Basic ConceptsBasic Concepts

IncidentsIncidents Any event which is not part of the standard Any event which is not part of the standard

operation of a service and which causes, or may operation of a service and which causes, or may cause an interruption to or a reduction in, the cause an interruption to or a reduction in, the quality of that servicequality of that service

ProblemsProblems A problem is a condition often identified as the A problem is a condition often identified as the

cause of multiple incidents that exhibit common cause of multiple incidents that exhibit common symptoms. symptoms.

Known ErrorsKnown Errors A known error is a condition identified by A known error is a condition identified by

successful diagnosis of the root cause of a successful diagnosis of the root cause of a problem, and subsequent development of a problem, and subsequent development of a Work-aroundWork-around

Relationship of the threeRelationship of the three

Problem is the root cause of the Problem is the root cause of the incidentsincidents

Incident is the manifest of a Incident is the manifest of a underline Problemunderline Problem

One Problem can cause many One Problem can cause many IncidentsIncidents

Known error is a problem with Known error is a problem with known root cause and known known root cause and known workaroundworkaround

Manage Incident vs. Manage Manage Incident vs. Manage Problem Problem Different goalsDifferent goals

Incident Management focus on restoring the Incident Management focus on restoring the service operation as quickly as possibleservice operation as quickly as possible

Problem management focus on finding and Problem management focus on finding and eliminating the root causeeliminating the root cause

Different actionsDifferent actions Incident management applies workarounds or Incident management applies workarounds or

temporary fixes to quickly restore the servicestemporary fixes to quickly restore the services Problem management issue a change to Problem management issue a change to

fundamentally eliminate the root causefundamentally eliminate the root cause Incident management is reactive and Incident management is reactive and

problem management is proactiveproblem management is proactive Incident management emphasize speed Incident management emphasize speed

and problem management emphasize and problem management emphasize qualityquality

Common mistakesCommon mistakes

Spend tremendous time and Spend tremendous time and efforts to find root cause before efforts to find root cause before the service level is recoveredthe service level is recovered

Stop the investigation after an Stop the investigation after an incident is fixed by a incident is fixed by a workaroundworkaround

Same incident occurs Same incident occurs repeatedly without repeatedly without understanding of the root causeunderstanding of the root cause

Solutions from ITILSolutions from ITIL

Separate out Incident Management Separate out Incident Management and Problem Management into two and Problem Management into two independent but related processesindependent but related processes

Handle incidents (restore service) as Handle incidents (restore service) as quickly as possiblequickly as possible

Proactively and independently work Proactively and independently work on resolving problemson resolving problems

Wisely manage Known ErrorsWisely manage Known Errors

Incident ManagementIncident Management

Always remember the goal is to Always remember the goal is to “Restore service level “Restore service level as quickly as as quickly as possiblepossible””

How to go fast?How to go fast? ClassificationClassification Match known errors and known Match known errors and known

workaroundsworkarounds Appropriate escalationAppropriate escalation

Go fast, but not go crazy. Don’t missGo fast, but not go crazy. Don’t miss RecordRecord PrioritizePrioritize Follow upFollow up

Incident Management ProcessIncident Management Process

Acceptance And RecordAcceptance And Record

Benefits of recording Benefits of recording Help to diagnosis new incidents based Help to diagnosis new incidents based

on known incidentson known incidents Help Problem Management to find the Help Problem Management to find the

root causeroot cause Easy to determine the impactEasy to determine the impact Be able to track and control the issue Be able to track and control the issue

resolution. resolution. Incident Reporting ChannelsIncident Reporting Channels

UserUser System Monitor/AlertSystem Monitor/Alert IT personIT person

Incident RecordIncident Record

Unique IDUnique ID Basic diagnosis infoBasic diagnosis info

TimestampTimestamp SymptomsSymptoms User info (name, contact info)User info (name, contact info) Who’s responsibleWho’s responsible

Additional informationAdditional information ScreenshotsScreenshots Logs Logs

StatusStatus New, Accepted, Scheduled, Assigned, Active, New, Accepted, Scheduled, Assigned, Active,

Suspended, Resolved, TerminatedSuspended, Resolved, Terminated

ClassificationClassification

ClassificationClassification Possible reasons (application, network, Possible reasons (application, network,

database, business logic, etc.)database, business logic, etc.) Supporting group (application group, Supporting group (application group,

database group, infrastructure group, database group, infrastructure group, network group, etc.)network group, etc.)

PrioritizePrioritize Priority = Impact X UrgencyPriority = Impact X Urgency Determine resolution timeline (resolve Determine resolution timeline (resolve

within X hours) based on Service Level within X hours) based on Service Level AgreementAgreement

Preliminary SupportPreliminary Support

Preliminary ResponsePreliminary Response Acknowledge of acceptanceAcknowledge of acceptance Collect basic infoCollect basic info Provide basic help to the userProvide basic help to the user

Service RequestsService Requests Service Request is standard service like Service Request is standard service like

check status, reset password, etc. check status, reset password, etc. Go through standard procedure to Go through standard procedure to

handle service requestshandle service requests

MatchMatch

Match known errorsMatch known errors Known solutionKnown solution Known workaroundKnown workaround Known resolution procedureKnown resolution procedure

Match existing incidentsMatch existing incidents Link the new incident with the existing Link the new incident with the existing

incidentsincidents Increase the impact level of the existing Increase the impact level of the existing

incidentincident If the existing one is already worked on, If the existing one is already worked on,

inform the responsible personal/groupinform the responsible personal/group

Investigate and DiagnosisInvestigate and Diagnosis

EscalationEscalation Functional escalation (Technical Functional escalation (Technical

escalation) : Involve more escalation) : Involve more technical experts, involve teams in technical experts, involve teams in other functional group, or involve other functional group, or involve external suppliersexternal suppliers

Hierarchical escalation Hierarchical escalation (Management escalation): (Management escalation): Escalate to higher level Escalate to higher level management teammanagement team

Escalation by PrioritiesEscalation by Priorities

A (Service Desk) A (Service Desk) B (Second Line) B (Second Line) C (Third Line, C (Third Line,

Supplier) Supplier)

D (Incident Manager) D (Incident Manager) E (Division E (Division

Management) Management) F (Corporate F (Corporate

ManagementManagement

PriorityPriority Resolution Resolution timelinetimeline

0 0 MinuteMinute

10 10 MinuteMinute

30% 30% timelinetimeline



11 2 hr2 hr AA BB CDCD EFEF

22 4 hr4 hr AA BB CC DD E,FE,F

33 6 hr6 hr AA BB CC DD

44 8 hr8 hr AA BB CC

Investigation ActivitiesInvestigation Activities

Assign dedicated support personAssign dedicated support person Collect basic infoCollect basic info Query historical dataQuery historical data

Recent releasesRecent releases Recent changesRecent changes Workload trendWorkload trend

AnalyzeAnalyze Again, don’t spend too much time in Again, don’t spend too much time in

finding the root cause. Find a finding the root cause. Find a workaround as soon as possible!workaround as soon as possible!

Resolve and recoverResolve and recover

Resolution (workarounds or Resolution (workarounds or permanent fix)permanent fix) Create a Request For Change (RFC)Create a Request For Change (RFC) Approve RFC Approve RFC Implement Change. Implement Change.

Record the analysis, the root cause, Record the analysis, the root cause, the workaround and the solutionthe workaround and the solution

Leave the incident in Open status Leave the incident in Open status when resolution hasn’t been foundwhen resolution hasn’t been found

TerminationTermination

Contact the user to confirm Contact the user to confirm incident is resolvedincident is resolved

Change the Incident status into Change the Incident status into “Closed”“Closed”

Update all the Incident record to Update all the Incident record to reflect the final priority, impact, reflect the final priority, impact, user and root causeuser and root cause

Track and MonitorTrack and Monitor

Assign an owner to each Assign an owner to each incident. Usually it’s the Service incident. Usually it’s the Service Desk person. Desk person.

Provide feedback to the users Provide feedback to the users after a changeafter a change

Enforce the escalation based on Enforce the escalation based on the prioritythe priority

Problem ManagementProblem Management

Problem ControlProblem Control Find the root cause of a problemFind the root cause of a problem Turn a problem into a Known ErrorTurn a problem into a Known Error

Error ControlError Control Control and Monitor the Known Errors Control and Monitor the Known Errors

until they are appropriately handleduntil they are appropriately handled Proactive Problem ManagementProactive Problem Management

Resolve problems before they cause Resolve problems before they cause any incidentsany incidents

Problem ControlProblem Control

Identify ProblemsIdentify Problems

Analyze the trends of incidentsAnalyze the trends of incidents Likely to reoccurLikely to reoccur Likely more will occurLikely more will occur Likely to have larger impactLikely to have larger impact

Analyze the weakness of the Analyze the weakness of the infrastructureinfrastructure AvailabilityAvailability CapabilityCapability

A significant incident (outage)A significant incident (outage)

DiagnosisDiagnosis

Recreate incident in testing Recreate incident in testing environmentenvironment

Link the modules with incidentsLink the modules with incidents Review the latest changesReview the latest changes After the root cause of a After the root cause of a

problem is found, this problem problem is found, this problem becomes a Known Errorbecomes a Known Error

Temporary FixesTemporary Fixes

It’s important to find a temporary fix if It’s important to find a temporary fix if the problem causes significant the problem causes significant incidentincident

If temporary fix involves changes in If temporary fix involves changes in the infrastructure, a Request For the infrastructure, a Request For Change must be submitted. (Later, Change must be submitted. (Later, another RFC may be submitted to another RFC may be submitted to fix the root cause)fix the root cause)

For urgent problems, Emergency For urgent problems, Emergency Change Request Process should be Change Request Process should be initialized. initialized.

Error ControlError Control

Identify and Record Known Identify and Record Known ErrorError IdentifyIdentify

Find the root cause of a problemFind the root cause of a problem Link a problem with a known errorLink a problem with a known error

RecordRecord Assign an IDAssign an ID SymptomsSymptoms Root causeRoot cause StatusStatus

NotificationNotification Notify incident management team. They Notify incident management team. They

can associate new incidents with known can associate new incidents with known errorserrors

Determine the solutionDetermine the solution

Evaluate based onEvaluate based on Service Level AgreementService Level Agreement Impact and UrgencyImpact and Urgency Cost and benefitCost and benefit

Possible solutionsPossible solutions Temporary fixesTemporary fixes Permanent fixesPermanent fixes No fix (cost is greater than benefits)No fix (cost is greater than benefits)

Record the decision in Problem Record the decision in Problem DatabaseDatabase

Known Errors from other Known Errors from other environmentsenvironments Known errors from development Known errors from development

environment environment We may choose to release with some We may choose to release with some

minor known issuesminor known issues Known errors from suppliersKnown errors from suppliers

Usually reported in the release notesUsually reported in the release notes Record, Monitor and Track those Record, Monitor and Track those

known errors known errors Relate problems with those known Relate problems with those known

errorserrors

PIR (Post Implementation PIR (Post Implementation Review)Review) Normal problemsNormal problems

Confirm all the related incidents are Confirm all the related incidents are closedclosed

Verify if the problem record is complete Verify if the problem record is complete (symptoms, root cause and solutions)(symptoms, root cause and solutions)

Change the problem status into ResolvedChange the problem status into Resolved Significant problemsSignificant problems

What went well?What went well? What went wrong?What went wrong? How to do better next time?How to do better next time? How to prevent the similar issues from How to prevent the similar issues from

happening again?happening again?

Track and MonitorTrack and Monitor

Track the full lifecycle of each Track the full lifecycle of each known errorknown error Reevaluate impact and urgency. Reevaluate impact and urgency.

Adjust the priorities accordingly.Adjust the priorities accordingly. Monitor the progress of the Monitor the progress of the

diagnosis and implementation of diagnosis and implementation of the solution. Monitor the the solution. Monitor the implementation of the RFC. implementation of the RFC.

Proactive Problem Proactive Problem ManagementManagement Focus on the quality of the Focus on the quality of the

service and the infrastructureservice and the infrastructure Analyze operational trendsAnalyze operational trends Detect the potential incidents Detect the potential incidents

and prevent them from and prevent them from happeninghappening

Find out the weak points of the Find out the weak points of the infrastructure or the overloaded infrastructure or the overloaded componentscomponents

Ideas to improve our Ideas to improve our Production Support processProduction Support process Idea 1: Create an independent Problem Idea 1: Create an independent Problem

Management Team.Management Team. Idea 2: Create an Problem DatabaseIdea 2: Create an Problem Database Idea 3: Define the Production Support Idea 3: Define the Production Support

ProcedureProcedure Idea 4: Review and revise the procedures Idea 4: Review and revise the procedures

of using TeamTrackof using TeamTrack Idea 5: Enforce Post Implementation Idea 5: Enforce Post Implementation

ReviewReview Idea 6: Proactively manage problemsIdea 6: Proactively manage problems Idea 7 (optional): Acquire an Service Desk Idea 7 (optional): Acquire an Service Desk

software to facilitate the processsoftware to facilitate the process

Create an independent Create an independent Problem Management Team.Problem Management Team. Can be a full time team or a part time teamCan be a full time team or a part time team Appoint a Problem Management Manager. Appoint a Problem Management Manager.

Must be different than the Production Must be different than the Production Support Manager. Their goals, schedules Support Manager. Their goals, schedules and requirements are different. and requirements are different.

Responsible for managing all the Responsible for managing all the production problems (not incidents) for production problems (not incidents) for multiple applicationsmultiple applications Identify problemsIdentify problems Record problemRecord problem Find and evaluate solutionsFind and evaluate solutions Track the progress till closureTrack the progress till closure

Work closely with the existing Production Work closely with the existing Production Support team. Support team.

Create a Problem DatabaseCreate a Problem Database

A easy to search knowledge database A easy to search knowledge database Include problems and known errorsInclude problems and known errors Track symptoms, root causes, temporary Track symptoms, root causes, temporary

fixes, workarounds, and permanent fixes, workarounds, and permanent solutionssolutions

Include all the known errors in DEV and Include all the known errors in DEV and unresolved or deferred defects in QA/RATE unresolved or deferred defects in QA/RATE environmentsenvironments

Maintained by the Problem Management Maintained by the Problem Management TeamTeam

Will be used by Production Support team Will be used by Production Support team for match and fast resolution of incidentsfor match and fast resolution of incidents

Define the Production Support Define the Production Support Procedure (Work Instructions)Procedure (Work Instructions)

Create a formal and detailed document. Create a formal and detailed document. Train Production Support Team to follow Train Production Support Team to follow the new procedurethe new procedure

Start with ITIL Incident Management Start with ITIL Incident Management Process. Adjust it to our own situation and Process. Adjust it to our own situation and toolstools

Clearly define how to calculate prioritiesClearly define how to calculate priorities Clearly define the time-bound escalation Clearly define the time-bound escalation

procedureprocedure Clearly define the monitoring and tracking Clearly define the monitoring and tracking

stepssteps

Review and define the procedure Review and define the procedure of using TeamTrackof using TeamTrack

TeamTrack is our existing Incident Tracking TeamTrack is our existing Incident Tracking system system Review the functions of TeamTrackReview the functions of TeamTrack Redefine the incident escalation process Redefine the incident escalation process

according to ITIL suggestionsaccording to ITIL suggestions Define the interface between PC Support Define the interface between PC Support

and IT Production Support Teamand IT Production Support Team Communication channelCommunication channel Roles and responsibilitiesRoles and responsibilities EscalationEscalation Track and ControlTrack and Control Knowledge sharingKnowledge sharing

Enforce PIREnforce PIR

Contact each user to confirm all Contact each user to confirm all the incidents are closedthe incidents are closed

Make sure the Problem record is Make sure the Problem record is complete and usefulcomplete and useful

Identify issues in the Incident Identify issues in the Incident and Problem Management and Problem Management process. Add those to Problem process. Add those to Problem database.database.

Proactively Manage ProblemsProactively Manage Problems

Responsibility of the Problem Management Responsibility of the Problem Management Team. Team.

Perform the following activities:Perform the following activities: Analyze incidents to find the trendAnalyze incidents to find the trend Analyze infrastructure to identify possible Analyze infrastructure to identify possible

bottleneckbottleneck Run fail-over and stress testsRun fail-over and stress tests Apply a problem solution across multiple related Apply a problem solution across multiple related

applicationsapplications Establish and maintain the Production Monitor Establish and maintain the Production Monitor

System to proactively detect system anomaliesSystem to proactively detect system anomalies Evaluate how many problems are Evaluate how many problems are

proactively identified and resolvedproactively identified and resolved

Service Desk SoftwareService Desk Software

Evaluate the existing TeamTrack Evaluate the existing TeamTrack software and see if it covers out software and see if it covers out needsneeds

Other popular optionsOther popular options HP Openview Service DeskHP Openview Service Desk Remedy Strategic Service SuiteRemedy Strategic Service Suite CA Unicenter Service DeskCA Unicenter Service Desk

deal with production issues - the itil way

Documents