deal with production issues - the itil way
DESCRIPTION
Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITILTRANSCRIPT
Deal with Production IssuesDeal with Production Issues
Suggestions from ITILSuggestions from ITIL
Problems to solveProblems to solve
Long resolution timeLong resolution time Neglected issuesNeglected issues
Issues we lose track of until our Issues we lose track of until our users remind ususers remind us
Recurring issuesRecurring issues Inconsistency in response timeInconsistency in response time Developers are distracted Developers are distracted
constantly to resolve issuesconstantly to resolve issues
GoalGoal
Manage issues in a consistent Manage issues in a consistent mannermanner
Fast resolutionFast resolution Reduce client impactReduce client impact Proactively resolve issues Proactively resolve issues
before they impact clientsbefore they impact clients
Basic ConceptsBasic Concepts
IncidentsIncidents Any event which is not part of the standard Any event which is not part of the standard
operation of a service and which causes, or may operation of a service and which causes, or may cause an interruption to or a reduction in, the cause an interruption to or a reduction in, the quality of that servicequality of that service
ProblemsProblems A problem is a condition often identified as the A problem is a condition often identified as the
cause of multiple incidents that exhibit common cause of multiple incidents that exhibit common symptoms. symptoms.
Known ErrorsKnown Errors A known error is a condition identified by A known error is a condition identified by
successful diagnosis of the root cause of a successful diagnosis of the root cause of a problem, and subsequent development of a problem, and subsequent development of a Work-aroundWork-around
Relationship of the threeRelationship of the three
Problem is the root cause of the Problem is the root cause of the incidentsincidents
Incident is the manifest of a Incident is the manifest of a underline Problemunderline Problem
One Problem can cause many One Problem can cause many IncidentsIncidents
Known error is a problem with Known error is a problem with known root cause and known known root cause and known workaroundworkaround
Manage Incident vs. Manage Manage Incident vs. Manage Problem Problem Different goalsDifferent goals
Incident Management focus on restoring the Incident Management focus on restoring the service operation as quickly as possibleservice operation as quickly as possible
Problem management focus on finding and Problem management focus on finding and eliminating the root causeeliminating the root cause
Different actionsDifferent actions Incident management applies workarounds or Incident management applies workarounds or
temporary fixes to quickly restore the servicestemporary fixes to quickly restore the services Problem management issue a change to Problem management issue a change to
fundamentally eliminate the root causefundamentally eliminate the root cause Incident management is reactive and Incident management is reactive and
problem management is proactiveproblem management is proactive Incident management emphasize speed Incident management emphasize speed
and problem management emphasize and problem management emphasize qualityquality
Common mistakesCommon mistakes
Spend tremendous time and Spend tremendous time and efforts to find root cause before efforts to find root cause before the service level is recoveredthe service level is recovered
Stop the investigation after an Stop the investigation after an incident is fixed by a incident is fixed by a workaroundworkaround
Same incident occurs Same incident occurs repeatedly without repeatedly without understanding of the root causeunderstanding of the root cause
Solutions from ITILSolutions from ITIL
Separate out Incident Management Separate out Incident Management and Problem Management into two and Problem Management into two independent but related processesindependent but related processes
Handle incidents (restore service) as Handle incidents (restore service) as quickly as possiblequickly as possible
Proactively and independently work Proactively and independently work on resolving problemson resolving problems
Wisely manage Known ErrorsWisely manage Known Errors
Incident ManagementIncident Management
Always remember the goal is to Always remember the goal is to “Restore service level “Restore service level as quickly as as quickly as possiblepossible””
How to go fast?How to go fast? ClassificationClassification Match known errors and known Match known errors and known
workaroundsworkarounds Appropriate escalationAppropriate escalation
Go fast, but not go crazy. Don’t missGo fast, but not go crazy. Don’t miss RecordRecord PrioritizePrioritize Follow upFollow up
Incident Management ProcessIncident Management Process
Acceptance And RecordAcceptance And Record
Benefits of recording Benefits of recording Help to diagnosis new incidents based Help to diagnosis new incidents based
on known incidentson known incidents Help Problem Management to find the Help Problem Management to find the
root causeroot cause Easy to determine the impactEasy to determine the impact Be able to track and control the issue Be able to track and control the issue
resolution. resolution. Incident Reporting ChannelsIncident Reporting Channels
UserUser System Monitor/AlertSystem Monitor/Alert IT personIT person
Incident RecordIncident Record
Unique IDUnique ID Basic diagnosis infoBasic diagnosis info
TimestampTimestamp SymptomsSymptoms User info (name, contact info)User info (name, contact info) Who’s responsibleWho’s responsible
Additional informationAdditional information ScreenshotsScreenshots Logs Logs
StatusStatus New, Accepted, Scheduled, Assigned, Active, New, Accepted, Scheduled, Assigned, Active,
Suspended, Resolved, TerminatedSuspended, Resolved, Terminated
ClassificationClassification
ClassificationClassification Possible reasons (application, network, Possible reasons (application, network,
database, business logic, etc.)database, business logic, etc.) Supporting group (application group, Supporting group (application group,
database group, infrastructure group, database group, infrastructure group, network group, etc.)network group, etc.)
PrioritizePrioritize Priority = Impact X UrgencyPriority = Impact X Urgency Determine resolution timeline (resolve Determine resolution timeline (resolve
within X hours) based on Service Level within X hours) based on Service Level AgreementAgreement
Preliminary SupportPreliminary Support
Preliminary ResponsePreliminary Response Acknowledge of acceptanceAcknowledge of acceptance Collect basic infoCollect basic info Provide basic help to the userProvide basic help to the user
Service RequestsService Requests Service Request is standard service like Service Request is standard service like
check status, reset password, etc. check status, reset password, etc. Go through standard procedure to Go through standard procedure to
handle service requestshandle service requests
MatchMatch
Match known errorsMatch known errors Known solutionKnown solution Known workaroundKnown workaround Known resolution procedureKnown resolution procedure
Match existing incidentsMatch existing incidents Link the new incident with the existing Link the new incident with the existing
incidentsincidents Increase the impact level of the existing Increase the impact level of the existing
incidentincident If the existing one is already worked on, If the existing one is already worked on,
inform the responsible personal/groupinform the responsible personal/group
Investigate and DiagnosisInvestigate and Diagnosis
EscalationEscalation Functional escalation (Technical Functional escalation (Technical
escalation) : Involve more escalation) : Involve more technical experts, involve teams in technical experts, involve teams in other functional group, or involve other functional group, or involve external suppliersexternal suppliers
Hierarchical escalation Hierarchical escalation (Management escalation): (Management escalation): Escalate to higher level Escalate to higher level management teammanagement team
Escalation by PrioritiesEscalation by Priorities
A (Service Desk) A (Service Desk) B (Second Line) B (Second Line) C (Third Line, C (Third Line,
Supplier) Supplier)
D (Incident Manager) D (Incident Manager) E (Division E (Division
Management) Management) F (Corporate F (Corporate
ManagementManagement
PriorityPriority Resolution Resolution timelinetimeline
0 0 MinuteMinute
10 10 MinuteMinute
30% 30% timelinetimeline
60% 60% timelinetimeline
100% 100% timelinetimeline
11 2 hr2 hr AA BB CDCD EFEF
22 4 hr4 hr AA BB CC DD E,FE,F
33 6 hr6 hr AA BB CC DD
44 8 hr8 hr AA BB CC
Investigation ActivitiesInvestigation Activities
Assign dedicated support personAssign dedicated support person Collect basic infoCollect basic info Query historical dataQuery historical data
Recent releasesRecent releases Recent changesRecent changes Workload trendWorkload trend
AnalyzeAnalyze Again, don’t spend too much time in Again, don’t spend too much time in
finding the root cause. Find a finding the root cause. Find a workaround as soon as possible!workaround as soon as possible!
Resolve and recoverResolve and recover
Resolution (workarounds or Resolution (workarounds or permanent fix)permanent fix) Create a Request For Change (RFC)Create a Request For Change (RFC) Approve RFC Approve RFC Implement Change. Implement Change.
Record the analysis, the root cause, Record the analysis, the root cause, the workaround and the solutionthe workaround and the solution
Leave the incident in Open status Leave the incident in Open status when resolution hasn’t been foundwhen resolution hasn’t been found
TerminationTermination
Contact the user to confirm Contact the user to confirm incident is resolvedincident is resolved
Change the Incident status into Change the Incident status into “Closed”“Closed”
Update all the Incident record to Update all the Incident record to reflect the final priority, impact, reflect the final priority, impact, user and root causeuser and root cause
Track and MonitorTrack and Monitor
Assign an owner to each Assign an owner to each incident. Usually it’s the Service incident. Usually it’s the Service Desk person. Desk person.
Provide feedback to the users Provide feedback to the users after a changeafter a change
Enforce the escalation based on Enforce the escalation based on the prioritythe priority
Problem ManagementProblem Management
Problem ControlProblem Control Find the root cause of a problemFind the root cause of a problem Turn a problem into a Known ErrorTurn a problem into a Known Error
Error ControlError Control Control and Monitor the Known Errors Control and Monitor the Known Errors
until they are appropriately handleduntil they are appropriately handled Proactive Problem ManagementProactive Problem Management
Resolve problems before they cause Resolve problems before they cause any incidentsany incidents
Problem ControlProblem Control
Identify ProblemsIdentify Problems
Analyze the trends of incidentsAnalyze the trends of incidents Likely to reoccurLikely to reoccur Likely more will occurLikely more will occur Likely to have larger impactLikely to have larger impact
Analyze the weakness of the Analyze the weakness of the infrastructureinfrastructure AvailabilityAvailability CapabilityCapability
A significant incident (outage)A significant incident (outage)
DiagnosisDiagnosis
Recreate incident in testing Recreate incident in testing environmentenvironment
Link the modules with incidentsLink the modules with incidents Review the latest changesReview the latest changes After the root cause of a After the root cause of a
problem is found, this problem problem is found, this problem becomes a Known Errorbecomes a Known Error
Temporary FixesTemporary Fixes
It’s important to find a temporary fix if It’s important to find a temporary fix if the problem causes significant the problem causes significant incidentincident
If temporary fix involves changes in If temporary fix involves changes in the infrastructure, a Request For the infrastructure, a Request For Change must be submitted. (Later, Change must be submitted. (Later, another RFC may be submitted to another RFC may be submitted to fix the root cause)fix the root cause)
For urgent problems, Emergency For urgent problems, Emergency Change Request Process should be Change Request Process should be initialized. initialized.
Error ControlError Control
Identify and Record Known Identify and Record Known ErrorError IdentifyIdentify
Find the root cause of a problemFind the root cause of a problem Link a problem with a known errorLink a problem with a known error
RecordRecord Assign an IDAssign an ID SymptomsSymptoms Root causeRoot cause StatusStatus
NotificationNotification Notify incident management team. They Notify incident management team. They
can associate new incidents with known can associate new incidents with known errorserrors
Determine the solutionDetermine the solution
Evaluate based onEvaluate based on Service Level AgreementService Level Agreement Impact and UrgencyImpact and Urgency Cost and benefitCost and benefit
Possible solutionsPossible solutions Temporary fixesTemporary fixes Permanent fixesPermanent fixes No fix (cost is greater than benefits)No fix (cost is greater than benefits)
Record the decision in Problem Record the decision in Problem DatabaseDatabase
Known Errors from other Known Errors from other environmentsenvironments Known errors from development Known errors from development
environment environment We may choose to release with some We may choose to release with some
minor known issuesminor known issues Known errors from suppliersKnown errors from suppliers
Usually reported in the release notesUsually reported in the release notes Record, Monitor and Track those Record, Monitor and Track those
known errors known errors Relate problems with those known Relate problems with those known
errorserrors
PIR (Post Implementation PIR (Post Implementation Review)Review) Normal problemsNormal problems
Confirm all the related incidents are Confirm all the related incidents are closedclosed
Verify if the problem record is complete Verify if the problem record is complete (symptoms, root cause and solutions)(symptoms, root cause and solutions)
Change the problem status into ResolvedChange the problem status into Resolved Significant problemsSignificant problems
What went well?What went well? What went wrong?What went wrong? How to do better next time?How to do better next time? How to prevent the similar issues from How to prevent the similar issues from
happening again?happening again?
Track and MonitorTrack and Monitor
Track the full lifecycle of each Track the full lifecycle of each known errorknown error Reevaluate impact and urgency. Reevaluate impact and urgency.
Adjust the priorities accordingly.Adjust the priorities accordingly. Monitor the progress of the Monitor the progress of the
diagnosis and implementation of diagnosis and implementation of the solution. Monitor the the solution. Monitor the implementation of the RFC. implementation of the RFC.
Proactive Problem Proactive Problem ManagementManagement Focus on the quality of the Focus on the quality of the
service and the infrastructureservice and the infrastructure Analyze operational trendsAnalyze operational trends Detect the potential incidents Detect the potential incidents
and prevent them from and prevent them from happeninghappening
Find out the weak points of the Find out the weak points of the infrastructure or the overloaded infrastructure or the overloaded componentscomponents
Ideas to improve our Ideas to improve our Production Support processProduction Support process Idea 1: Create an independent Problem Idea 1: Create an independent Problem
Management Team.Management Team. Idea 2: Create an Problem DatabaseIdea 2: Create an Problem Database Idea 3: Define the Production Support Idea 3: Define the Production Support
ProcedureProcedure Idea 4: Review and revise the procedures Idea 4: Review and revise the procedures
of using TeamTrackof using TeamTrack Idea 5: Enforce Post Implementation Idea 5: Enforce Post Implementation
ReviewReview Idea 6: Proactively manage problemsIdea 6: Proactively manage problems Idea 7 (optional): Acquire an Service Desk Idea 7 (optional): Acquire an Service Desk
software to facilitate the processsoftware to facilitate the process
Create an independent Create an independent Problem Management Team.Problem Management Team. Can be a full time team or a part time teamCan be a full time team or a part time team Appoint a Problem Management Manager. Appoint a Problem Management Manager.
Must be different than the Production Must be different than the Production Support Manager. Their goals, schedules Support Manager. Their goals, schedules and requirements are different. and requirements are different.
Responsible for managing all the Responsible for managing all the production problems (not incidents) for production problems (not incidents) for multiple applicationsmultiple applications Identify problemsIdentify problems Record problemRecord problem Find and evaluate solutionsFind and evaluate solutions Track the progress till closureTrack the progress till closure
Work closely with the existing Production Work closely with the existing Production Support team. Support team.
Create a Problem DatabaseCreate a Problem Database
A easy to search knowledge database A easy to search knowledge database Include problems and known errorsInclude problems and known errors Track symptoms, root causes, temporary Track symptoms, root causes, temporary
fixes, workarounds, and permanent fixes, workarounds, and permanent solutionssolutions
Include all the known errors in DEV and Include all the known errors in DEV and unresolved or deferred defects in QA/RATE unresolved or deferred defects in QA/RATE environmentsenvironments
Maintained by the Problem Management Maintained by the Problem Management TeamTeam
Will be used by Production Support team Will be used by Production Support team for match and fast resolution of incidentsfor match and fast resolution of incidents
Define the Production Support Define the Production Support Procedure (Work Instructions)Procedure (Work Instructions)
Create a formal and detailed document. Create a formal and detailed document. Train Production Support Team to follow Train Production Support Team to follow the new procedurethe new procedure
Start with ITIL Incident Management Start with ITIL Incident Management Process. Adjust it to our own situation and Process. Adjust it to our own situation and toolstools
Clearly define how to calculate prioritiesClearly define how to calculate priorities Clearly define the time-bound escalation Clearly define the time-bound escalation
procedureprocedure Clearly define the monitoring and tracking Clearly define the monitoring and tracking
stepssteps
Review and define the procedure Review and define the procedure of using TeamTrackof using TeamTrack
TeamTrack is our existing Incident Tracking TeamTrack is our existing Incident Tracking system system Review the functions of TeamTrackReview the functions of TeamTrack Redefine the incident escalation process Redefine the incident escalation process
according to ITIL suggestionsaccording to ITIL suggestions Define the interface between PC Support Define the interface between PC Support
and IT Production Support Teamand IT Production Support Team Communication channelCommunication channel Roles and responsibilitiesRoles and responsibilities EscalationEscalation Track and ControlTrack and Control Knowledge sharingKnowledge sharing
Enforce PIREnforce PIR
Contact each user to confirm all Contact each user to confirm all the incidents are closedthe incidents are closed
Make sure the Problem record is Make sure the Problem record is complete and usefulcomplete and useful
Identify issues in the Incident Identify issues in the Incident and Problem Management and Problem Management process. Add those to Problem process. Add those to Problem database.database.
Proactively Manage ProblemsProactively Manage Problems
Responsibility of the Problem Management Responsibility of the Problem Management Team. Team.
Perform the following activities:Perform the following activities: Analyze incidents to find the trendAnalyze incidents to find the trend Analyze infrastructure to identify possible Analyze infrastructure to identify possible
bottleneckbottleneck Run fail-over and stress testsRun fail-over and stress tests Apply a problem solution across multiple related Apply a problem solution across multiple related
applicationsapplications Establish and maintain the Production Monitor Establish and maintain the Production Monitor
System to proactively detect system anomaliesSystem to proactively detect system anomalies Evaluate how many problems are Evaluate how many problems are
proactively identified and resolvedproactively identified and resolved
Service Desk SoftwareService Desk Software
Evaluate the existing TeamTrack Evaluate the existing TeamTrack software and see if it covers out software and see if it covers out needsneeds
Other popular optionsOther popular options HP Openview Service DeskHP Openview Service Desk Remedy Strategic Service SuiteRemedy Strategic Service Suite CA Unicenter Service DeskCA Unicenter Service Desk