incident management revue strategic process planning and integration management (sppim) sue silkey,...
TRANSCRIPT
Incident Management RevueIncident Management Revue
Strategic Process Planning
and Integration Management (SPPIM)
Sue Silkey, Thelma Simons
and Gail Schaplowsky
Best PracticesBest Practices
• Best practices serve as a guide to designing IT management processes that increase the overall efficiency, reduce costs and align IT with business needs.
• ITIL asks…
How ITIL best practices can helpHow ITIL best practices can help
• Faster incident recovery • Fewer unplanned outages• Better communication with users• Information that enables better informed
management decisions
Incident ManagementIncident Management
Goal• Restore normal service operation as quickly
as possible and minimize adverse impact on business operations
• Basically this means using all available resources to get the user back to a productive state as quickly as possible
Incident ManagementIncident Management
Benefits• Minimize the disruption and downtime for our
users• Maintain a record during the entire Incident
life-cycle. (This allows any member of the service team to obtain or provide an up-to-date progress report)
• Building knowledgebase of known issues to allow quicker resolution of frequent Incidents
Incident ManagementIncident Management
How we implemented• Began using process July, 2006• Continued regular meetings to review and
tweak process• Process formally adopted in December, 2006
Current status• Starting to develop metrics to create
management reports (how many incidents, major incidents, etc.)
DefinitionsDefinitions
• Incident - any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service
• Service Request - request for increased functionality for new services, not a failure in the IT infrastructure.
• Major Incident – an Incident for which the degree of impact on the User community is extreme, and which requires a response that is above and beyond that given to normal incidents.
• Problem - A condition identified by multiple incidents exhibiting common symptoms, or from one single significant incident, indicative of a single error, for which the cause is unknown
Incident LifecycleIncident Lifecycle
A day in the life…A day in the life…of an Incidentof an Incident
Our players• Nervous Nellie – Gail Schaplowsky• Incident/Major Incident – Dave Barnhill• Support Staff – Mike Wright• Major Incident Manager – Sue Silkey• CSC Staff – Bill Farris• Narrator – Thelma Simons
We begin on a bright and sunny day…
Case TypesCase Types
• Incident - any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service
• Service Request - request for increased functionality for new services, not a failure in the IT infrastructure.
• Major Incident – an Incident for which the degree of impact on the User community is extreme, and which requires a response that is above and beyond that given to normal incidents.
• Problem - A condition identified by multiple incidents exhibiting common symptoms, or from one single significant incident, indicative of a single error, for which the cause is unknown
Incident ManagementIncident Management
Goal• Restore normal service operation as quickly
as possible and minimize adverse impact on business operations
I+U=PI+U=P
Impact + Urgency = Priority
I+U=PI+U=P
Impact is defined as the number of people
affected by a service outage.
• Low Impact: One customer affected, where no executive or executive staff are involved.
• Medium Impact: Several customers are affected, or an executive or executive staff are involved.
• High Impact: Whole organization, complete department or building affected, or revenue/financial systems affected.
I+U=PI+U=P
Urgency is defined as the affect of the event on a customer’s ability to work. (This is not to be confused with how urgent the requestor believes the incident to be.)
• Low Urgency: Ability not impaired, the customer is requesting extra or additional functions or services (a service request).
• Medium Urgency: Abilities are partially impaired, and customers cannot use certain functions or services.
• High Urgency: Abilities are completely impaired and customers cannot work.
I+U=PI+U=P
Priority is based on Impact and Urgency. The priority determines how quickly the issue needs to be addressed.
• Low Priority: Work to be completed in 4 business days.
• Medium Priority: Work to be completed in 2 business days.
• High Priority: Work to be completed in 4 hours.
• Urgent Priority: Work to be completed in 2 hours.
Major IncidentMajor Incident
I am the highest category of impact for an incident
I result in significant disruption to our business
In short, in matter technical on which we are dependent
I am the very model of an IT Major Incident!
(Sung to the tune of The Major General’s Song in the Pirates of Penzance
Case TypesCase Types
• Incident: an event which is not part of the standard operation of a service and which causes or may cause an interruption to, or a reduction in the quality of, that service i.e. some piece of technology that I previously used is not working now.
Major Incident: an Incident for which the degree of impact on the User community is extreme, or where the disruption is excessive and which requires a response that is above and beyond that given to normal incidents.
Major Incident Responsibilites Major Incident Responsibilites
Support Staff Major Incident Checklist
Assign the case to yourself (if not already done so)
Updates:• Hourly updates should be made to the work log or to
the Major Incident Manager at the CSC. If you do not make these hourly updates, the MIM or CSC will contact you for an update.
• Resolution updates should be called into the MIM or CSC for verification.
Once verified, Move the case to resolved Status and complete the information in the solutions tab.
Major Incident Responsibilites Major Incident Responsibilites
Major Incident Manager Checklist
1. Replicate or substantiate the failure (via monitoring equipment alerts)
2. Log the case3. Consult the Call List (contact support staff, Service
Owner, SCC)4. Monitor the case
a. Check activity log for updates hourlyb. If activity log hasn’t been updated for an hour,
contact support staff.5. Upon “resolution” or moving the case to “Pending –
Major Incident Cleared”a. Test that failure is resolved.b. Contact the SCC.
Call ListCall List
Tune in next time…Tune in next time…
• What will happen to Major Incident?• Come back next month to see the continuing
saga of Mr. Incident as he wafts his way through Change Management, Problem Management and Configuration Management.
Hope you had fun and…Hope you had fun and…
Learned • The difference between Incident and Major
Incident• How IM can minimize the disruption and
downtime for our users• The importance of maintaining a record
during the entire Incident life-cycle• That building a knowledgebase of known
issues will allow quicker resolution of frequent Incidents
IM Wrap UpIM Wrap Up
• Where we are• Where we want to be• Metrics to tell us when we arrive• Annual Review• New committee based on reorganization
Upcoming SessionsUpcoming Sessions
Future sessions are scheduled on:• Change Management • Problem Management• Configuration Management• Release Management
Questions?Questions?
More information at SPPIM (PSMO) website
www.technology.ku.edu/psmo
Also in IS/Process Management public folders