incident management revue strategic process planning and integration management (sppim) sue silkey,...

Incident Management RevueIncident Management Revue

Strategic Process Planning

and Integration Management (SPPIM)

Sue Silkey, Thelma Simons

and Gail Schaplowsky

Best PracticesBest Practices

• Best practices serve as a guide to designing IT management processes that increase the overall efficiency, reduce costs and align IT with business needs.

• ITIL asks…

How ITIL best practices can helpHow ITIL best practices can help

• Faster incident recovery • Fewer unplanned outages• Better communication with users• Information that enables better informed

management decisions

Incident ManagementIncident Management

Goal• Restore normal service operation as quickly

as possible and minimize adverse impact on business operations

• Basically this means using all available resources to get the user back to a productive state as quickly as possible


Benefits• Minimize the disruption and downtime for our

users• Maintain a record during the entire Incident

life-cycle. (This allows any member of the service team to obtain or provide an up-to-date progress report)

• Building knowledgebase of known issues to allow quicker resolution of frequent Incidents


How we implemented• Began using process July, 2006• Continued regular meetings to review and

tweak process• Process formally adopted in December, 2006

Current status• Starting to develop metrics to create

management reports (how many incidents, major incidents, etc.)

DefinitionsDefinitions

• Incident - any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service

• Service Request - request for increased functionality for new services, not a failure in the IT infrastructure.

• Major Incident – an Incident for which the degree of impact on the User community is extreme, and which requires a response that is above and beyond that given to normal incidents.

• Problem - A condition identified by multiple incidents exhibiting common symptoms, or from one single significant incident, indicative of a single error, for which the cause is unknown

Incident LifecycleIncident Lifecycle

A day in the life…A day in the life…of an Incidentof an Incident

Our players• Nervous Nellie – Gail Schaplowsky• Incident/Major Incident – Dave Barnhill• Support Staff – Mike Wright• Major Incident Manager – Sue Silkey• CSC Staff – Bill Farris• Narrator – Thelma Simons

We begin on a bright and sunny day…

Case TypesCase Types

• Incident - any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service

• Service Request - request for increased functionality for new services, not a failure in the IT infrastructure.

• Major Incident – an Incident for which the degree of impact on the User community is extreme, and which requires a response that is above and beyond that given to normal incidents.

• Problem - A condition identified by multiple incidents exhibiting common symptoms, or from one single significant incident, indicative of a single error, for which the cause is unknown


Goal• Restore normal service operation as quickly

as possible and minimize adverse impact on business operations

I+U=PI+U=P

Impact + Urgency = Priority

I+U=PI+U=P

Impact is defined as the number of people

affected by a service outage.

• Low Impact: One customer affected, where no executive or executive staff are involved.

• Medium Impact: Several customers are affected, or an executive or executive staff are involved.

• High Impact: Whole organization, complete department or building affected, or revenue/financial systems affected.

I+U=PI+U=P

Urgency is defined as the affect of the event on a customer’s ability to work. (This is not to be confused with how urgent the requestor believes the incident to be.)

• Low Urgency: Ability not impaired, the customer is requesting extra or additional functions or services (a service request).

• Medium Urgency: Abilities are partially impaired, and customers cannot use certain functions or services.

• High Urgency: Abilities are completely impaired and customers cannot work.

I+U=PI+U=P

Priority is based on Impact and Urgency. The priority determines how quickly the issue needs to be addressed.

• Low Priority: Work to be completed in 4 business days.

• Medium Priority: Work to be completed in 2 business days.

• High Priority: Work to be completed in 4 hours.

• Urgent Priority: Work to be completed in 2 hours.

Major IncidentMajor Incident

I am the highest category of impact for an incident

I result in significant disruption to our business

In short, in matter technical on which we are dependent

I am the very model of an IT Major Incident!

(Sung to the tune of The Major General’s Song in the Pirates of Penzance

Case TypesCase Types

• Incident: an event which is not part of the standard operation of a service and which causes or may cause an interruption to, or a reduction in the quality of, that service i.e. some piece of technology that I previously used is not working now.

Major Incident: an Incident for which the degree of impact on the User community is extreme, or where the disruption is excessive and which requires a response that is above and beyond that given to normal incidents.

Major Incident Responsibilites Major Incident Responsibilites

Support Staff Major Incident Checklist

Assign the case to yourself (if not already done so)

Updates:• Hourly updates should be made to the work log or to

the Major Incident Manager at the CSC. If you do not make these hourly updates, the MIM or CSC will contact you for an update.

• Resolution updates should be called into the MIM or CSC for verification.

Once verified, Move the case to resolved Status and complete the information in the solutions tab.

Major Incident Responsibilites Major Incident Responsibilites

Major Incident Manager Checklist

1. Replicate or substantiate the failure (via monitoring equipment alerts)

2. Log the case3. Consult the Call List (contact support staff, Service

Owner, SCC)4. Monitor the case

a. Check activity log for updates hourlyb. If activity log hasn’t been updated for an hour,

contact support staff.5. Upon “resolution” or moving the case to “Pending –

Major Incident Cleared”a. Test that failure is resolved.b. Contact the SCC.

Call ListCall List

Tune in next time…Tune in next time…

• What will happen to Major Incident?• Come back next month to see the continuing

saga of Mr. Incident as he wafts his way through Change Management, Problem Management and Configuration Management.

Hope you had fun and…Hope you had fun and…

Learned • The difference between Incident and Major

Incident• How IM can minimize the disruption and

downtime for our users• The importance of maintaining a record

during the entire Incident life-cycle• That building a knowledgebase of known

issues will allow quicker resolution of frequent Incidents

IM Wrap UpIM Wrap Up

• Where we are• Where we want to be• Metrics to tell us when we arrive• Annual Review• New committee based on reorganization

Upcoming SessionsUpcoming Sessions

Future sessions are scheduled on:• Change Management • Problem Management• Configuration Management• Release Management

Questions?Questions?

More information at SPPIM (PSMO) website

www.technology.ku.edu/psmo

Also in IS/Process Management public folders

incident management revue strategic process planning and integration management (sppim) sue silkey,...

Documents

incident lifecycle slide

incident management

definitions incident

incident management

single significant incident

possible slide

unknown slide

case types incident