ITIL® v3 Event Management —A Look at the Theory (from the Real World)
Brenda L. Peery, 14th September 2009BCS Specialist Group Session,
All copyrights acknowledged. ITIL ® is a Registered Trade Mark of the Office of Government Commerce, and is Registered in the U.S. Patent and Trademark Office
Not here for the tents and soundstage?
What is worth taking from that as we go forward to look at our idea of Event Management?
• It looks like it could be a bit muddy– Very broad definition– Obscure language
• But there is an idea of purpose …– [from ‘Event Management’, Wikipedia]
“to market themselves, build business relationships, raise money or celebrate”
Speaker’s Background
• 15+ years experience with IT Service Related projects and roles – both vendor and user sides with Event & Systems Management related work
• ITIL v2 Manager, v3 Expert, MSP and Prince2 Practitioner, ITIL instructor, APMG committee member developing new ITIL credentials
• As an independent consultant for the last 5 years, “IT Service Management Architect” is my favourite title thus far …
Main Topics / Goals
• Event Management –you may already know it and have it – Monitoring and Event Management (key relationship)
• Event Management – the Basics according to ITIL®
• Where EM fits & What to consider in doing it– First ask why – strategy– Planning and managing
• Evaluation of the need• What are you trying to solve / what need are you trying to serve• Define a model and develop a strategy
Initial Context – Familiarity?
• Event Management (EM) as a core process is new with v3 ITIL with some roots in v2
• What elements are familiar?
© Crown copyright. Reproduced from the OGC's ITIL® version 2 volume: ICT Infrastructure Management and version 3 Core volume Service Operation. All rights acknowledged.
Initial Context – Monitoring?
• Almost everyone has some familiarity with “Monitoring”
• Consider monitoring and management over the last decade:
– Systems Management software tools: IBM Tivoli (particularly TEC), CA NSM, BMC Patrol
– the reporting capability of underlying Operating Systems: log files and system utilities, Task Manager in Windows, the “top”command in Unix
– And never underestimate the diagnostic scripts that your SysAdmins have written or inherited
Monitoring
Other kinds of monitoring?
• Other IT?• Other sector?• Inventory?• Business monitoring?
• Projects to bring in & Manage that monitoring
•Why do we do it?
Initial Context – HistorySo even though Event Management is ‘new’ there are some challenges – in creating a process model – from the back history that comes along with your infrastructure:
• There may already be strategies in place and benefits being realised from monitoring programmes
• There are likely to ‘competing understandings’: – what events are – what you are or are not doing about them and – at what levels you are engaging to monitor and utilise them
• Stakeholders may range from in-depth technical all the way up to non technical consumers of the information EM can produce
Your back history, embedded in your kit, will shape or constrain your EM possibilities
Best Practice Benefits
Develop a shared understanding and common language based on best practice recommendations, at least as your starting point …
EM Basics 1 – EM Process
“Event Management is the process that monitors all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exception conditions” (SO p.35).
So it is about:– Detecting events– Making sense of them– Determining appropriate control actions in response to them
But also:– Acting as a basis for automating routine Operations Management, and– Because it provides data for comparison, supporting
• Service Assurance and Reporting• Service Improvement
Event Management - Value“Generally indirect” (SO p.39)
• EM provides mechanisms for early detection of incidents (possibly action before any impact felt)
• EM provides a basis for automated operations• EM provides a basis for monitoring automated activity by
exception – Reducing the need for “expensive and resource intensive real-time
monitoring while reducing downtime”• Improves performance of other major Processes (early
responses, more business benefit from more effective and efficient ITSM)
EM Basics 2 – Event Definition
What is an Event?
Any detectable or discernible occurrence that has significance for the management of the IT infrastructure or the delivery of IT service and the evaluation of the impact a deviation may cause to a service.
Events are typically notifications created by an IT service, Configuration Item (CI) or monitoring tool.
(SO, p.35-36)
EM Basics 3 – Event Definition (Breadth)
Checking the official scope doesn’t narrow it down much:
“Event Management can be applied to any aspect of Service Management that needs to be controlled and which can be automated” (SO p.36).
EM Basics 4 – Event TypeBut there is more detail – the guidance suggests that you sub divide Events and “that at least these three broad categories be represented” in your Event Types:
1. Informational• There is no action required• Signifies regular operation (not an exception).
2. Warning• Approaching a threshold. • Signifies unusual, but not exceptional, operation
3. Exception• Abnormal operation. Breach of parameters.
Note also: Alert (to trigger human attention or intervention)[SO, p.40]
EM Basics 5 Process Flowchart
End
No
Event
Event NotificationGenerated
Event Detected
Event Filtered
Significance?
Warning
Informational Exception
Event Correlation
Trigger
AlertAuto ResponseEvent Logged
Human Intervention
Type?
Problem ManagementIncident Management Change Management
Yes
Review Actions
Effective?
Close Event
IP
C
© Crown copyright 2008. Reproduced from the OGC's ITIL® core volume: Service Operation. All rights acknowledged.
EM Basics 6 – Process Activities Summary
IP
C
Event Occurs –Notification / Detection
Filtering (Categorisation)
Correlation (Logic/rules)Note: Load
Trigger / Response SelectionNote: Human Perception
Review / Close
EM Basics 7 – Events and Infrastructure
Consider the extent to which your process design is and must be connected to your installed architecture
– Notification/Detection: How are you detecting and how are notifications sent or collected (and what impact does this have)?
– Filtering/Categorising: events into I , W , E streams, ignore event (or log/record locally)
– Triggering an Alert, Auto Response, or related Process (does your architecture allow this?)
EM – Lifecycle & Summary
ServiceStrategy
ServiceDesign
ServiceTransition
ServiceOperation
CSI
In the Lifecycle concept that is at the heart of v3 ITIL, the Event Management process is seated in Service Operation with the full set of SO processes including:– Event Management– Incident Management– Request Fulfilment– Problem Management– Access Management– Operational aspects of other
Processes
The EM & Monitoring Relationship
If we revisit the basic defintion:
“Event Management is the process that monitors all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exception conditions” (SO p.35).
While the Service Operation book provides a high level model of a ‘sample’ EM process, have we really looked at its key activity sufficiently ...
Designing EM – Alternate Lifecycle
ServiceStrategy
ServiceDesign
ServiceTransition
ServiceOperation
CSI
“In an ideal world, the Service Design process should define which events need to be generated and then specify how this can be done for each type of CI. During Service Transition, the event generation options would be set and tested”. (SO p.39)
Monitoring and Infrastructure
The base monitoring architecture:• Agent based• Agent less• A sample of an evolved monitoring
architecture
Agent based
Server (Windows/Unix)
applog
systemerrorlog
disks
Process 1
UP?Mem?CPU?
Process 2
Process 3script CMD
Monitoring Server
ConfigHistory* Alerts* Metrics
storedconfig
AlertsMetrics
config(once)
Hub /gateway /Monitoring
Server Agent
GUI Console
Advantages:* Technically more efficient* Possible offline operation* Often Richer in Functionality
Disadvantage:* More complicated to install* Agent disk footprint
Agent-less Advantages:* No agent to install -> easy to install* No Agent Footprint
Disadvantages:* More load on monitored machine* Less resilient to network problems
Monitoring Server
ConfigHistory* Alerts* Metrics
Aler
tsMetr
icsWeb Console
Cross-MachineScheduling Loop
Server (Windows/Unix)
applog
Rescan file
systemerrorlog
rescan file
disksCheck disks
Process 1
relist processes & filter
Process 2
Process 3script
Remote Execute
CMD
New
con
nect
ion
ever
y cy
cle
Sched
ules
WebServer
Design Considerations – Starting Systems
Unix Database Server
Oracle1
Oracle2
CRON
Unix Database Server
Sybase1
Sybase2
CRON
Windows Database Server
MSS1
MSS2
Unix Application Server Windows Application Server
App 1Proc 1
App 1Proc 2
App 1Proc 3
App 2Proc 1
App 1Proc 2
CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs
Design Considerations – System Capacity
Unix Database Server
Oracle1
Oracle2
CRON
Unix Database Server
Sybase1
Sybase2
CRON
Windows Database Server
MSS1
MSS2
Cap CapCap
Open SourceCapacity Tool
In HouseGUI
Unix Application Server
Cap
Windows Application Server
Cap
App 1Proc 1
App 1Proc 2
App 1Proc 3
App 2Proc 1
App 1Proc 2
CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs
Design Considerations – DB Mon. Capacity
Unix Database Server
Oracle1
Oracle2
CRON DbMon
Unix Database Server
Sybase1
Sybase2
CRON DBMon
DatabaseCap Plan
Windows Database Server
MSS1
MSS2
DBMon
WebReports
Unix Application Server Windows Application Server
App 1Proc 1
App 1Proc 2
App 1Proc 3
App 2Proc 1
App 1Proc 2
CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs
Database Monitoring
Design Considerations – App. Log Check
Unix Database Server
Oracle1
Oracle2
CRON
Unix Database Server
Sybase1
Sybase2
CRON
Windows Database Server
MSS1
MSS2
Agent
Unix Application Server Windows Application Server
App 1Proc 1
App 1Proc 2
App 1Proc 3
App 2Proc 1
App 1Proc 2
CPU Disk Mem Logs
AgentAgent Agent Agent
CPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs
Monitoring Serverwith thresholds &
app-specificmonitoring
configuration
ESM Arch (Generic)
Additional DepartmentalMonitoring(Application Specific)
Incident ManagementSystem
Network Monitoring
Database Monitoring
Unix Database Server
CRON DbMon
Unix Database Server
CRON DBMon
Windows Database Server
DBMon
Central EventServer
events
events
Events
Agent
Rulesevents
Live OutageReport
Unix Application Server Windows Application Server
AgentAgent Agent Agent
Monitoring Serverwith thresholds &
app-specificmonitoring
configuration
even
ts
Ticket w/eventdetails
Cap
The Two Perspectives
Operations led and Design led
– Operations led delivers the everyday working process– Operations led vision is really pre-Incident Incident
management
– Design led establishes a conduit between IT Service Management and the underlying technology
– Design led has the potential to be a very effecttive front end and interface for traditionally less visible processes:
• Performance & Management Information (dashboards) • Capacity• Availability