the enterprise it event management playbook how to … · the enterprise it event management...

23
The Enterprise IT Event Management Playbook How To Reduce Incidents and Outages This interactive workbook will help you to define your business plan, identify your team members, and specify technical requirements.

Upload: ngominh

Post on 16-Jun-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

The Enterprise IT Event Management Playbook

How To Reduce Incidents and Outages

This interactive workbook will help you to define your business plan, identify your team members, and specify technical requirements.

The Enterprise IT Event Management Playbook

Table of Contents

Introduction 4• This Blueprint is a framework 4

• How do I use this template? 4

Revision Chart 5

Project Overview 6• Background 6

• Current State and Sample Problem Statements 6

• Current Landscape: 6

Project Overview 7• Key Business Goals 7

• Notional Architecture 7

Project Overview 8• Project Team 8

Integration Requirements 9• Integration Capabilities / API Support 9

• Event Data Sources 10

• ITSM Integration 10

Event Processing Requirements 11• Normalization/Common Event Format 11

• Filtering standards 11

• Data Enrichment 13

• Data enrichment sources 13

• Deduplication 13

• Correlation and Grouping 14

• Service Impact Determination 16

www.evanios.com | Page 2

The Enterprise IT Event Management Playbook

Table of Contents

• Event Closure 18

• Incident Creation and Resolution 18

• Extensibility 20

• Visualization and Usability Requirements 20

• Consoles 21

Meeting and Requirements Session Log 23

www.evanios.com | Page 3

The Enterprise IT Event Management Playbook

Introduction

This Blueprint is a frameworkIf you are heading up an event management initiative, there are many things you’ll need to consider. They generally fall into three categories:

• Creating the business case

• Building a cross-functional team

• Defining your technical criteria

Your company and situation are undoubtedly unique, but event management projects should all follow some common best-practice guidelines. In this blueprint, we’ve codified some of the most important principles and suggestions encountered through decades of work in this field.

How do I use this template?Follow the italicized instructions to tailor this document to your unique requirements. You can replace the Evanios logo (cover) with your own, then use the finished product to dictate your requirements and/or kickstart your project management plan. If you prefer a Microsoft Word version of this document (for easier editing) please email “[email protected]” (use subject line “Request Playbook”).

www.evanios.com | Page 4

The Enterprise IT Event Management Playbook

Revision Chart

[Update the revision chart to show your work]

Version Primary Author(s) Description of Version Date

0.5 Billy Joe Initial Draft Document 10/23/2017

1.0 Joe Bob Version 1.0 10/8/2017

www.evanios.com | Page 5

The Enterprise IT Event Management Playbook

Project Overview

Background[Describe your current situation in human terms, something like...]

[Our company] intends to improve overall service levels by reducing incident volume, ensuring that individual incidents are actionable, and implementing automation to reduce diagnostic, repair and resolution times. To do so, we need to change how events and alerts are managed within our environment.

Adoption of new technologies including microservices and distributed applications has increased complexity and created “noise” that we can no longer resolve manually. [Our company] currently utilizes “X number of” “stand alone” monitoring tools (e.g. SolarWinds, AppDynamics, SCOM, Nagios, etc.). A centralized cross-correlation platform would substantially reduce overall alert counts, while also providing better data and context for upstream analytics and actions (e.g. assignment of priority levels based on service impact, identification of leading indicators, and automated root cause analysis).

Current State and Sample Problem Statements[Insert your current architecture details here, including technical limitations and preferred future state. For example...]

1. [Our company’s] stated goal to reduce incidents and mean time to resolution is limited by access to shared resources. Primary diagnostic tools and related information are not accessible to responders as needed.

2. Notifications are problematic: lacking intelligent workflows, we are using group email boxes that do not clearly identify ownership or priority.

3. Because of the “stand alone” architecture, comparable alerts cannot be deduplicated or correlated. Resulting deficiencies include redundant events/incidents, an increased toll on manual diagnostics and weak or non-actionable analytics.

4. Alerts are generated even during known change periods.

5. No event/incident hierarchies are established.

Current Landscape:[Insert sample diagrams and descriptions (e.g. email alerts from standalone tools go to a group inbox. Other alerts go to…)]

Fig 1: Illustration of [Company Name] current monitoring environment, notification method and ITSM connectivity

www.evanios.com | Page 6

The Enterprise IT Event Management Playbook

Project Overview

Key Business Goals[Below is a list of key business goals usually met through the implementation of event management. Remove any irrelevant goals or examples, and add specific details if you have them]

1. Integrate all monitoring tools into a single pane of glass. Centralize metric and event data into a scalable cloud architecture where operators or automation can quickly react to restore service.

2. Reduce noise (eliminate or de-prioritize low value events). Decrease the volume of events through deduplication and correlation algorithms, as well as hierarchical organization that enable operations teams to focus on what matters.

3. Reduce incidents by “X” percent. [insert your own target] Consolidate multiple disparate events into a single more meaningful incident, and trigger automated remediation actions.

4. Enable the team to work on incidents within the ITSM platform.

5. Improve routing of notifications and incidents. Develop workflows or automation to direct events/incidents to the right people, at the right time.

6. Automate incident creation & resolution - Automate event to incident creation and resolution in the service desk. Create intelligent incidents, where all relevant data is automatically consolidated onto a single incident.

7. Remove “X” hours of manual labor - Eliminate manual effort wherever possible. [insert your own target, or area of focus]

8. Reduce time to find and fix outages - Increase availability and accelerate service restoration by managing all services through an interactive, predictive impact map that clearly illustrates which business services are most significantly affected by failures or performance issues.

9. Predict outages before they occur - Proactively take action based on leading indicators, to avoid incidents, reduce MTTR, and mitigate risk to the organization.

10. Prioritize effort - Identify and focus remediation on the issues with the largest business impact.

Notional ArchitectureEvents will be consolidated from multiple tools existing throughout the enterprise into a single vendor-agnostic event management solution and process. From there, information will be filtered and correlated, then where appropriate (based on business logic and severity) incidents will be automatically created.

www.evanios.com | Page 7

The Enterprise IT Event Management Playbook

Project Overview

Project TeamTo align application and infrastructure monitoring with ITSM processes (e.g. Incident, Change, CMDB, Knowledge, etc.) the evaluation and implementation team will be cross functional, representing stakeholders from complementary business units.

Getting people onto the same page (aligning schedules, determining common interests, evaluating ownership) is often the hardest part of the event management process. It’s a cultural change for many organizations, representing a shift from silo workflows to a collaborative approach. We recommend identifying stakeholders as early in the process as possible. Gaining initial buy-in will help while refining the business goals, determining the best vendor(s) and mapping your technical requirements.

[Update the revision chart to show your work]

Team Member Title Project Role

Sarah Connor CIO Executive Sponsor

Brad Johnson VP Infrastructure Project Owner

Brian Wilson Operations Manager Project Manager

Mike Jones Infrastructure Architect Project Lead

Frank Moses Solarwinds Admin Integrated Tool Owner

John McClane Business Analyst Business Analyst

Thomas Anderson Director, Process Automation Stakeholder

Michael Flatley Operations Specialist Stakeholder

www.evanios.com | Page 8

The Enterprise IT Event Management Playbook

Integration Requirements

Integration Capabilities / API SupportIntegrations are at the heart of any event management solution. It’s important to collect all the necessary events to provide a clear picture of service health. Robust API Support is a critical requirement which enables integration between various monitoring tools and the event management and analytical layers.

[Adjust priorities for your requirements as needed]

API Description Priority

Command Line Ability to create events direct from a command line client HIGH

REST/Web Services Submit via a unique web service HIGH

SNMP Create events via SNMP traps MED

Pull/Query Use scripts that run periodically to connect to databases, or check file systems for information

HIGH

TCP Receive and decode a stream of raw TCP information MED

Log File Read logs or Windows event logs, turn those log entries into events

LOW

Email Receive events via Emails, transform those into events LOW

www.evanios.com | Page 9

The Enterprise IT Event Management Playbook

Integration Requirements

Event Data SourcesIdentify all integration points; everything that generates events for consumption by the event management solution. Where possible, we recommend leveraging pre-packaged, supported integrations into these solutions. Benefits to doing so include reduced initial labor (why script, when you can plug and play?) and limited exposure to ongoing support issues (e.g. ensuring that all integrations are running on the most current versions).

[Insert your list of monitoring tools or solutions to be integrated]

Solution Description Priority

Solarwinds Currently our primary network management tool HIGH

SCOM Currently our primary server management tool HIGH

NewRelic Partially adopted, collecting APM metrics from specific applications

MED

Sensu Used by various business units MED

[Our company] Chat Direct log errors from application LOW

[Our company] Order Entry Errors kept in MSSQL database, these errors should become events

LOW

ITSM Integration[Adjust as per your ITSM solution, and which ITSM data you expect to leverage]

Currently [Our company] uses [ServiceNow] as it’s IT Service Management system of record. ITSM context data will be used to process events as described below.

CMDB. Configuration data and relationships will be used for enrichment of events with valuable metadata, service impact analysis, and correlation

Change Management. Change Management data will be used to suppress events during known change windows and outage schedules

Maintenance Schedules. [ServiceNow] maintenance schedule will be leveraged in the creation of event management logic

Incident Management. Incidents will be generated automatically when appropriate. Multiple events can trigger a single correlated incident

www.evanios.com | Page 10

The Enterprise IT Event Management Playbook

Event Processing Requirements

Normalization/Common Event FormatCommon Event Format is based on the concept that if every message follows a consistent format, the data will be easier to correlate and understand. The idea is that events from multiple sources get normalized into the same set of fields, as well as nomenclature. This way a CPU alarm from SCOM, looks the same as a CPU alarm from Solarwinds. Typically, a common event format is made up of a set of fields that define the object, and the situation occurring on that object.

Using normalized data makes it easier for administrators to create unified rule sets, and for machine learning aspects to work regardless of data source. Additionally, “Apples to apples” data is becomes critically important for advanced analytics, including capabilities such as scoring, predictive, etc.

It is recommended that RAW data still be retained as part of the normalized event. This way, technology specialists have access to all data during the investigation process.

Filtering standards Filtering is one of the most important disciplines in event management. It allows the organization to remove “noise” or unneeded events from the event stream, in order to focus on events that have significant impact on the business.

Types of events that will be filtered

In this section, try to identify as many filters as possible. Understanding that your initial list will almost certainly expand during implementation. For this reason (and many others) we suggest an event management system based on pre-packaged filter logic, that can be easily tuned.

[Tailor this for your needs, identifying common events that should be filtered]

• Events that cannot be filtered by the monitoring tool (possibly due to limitations from the source), but are not significant to our operations (e.g. development servers, planned maintenance, etc.)

• Events from business unit A, since we don’t support their efforts

• Routers supporting regional offices

• etc.

www.evanios.com | Page 11

The Enterprise IT Event Management Playbook

Event Processing Requirements

Optimal filtering strategy

Filtering event data close to the source is an event management best practice. There is no reason to waste resources processing events at the top level, if those events could have been eliminated closer to the source.

Best: On-premise, in Source Monitoring Tool. If you can stop your monitoring tools from generating unnecessary events through threshold tuning or analytics, that is ideal. Where possible, use mature monitoring tools with advanced thresholding and configuration capabilities.

Better: On-premise, in Event Integration Layer. Many monitoring tools lack the ability to properly filter noise on their own. This may be due to limitations in the tool, or mis-configuration of the tool. In these cases the event integration system will need to take over and perform the filtering activity. Typically at this level filters are rudimentary, since the Event Aggregation layer doesn’t contain enrichment or contextual data such as the CMDB, or change management.

Good: Cloud, in Event Correlation Layer. If an event requires logic or contextual data that is only available at the Event Correlation Layer, there is no choice to filter at this point. An example would be if the system needs to lookup device IP addresses in the CMDB, and the criteria is to filter any non-prod devices. The CMDB lookup would typically only be available in the event correlation layer.

What should happen when events are filtered?

When events are filtered, typically two behaviors are desired. The most common behavior is to simply “drop” the event, never to be heard from again. In this case there is no record that the event ever occurred, and it cannot be queried or tracked.

A second, optional behavior available in many solutions is to “log” the event. This will cease any processing on the event, but still write the raw event to a database table or log record, such that there is some indication this event occurred. Typically this behavior must occur at the event correlation/database layer.

[If you have a case where logging certain filtered events is important, or not so important, modify the sections above and describe the use case when you want to apply each behavior]

www.evanios.com | Page 12

The Enterprise IT Event Management Playbook

Event Processing Requirements

Data EnrichmentWhat is data enrichment?

Data enrichment is the process of bringing in fields not provided in the native event from related data sources, in order to make the event more meaningful to the recipient or the correlation algorithms. For example, for a server based event, you might want to look up it’s physical location in a database. Knowing the server’s location then enables the operations team or the correlation technology to make conclusions based on the physical location of the server.

Data enrichment sources[Describe your sources here]

Source Description

CMDB Lookup CI owner, Support Group

Change Management Determine if it’s part of an active change

Location Database Find location and rack

Other

DeduplicationPurpose

The primary goal for deduplication is to consolidate or collapse identical, or similar events into a single event. For example, if a monitoring tool generates an identical event every minute as long as the error or condition still exists, you would want to duplicate these into a single event, and simply update the count. You may also want to create a single event if two separate monitoring tools send events describing the same condition.

www.evanios.com | Page 13

The Enterprise IT Event Management Playbook

Event Processing Requirements

Configuration

Deduplication is based on field matching. For example, if the object being monitored, and the condition being reported are the same, the events could be considered duplicates. There may also be circumstances in which you’ll want to change the way duplicates are treated.

[Adjust requirements to fit your needs]

Requirement Description

Ability to use different fields for deduplication, based on the source or value of the data

This allows the use unique deduplication logic for events from various sources. For example, we may want to use different deduplication logic for Solarwinds events when compared to SCOM events.

Ability to control what happens to duplicate events May be a desire to drop or log duplicate event.

Ability to control how the existing event is updated by the duplicate

Ability to select which fields are updated, or make an annotation of the existing event [e.g. that it was updated by a duplicate].

Correlation and GroupingCorrelation and grouping technologies allow related events to be grouped or linked, providing a summarized view to operations staff in order to speed response. Correlation can also be used to present conclusions, or take actions automatically.

How events are correlated

Events are typically correlated based on similar field values (“matching content”), time proximity, and defined relationships.

Method Description

Matching Content Events have similar event fields, indicating they represent a similar object or condition.

Time Proximity Events occur within a similar timeframe. For example, if three events occur within 30 seconds this may indicate a stronger correlation than if they occurred 3 hours apart.

Defined Relationships Relationships can be defined between CI’s in the CMDB, or through manual configuration that identifies a relationship between the events themselves. When leveraging defined relationships, it’s good to understand where this data exists today, such that existing data can be leveraged, eliminating manual configuration.

www.evanios.com | Page 14

The Enterprise IT Event Management Playbook

Event Processing Requirements

How correlations are presented and used

Correlations performed by the solution can be presented as visual linkages or groups. This allows the operations staff to see connections between the events, and choose to take action. For example, a solution could display a “parent” event, associated with many “child” events.

Automation can also be triggered via correlation. You might want the solution to take action immediately, without waiting for human intervention. For example, if you are able to correlate that multiple application processes are down on a particular cluster, you may want to trigger a failover automatically.

Planned use cases for correlation

[Insert any use cases you can think of for correlation that have already been identified.]

Use Case Description Methods Leveraged

Location Down Multiple failure events should be grouped when the router at that location fails

Matching content, Time, Defined relationships (from CMDB)

Temp Directory Full Multiple application errors that indicate a full temp directory should trigger automation for temp direction deletion

Matching content, Defined relationships (manually)

Related Application Components

Failures related to [Our company] Chat regions within 5 minutes should be grouped by region

Matching content, Time, Defined relationships (from CMDB)

www.evanios.com | Page 15

The Enterprise IT Event Management Playbook

Event Processing Requirements

Service Impact DeterminationWhen technology events occur, it’s important to start understanding which alerts/events impact key business services, and how. Clear data and comprehensive visualizations enable operations teams to prioritize service impacting events, and lead to quicker restoration for key business services.

Service Topology

Most mature event management solutions embrace the concept of a “service topology” or “service model” which depicts the relationships between technology components and business services. Corresponding visualizations illustrate relationships between network devices, servers, applications, and business services (see Fig. 2).

Relationships should be able to be configured such that impact can be approximated as it flows through the model. For example, if we have a cluster of two servers, we should be able to indicate the application above is “degraded” when only a portion of the cluster had failed.

Figure 2: Example of a service model topology view. Inner color represents actual service status,

outer color represents predicted impact

www.evanios.com | Page 16

The Enterprise IT Event Management Playbook

Event Processing Requirements

Leverage Existing Data

Successful service models are typically created from a combination of discovery technologies, existing data structures, and human validation. It’s important to understand where this relationship knowledge exists today, and where it must be imported from to create an accurate model.

Operations technologies (including an event correlation engine) should make it easy to import or leverage data from common sources, such as discovery tools or existing CMDB’s. They should also ensure that manual adjustments to the models can be executed easily.

[Insert places where service structure knowledge exists today, so it can be factored into the implementation. if you’re not sure, that’s OK, just remove this section]

Data Source Description

Discovery Tool A Discovers network components

Microsoft SCOM Contains server data

Application Group D Have manual maps of existing application

Start Small

Service modelling usually includes and affects a large number of stakeholders, so it’s important to start small and build a standardized, repeatable process. Most organizations choose 2-3 models to start. It’s also unlikely that there is value in modeling the entire environment, so a best practice is to focus efforts where they have the largest return on investment; this usually means critical or problematic business services.

[Insert key targets for building service models.]

Data Source Description

[Our company] Chat Our core chat application

[Our company] Finance Key application used by [Our company] finance team

www.evanios.com | Page 17

The Enterprise IT Event Management Playbook

Event Processing Requirements

Event Closure Most event management processes recognize an “active” or “open” event, as being different from a “closed” event. An “open” event indicates that this event represents a condition or state that is still occurring, while a “closed” event means the condition happened in the past, and is no longer taking place. The “open” events should be a real-time reflection of the health of the environment, representing all the critical conditions occurring in the environment at that moment. In order to ensure that state is maintained, a strategy for event closure must be created. The following are common methods of closing events.

Auto-closed by state change event. Simply put, “UP” events should close “DOWN” events. When a system fails or is degraded, in an optimal environment, the monitoring tool will send a “DOWN” event. When the system is restored, or the degradation is no longer being experienced, it will send an “UP” message. The “UP” message should be configured to match the “DOWN”, and make the closure.

Auto-closed by incident closure. If events auto-generate incidents, and technicians are responding to and working incidents instead of events, it may make sense to auto-close events related to incidents when the incident is closed. When a technician closes an incident, he/she is indicating the situation has been resolved, as well as designating accountability for closure of the incident.

Manual close through event view. Some organizations triage and work events directly in the event console. In this case, operators can manually assign themselves to events or groups of events, and manually close events when the work is complete.

Auto-closed by timeout. Even on the most critical events, there is a point at which it should be auto-closed to maintain the integrity of the console. As a best practice, all events should be configured to auto-close eventually. For example, this might be using logic of 1 hour on Minor Events, 48 hours on Warnings, and 30 days on Critical. Timeouts should be configured, even if one of the above strategies is defined as the primary.

Incident Creation and ResolutionMost organizations will want to create incidents for a subset of their event traffic. It’s important that the solution and process have controls around when and how incidents are created.

Multiple events into a single incident

Events should not normally have a one-to-one relationship with incidents. Incidents are meant to be the “higher level” rollout of multiple events. For example, if a “Customer Facing Website Down” event is received, it is likely that this would generate an incident ticket immediately, without needing any correlation. However, it’s preferred that events related to the “Customer Facing Website Down” event also be linked to the incident ticket to shorten the troubleshooting time required for the incident responder. It’s important that the solution and process support a many-to-one relationship.

Routing incidents

In many organizations, when an event is turned to an incident, this is the machine’s way of asking a human incident responder for assistance. The concept is that the machine-based correlation has taken things as far as it can, and now it’s time to present the the “right” information to the “right” person. Getting the “right” person is absolutely critical, and fairly complex routing logic may be required to ensure that happens.

www.evanios.com | Page 18

The Enterprise IT Event Management Playbook

Event Processing Requirements

For example, let’s say there is an application outage on a specific cluster. While the initial instinct may be to automatically assign the related incident(s) to the Application team first, and let them figure it out manually, this could squander critical minutes playing “hot potato” with the incident(s). This could also be very costly, using high dollar engineers from multiple teams to troubleshoot something that could have been done with correlation.

A more optimal routing logic might be that if the outage is correlated to be caused by a network outage, the incident is routed to the Network team (while keeping other impacted teams informed). If the outage is caused by a server or OS failure, the incident should be routed to the Server Support team. Finally if network and server are healthy, the outage should be routed to the Application team.

Auto-resolution of incidents

Incidents can be resolved by human responders or by automation. Human based resolution involves an incident responder receiving a ticket, reviewing it, taking some action, logging their resolution, and setting it to “resolved”. Then the incident process kicks in, and (depending on the organization) usually a resolve incident is set to auto-close within a few hours or days.

For automated resolution, it’s important to understand the implications. Most event management technologies that are able to open an incident, can also set an incident to “closed” or “resolved” automatically when the event condition is no longer occurring. For example, if an incident is created for “Application Down”, the system can automatically set the incident to “resolved” when the “Application Up” event is received. While this can be very helpful, it must be deployed with care.

Mirroring the transient nature of the event console in the incident management system can “dumb down” the incident system. It can also lead to confusion as incidents that are being actively worked are resolved automatically behind the scenes, or lead to apathy on behalf of incident responders who might just wait for the automated resolution to kick in. It’s important that the event management system signals incident management that the event condition is no longer present, but often it is desirable for a human to investigate why the event condition occurred in the first place, and log that investigation.

Use Cases

Below are use cases that have currently been defined related around event to incident handling.

[Fill out the below with any specific routing use cases that have been defined, it’s OK if you don’t have them all right now, it’s good to get a representative sample of the types of use cases you expect]

Use Case Description Methods Leveraged

[Our company] Chat Application Failure

All related chat failures should be on a single incident

Network related = [Our company] Chat Network Team Server related = [Our company] Cloud Hosting Team Application related = [Our company] Chat DevOps Team

(All parties informed)

Office Connectivity Failure

All related connectivity failures for that office should be on same incident. If multiple office connectivity failures occur within 5 minutes, consolidate all on the same incident

Network related = Local Network Engineer If multiple office failures occur at one time = [Our company] Global Network Team

www.evanios.com | Page 19

The Enterprise IT Event Management Playbook

Event Processing Requirements

ExtensibilityWhile you can (and should) do some planning around specific challenges your business currently faces, these will continuously change and evolve, forcing you to be prepared for the unexpected. For that reason, it’s important that your event management technology and process be extensible and configurable.

Event Processing Configuration

At the core of any event processing solution is the ability to configure logic. This is sometimes referred to as “rules”, “policies”, or “recipies”. It’s important to have extremely flexible out of the box logic configuration, that can easily accomplish (and adjust to) difficult event management tasks. Things like correlation, deduplication, and incident creation, should not require coding or advanced training to implement. These are expected event management use cases, and a modern solution should provide configurable, pre-packaged capabilities in these areas.

However, there will be times where the vendor can not foresee your requirement. In these cases, it’s important that there is some route for extensibility, such as the ability to insert scripts or code that enable the organization to deal with challenges that were not anticipated.

Algorithm Tuning

Event data and client needs can be very different, so it’s important that machine learning algorithms can be clearly understood and tuned as needed (without a dedicated data scientist). It’s rare for an organization to prefer a chatty system that is constantly producing insights, even if the observations/predictions are low accuracy. More often than not, low accuracy leads operations teams to mistrust the system. For that reason, it may be preferable to hold or fiter insights that are not determined by the system as a “sure thing”. If your event management system includes predictive capabilities, for example, you may want to set thresholds.

Visualization and Usability RequirementsDashboards

Dashboards should be available to display visualizations of event and metric data. Typically, widgets of interest are things like active events by source, event volume in last hour, ratio of events to incidents, % of events suppressed, # of events handled per user, etc.

It’s important that dashboards can be created to accommodate specific user needs. For example, if you were the manager over the PeopleSoft group, you may want a dashboard that identifies event statistics specific to PeopleSoft. Meanwhile, the manager over the Operations group will want to see performance of the operations team across all applications.

In some cases, certain application owners may not want their data shared with other application owners. Executives may also be prevented from getting deeper drill-down views which could prove confusing. In these cases, tight role based security is required.

www.evanios.com | Page 20

The Enterprise IT Event Management Playbook

Event Processing Requirements

[Fill out the below with any specific dashboarding use cases that have been defined, it’s OK if you don’t have them all right now, it’s good to get a representative sample of the types of use cases you expect]

Use Case Description

PeopleSoft Application Owner Dashboard Detailed event based reports specific to PeopleSoft application

[Our company] Chat Application Owner Dashboard Detailed event based reports specific to [Our company] Chat application (note this dashboard can only be accessed by managers of the [Our company] Chat team)

Executive Dashboard High level representation of event management process performance across all applications and services

ConsolesEvent Console

Event management solutions typically have a dynamic view that shows all events in the system, in real time. The console allows operators to triage events by assigning, taking ownership adding comments, or closing. The console should also make event correlation easy to understand.

The event console is a critical part of any a modern event management solution. It must be flexible enough to accommodate various processes and use cases, including:

www.evanios.com | Page 21

The Enterprise IT Event Management Playbook

Event Processing Requirements

Use Case: Operations Team works from Events

In highly dynamic environments, it may be preferable to utilize the event console actively. That means having a 24x7 team who utilizes the console, can triage events immediately, make conclusions, and take response actions. The benefit and vision of working from the console is to provide instant response and triage for critical system outages.

Use Case: Operations Team works from Incidents

In some cases, it may be preferable to work from incidents. This makes the event management layer more of an automation/correlation layer, with human based actions being triggered after incidents are created. The benefit and vision is to create a smaller number of “smart” incidents which need manual response. These “smart” incidents might represent correlations of hundreds of events.

Use Case: Hybrid - work from Events and Incidents

In some environments there is a need to use a combination of working from events and incidents. This split may be due to the nature of certain applications or team structures.

Planned Role for Event Console

[Talk about your organization’s historical use of event consoles, as well as the future of how the event response process will work. If you don’t have this ready yet, just remove this section]

www.evanios.com | Page 22

The Enterprise IT Event Management Playbook

Meeting and Requirements Session Log

[It’s often helpful to keep a log of who contributed to the information contained in this document, so stakeholders know their feedback was considered]

A log of meetings and workshops where requirements and design data was collected.

Objective Date Participants

SiteScope Integration Requirements

10/20/2017 Billy, Joe, Bob, Martha, Tod, Mike Bill

HP OpenView Integration Requirements

10/20/2017 Billy, Joe, Martha, Tod, Mike, Bill, Tom, Brad

ITSM Integration Requirements 10/20/2017 Billy, Joe, Bob, Martha, Tod, Mike, Bill

Executive Session 10/20/2017 Billy Joe, Joe Bob, Sally

www.evanios.com | Page 23