use case: help me from tons of alarms...e-mail, webhook and integration with external services -...

48
Use Case: Help me from tons of alarms Copyright 2017 FUJITSU LIMITED 0 OpenStack Summit Sydney 6 Nov, 2017 Hisashi Osanai, Koji Nakazono, Daisuke Fujita (FUJITSU)

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Use Case:Help me from tons of alarms

Copyright 2017 FUJITSU LIMITED0

OpenStack Summit Sydney6 Nov, 2017

Hisashi Osanai, Koji Nakazono, Daisuke Fujita (FUJITSU)

Page 2: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Share our use case

Copyright 2017 FUJITSU LIMITED

Alarm management

Got tons of alarms when a trouble

happened in only a few components

Monasca and Monasca-analytics

eliminated unnecessary alarms

Fujitsu Cloud

Monasca

Save the operator from

long investigation works!

1

Page 3: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Infrastructure operators need help!

Large-scale data

40 zeta bytes in 2020

Copyright 2017 FUJITSU LIMITED

Complex infrastructures

Virtualized Data Centers (w/ SDx)

Source: IDC’s Digital Universe Study, 2012

2

Page 4: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Data and SDx tech. interplay lead to…

Slow problem resolution times

Hard to extract clear patterns to alleviate problem

Redundancy very similar solutions

applied everywhere

Difficult to generalize common situations to

establish widely applicable best practices

Copyright 2017 FUJITSU LIMITED3

Page 5: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Widely applicable best practices

Realize long tail distribution

Infrastructure operators don’t want to

use vendor specific solution

To achieve it, we follow two steps

(Easy to) find “unseeing” problems in

multitude of small/mid-sized

installations

Need to share expertise and establish

common practices for the problems

Copyright 2017 FUJITSU LIMITED4

Page 6: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Market Trends

AIOps Platforms (Artificial Intelligence for IT Operations)

Enhance IT operations using…

big data

modern machine learning

other advanced analytics

Enable the concurrent use of multiple…

Data sources

Data collection methods

Analytical technologies

Presentation technologies

Copyright 2017 FUJITSU LIMITED

Source: Gartner, 2017

5

Page 7: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

AIOps capabilities

Copyright 2017 FUJITSU LIMITED

Historic data management

Streaming data

management

Log data ingestion

Wire data ingestion

Metric data ingestion

Document text ingestion

Automated pattern discovery

and prediction

Anomaly detection

Root cause determination

On-premises delivery

Software as a service

Delivery related

6

Infrastructure as a service

Monitoring related Analytics related

Page 8: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Machine Learning for operational mgmt.

Slow problem resolution times

Help tame scale and complexity

(easy to extract common patterns)

Speedy resolution of problems

Achieving long tail distribution

No solution to share expertize and

define Machine Learning solution recipes

Copyright 2017 FUJITSU LIMITED

OpenStack community and Monasca-analytics solve this

7

Page 9: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Potential use cases

Copyright 2017 FUJITSU LIMITED

Share our use case in details8

Page 10: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca

Copyright 2017 FUJITSU LIMITED9

Page 11: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Official project of OpenStack

Monitoring-as-a-service solution that well-integrated with OpenStack

Collects/monitors/analyzes metrics and logs of OpenStack entirely

Scalable, HA, High performance

Monasca consists of two features

Monitoring

- Metric monitoring

- Log management

Analytics

What is Monasca

Copyright 2017 FUJITSU LIMITED10

Page 12: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca Architecture

Copyright 2017 FUJITSU LIMITED

Log API

Log Agent

LogPersister

LogTransformer

LogMetrics

Log DB

Metrics API

Metrics Agent

NotificationEngine

ThresholdEngine

Persister

Metrics DBConfig

DB

Message QueueKafka

Monasca UI

AnalyticsEngine

11

Page 13: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Metrics monitoring

Monitors many types of metrics

System metrics, like CPU usage, memory usage

Many applications(OpenStack services, RabbitMQ, MariaDB...)

Detects and monitors applications automatically

Monitors nova instances “cross tenant”

Provides metrics of nova instances to each tenant user without installing

the agent into instances

Copyright 2017 FUJITSU LIMITED12

Page 14: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Collector

Collects metrics periodically

Plugin capability for applications

Statsd

Collects metrics as a Daemon

Forwarder

Sends the metrics collected to

Monasca API

Supervisord

Manages the processes above

Metrics Agent

Monasca Architecture – Metrics Agent

Copyright 2017 FUJITSU LIMITED

Collector Forwarder Statsd

Supervisord

Metrics API

13

Page 15: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca UI

Copyright 2017 FUJITSU LIMITED14

Page 16: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca UI

Copyright 2017 FUJITSU LIMITED15

Page 17: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca UI - Overview

Copyright 2017 FUJITSU LIMITED16

Page 18: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca UI - Alarms

Copyright 2017 FUJITSU LIMITED17

Page 19: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Dashboard for metric

Copyright 2017 FUJITSU LIMITED18

Page 20: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca Architecture

Copyright 2017 FUJITSU LIMITED

Log API

Log Agent

LogPersister

LogTransformer

LogMetrics

Log DB

Metrics API

Metric Agent

NotificationEngine

ThresholdEngine

Persister

Metrics DBConfig

DB

Message QueueKafka

Monasca UI

AnalyticsEngine

19

Page 21: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Log management

Collects logs whole of the environment

Analyzes the logs cross all services with powerful GUI

Alarms with error/critical log messages

Authenticates with Keystone

Supports multi-tenancy

Provides the infra logs to tenant userhttps://www.openstack.org/videos/boston-2017/show-me-my-packet-log-neutron-packet-logging-with-monasca

Copyright 2017 FUJITSU LIMITED20

Page 22: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Why is log management necessary?

Log files are distributed in micro-services systems

OpenStack components are also provided as micro-services

Message monitoring is insufficient for micro-services systems

Can detect the machine which an error occurs, but…

Difficult to trace the logs through components

The centralized log management is needed

Copyright 2017 FUJITSU LIMITED21

Page 23: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Dashboard for log

Copyright 2017 FUJITSU LIMITED22

Page 24: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Introduce Monasca

Copyright 2017 FUJITSU LIMITED23

Page 25: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Before introducing Monasca

Copyright 2017 FUJITSU LIMITED

Monitored Fujitsu Cloud with Zabbix...

Could NOT detect slow response of REST API

Time-consuming to check the logs of each nodes

Zabbixmonitor

investigate

Fujitsu Cloud

notify

24

Page 26: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Introduced Monasca

To monitor the API response

To introduce log management

Copyright 2017 FUJITSU LIMITED

Monasca

monitorcollect logs

Fujitsu Cloud

notify

investig

ate

logs

25

Page 27: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monitor API response time

Transforms the time of REST API in logs to metric

OpenStack components output the API response time using oslolog

2017-10-28 17:36:05.525 6248 INFO nova.osapi_compute.wsgi.server [req-id id id - default

default] nova1 "GET /v2.1/id/servers/detail?all_tenants=True&changes-since=date

HTTP/1.1" status: 200 len: 413 time: 0.9194989

Picked up the value of “time” by log agent, and send it as a metric to

Monasca API via Statsd of the metric agent

Copyright 2017 FUJITSU LIMITED

Metrics Agent

Forwarder Statsd

Supervisord

Metrics API

novalog

neutronlog

Log AgentStatsdclient

26

Page 28: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Dashboard for monitoring API response

Copyright 2017 FUJITSU LIMITED27

Page 29: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Dashboard for monitoring API response

Copyright 2017 FUJITSU LIMITED28

Page 30: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Notification with Monasca

Supported several notification types

E-mail, webhook and integration with external services

- Jira, Slack, Pagerduty and HipChat are supported

Slack notification can be used with mattermost, we’re using it

Set the notification when response time is over 5 seconds

Copyright 2017 FUJITSU LIMITED

差し替え必須

29

Page 31: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Tons of alarms in trouble

Became to be able to monitor API response time, but…

OpenStack APIs are related with each component

When only one component was in trouble, the other

components also notified the slow down of API response

Copyright 2017 FUJITSU LIMITED

Huge number of alarms were notified in a trouble of only one component!

30

Page 32: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Wasteful alarms bother me

Nearly 400 alarms in a trouble, but

NO needs to care the services which is not the root cause

Quite bothering task to check all alarms...

Copyright 2017 FUJITSU LIMITED

Monasca

monitorcollect logs

Fujitsu Cloudin

vestig

ate

logs

notify

31

Page 33: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Our challenge with Monasca-analytics

Copyright 2017 FUJITSU LIMITED32

Page 34: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Motivation to solve the problem

Focus on only what matters in troubles

Reducing the tons of alarms for the operator to. . .

• Reduce the time of investigation

Copyright 2017 FUJITSU LIMITED

Reducing tons of alarms

Focus onroot cause alarms

Monasca

Notify

33

Page 35: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Alarm mgmt. with Monasca-analytics

Succeeded in reducing tons of alarms dramatically

Monasca-analytics solves to. . .

• Check whether an alarm is necessary or not

• Get root cause alarms

Copyright 2017 FUJITSU LIMITED

Monasca-analytics

Detect root cause alarms

Cinder

Keystone

Nova…

Alarms related to API response time 34

Page 36: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

What is Monasca-analytics

Copyright 2017 FUJITSU LIMITED

One of the Monasca’s component

Analyze data and predict anomaly/root cause with machine

learning

Define machine learning solution as a “Recipe”

OpenStack project/repo https://github.com/openstack/monasca-analytics

35

Page 37: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca Architecture

Copyright 2017 FUJITSU LIMITED36

Page 38: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca-analytics Architecture: Framework

Monasca-analytics has a framework that can automatically

execute machine learning process as data flow sequence

The data flow sequence is composed by following modules

able to add new module or customize module

Copyright 2017 FUJITSU LIMITED

DataPrediction

MachineLeaning

Algorithm

DataAggregation

DataPreprocessing

DataReception

37

Page 39: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca-analytics Architecture: Data flow sequence

Training phase

Use past data

Prediction phase

Use streaming data

Copyright 2017 FUJITSU LIMITED38

Page 40: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Monasca-analytics Architecture: Recipe

Configuration file which defines the data flow sequence

Store recipes as templates

Copyright 2017 FUJITSU LIMITED39

Page 41: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

How to solve the problem

Find causality of each alarm and notify only the cause alarms

Train causality of each alarm from tons of alarms

Predict causal and affected alarms

Copyright 2017 FUJITSU LIMITED

Cinder

Keystone

Nova

Causal analysis…

Alarms related to API response time 40

Page 42: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Training Phase

Copyright 2017 FUJITSU LIMITED

Message QueueKafka

1. Consume list of alarms which only related to API response time from Kafka

2. Aggregate alarms and transform to training data for machine learning algorithm

3. Learning causality of each alarm

41

Page 43: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Prediction Phase

Copyright 2017 FUJITSU LIMITED

Message QueueKafka

1. Consume list of alarms which only related to API response time from Kafka

2. Predict causal and affected alarms

3. Push list of alarms to Kafka and notify to the Mattermost

42

Page 44: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Result

Copyright 2017 FUJITSU LIMITED43

Before After

Page 45: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Conclusion

Copyright 2017 FUJITSU LIMITED

Number of alarms

per trouble

Time of investigation

per trouble

Operator has no bother about unnecessary alarms!

5.5h

0.5h

397

8

44

Page 46: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Copyright 2017 FUJITSU LIMITED45

Page 47: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Any Questions?

46

Page 48: Use Case: Help me from tons of alarms...E-mail, webhook and integration with external services - Jira, Slack, Pagerduty and HipChat are supported Slack notification can be used with

Contacts

Gladly explain more details

Copyright 2017 FUJITSU LIMITED47

Feel free to access us if

you have any question! Hisashi Osanai

[email protected]

Koji Nakazono

[email protected]

Daisuke Fujita

[email protected]