r. lange, m. giacchini: monitoring a control system using nagios monitoring a control system using...

15
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

Upload: cornelia-reeves

Post on 23-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Monitoring a Control System Using Nagios

Ralph Lange, BESSY – Mauro Giacchini, LNL

Page 2: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

What is the Situation?

Machine Status vs. Controls Infrastructure Status• Machine status:

– usually handled in the Control Room by an operator– uses the Alarm Handler or other EPICS tools– based on Channel Access connections

• Control System infrastructure can be comparably complex, its status:– needs to be handled outside the Control Room– with tools that allow remote access– using different types of connections/checks: ping, snmp, http,

Channel Access, disk usage, ...• BESSY was starting to have an increasing number of failures due to

ageing hardware• One summer day Mauro (preparing an EPICS training in hot Italian

summer) was asking me if I knew Nagios ...

Page 3: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

What is Nagios?

Nagios (“nah-ghee-ose”)• Open source monitoring framework

– widely used & actively developed: www.nagios.org

• Host and service problems detection and recovery

• Provides wide set of basic plugins (checks)

– easy to develop custom plugins• Active vs. passive checks

• Centralized vs. distributed deployment– also allows redundant Nagios daemons

• High configurability

– service dependencies, fine-grained notification options

• Web interface

– status view, administration (e.g. analysis, downtime scheduling)

Page 4: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

The Plugin (Check) Interface

Plugins (Checks)• Checks are command line programs that follow a convention for

arguments, stdout output, and return code: nagiosplugins.org– Output: one line of status info– Return code: OK / WARNING / CRITICAL / UNKNOWN

• Can be written in any (i.e. your favourite) compiled or interpreted language

• Are configured into Nagios for local or remote execution

Passive Checks• An external application can write check results (following a certain

format) into a file (or a pipe)

• Nagios reads from this and accepts the results (if configured)

Page 5: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Nagios + CA Plugin = NAL

Nagios Channel Access Plugins• caget type plugin (active check) by Mauro Giacchini (LNL)

• camonitor type daemon (passive check) by Debby Quock (APS)

• Integrate data available through CA into the Nagios monitoring framework

• Can check the health of EPICS integrated VME crates, VME IOCs, soft IOCs, PLCs, CA gateways, CA archivers, ... as well as OPI machine and server health, disk status, network device status, NTP, DNS, web services etc.

• Allows NAL (Nagios Alarm Handler) to be the central monitoring system for all control system infrastructure, whereas the ALH in the control room provides similar functionality for the controlled facility

Page 6: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Current Configuration at BESSY

Servers• All machines: ping, disk usage, load, processes, users, SSH

• Some: DNS (foreign and internal addresses), NTP

vxWorks IOCs• Ping, CPU load, memory usage, FD usage

Services• Wikis, web server, help pages, issue trackers (Trac/Redmine), elog

• Oracle servers: Ping, ODB Telnet, ODB TNS for important DBs

=> 296 checks on 111 hosts

Page 7: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Tactical Overview

Page 8: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Detail

Page 9: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Detail

Page 10: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Availability Report

Page 11: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Trends

Page 12: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Firefox/Thunderbird Plugin

• Highly configurable, many filtering options

• New alarm starts blinking and may play sound

• Mouse-over opens a pop-up showing the current alarms

• Clicking an alarm opens the related Nagios page in a tab

Page 13: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Experiences

Nagios is a very stable and reliable framework, configuration is flexible, options and plugins are many

Off control room, web based, email notification approach fits our controls group better than ALH

Manual configuration can be tedious, some parts could (should!) be generated from our RDB

Found some network problems, one running system clock, two disks filling up, IOC load and memory saturation on a number of mv162s (which were replaced by mv2100s)

Page 14: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Next Steps

To be configured:• Soft IOCs, CA Gateways, VME crates (Wiener), Embedded Controllers

• NFS shares usage, switches/routers, printers

Checks to be written: Conserver (IOC console access) CA Archiver (through ArchiveManager web interface) CA access rights (based on cainfo)

Collaborate:• Integrate CA check plugin development

• Agree on a common place for our plugins (APS? Sourceforge? Nagios?)

Page 15: R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

LivEPICS Example

Live Example:Mauro Giacchini's LivEPICS distribution includes Nagios 3.0

(configured to look at the EPICS Base example app channels)

Go check it out – now!