osmc 2014: naemon 1, 2, 3, n | andreas ericsson

24
… more than software Naemon & Nostalgia Andreas Ericsson [email protected]

Upload: netways

Post on 02-Jul-2015

257 views

Category:

Software


0 download

DESCRIPTION

How monitoring should be automated without jeopardizing accuracy. I will present a ready-to-use system that allows system admins to set up their servers to be automagically picked up by Naemon, and also allowing them to tweak their settings without requiring access to the monitoring system. Most notably, without even restarting or reloading the monitoring system. I will also present a working (I hope) demo of dynamic thresholds in Naemon, using various helpers in a request/response system.

TRANSCRIPT

Page 1: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Naemon & Nostalgia

Andreas Ericsson

[email protected]

Page 2: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Agenda

● Agenda

● Ego slide

● About op5

● IT now and then

● Naemon

● Roadmap progress

● Up and coming

Page 3: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Ego slide

● Programming since I was seven

● Core architect at op5 since 2003

● Nagios core developer 2009-2013

● Performance fanatic

● Author of Merlin and Nagios 4

● Naemon maintainer

● Voted “most likely to invent the lightsaber and then accidentally killing

himself with it” when we last played that particular drinking game.

● Motivation: “He does a lot of dumb shit but he's really smart on the

inside”

Page 4: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

About op5

● Founded 2003

● +900 customers

● 97% renewal rate

● Focus on large installations

● http://www.op5.com

Page 5: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

IT now and then

Page 6: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

“IT” ca 1970

● Computer performance measured in CPM (cards per minute) as often as

in MHz

● 1 year of training before you were allowed to touch a machine

● Unix development began to create a multitasking, multiuser system

● Firs time a computer passes a college-level calculus course

● First successful ARPANET test

● IANA formed

● Computer-powered devices per person: 0.00000001

● Average CPU speed: 1Mhz

● admin:computer ratio 300:1

Page 7: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

IT ca 1980

● One-man computers were gaining ground (Apple II)

● PacMan!

● Portable computers were being developed

● Ethernet standards introduced

● ARPANET, BitNet, CSNET et al began merging

● TCP/IP standards formalized

● SMTP, DNS etc quickly followed

● Average CPU speed: 8Mhz

● admin:computer ratio 10:1

Page 8: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

IT ca 1990

● IBM PC style computers becoming popular

● GUI's become popular

● First web page created

● Linux invented

● Internet had suffered its first worm (Morris)

● “Software installation” was actually part of a job description

● Average CPU speed: 66Mhz

● admin:computer ratio 1:1

Page 9: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

IT ca 2000

● WiFi standards emerge

● The dot-com era

● Google overtakes AltaVista as the most popular search engine

● It was ok for non-nerds to get computers

● Monitoring starts to become a thing

● Nagios development starts

● Average CPU speed: 800Mhz

● admins:computers ratio 1:10

Page 10: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

IT ca now

● An average smartphone has 120 million times the computing power of

the first general-purpose computer (Ferranti Mark 1)

● An average smartphone has 4 million times the amount of main memory

● Giant datacenters house 100,000+ servers

● Average CPU speed: 2.4GHz

● admins:computers ratio 1:300

Page 11: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Hands up if...

● … the number of servers you

manage has grown faster than

the staff you have to monitor and

manage them

● … you use more than two tools

just to manage your servers

● … you have people dedicated to

managing the servers that

monitor and manage your

servers

Page 12: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Conclusions

● Manpower is getting scarce

● Sharing resources is more important than ever

● The most expensive resources are the most important to share

● Using what works now but doesn't lock one into a corner is key to

remaining effective

● Developers have a duty to minimize the job they do (laziness is

important! :-p)

● Developers have a duty to minimize the job sysadmins do

● Keep stuff simple. If it breaks, you not only get to keep the pieces, but

you get to do the same job again in a different way

Page 13: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Last year's Naemon roadmap

● Completed

● External commands via query handler

● Dropdir support

● Livestatus

● In progress

● Check result transformer

● Object creation/modification at runtime

● Scheduler-controlled helper daemons (well, kinda)

● Scrapped

● Runtime-modifiable main-config – no usecase found

● Object extensions – custom variables fill the same role

Page 14: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Up and coming

● Backlogged: Runtime object creation and modification

● Backlogged: Check result transformer

● Backlogged: Scheduler-controlled helper daemons

● Performance data handling

● Active agents

● Report data export

● Because http://www.youtube.com/watch?v=8yVFkMXy8rw

Page 15: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Runtime object creation/modification

● User story:

● If users prefer to configure their monitoring on the monitored hosts,

we should automagically add them to the monitoring config without

reloading it.

● On-call schedule handover

● Added as a queryhandler extension

● Housekeeping events every X seconds

● Allows new stuff to call in and start getting monitored automagically

● Object creation may not happen if I can't get it stable in a reasonable

timeframe, because config reload is superfast nowadays

Page 16: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Check result transformer

● User story:

● Since anomalies in network and application behaviour often indicate

errors, it's important that we can detect them and notify about it

● External helper connects to NERD

● Events zip to the helper

● Helper can alter state/output/perfdata/whatever

● Helper zaps result back to core via QH

● Allows for adaptive thresholds

● The monitoring system requires little or no configuration

● Inspired by BisCheck (which we will likely end up using as engine)

Page 17: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Scheduler-controlled helper daemon

● User story:

● Users must be able to trust that exported data is complete

● New twist: Naemon will connect to other systems sockets instead

● Will most likely be implemented as a module

Page 18: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Performance data handling

● User story:

● Users should be able to produce graphs of all metrics they monitor

with as little impact on available resources as possible.

● Performance data can now be streamed from Naemon

● Reduces I/O from spoolfile writing

● Feeds data to PNP or Graphite

● Gets rid of the last synchronously executed system call

● Already completed

● Builds on top of already-implemented interfaces

Page 19: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Naemon

Perfdata handling design

PNP/Graphite/?

NERD

Page 20: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Active agents

● User stories:

● Naemon should scale to as close to infinite sizes as possible

● To save time, new hosts should report what metrics they're offering

and Naemon should automagically monitor them.

● collectd or a pushing version of check_mk

● Input for automagic host/service creation

● Large providers already write and ship compatible plugins

● Complements active checks, but doesn't replace them

● Improves security in many setups (especially over NRPE)

● Allows us to reuse existing code (ie, be lazy and do less work)

Page 21: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Active agents design

Naemon

query-handler

Receiver/Evaluator

livestatus collectd

collectd

collectd

collectd

Page 22: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Report data export

● User story:

● It should be easy to get performance data from Naemon into

SystemX in order to facilitate the graphing power of SystemX

● Streams host- and service statechanges, downtime and start/stop events

from Naemon

● Builds on top of existing interfaces

● Provides excellent performance

● Allows using extreme-performance data warehouse software to store

(possibly huge amounts of) report data.

Page 23: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Naemon

Report data export design

Data warehouse

NERD

Page 24: OSMC 2014: Naemon 1, 2, 3, N | Andreas Ericsson

… more than software

Questions?

[email protected]

● http://www.naemon.org/

● http://www.youtube.com/watch?v=8yVFkMXy8rw

● Or just talk to me. I'm not dangerous until I have that lightsaber ;-)