mike guthrie - revamping your 10 year old nagios installation

Revamping Your 10

Year Old Nagios

Installation

By Mike Guthriemguthrie@redventures.com

Case Study: Red Ventures• Digital Marketing Company

• Acquire customers for our partners– Optimize SEO for websites

– Take inbound call volume for sales calls

RV Technology Notes• LAMP Environment - PHP and JS

• We LOVE data – Many TB of DB storage

• We move fast…think Agile development on steroids.

• 50-60 in-house developers

• Redundancy – CLT and ATL datacenters

• Almost everything is clustered

• Our speed often creates technical debt

March 2015 – Nagios Profile

• 2 Nagios Installations – CLT and ATL

• 1100 Hosts/8000 Services (Now 1500/13000)– Linux servers (web, mysql, cron, load balancers)

– Windows servers (phone, terminal)

– Network (Routers, UPS, PDU)

• PNP4Nagios for Performance Data

• Thruk UI

Key Problems• No system to configs whatsoever

• No consistency between ATL and CLT in setup

• Adding one check to a server type meant touching hundreds of files

• Terrible alerts storms

• Misdirected or missing alerts

• Lots of hosts not being monitored at all

• ATL latency problems

• Broken escalations

• No effective historical reporting

What Every Engineer Wants To Hear

• Manageable configuration

• Minimize time spent on maintenance

• Reporting / Dashboards / Visualization

• Scalability

• Noise reduction

Step 1: Fix Config Management• Version controlled and synced:

– Contacts– Templates– Commands – Hostgroups– Escalations– Dependencies

• Decoupled: – Hosts– One-offs escalations and dependencies

Step 1:Fix Config Management

• All hosts / services use templates

• Almost all service checks are applied through Service -> Hostgroup relationships

• Hostgroup = roles / attributes– linux-server (Load, Memory, Disk, Procs, etc)

– mysql-server (Mysql, Slaving, Storage partition)

– supervisord-server (supervisord procs running)

• Use host variables for differing ports, SNMP strings, active disk partitions, etc

define host {host_name rv-atl-serverl01alias rv-atl-serverl01use rv-routershostgroups MsSQLcontact_groups phoneops

define service {host_name rv-atl-serverl01use rv-routers-serviceservice_description PINGcontact_groups phoneopscheck_command check_ping!200.0,30%!300.0,70%

}define service {

host_name rv-atl-serverl01use rv-windows-serviceservice_description DISKcontact_groups phoneopscheck_command check_windows_disk!public!CHIJKM!80!85

define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description CPU LOADcheck_command check_snmp_load_windows!public!50!80contact_groups phoneopscheck_period 24x7MinusSQLBackup

define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description SQL Servicecheck_command check_windows_service!public!SQL Server

\$MSSQLSERVER\$contact_groups phoneops

define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description VIRTUAL MEMORY USAGEcheck_command check_snmp_misc!public!Virtual Memory!90!95contact_groups phoneops

define host {

host_name rv-atl-serverl01

alias rv-atl-serverl01

use mssql-server

hostgroups windows-server,mssql-server

_SNMP public

Host Config Before

Host Config After

Step 2: Config Automation

• Most of our servers are puppet managed (transitioning to Salt)

• Linux machines need to be self-aware of what they need to have monitored

• Linux servers use passive checks to propogate themselves up to Nagios

GenerateGenerate

Enforces

Process

Result

Remote Host

• NRPE Config

• Passive Crontab

Nagios /

Webhook

• Does this

host exist?

Config Manager

• Puppet

• Salt

• Add Host

• Verify

• Restart

• Notify

Result

• CONSISTENCY!

• All Linux configs are now auto-generated

• Everything else is either cloned or generated from a custom webtool

• Maintenance time went from 10-20 hours per week to less than 1 hour most weeks

Step #2: Reporting / Visualization

• Need NOC-level visibility

• Need cluster-level views of performance data

• Historical view of state changes and notifications

Nagios

Perfdata

(Carbon)

Ndoutils

(Mysql)

Server?

(TODO)

Grafana Dashboards

Custom NOC Dashboard

Thruk UI

NOC Dashboard

Graphite + Grafana = Awesome• Opted not to use Graphiosservice_perfdata_file_template=\

$TIMET$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICEPERFDATA

service_perfdata_file_processing_command=<customScript>

• Nagios writes to buffer file

• Custom scripts grabs the buffer and flushes it to carbon

• Multiline socket write over UDP

• Will send 1000 data points in less than .03 seconds

• Carbon can scale far beyond anything we can throw at it

Grafana• Makes combining and templating graphs EASY• Can combine all sorts of metrics on a graph and perform a variety of

mathematical functions on them

atl.rv-atl-server*.CPU_Load.load5

*.rv-{atl,clt}-server*.CPU_Load.load

• Can setup new NOC dashboards in minutes• Also using this for application monitoring data• Allows us to easily spot performance anomalies• Helps with event correlation

This is OK

This is not OK

• Event Automation – create automatic response

tasks to known issues with common fixes

• Better connectivity to application monitoring

• Adaptive monitoring for situations like this:

Implementation• Left the old servers alone and running• Spun up new servers with notifications and event handling disabled• Migrated 600+ configs by hand• 500+ generated automatically• Problem states were perfect for identifying what wasn’t setup yet• Took about 6 weeks to migrate configs to new machines• Launch day was changing Thruk’s backend config to point to new

servers, and switch over notifications• Audit, design, migration, and stable implementation took about 90

Things I Learned• Take the time to understand what you’re monitoring

• Lack of understanding will produce alert noise, which is ineffective monitoring

• In a complex system, log everything

• Small changes do the most damage

• Automation is cool except for when it automatically sends 60% of your environment into a CPU death spiral

• I wish Nagios Core allowed hostgroup exclusions in service definitions (hint, hint)

• Nagios is still the best tool and monitoring tool out there

Thank you!

Any Questions?

mike guthrie - revamping your 10 year old nagios installation

Presentations & Public Speaking

network monitoring & management: nagios monitoring &...

a revamping

nagios conference 2012 - john sellens - nagios indirection

nagios conference 2012 - bryan mclellan - using nagios with...

nagios conference 2013 - avleen vig - nagios at etsy

nagios conference 2014 - leland lammert - distributed...

nagios conference 2014 - dave williams - multi-tenant nagios...

graphing and trending in nagios - lancet www...

nagios xi â€“ monitoring websites - nagios - the...

nagios xi 2012 mike guthrie email: mguthrie@nagios.com...

nagios conference 2013 - nick scott - nagios network...

revamping developmental education

fire administrative records retention revamping project...

guthrie county extension guthrie county newsletter ·...

nagios core vs. nagios xi presentation power point.pptx...

revamping rasna

nagios - github...

nagios conference 2013 - luis contreras - nagios in wind...

network monitoring systems nagios - it pro...

revamping makro