mike guthrie - revamping your 10 year old nagios installation
Post on 15-Apr-2017
563 Views
Preview:
TRANSCRIPT
Revamping Your 10
Year Old Nagios
Installation
By Mike Guthriemguthrie@redventures.com
Case Study: Red Ventures• Digital Marketing Company
• Acquire customers for our partners– Optimize SEO for websites
– Take inbound call volume for sales calls
RV Technology Notes• LAMP Environment - PHP and JS
• We LOVE data – Many TB of DB storage
• We move fast…think Agile development on steroids.
• 50-60 in-house developers
• Redundancy – CLT and ATL datacenters
• Almost everything is clustered
• Our speed often creates technical debt
March 2015 – Nagios Profile
• 2 Nagios Installations – CLT and ATL
• 1100 Hosts/8000 Services (Now 1500/13000)– Linux servers (web, mysql, cron, load balancers)
– Windows servers (phone, terminal)
– Network (Routers, UPS, PDU)
• PNP4Nagios for Performance Data
• Thruk UI
Key Problems• No system to configs whatsoever
• No consistency between ATL and CLT in setup
• Adding one check to a server type meant touching hundreds of files
• Terrible alerts storms
• Misdirected or missing alerts
• Lots of hosts not being monitored at all
• ATL latency problems
• Broken escalations
• No effective historical reporting
What Every Engineer Wants To Hear
Goals
• Manageable configuration
• Minimize time spent on maintenance
• Reporting / Dashboards / Visualization
• Scalability
• Noise reduction
Step 1: Fix Config Management• Version controlled and synced:
– Contacts– Templates– Commands – Hostgroups– Escalations– Dependencies
• Decoupled: – Hosts– One-offs escalations and dependencies
Step 1:Fix Config Management
• All hosts / services use templates
• Almost all service checks are applied through Service -> Hostgroup relationships
• Hostgroup = roles / attributes– linux-server (Load, Memory, Disk, Procs, etc)
– mysql-server (Mysql, Slaving, Storage partition)
– supervisord-server (supervisord procs running)
• Use host variables for differing ports, SNMP strings, active disk partitions, etc
define host {host_name rv-atl-serverl01alias rv-atl-serverl01use rv-routershostgroups MsSQLcontact_groups phoneops
}
define service {host_name rv-atl-serverl01use rv-routers-serviceservice_description PINGcontact_groups phoneopscheck_command check_ping!200.0,30%!300.0,70%
}define service {
host_name rv-atl-serverl01use rv-windows-serviceservice_description DISKcontact_groups phoneopscheck_command check_windows_disk!public!CHIJKM!80!85
}
define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description CPU LOADcheck_command check_snmp_load_windows!public!50!80contact_groups phoneopscheck_period 24x7MinusSQLBackup
}
define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description SQL Servicecheck_command check_windows_service!public!SQL Server
\\(MSSQLSERVER\\)contact_groups phoneops
}
define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description VIRTUAL MEMORY USAGEcheck_command check_snmp_misc!public!Virtual Memory!90!95contact_groups phoneops
}
define host {
host_name rv-atl-serverl01
alias rv-atl-serverl01
use mssql-server
hostgroups windows-server,mssql-server
_SNMP public
}
Host Config Before
Host Config After
Step 2: Config Automation
• Most of our servers are puppet managed (transitioning to Salt)
• Linux machines need to be self-aware of what they need to have monitored
• Linux servers use passive checks to propogate themselves up to Nagios
GenerateGenerate
Enforces
Process
Result
Remote Host
• NRPE Config
• Passive Crontab
Nagios /
Webhook
• Does this
host exist?
Config Manager
• Puppet
• Salt
• Add Host
• Verify
• Restart
• Notify
Result
• CONSISTENCY!
• All Linux configs are now auto-generated
• Everything else is either cloned or generated from a custom webtool
• Maintenance time went from 10-20 hours per week to less than 1 hour most weeks
Step #2: Reporting / Visualization
• Need NOC-level visibility
• Need cluster-level views of performance data
• Historical view of state changes and notifications
Nagios
Core
CLT
Nagios
Core
ATL
Perfdata
(Carbon)
Ndoutils
(Mysql)
Event
Server?
(TODO)
Grafana Dashboards
Custom NOC Dashboard
Thruk UI
NOC Dashboard
Graphite + Grafana = Awesome• Opted not to use Graphiosservice_perfdata_file_template=\
$TIMET$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICEPERFDATA
service_perfdata_file_processing_command=<customScript>
• Nagios writes to buffer file
• Custom scripts grabs the buffer and flushes it to carbon
• Multiline socket write over UDP
• Will send 1000 data points in less than .03 seconds
• Carbon can scale far beyond anything we can throw at it
Grafana• Makes combining and templating graphs EASY• Can combine all sorts of metrics on a graph and perform a variety of
mathematical functions on them
atl.rv-atl-server*.CPU_Load.load5
*.rv-{atl,clt}-server*.CPU_Load.load
• Can setup new NOC dashboards in minutes• Also using this for application monitoring data• Allows us to easily spot performance anomalies• Helps with event correlation
This is OK
This is not OK
TODO
• Event Automation – create automatic response
tasks to known issues with common fixes
• Better connectivity to application monitoring
• Adaptive monitoring for situations like this:
Implementation• Left the old servers alone and running• Spun up new servers with notifications and event handling disabled• Migrated 600+ configs by hand• 500+ generated automatically• Problem states were perfect for identifying what wasn’t setup yet• Took about 6 weeks to migrate configs to new machines• Launch day was changing Thruk’s backend config to point to new
servers, and switch over notifications• Audit, design, migration, and stable implementation took about 90
days
Things I Learned• Take the time to understand what you’re monitoring
• Lack of understanding will produce alert noise, which is ineffective monitoring
• In a complex system, log everything
• Small changes do the most damage
• Automation is cool except for when it automatically sends 60% of your environment into a CPU death spiral
• I wish Nagios Core allowed hostgroup exclusions in service definitions (hint, hint)
• Nagios is still the best tool and monitoring tool out there
Thank you!
Any Questions?
top related