distributed monitoring at hyves- puppet
TRANSCRIPT
Welcome
Jeffrey Lensen
System Engineer
1
Hyves Infrastructure
3000+ Gentoo servers190 func;ongroups/types
3 datacentersDatabase for server management
2
Using Puppet
Since: January 2007Puppetmasters: 3 Loadbalanced,
1 for CA and developmentVersion: 2.6.1
MySQL backend for (thin_)storeconfigsNginx + 8 Mongrel instances per server
100+ modulesnodes.rb uses management database
Puppet run every morning on every server
3
Nagios
8 Nagios hosts in distributed setup2500 hosts
1 Nagios master for web and aler;ngScripts to generate configura;on
Management database for informa;onTemplates for service checks
Died during large fallouts
4
Icinga
Switched to Icinga November 2010
Distributed Icinga setup doesn’t require centralized hostVery fast standalone Icinga-‐web interface
Uses database backendREST API
Switching was easy due to similar configura;on
5
Current monitoring setup
Monitoring hosts: 12 (4 per DC)Services: over 83.000Hosts: nearly 3.500
Average check interval: every 5 minNOC monitoring host: 1
Overview checks using APICommandline interface
6
Problems with monitoring
Adding new checks meant manually edi;ng a lot of templates
Things that should be monitored aren’tWon’t realize it un;l it’s too late
No monitoring makes it harder to find the problem
7
Using Puppet to configure Icinga
Puppet knows it all so why not use that informa;on?
Exported resources from Naginator to define monitoring checks
Include the monitoring defini;ons in profiles
Running Puppet defines all necessary monitoring checks for that host
8
Example
modules/monitoring/manifests/init.pp:
class monitoring { service { "nrpe": ensure => running, enable => true }
@@nagios_host { "$hostname": address => $ip }
@@nagios_service { "NRPE $hostname": service_description => "NRPE", check_command => "check_nrpe_scripts", }}
Appending $hostname in nagios_service definition to prevent duplicate definitions
on monitoring hosts
9
Example Nginx
modules/nginx/manifests/init.pp:
class nginx { service { "nginx": ensure => running, enable => true }
@@nagios_service { "HTTP $hostname": service_description => "HTTP", check_command => "check_http", event_handler => "service_restart!nginx”, contact_groups => “admins_email, admins_sms” }}
Automa;cally create HTTP checks when including Nginx
10
Predefining and distributing
manifests/defines.pp:
$__notifications_enabled = $systemstatus ? { operational => "1", fail => "0"} Nagios_host { ensure => present, host_name => $hostname.$domain, hostgroups => $role, use => "generic-host", #our standard template alias => $hostname, notifications_enabled => $__notifications_enabled, target => "/etc/icinga/puppetgenerated/hosts/$hostname.cfg", notes => $monitoringhost}
Nagios_service { ensure => present, host_name => $hostname.$domain, use => "generic-service", #our standard template notifications_enabled => $__notifications_enabled, target => "/etc/icinga/puppetgenerated/services/$hostname.cfg", notes => $monitoringhost}
11
Retrieving exported resources
modules/icinga/manifests/init.pp:
class icingacollect { Nagios_host <<| notes == "$hostname" |>> { require => File["/etc/icinga/puppetgenerated/hosts"] } Nagios_service <<| notes == "$hostname" |>> { require => File["/etc/icinga/puppetgenerated/services"] }}
12
Why not Tags?
Using “notes” to assign monitoring host
Tagging caused problems when seing require in Nagios_host and Nagios_service
Tagging meant redefining, it’s not inherited
Solu;on: stages (?)
13
Fail-safes
modules/icinga/manifests/init.pp:
class icinga { include icingacollect
exec { "verify new cfg": command => "/usr/bin/icinga -v /etc/icinga/verify-puppetgenerated.cfg", require => Class["icingacollect"] }
exec { "mv cfgs": command => "rm -rf /etc/icinga/puppet/* ; mv /etc/icinga/puppetgenerated/* /etc/icinga/puppet/", require => Exec["verify new cfg"] }
exec { "restart icinga": command => ""/usr/bin/printf '[] RESTART_PROGRAM\n' > /var/icinga/rw/icinga.cmd"", require => [ Exec["mv cfgs"], Service["icinga"] ] }}
14
Deploying monitoring
Deploy script starts Puppet run on all monitoring hosts
Threaded with small sleep in between start to prevent thundering herd on Puppet masters
Waits for all puppet runs to finish and reports whether they were successful or not
15
Downsides
Puppet run on Icinga hosts takes about 20 minutes(using separate config files for each host helps)
Modifying a servicecheck requires a puppet run on all hosts with that servicecheck (solu;on: use -‐-‐noop)
Cleaning up old resources
16
Cleaning up
$fqdn = $host_to_be_removed.$domainpuppet apply
--certname $fqdn --node_name facter --thin_storeconfigs $dbsettings --execute 'resources { ["nagios_service","nagios_host"]: purge => true }'
17
What if something isn’t running Puppet?
Configcheck checkCompares management database with Icinga API
18
Other cool stuff
Genera;ng daemon checks
modules/role/lib/facter/customfacters.rb:
Facter.add("hyves_daemons") do daemons = ["None"] if File::exists?( "/<path_to_config>/daemons.conf" ) daemons = [] daemonarray = [] daemonconf = %x{grep name /<path_to_config>/daemons.conf} for daemon in daemonconf daemon.sub!(/.*\* name:/, '') daemonarray.push(daemon.chomp) end end setcode do daemonarray.uniq endend
19
Other cool stuff
Genera;ng daemon checks
modules/daemons/manifests/init.pp:
class daemons { define add_daemon_check { @@nagios_service { "$name Daemon $hostname": use => "Daemon-check", service_description => "$name Daemon", check_command => "check_daemon!$name" } }
add_daemon_check { $hyves_daemons: }}
20
Other cool stuff
Genera;ng overview daemon checksrequire 'net/http'
module Puppet::Parser::Functions newfunction(:get_daemons, :type => :rvalue, :docs => "\ This function returns an array of all current hyves_autodaemons, based on the Icinga API ") do |args|
domain = "<domain_of_icinga_web>" url = "/icinga-web/web/api/service/filter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json" response = Net::HTTP.get_response(domain, url) data = response.body results = PSON.parse(data) daemons = Array.new results.each { |result| daemon = result['SERVICE_NAME'] daemon.sub!(/ Daemon/, '') daemons << daemon }
daemons.uniq endend
21
Other cool stuff
Genera;ng overview daemon checksmodules/icinga/manifests/noc.pp:
$__daemons = get_daemons()templatefile { "/etc/icinga/puppetgenerated/other/daemons.cfg": template => template("icinga/daemons.cfg.erb")}
hyvesdaemons.cfg.erb:
define host{ use generic-host host_name daemons alias daemons address www.hyves.nl}
<% __daemons.each do |daemon| -%>define service{ use DaemonOverview-check host_name daemons service_description <%= daemon %>}<% end -%>
22