sensu and sensibility - puppetconf 2014
Post on 29-Nov-2014
2.017 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sensu and Sensibility
Tomas Doran @bobtfish 2014-‐09-‐23
2
Sensu and Sensibility
Cycle of failure and disappointment
• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production
3
4
Cycle of failure and disappointment
• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production
• Escalation of issues is hard • Ops ignore alerts from services • Postmortems
5
6
Cycle of failure and disappointment
• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production
• Escalation of issues is hard • Ops ignore alerts from services • Postmortems
• High friction, low trust, low visibility.7
“Normality”
8-‐ http://gunshowcomic.com/648
“Normality”
9-‐ http://gunshowcomic.com/648
This is dysfunctional
10
Sensibility
11
Sensibility
“51 % viewed their ERP implementation as unsuccessful”
12
The Robbins-Gioia Survey (2001)
“40 % of the projects failed to achieve their business case within one year of going live”
13
The Conference Board Survey (2001)
• “17 percent of large IT projects go so badly that they can threaten the very existence of the company”
• “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted”
14
McKinsey & Company in conjunction with the University of Oxford (2012)
Failure is an option
15-‐ blog.parasoft.com/single-‐greatest-‐barrier-‐with-‐sw-‐delivery
Sensibility
16
17
Sensibility
Why Sensu?• Designed to be pluggable / extensible
• Arbitrary check metadata • Simple model
• Components do exactly one thing • Ruby
• Not afraid to extend (or fork!)
18
‘industry standard’ ‘enterprise class’
19
Cheap shot
20
21
status.dat cmd.dat
22
cmd.dat
23
24
Centralized
25
How we use Sensu
• Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module
26
Sensu data flow
• Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over.
• Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis
• Redis + sentinel • Read by API (2 instances)
• All layers behind haproxy
27
Quis custodiet ipsos custodes?
28
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
Mutually assured monitoring
• Multiple independent Sensu installs (per-datacenter) • Monitor each other!
29
Machine readable config
• /etc/sensu/conf.d/checks/check_name.json
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
30
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
31
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
32
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
33
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
34
sensu::check
• monitoring_check wraps this
• Writes a JSON file for each check
• Comment safe
35
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
36
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
37
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
38
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
39
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
40
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
41
Check scripts
• Same as nagios checks • Simple (text) output • Exit code
• Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data.
42
Handlers
• base • JIRA • email • irc • pagerduty • awsprune
43
How do checks get run?
• Every machine runs the client.
• Client managed by puppet
• Client has a TCP socket you can send JSON to
• Custom checks + pysensu-yelp
44
45
Situational awareness
46
Single source of truth
• DNS is canonical for sensu servers • Configure things in one place!
47
Single source of truth
• DNS is canonical for sensu servers • Configure things in one place!
48
Automatic monitoring
• E.g. cron jobs - check successful recently! • cron::d
49
Automatic monitoring
• E.g. cron jobs - check successful recently! • cron::d
50
Generate monitoring_check
51
User specified monitoring
52
User specified monitoring
53
• Data lives in the service config • Next to the code to emit metrics!
• Simple checks for free!
54
User specified monitoring
User specified monitoring
• Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS
55
Cluster checks
• We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise.
• If a service becomes fully unavailable to clients, you want to page someone.
• If one machine goes belly up, you don’t (make a JIRA ticket for handling later!)
56
WIP
• This is all still a work in progress.
• We’ve not 100% migrated off of Nagios
• Open sourcing the pieces
57
Thanks!• Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish
• Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/
aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp
58
top related