sensu and sensibility - puppetconf 2014

Post on 29-Nov-2014

2.017 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

As the Yelp infrastructure and engineering team grew, so did the pain of managing Nagios. Problems like splitting alerting across multiple teams, providing high availability and managing nagios systems in multiple environments had become pressing. As we grew towards a service oriented architecture and pushed some services out into the cloud, we rapidly needed more automated monitoring configuration. An evolutionary solution wasn’t going to solve all of our problems, we needed to revolutionize our monitoring. Sensu is built from the ground up to solve many of our issues and be easy to extend. This talk covers our puppet ‘monitoring_check’ API (that sets up monitoring for our services within puppet), how and why we deploy Sensu and our custom handlers and escalations, along with how we provide automatic ‘self service’ monitoring for dynamic services and how we deal with the challenges posed by the more ephemeral nature of cloud architectures.

TRANSCRIPT

Sensu and Sensibility

Tomas  Doran  @bobtfish  2014-­‐09-­‐23

2

Sensu and Sensibility

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

3

4

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

5

6

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

• High friction, low trust, low visibility.7

“Normality”

8-­‐  http://gunshowcomic.com/648

“Normality”

9-­‐  http://gunshowcomic.com/648

This is dysfunctional

10

Sensibility

11

Sensibility

“51 % viewed their ERP implementation as unsuccessful”

12

The Robbins-Gioia Survey (2001)

“40 % of the projects failed to achieve their business case within one year of going live”

13

The Conference Board Survey (2001)

• “17 percent of large IT projects go so badly that they can threaten the very existence of the company”

• “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted”

14

McKinsey & Company in conjunction with the University of Oxford (2012)

Failure is an option

15-­‐  blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery

Sensibility

16

17

Sensibility

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata • Simple model

• Components do exactly one thing • Ruby

• Not afraid to extend (or fork!)

18

‘industry standard’ ‘enterprise class’

19

Cheap shot

20

21

status.dat cmd.dat

22

cmd.dat

23

24

Centralized

25

How we use Sensu

• Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module

26

Sensu data flow

• Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over.

• Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis

• Redis + sentinel • Read by API (2 instances)

• All layers behind haproxy

27

Quis custodiet ipsos custodes?

28

“Sensu  has  so  many  moving  parts  that  I  wouldn’t  be  able  to  sleep  at  night  unless  I  set  up  a  Nagios  instance  to  make  sure  they  were  all  running.”

Mutually assured monitoring

• Multiple independent Sensu installs (per-datacenter) • Monitor each other!

29

Machine readable config

• /etc/sensu/conf.d/checks/check_name.json

• Extensible with arbitrary metadata

• Hash merge

• Never edit by hand!

30

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

31

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

32

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

33

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

34

sensu::check

• monitoring_check wraps this

• Writes a JSON file for each check

• Comment safe

35

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

36

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

37

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

38

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

39

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

40

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

41

Check scripts

• Same as nagios checks • Simple (text) output • Exit code

• Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data.

42

Handlers

• base • JIRA • email • irc • pagerduty • awsprune

43

How do checks get run?

• Every machine runs the client.

• Client managed by puppet

• Client has a TCP socket you can send JSON to

• Custom checks + pysensu-yelp

44

45

Situational awareness

46

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

47

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

48

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

49

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

50

Generate monitoring_check

51

User specified monitoring

52

User specified monitoring

53

• Data lives in the service config • Next to the code to emit metrics!

• Simple checks for free!

54

User specified monitoring

User specified monitoring

• Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS

55

Cluster checks

• We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise.

• If a service becomes fully unavailable to clients, you want to page someone.

• If one machine goes belly up, you don’t (make a JIRA ticket for handling later!)

56

WIP

• This is all still a work in progress.

• We’ve not 100% migrated off of Nagios

• Open sourcing the pieces

57

Thanks!• Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish

• Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/

aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp

58

top related