sensu and sensibility - puppetconf 2014

58
Sensu and Sensibility Tomas Doran @bobtfish 20140923

Upload: tomas-doran

Post on 29-Nov-2014

2.017 views

Category:

Software


0 download

DESCRIPTION

As the Yelp infrastructure and engineering team grew, so did the pain of managing Nagios. Problems like splitting alerting across multiple teams, providing high availability and managing nagios systems in multiple environments had become pressing. As we grew towards a service oriented architecture and pushed some services out into the cloud, we rapidly needed more automated monitoring configuration. An evolutionary solution wasn’t going to solve all of our problems, we needed to revolutionize our monitoring. Sensu is built from the ground up to solve many of our issues and be easy to extend. This talk covers our puppet ‘monitoring_check’ API (that sets up monitoring for our services within puppet), how and why we deploy Sensu and our custom handlers and escalations, along with how we provide automatic ‘self service’ monitoring for dynamic services and how we deal with the challenges posed by the more ephemeral nature of cloud architectures.

TRANSCRIPT

Page 1: Sensu and Sensibility - Puppetconf 2014

Sensu and Sensibility

Tomas  Doran  @bobtfish  2014-­‐09-­‐23

Page 2: Sensu and Sensibility - Puppetconf 2014

2

Sensu and Sensibility

Page 3: Sensu and Sensibility - Puppetconf 2014

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

3

Page 4: Sensu and Sensibility - Puppetconf 2014

4

Page 5: Sensu and Sensibility - Puppetconf 2014

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

5

Page 6: Sensu and Sensibility - Puppetconf 2014

6

Page 7: Sensu and Sensibility - Puppetconf 2014

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

• High friction, low trust, low visibility.7

Page 8: Sensu and Sensibility - Puppetconf 2014

“Normality”

8-­‐  http://gunshowcomic.com/648

Page 9: Sensu and Sensibility - Puppetconf 2014

“Normality”

9-­‐  http://gunshowcomic.com/648

This is dysfunctional

Page 10: Sensu and Sensibility - Puppetconf 2014

10

Sensibility

Page 11: Sensu and Sensibility - Puppetconf 2014

11

Sensibility

Page 12: Sensu and Sensibility - Puppetconf 2014

“51 % viewed their ERP implementation as unsuccessful”

12

The Robbins-Gioia Survey (2001)

Page 13: Sensu and Sensibility - Puppetconf 2014

“40 % of the projects failed to achieve their business case within one year of going live”

13

The Conference Board Survey (2001)

Page 14: Sensu and Sensibility - Puppetconf 2014

• “17 percent of large IT projects go so badly that they can threaten the very existence of the company”

• “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted”

14

McKinsey & Company in conjunction with the University of Oxford (2012)

Page 15: Sensu and Sensibility - Puppetconf 2014

Failure is an option

15-­‐  blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery

Page 16: Sensu and Sensibility - Puppetconf 2014

Sensibility

16

Page 17: Sensu and Sensibility - Puppetconf 2014

17

Sensibility

Page 18: Sensu and Sensibility - Puppetconf 2014

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata • Simple model

• Components do exactly one thing • Ruby

• Not afraid to extend (or fork!)

18

Page 19: Sensu and Sensibility - Puppetconf 2014

‘industry standard’ ‘enterprise class’

19

Page 20: Sensu and Sensibility - Puppetconf 2014

Cheap shot

20

Page 21: Sensu and Sensibility - Puppetconf 2014

21

Page 22: Sensu and Sensibility - Puppetconf 2014

status.dat cmd.dat

22

Page 23: Sensu and Sensibility - Puppetconf 2014

cmd.dat

23

Page 24: Sensu and Sensibility - Puppetconf 2014

24

Centralized

Page 25: Sensu and Sensibility - Puppetconf 2014

25

Page 26: Sensu and Sensibility - Puppetconf 2014

How we use Sensu

• Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module

26

Page 27: Sensu and Sensibility - Puppetconf 2014

Sensu data flow

• Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over.

• Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis

• Redis + sentinel • Read by API (2 instances)

• All layers behind haproxy

27

Page 28: Sensu and Sensibility - Puppetconf 2014

Quis custodiet ipsos custodes?

28

“Sensu  has  so  many  moving  parts  that  I  wouldn’t  be  able  to  sleep  at  night  unless  I  set  up  a  Nagios  instance  to  make  sure  they  were  all  running.”

Page 29: Sensu and Sensibility - Puppetconf 2014

Mutually assured monitoring

• Multiple independent Sensu installs (per-datacenter) • Monitor each other!

29

Page 30: Sensu and Sensibility - Puppetconf 2014

Machine readable config

• /etc/sensu/conf.d/checks/check_name.json

• Extensible with arbitrary metadata

• Hash merge

• Never edit by hand!

30

Page 31: Sensu and Sensibility - Puppetconf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

31

Page 32: Sensu and Sensibility - Puppetconf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

32

Page 33: Sensu and Sensibility - Puppetconf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

33

Page 34: Sensu and Sensibility - Puppetconf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

34

Page 35: Sensu and Sensibility - Puppetconf 2014

sensu::check

• monitoring_check wraps this

• Writes a JSON file for each check

• Comment safe

35

Page 36: Sensu and Sensibility - Puppetconf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

36

Page 37: Sensu and Sensibility - Puppetconf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

37

Page 38: Sensu and Sensibility - Puppetconf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

38

Page 39: Sensu and Sensibility - Puppetconf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

39

Page 40: Sensu and Sensibility - Puppetconf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

40

Page 41: Sensu and Sensibility - Puppetconf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

41

Page 42: Sensu and Sensibility - Puppetconf 2014

Check scripts

• Same as nagios checks • Simple (text) output • Exit code

• Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data.

42

Page 43: Sensu and Sensibility - Puppetconf 2014

Handlers

• base • JIRA • email • irc • pagerduty • awsprune

43

Page 44: Sensu and Sensibility - Puppetconf 2014

How do checks get run?

• Every machine runs the client.

• Client managed by puppet

• Client has a TCP socket you can send JSON to

• Custom checks + pysensu-yelp

44

Page 45: Sensu and Sensibility - Puppetconf 2014

45

Page 46: Sensu and Sensibility - Puppetconf 2014

Situational awareness

46

Page 47: Sensu and Sensibility - Puppetconf 2014

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

47

Page 48: Sensu and Sensibility - Puppetconf 2014

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

48

Page 49: Sensu and Sensibility - Puppetconf 2014

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

49

Page 50: Sensu and Sensibility - Puppetconf 2014

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

50

Page 51: Sensu and Sensibility - Puppetconf 2014

Generate monitoring_check

51

Page 52: Sensu and Sensibility - Puppetconf 2014

User specified monitoring

52

Page 53: Sensu and Sensibility - Puppetconf 2014

User specified monitoring

53

• Data lives in the service config • Next to the code to emit metrics!

Page 54: Sensu and Sensibility - Puppetconf 2014

• Simple checks for free!

54

User specified monitoring

Page 55: Sensu and Sensibility - Puppetconf 2014

User specified monitoring

• Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS

55

Page 56: Sensu and Sensibility - Puppetconf 2014

Cluster checks

• We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise.

• If a service becomes fully unavailable to clients, you want to page someone.

• If one machine goes belly up, you don’t (make a JIRA ticket for handling later!)

56

Page 57: Sensu and Sensibility - Puppetconf 2014

WIP

• This is all still a work in progress.

• We’ve not 100% migrated off of Nagios

• Open sourcing the pieces

57

Page 58: Sensu and Sensibility - Puppetconf 2014

Thanks!• Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish

• Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/

aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp

58