sense and sensu-bility: painless metrics and monitoring in the cloud with sensu
DESCRIPTION
Are you unhappy with the state of monitoring in your organization? Are you successfully automating “all the things” except your monitoring checks? Are you tired of looking at monitoring dashboards that hark from another era? Do you long to access your monitoring system via a REST API? Paperless Post recently solved these problems by replacing Nagios with Sensu, a new and awesome free monitoring and metrics router that is designed with configuration management and cloud deployments in mind. In my presentation we’ll take an in-depth look into why we chose Sensu and how we monitor our services and collect system metrics to send to Graphite. Subtopics will include how we planned for and executed the migration, mistakes we made along the way, how we knew when to scale and how we did it. I’ll also cover how we’re making our Sensu setup redundant and highly available, how we’re monitoring and collecting metrics about Sensu, and how we’ve integrated our internal tools with Sensu.TRANSCRIPT
SENSE AND SENSU-BILITYPainless Metrics And Monitoring
In The Cloud with Sensu
Bethany ErskineVelocity NYC 2013
http://github.com/skymob/sensu-tutorial
Monday, October 14, 13
BEFORE I BEGIN...IF YOU DID NOT SET UP SENSU-TUTORIAL
BEFORE THE CLASS:
1. grab a USB key 2. follow the instructions on the README
If you don’t have a computer, no sweat!
Monday, October 14, 13
DO YOU LOVE YOUR
MONITORING SETUP?
Monday, October 14, 13
#MONITORINGLOVE
Monday, October 14, 13
MY STORY
+
(╯(╰,)
Monday, October 14, 13
Monday, October 14, 13
Monday, October 14, 13
+
Monday, October 14, 13
WHY SENSU
✓Ruby
✓Plugins can be written in any language
✓sensu-chef cookbook
✓community
Monday, October 14, 13
WHY SENSU
✓re-use Nagios checks!
✓metrics and checks all collected by one system
✓Graphite integration
✓easy to scale
Monday, October 14, 13
WHY SENSU
✓“Can I do X with Sensu?” probably!
Monday, October 14, 13
WHY SENSU
Monday, October 14, 13
WHY SENSU?
✓Sensu source is well-written and easy to parse
✓https://github.com/sensu
Monday, October 14, 13
WHY SENSU?
✓sensu-community-plugins
✓80 contributors
✓over 600 plugins
✓https://github.com/sensu/sensu-community-plugins
Monday, October 14, 13
TODAY at PAPERLESS
Two Sensu environments (prod/testing)~ 250 - 275 instances of sensu-client
4-6 Sensu-server instances25k Metrics/Hour to Graphite
1 custom dashboard1 custom CLI
Monday, October 14, 13
RESOURCES✓All of our Sensu infrastructure is
virtualized.
✓We typically give a sensu-server box 1.5GB RAM and 2 processors, scaling up RAM for any box running more than one Sensu service on it.
✓ 4GB RAM for a monolithic Sensu install (Rabbit, Redis, all Sensu components on one)
Monday, October 14, 13
AS WE GREWGrowing pains and lessons learned...
Monday, October 14, 13
NEEDS MORE SENSU
✓High load on Sensu server
✓Backed-up queues in RabbitMQ
✓TIP: set up check to monitor the RabbitMQ ready queue size, you'll want an email when the queue grows about 10K and stays there
Monday, October 14, 13
HOW TO SCALE
✓Add more sensu-server instances
✓No special configuration needed
✓checks will be distributed in round-robin fashion to the sensu-servers
Monday, October 14, 13
GRAPHITE PAINS
✓symptoms: backed up queues in RabbitMQ, spotty graphs
✓cluster couldn’t keep up with the large amount of metrics we were now serving it via AMQP
Monday, October 14, 13
GRAPHITE PAINS
✓Solution: stop collecting metrics every 10 seconds (excessive!)
✓moved staging metrics to staging Graphite cluster
✓Moved prod Graphite cluster to SSD
Monday, October 14, 13
THE MIGRATIONor, How To Quit Nagios in Ten Easy Steps
Monday, October 14, 13
STEP 1: NUKE AND PAVE
Monday, October 14, 13
STEP 2: PLANMETRICS AND MONITORING SURVEY
Monday, October 14, 13
METRICS AND MONITORING SURVEY
Monday, October 14, 13
STEP 3: DEFINE GLOBALS
✓CHECKS: must be actionable!
✓METRICS: go nuts
✓HANDLERS: EMAIL for everything initially, added Pagerduty later.
Monday, October 14, 13
OUR GLOBALS
✓CHECKS: disk usage, swap usage, zombie processes, RO filesystems
✓METRICS: vmstat, disk usage, cpu, memory, interface and disk perf
✓HANDLERS: Email, Campfire, Pagerduty
Monday, October 14, 13
STEP 4: DEFINE SPECIFICS
✓For each server role, define additional states to be checked and alerted on:
✓Process Checks
✓System Checks
✓Service Checks
✓Service Metrics
Monday, October 14, 13
STEP 5: SET UP A PLACE TO TEST
✓Set up a permanent testing Sensu stack using your CM tool of choice
✓we used sensu-chef cookbook
Monday, October 14, 13
STEP 6: SET A WORKFLOW
✓Develop and document a workflow for implementing, testing, deploying and signing off on checks
✓You’ll get the best coverage if anyone (developers or ops) can easily add checks and metrics to Sensu
Monday, October 14, 13
EXAMPLE WORKFLOW
✓add new sensu_check definitions to the appropriate cookbook in Chef
✓deploy new check to staging env using Chef
✓Pull Request with sample graphs or alerts
✓Code Review from colleague
✓Deploy to Prod
Monday, October 14, 13
SENSU IN CHEF
Monday, October 14, 13
STEP 7: EXECUTE WORKFLOW
✓Starting with the low-hanging fruit (plugins that already existed in sensu-community-plugins repository), configure and deploy each check in the worksheet to the testing Sensu server
✓deploy sensu-client to a few select machines
Monday, October 14, 13
STEP 8: WATCH THE WATCHER
✓Set up some bare-minimum 3rd party monitoring for the Sensu servers
✓We use Panopta’s agent to check for aliveness, disk usage and CPU usage.
Monday, October 14, 13
Monday, October 14, 13
MONITOR THE MONITOR
✓Other ideas: have Testing Sensu monitor Prod Sensu
✓Sensu can collect metrics about itself
Monday, October 14, 13
STEP 9: ROLLOUT
✓Deploy your Production server infrastructure
✓Roll out the client and checks to the rest of the your prod environments.
Monday, October 14, 13
STEP 10: TUNE
✓Laissez le bon alertes roulent!
✓Expect to need to tune thresholds and alert occurrences.
Monday, October 14, 13
SENSU ARCHITECTURE
Monday, October 14, 13
SENSU ARCHITECTURE
Monday, October 14, 13
OMNIBUS INSTALLER
is awesome
Monday, October 14, 13
LET’S PLAY WITH SENSU
If you haven’t been able to get your sandboxes up and running,
please pair with someone near you.
Monday, October 14, 13
SANDBOX GOALS✓Get familiar with Sensu
configuration
✓Install a Handler
✓Deploy a check
✓Trigger an alert on that check
✓Give you something to take home and hack on
Monday, October 14, 13
OOPS
If you mess anything up:
vagrant halt; vagrant up
Worst case:
vagrant destroy; vagrant up
Monday, October 14, 13
TWO VIRTUALBOXES
Sensu-Server and Sensu-ClientVagrant/Chef
Centos 6.4Sensu Version 0.10.2
Monday, October 14, 13
SENSU CONFIGURATION
✓Please open up a terminal and SSH into both your sensu-server and sensu-client VMs
✓sudo su -
✓cd /etc/sensu
Monday, October 14, 13
SENSU CONFIGURATION✓/etc/sensu/config.json - config for
redis, rabbitmq, api and dashboard
✓/etc/sensu/conf.d/ - checks go here
✓/etc/sensu/conf.d/client.json - client configuration, subscriptions
✓/etc/sensu/{extensions|handlers|mutators|plugins}
Monday, October 14, 13
TRIGGER AN ALERT!
On sensu-client:
service sensu-client stop
Monday, October 14, 13
CHECK YOUR DASHBOARD
✓Open a web browser and go to http://10.254.254.10:8080
✓username: admin / password: secret
Monday, October 14, 13
HANDLERS
✓A HANDLER takes action on an event using a pipe, TCP, UDP, AMQP, or a set of other handlers
✓Examples: send an email, send event to Pagerduty, send metrics to Graphite
✓Default is “debug”
Monday, October 14, 13
HANDLER EXAMPLES
✓BASIC: send an email to ops@
✓ADVANCED: attempt to remediate the alert (i.e. run a custom script that spins up additional ec2 instances)
Monday, October 14, 13
HANDLERS
✓Let’s configure an EMAIL handler to send a informative email for an event.
✓/etc/sensu/handlers/mailer.rb plugin is installed for you, we just need to configure and install it
Monday, October 14, 13
CONFIGURE THE PLUGIN
{ "mailer": { "mail_from": "[email protected]", "mail_to": "[email protected]" }}
ON SENSU SERVER:vim /etc/sensu/conf.d/handlers/mailer.json
Monday, October 14, 13
CONFIGURE THE HANDLER
cp /etc/sensu/conf.d/handlers/default.json /etc/sensu/conf.d/handlers/email.json
vim /etc/sensu/conf.d/handlers/email.json
Monday, October 14, 13
EMAIL.JSON
"handlers": { "email": { "type": "pipe", "command": "/etc/sensu/handlers/mailer.rb" }}
Monday, October 14, 13
CHECK GEM DEPENDENCIES
/opt/sensu/embedded/bin/gem list | grep mail
Monday, October 14, 13
FIX PERMISSIONS
chown -R .sensu /etc/sensu/conf.d/
Monday, October 14, 13
RESTART SERVICES
service sensu-server restart
tail -100 /var/log/sensu/sensu-server.log | grep mail
Monday, October 14, 13
CHECKS✓Sensu-client runs CHECKS that are
defined and scheduled either locally (standalone) or on the sensu-server (subscription).
✓A CHECK sends a RESULT as an EVENT to a HANDLER - this applies to anything - service checks, metrics, etc
Monday, October 14, 13
CHECK EXECUTION
✓Either scheduled by the server (subscription) or scheduled by the client (standalone)
✓Today we will configure a subscription-based check on the server that will run on our client
Monday, October 14, 13
LETS CONFIGURE A CHECK
✓Use check-procs.rb to make sure at least one instance of cornbread is running
Monday, October 14, 13
DETERMINE OUR CHECK COMMAND
On your SENSU CLIENT:
/opt/sensu/embedded/bin/ruby /etc/sensu/plugins/check-procs.rb -p cornbread -W1
Monday, October 14, 13
INSTALL OUR CHECK
✓On your SENSU SERVER:
✓vim /etc/sensu/conf.d/checks/cornbread_process.json
Monday, October 14, 13
CORNBREAD_PROCESS.JSON
Monday, October 14, 13
RESTART SERVICES
service sensu-server restart
tail -100 /var/log/sensu/sensu-server.log | grep cornbread
Monday, October 14, 13
CHECK YOUR DASHBOARD
Monday, October 14, 13
CHECK YOUR EMAIL
Monday, October 14, 13
SENSU API
✓REST API
✓HTTP/4567
✓on SENSU SERVER try:
curl -l http://localhost:4567/events \ | python -mjson.tool
Monday, October 14, 13
SENSU SERVICES
✓Sensu API
✓Sensu Server
✓Sensu Client
✓Sensu Dashboard
Monday, October 14, 13
EVERYTHING OK?
✓/etc/init.d/sensu-service {client|server|api|dashboard} {start|stop|status|restart}
✓ps -ef | grep sensu
✓tail -f /var/log/sensu/*.log
✓curl -l localhost:4567/info
Monday, October 14, 13
COOL SENSU TRICKS
Monday, October 14, 13
SEND DIRECTLY TO SENSUnetcat to: 127.0.0.0:3030
Monday, October 14, 13
AGGREGATE ALERTS
✓Handy for preventing alert floods
✓Alert when X% of checks are are not OK
Monday, October 14, 13
MY SENSU TIPS
✓install the RabbitMQ management web interface and bookmark it (see http://10.254.254.10:15672/#/ )
✓lock your plugins’ gem dependency versions
Monday, October 14, 13
TIPS TIPS TIPS
✓have alternate ways to access your Dashboard information
✓we integrated our command-line developer tools with Sensu API
✓we also created our own Ops dashboard that queries Sensu, Graphite and our app for data
Monday, October 14, 13
MORE TIPS
✓Put NGINX in front of sensu-dashboard
Monday, October 14, 13
HA SENSU
✓Redundancy is easy (bring up more sensu-servers)
✓Making Redis and RabbitMQ HA more challenging
✓We’re still running one solitary Redis and RabbitMQ but are OK with this risk for now
Monday, October 14, 13
WHERE TO GO FOR HELP
✓http://docs.sensuapp.org
✓IRC: #sensu - freenode
✓sensu-users mailing list
Monday, October 14, 13
QUESTIONS
Monday, October 14, 13
THANK YOU
[email protected]@skymob - twitter
robotwitharose - #sensu on IRC (freenode)
Monday, October 14, 13