nagios conference 2012 - kishore jalleda - nagios in the agile devops continuous deployment world

41
Nagios in the Agile / DevOps / Continuous Deployment World Kishore Jalleda Director of Operations IMVU, Inc [email protected]

Upload: nagios

Post on 11-Nov-2014

1.566 views

Category:

Technology


0 download

DESCRIPTION

Kishore Jalleda's presentation on using Nagios in a continuous development environment. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

TRANSCRIPT

Page 1: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Nagios in the Agile / DevOps / Continuous Deployment World

Kishore Jalleda

Director of Operations

IMVU, Inc

[email protected]

Page 2: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

22012

About IMVU

Page 3: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

3

About IMVU

Avatar based Social Entertainment destination

$50+ Million Annual Revenue

100+ Million Registered Users

10+ Million Items in Virtual Catalog

2012

Page 4: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

42012

IMVU Engineering and Continuous Deployment

►Doing the Impossible 50 times a day

►Continuous deployment (CD) is real

►IMVU has been one of the pioneers of CD

►DevOps culture is big

►No approval needed to ship to 1% of customers

Check out our engineering blog http://engineering.imvu.com/

Page 5: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

52012

What does this mean ?

►Things change quickly

►New features add up instantly

►Can break frequently

►Failures can cascade rapidly

►Things can fall through the cracks

►Many things change at the same time

►Etc

Page 6: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Insights into Nagios @IMVU

Page 7: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

72012

Overview

►Nagios Core 3.2.0

►800+ Hosts

►18000+ Service Checks

►Single Nagios Instance

►8 cores, 8GB RAM

Page 8: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

2012 8

Server Lifecycle Management

Purchase & Asset

Management

DHCP,

DNSPreseed,

CFEngine Opspush Nagios,

Cacti, Istatd CFEngine Production Decommissi

on

Page 9: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

[ Operations ] Continuous Integration and Deployment

2012 9

Page 10: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

102012

IMVU Asset Database ( AssetDB )

►Built internally by IMVU

►Simple but powerful concept

►Source of truth for everything asset related

►Has information on

►Class ( mysql, standard-http-server, redis )

►Role ( customer shard, clientdynweb )

►Tag (available, no-update )

►Attributes (cpu-cores, memory-size, mysql-role )

►Much more …

Page 11: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

112012

Auto generation of Nagios configuration files

#generate_nagios_conf.pl

( most configurations auto generated from AssetDB )

Page 12: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

122012

Ops Buildbot ( builds, builders/buildslaves )

# svn commit hosts.cfg hostgroups.cfg

Page 13: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

132012

Opspush ( Operations Push System )

# opspush --comment “xxxxxx” –role nagios

opspush

check status of “last build”

run “cfagent -v” on the box

--oncall-override ?

green

red

exit

yes

No

--use-last-green-rev

Page 14: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

2012 14

Product Development

Ideation, UI Design,

Usability Testing, etc

Tech Design

Monitoring and Alerting

Coverage.. Nagios

Production Maintenance

Page 15: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

15

Tech Designs & New Nagios Alert Requests

2012

Page 16: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

16

Nagios Alert Request Template

2012

Page 17: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

172012

Big Data / De-Sharding

► Data freshness is critical to help make the right business decisions

► Nagios used for ETL/DW status and error checking

► Nagios and Ops embeds can help empower your Data Infrastructure team

Page 18: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Things will FAIL

2012 18

Page 19: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

2012 19

How we try to prevent and catch failures

Local Acceptance

Tests Hypo Builds Buildbot

Automated Cluster

Immunity (CI)

Manual QA using roll-out Nagios

3rd party like webmetrics, customers,

etc

Page 20: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Push to X% of

servers

Monitor Critical Metrics

Push to rest

Auto Rollback

Good

Bad

w00t!, my change is

Live

Monitor Critical Metrics

Bad

Good

Cluster Immune System

Automated push monitoring and rollback !

Page 21: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Don’t just rely on Standard Metrics

2012

Page 22: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

222012

Demystifying P1s ( Priority 1 )

P1: Priority 1 issue impacting live operations

Phases

► Identification (Nagios )

► Communication and Declaration

► Resolution

► Postmortem / 5 Whys / Root Cause Analysis

► P1 follow up

Page 23: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

232012

5 Why / Postmortem (PM) / Root Cause Analysis

► 5 Why process

► Amazing culture of running blameless postmortems

► New Nagios checks are the most common action Items .

► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs

Page 24: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

242012

Example “5 Whys” Process

Page 25: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

252012

Monitor Business & Application Level Metrics

Page 26: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

262012

Monitor Response Times

Load Average is a meaningless number

Page 27: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

272012

Continuous Monitoring ( Istatd )

► Developed by IMVU

► Sub 10 sec resolution of data

► API to get average, SD, min, max sample count for each data point in a graph

► Ability to stack multiple graphs on the fly

► Long retention times

► Releasing as open source this week !!!

https://github.com/imvu-open/istatd/wiki

Page 28: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

282012

Istatd: 10 Second Resolution of Data

Page 29: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

292012

Istatd: Stacking graphs on the fly

Page 30: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Have a “Strategy” for Monitoring and Alerting

Page 31: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

312012

Our (Nagios) Strategy

► Human element of Monitoring and Alerting ( Nagios )

► Nagios & Test Driven Development ( TDD )

► Decouple ( Nagios )

► Aggregated Checks

Page 32: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

322012

Human Element of Monitoring and Alerting

► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;)

► Do not let people develop immunity to pages as very soon real issues will be ignored

► All pages are Actionable policy: If there is no action, it should not be paging

► Automatic enabling of alerting/notifications for improperly silenced ones.

► Ownership and accountability of issues/alerts

Page 33: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

332012

Daily Triage of Nagios Alerts and Interrupts

Page 34: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

342012

Nagios & Test Driven Development (TDD)

► Write tests for your Nagios Infrastructure

► Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome )

► High degree of confidence in pushing changes

► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother.

► Functional testing can still be a challenge

Page 35: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

352012

Sample Nagios Test Output

Page 36: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

362012

Decouple Nagios

We do it using “Fact, Worker, Reporter & Aggregator” Model

Worker

Redis

Reporter

Aggregator

fact

fact

fact status

fact status

Page 37: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

372012

Why Decouple ?

For scalability and efficiency

Our model was higher performing compared to NRPE

Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE )

Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton

Page 38: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Closing Remarks

Page 39: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

392012

Closing Remarks

► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it

► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless

► Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance

► Build some form of predictive monitoring and alerting to catch and alert on change in trends

► Invest in configuration automation, validation and compliance

► Finally, Nagios has been like a Honda, very reliable !!!

Page 40: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Questions ?

Page 41: Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

412012

Thank You !!!

[email protected]

We are Hiring: imvu.com/jobs

Engineering Blog: http://engineering.imvu.com/