nagios in the agile / devops / continuous deployment world

41
Nagios in the Agile / DevOps / Continuous Deployment World Kishore Jalleda Director of Operations IMVU, Inc [email protected]

Upload: constance-kim

Post on 30-Dec-2015

54 views

Category:

Documents


0 download

DESCRIPTION

Nagios in the Agile / DevOps / Continuous Deployment World. Kishore Jalleda Director of Operations IMVU, Inc [email protected]. About IMVU. About IMVU. Avatar based Social Entertainment destination $50+ Million Annual Revenue 100+ Million Registered Users - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nagios in the Agile /  DevOps  / Continuous Deployment World

Nagios in the Agile / DevOps / Continuous Deployment World

Kishore Jalleda

Director of Operations

IMVU, Inc

[email protected]

Page 2: Nagios in the Agile /  DevOps  / Continuous Deployment World

22012

About IMVU

Page 3: Nagios in the Agile /  DevOps  / Continuous Deployment World

3

About IMVU

Avatar based Social Entertainment destination

$50+ Million Annual Revenue

100+ Million Registered Users

10+ Million Items in Virtual Catalog

2012

Page 4: Nagios in the Agile /  DevOps  / Continuous Deployment World

42012

IMVU Engineering and Continuous Deployment

►Doing the Impossible 50 times a day

►Continuous deployment (CD) is real

►IMVU has been one of the pioneers of CD

►DevOps culture is big

►No approval needed to ship to 1% of customers

Check out our engineering blog http://engineering.imvu.com/

Page 5: Nagios in the Agile /  DevOps  / Continuous Deployment World

52012

What does this mean ?

►Things change quickly

►New features add up instantly

►Can break frequently

►Failures can cascade rapidly

►Things can fall through the cracks

►Many things change at the same time

►Etc

Page 6: Nagios in the Agile /  DevOps  / Continuous Deployment World

Insights into Nagios @IMVU

Page 7: Nagios in the Agile /  DevOps  / Continuous Deployment World

72012

Overview

►Nagios Core 3.2.0

►800+ Hosts

►18000+ Service Checks

►Single Nagios Instance

►8 cores, 8GB RAM

Page 8: Nagios in the Agile /  DevOps  / Continuous Deployment World

2012 8

Server Lifecycle Management

Purchase & Asset

Management

DHCP,

DNSPreseed,

CFEngine Opspush Nagios,

Cacti, Istatd CFEngine Production Decommissi

on

Page 9: Nagios in the Agile /  DevOps  / Continuous Deployment World

[ Operations ] Continuous Integration and Deployment

2012 9

Page 10: Nagios in the Agile /  DevOps  / Continuous Deployment World

102012

IMVU Asset Database ( AssetDB )

►Built internally by IMVU

►Simple but powerful concept

►Source of truth for everything asset related

►Has information on

►Class ( mysql, standard-http-server, redis )

►Role ( customer shard, clientdynweb )

►Tag (available, no-update )

►Attributes (cpu-cores, memory-size, mysql-role )

►Much more …

Page 11: Nagios in the Agile /  DevOps  / Continuous Deployment World

112012

Auto generation of Nagios configuration files

#generate_nagios_conf.pl

( most configurations auto generated from AssetDB )

Page 12: Nagios in the Agile /  DevOps  / Continuous Deployment World

122012

Ops Buildbot ( builds, builders/buildslaves )

# svn commit hosts.cfg hostgroups.cfg

Page 13: Nagios in the Agile /  DevOps  / Continuous Deployment World

132012

Opspush ( Operations Push System )

# opspush --comment “xxxxxx” –role nagios

opspush

check status of “last build”

run “cfagent -v” on the box

--oncall-override ?

green

red

exit

yes

No

--use-last-green-rev

Page 14: Nagios in the Agile /  DevOps  / Continuous Deployment World

2012 14

Product Development

Ideation, UI Design,

Usability Testing, etc

Tech Design

Monitoring and Alerting

Coverage.. Nagios

Production Maintenance

Page 15: Nagios in the Agile /  DevOps  / Continuous Deployment World

15

Tech Designs & New Nagios Alert Requests

2012

Page 16: Nagios in the Agile /  DevOps  / Continuous Deployment World

16

Nagios Alert Request Template

2012

Page 17: Nagios in the Agile /  DevOps  / Continuous Deployment World

172012

Big Data / De-Sharding

► Data freshness is critical to help make the right business decisions

► Nagios used for ETL/DW status and error checking

► Nagios and Ops embeds can help empower your Data Infrastructure team

Page 18: Nagios in the Agile /  DevOps  / Continuous Deployment World

Things will FAIL

2012 18

Page 19: Nagios in the Agile /  DevOps  / Continuous Deployment World

2012 19

How we try to prevent and catch failures

Local Acceptance

Tests Hypo Builds Buildbot

Automated Cluster

Immunity (CI)

Manual QA using roll-out Nagios

3rd party like webmetrics, customers,

etc

Page 20: Nagios in the Agile /  DevOps  / Continuous Deployment World

Push to X% of

servers

Monitor Critical Metrics

Push to rest

Auto Rollback

Good

Bad

w00t!, my change is

Live

Monitor Critical Metrics

Bad

Good

Cluster Immune System

Automated push monitoring and rollback !

Page 21: Nagios in the Agile /  DevOps  / Continuous Deployment World

Don’t just rely on Standard Metrics

2012

Page 22: Nagios in the Agile /  DevOps  / Continuous Deployment World

222012

Demystifying P1s ( Priority 1 )

P1: Priority 1 issue impacting live operations

Phases

► Identification (Nagios )

► Communication and Declaration

► Resolution

► Postmortem / 5 Whys / Root Cause Analysis

► P1 follow up

Page 23: Nagios in the Agile /  DevOps  / Continuous Deployment World

232012

5 Why / Postmortem (PM) / Root Cause Analysis

► 5 Why process

► Amazing culture of running blameless postmortems

► New Nagios checks are the most common action Items .

► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs

Page 24: Nagios in the Agile /  DevOps  / Continuous Deployment World

242012

Example “5 Whys” Process

Page 25: Nagios in the Agile /  DevOps  / Continuous Deployment World

252012

Monitor Business & Application Level Metrics

Page 26: Nagios in the Agile /  DevOps  / Continuous Deployment World

262012

Monitor Response Times

Load Average is a meaningless number

Page 27: Nagios in the Agile /  DevOps  / Continuous Deployment World

272012

Continuous Monitoring ( Istatd )

► Developed by IMVU

► Sub 10 sec resolution of data

► API to get average, SD, min, max sample count for each data point in a graph

► Ability to stack multiple graphs on the fly

► Long retention times

► Releasing as open source this week !!!

https://github.com/imvu-open/istatd/wiki

Page 28: Nagios in the Agile /  DevOps  / Continuous Deployment World

282012

Istatd: 10 Second Resolution of Data

Page 29: Nagios in the Agile /  DevOps  / Continuous Deployment World

292012

Istatd: Stacking graphs on the fly

Page 30: Nagios in the Agile /  DevOps  / Continuous Deployment World

Have a “Strategy” for Monitoring and Alerting

Page 31: Nagios in the Agile /  DevOps  / Continuous Deployment World

312012

Our (Nagios) Strategy

► Human element of Monitoring and Alerting ( Nagios )

► Nagios & Test Driven Development ( TDD )

► Decouple ( Nagios )

► Aggregated Checks

Page 32: Nagios in the Agile /  DevOps  / Continuous Deployment World

322012

Human Element of Monitoring and Alerting

► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;)

► Do not let people develop immunity to pages as very soon real issues will be ignored

► All pages are Actionable policy: If there is no action, it should not be paging

► Automatic enabling of alerting/notifications for improperly silenced ones.

► Ownership and accountability of issues/alerts

Page 33: Nagios in the Agile /  DevOps  / Continuous Deployment World

332012

Daily Triage of Nagios Alerts and Interrupts

Page 34: Nagios in the Agile /  DevOps  / Continuous Deployment World

342012

Nagios & Test Driven Development (TDD)

► Write tests for your Nagios Infrastructure

► Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome )

► High degree of confidence in pushing changes

► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother.

► Functional testing can still be a challenge

Page 35: Nagios in the Agile /  DevOps  / Continuous Deployment World

352012

Sample Nagios Test Output

Page 36: Nagios in the Agile /  DevOps  / Continuous Deployment World

362012

Decouple Nagios

We do it using “Fact, Worker, Reporter & Aggregator” Model

Worker

Redis

Reporter

Aggregator

fact

fact

fact status

fact status

Page 37: Nagios in the Agile /  DevOps  / Continuous Deployment World

372012

Why Decouple ?

For scalability and efficiency

Our model was higher performing compared to NRPE

Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE )

Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton

Page 38: Nagios in the Agile /  DevOps  / Continuous Deployment World

Closing Remarks

Page 39: Nagios in the Agile /  DevOps  / Continuous Deployment World

392012

Closing Remarks

► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it

► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless

► Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance

► Build some form of predictive monitoring and alerting to catch and alert on change in trends

► Invest in configuration automation, validation and compliance

► Finally, Nagios has been like a Honda, very reliable !!!

Page 40: Nagios in the Agile /  DevOps  / Continuous Deployment World

Questions ?

Page 41: Nagios in the Agile /  DevOps  / Continuous Deployment World

412012

Thank You !!!

[email protected]

We are Hiring: imvu.com/jobs

Engineering Blog: http://engineering.imvu.com/