nagios conference 2012 - eric loyd - nagios implementation case eastman kodak company
DESCRIPTION
Eric Loyd's presentation Case Study on Nagios Implementation Case Eastman Kodak Company. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcnaTRANSCRIPT
![Page 1: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/1.jpg)
Nagios Implementation Case:Eastman Kodak Company
Eric LoydFounder & CEO
Bitnetix Incorporated
877.BITNETIX
![Page 2: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/2.jpg)
2© 2012 Bitnetix Incorporated
About Eric Loyd and Bitnetix
Founder and CEO of Bitnetix Incorporated
VOIP services and IT/network consulting
25 Years in IT at places like
Eastman Kodak
Frontier Communications
Global Crossing
Bitnetix started its seventh year in July, 2012
2012 Digital Rochester GREAT Award Finalist in Communications Technology
Using Nagios to monitor our client equipment, VOIP platform, and still using it at Kodak since 2004
![Page 3: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/3.jpg)
A History of Eastman Kodak’s kodak.com Web Server
Infrastructure (non-confidential)
![Page 4: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/4.jpg)
4© 2012 Bitnetix Incorporated
History of kodak.com
Pre-2004
Machines located in Rochester, NYPublic Apache servers
Reverse proxy Apache servers
Application servers (ATG/Dynamo, Tomcat, etc)
Database boxes, Production Support, etc.
2004 – Moved ~80 machines from ROC -> ???
ROC <-> ??? Firewalls
Bandwidth requirements
Minimal user impact
Flipped the switch, went live
![Page 5: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/5.jpg)
5© 2012 Bitnetix Incorporated
History of kodak.com
Some of the things kodak.com did at the time
Consumer store and product information
B2B portal and wholesaler purchasing
“Picture Of The Day” (www.kodak.com/go/potd)
Warranty registration
Photo lab calibration strips
“Phone home” reports for printers, docks, cameras, etc
Software/firmware updates
Corporate press releases, bios, and regulatory information
Reverse proxy for internal information through secure channels
Dozens of sitelets for products and campaigns
![Page 6: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/6.jpg)
Why Kodak Chose Nagiosto Monitor kodak.com
![Page 7: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/7.jpg)
7© 2012 Bitnetix Incorporated
Why Nagios?
No centralized corporate monitoring software
Nothing to compete with internally
Nothing to build on, either
Cost
No additional cost beyond existing human resources
Framework
Nagios worked with firewalls without needing agents
Leverage SSH, HTTP and other remote protocols
Custom checks and notifications (very important)
![Page 8: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/8.jpg)
Initial Hurdles in the New Complex Server Environment
![Page 9: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/9.jpg)
kodak.com Network
© 2012 Bitnetix Incorporated
![Page 10: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/10.jpg)
10© 2012 Bitnetix Incorporated
Initial hurdles
Firewalls
Public load balancers on external Internet IPs
Public Apaches in Zone 1, Kodak network
Reverse proxy, app servers in Zone 2, semi-secure
Nagios machine in internal Zone 3, most secure
Complex “top” and “bottom” checks for web site
Is the site working from the user’s perspective (top)?
From the application side (bottom)?
How to separate apparent from actual failure
![Page 11: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/11.jpg)
11© 2012 Bitnetix Incorporated
Initial hurdles
No Internal Nagios Knowledge
It was a contractor who set up Nagios (me)
Contractors typically have a finite lifespan at Kodak
Contractor made custom checks, event handlers, and all Nagios configurations. Uh-oh…
Escalation and Paging
Screw it – let’s email everyone, every time and let Thunderbird sort it all out
Paging done via texting gateway email addressWhich means email gateway failure = notification failure
Twitter API as backup / current primary notification
![Page 12: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/12.jpg)
SSH to Remote Servers
![Page 13: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/13.jpg)
13© 2012 Bitnetix Incorporated
SSH to the rescue
One user, one key, infinite access
Software apps run as second user, with SSH auth
Additional robot accounts can be added at any time
Wrap existing checks in an SSH shell
Provides additional control, error handling, reporting
Allows all checks to submit results to SQL databaseSQL Database Side Note – all custom scripts executed CLI Perl code that locked a file, logged to it, and unlocked it. A Perl cron job woke up every 5 minutes, locked the file, read it, pushed things to Oracle, unlocked, and deleted log file. A second cron pruned Oracle daily to 400 days of data and collapsed checks older than 30 days so that successive checks with the same status were removed.
![Page 14: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/14.jpg)
Managing NagiosConfiguration Files
![Page 15: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/15.jpg)
15© 2012 Bitnetix Incorporated
Configuration Management
SCCS
Solaris’s “poor man’s CVS”
Pre-installed, no additional cost, existing expertise
Current configuration is managed through SVN
Rsync – the workhorse to move config files
Configuration Repository and Push (CRaP) directory
Cfengine
Local versus remote execution
Post-install, ignore pid files, deploy/restart, etc.
Makefile – the “CLI” to the entire process
![Page 16: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/16.jpg)
Common Event Handler
![Page 17: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/17.jpg)
© 2012 Bitnetix Incorporated 17
Common Event Handler
EKrestart – That Which Does
Setup
• Arguments• Conversions• do_soft/hard?• do_something?• do_restart
do_restart
• Lock, logs, SQL• send_nagios• SSH to remote• Remote
EKrestart• Process args• do_<service>• send_nagios• Unlock, log, SQL• Terminate
do_<service>
• Locks (level 2)• Instance mapping• Port mapping• App restart• Email & log• Exit
![Page 18: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/18.jpg)
18© 2012 Bitnetix Incorporated
A Closer Look at EKrestart#!/bin/shPATH=...
[ "$1" = "-r" ] && client_code
host="$1"service="$2"baseService=`echo $service | awk -F: '{print $1}'`state="$3"type="$4"tries="$5"perfdata="$6"class="<based on machine name, e.g., x-y-CLASS-nnn.kodak.com>"number="<based on machine name, e.g., x-y-class-NNN.kodak.com>"
case "$state" in OK) do_fixit;; WARNING) do_nothing;; UNKNOWN) do_nothing; CRITICAL) do_something; *) do_nothing;esac
![Page 19: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/19.jpg)
19© 2012 Bitnetix Incorporated
A Closer Look at EKrestartdo_fixit() { case "$baseService" in Workers) do_restart;; *) do_nothing;; esac}
do_nothing() { $debug && echo "$service is in $state state ($type) for $tries tries."}
do_something() { case "$type" in SOFT) do_soft;; # Take action before it's too late? HARD) do_restart;; # Hard CRITICAL - Our last chance to take action *) do_nothing;; esac}
do_soft() { case "$tries" in 3,4,5) do_restart;; # Okay, let's restart it before it goes hard *) do_nothing;; # Don't restart yet esac}
![Page 20: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/20.jpg)
20© 2012 Bitnetix Incorporated
A Closer Look at EKrestartdo_restart() { # <figure some stuff out, set up lock files, send_nagios, log to SQL, etc> ssh $machine <EKrestart> -r do_$service <parameters> # <tear down, unlock, close log, send_nagios, log to SQL, etc> exit}
# On the client side, we use the same EKretart script, but start at client_code()client_code() { host=`hostname` function="$2" service="$3" # (etc) eval $function exit}
# Example functiondo_Dynamo() { # lock file processing # turn off new sessions, wean existing ones # /etc/init.d/restart_dynamo_$instance # tear down return}
![Page 21: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/21.jpg)
Integrating Nagios into Operational Procedures
![Page 22: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/22.jpg)
22© 2012 Bitnetix Incorporated
Integration with Operations
Homebrew API
nchart, send_nagios, nlog – all portable to other installations of Nagios on other machines
Integrate with start/stop scripts
Lock files. Lots of lock files! TOO MANY lock files!!
The “Rippler”
Leverage EKrestart, cron, and send_nagios
Pager / Twitter and lots of private twitter feeds
Inter-group notifications
Predominately with procmail
![Page 23: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/23.jpg)
Predictive Failure Recoveryand a Good Night’s Sleep
![Page 24: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company](https://reader033.vdocument.in/reader033/viewer/2022042614/558cbfd3d8b42a7f788b4593/html5/thumbnails/24.jpg)
24© 2012 Bitnetix Incorporated
Predictive Failure Recovery
On ATG/Dynamo (and other) services
do_soft triggers do_restart on third failure
do_hard always triggers restart
Notifications on fourth failure
Escalation to pager only on fifth notification
Nagios has time to restart things that are bad, or are going bad, prior to sending out notifications
Service check dependencies allow us to know whether it’s a bad application, server, or user experience
Twitter – follow private tweets with smartphone, use apps to acknowledge problems, and get an even better night’s sleep!!