lee myers - what to do when nagios notification don't meet your needs
TRANSCRIPT
What to do when Nagios notification don't meet your needs?
You Push It
Background
Career Start
Intel - ASCII RED Supercomputer
• 1st TeraFlops Supercomputer• Cabinets 102 - Drive & Compute clusters• 4,536 Nodes• 9,216 Processors (Pentium Pro’s)• 9,216 Cores• 1600 Square Feet
Currently
NCAR - Yellowstone Computer
• 2012: 13th with 1.5 PetaFlops, Now 50th• 94 Cabinets - 74 Compute & 10 Drive clusters• 4,542 Nodes• 9,036 Processors (Intel Xeon E5-2670)• 72,288 Cores
• 2,000 Square Feet
Nagios Configuration
Primary Instance• Hosts - 1289• Services - 3235
Total Instances• Hosts - 1410• Services - 3867
Test Instance• Hosts - 20,007• Services - 40,045• Passive Results from scripts
Primary Instance• 4 Check_MK Monitored Servers• 5 Remote Servers sending Passive
Results• 4 Sites being Monitored
Normal Load < 1 with 5 instances running.
Load with Test running < 4
Using OMD 1.2 (Nagios 3.5, Check_MK 1.2.4p5, Thruk 1.84-6, PNP4Nagios 0.6.24)
Nagios Notification Configuration
Host / Service
• notification_period– 24x7– workhours
• contact_groups
Contact
• service_notification_period– 24x7– workhours
• host_notification_period– 24x7– workhours
• service_notification_options– w,u,c,r,f
• host_notification_options– d,u,r
Standard Work Week
Simple distinction between work and home.
Non-Standard Rotating Work Week
Complex and Every Week is Different.
Since we have 24x7 coverage, why did we want notifications?
We are not always in our Operations Center at Night
• Doing nightly Visual Inspections• Replacing hardware in the Supercomputer• Working with facilities• Talking with Security• Eating a meal in our Kitchen• Watching fireworks with facilities• ...
Our initial Failure
No Sound from iPad Web or Apps
What We Needed
• Interface to Nagios Data• Something to Parse for
Unacknowledged Alerts• Something to send out Notifications• Program to give us our alerts on our
Mobile Devices
Interface to Nagios Data
Check_MK Livestatus• Nagios Broker Module• Written by Mathias Kettner• Direct Connection to Nagios through a
UNIX Socket• No Database to administer• No Configuration needed• Single line needs to be added to
nagios.cfg• Access it from the shell with unixcat• Uses Livestatus Query Language• http://mathias-kettner.com/checkmk_livestatus.html
Example:root@linux# echo 'GET hosts' | unixcat /var/lib/nagios/rw/live
acknowledged;action_url;address;alias;check_command;check_period;checks_enabled;contacts;in_check_period;in_notification_period;is_flapping;last_check;last_state_change;name;notes;notes_url;notification_period;scheduled_downtime_depth;state;total_services
0;/nagios/pnp/index.php?host=$HOSTNAME$;127.0.0.1;Acht;check-mk-ping;;1;check_mk,hh;1;1;0;1256194120;1255301430;Acht;;;24X7;0;0;7
0;/nagios/pnp/index.php?host=$HOSTNAME$;127.0.0.1;DREI;check-mk-ping;;1;check_mk,hh;1;1;0;1256194120;1255301431;DREI;;;24X7;0;0;1
0;/nagios/pnp/index.php?host=$HOSTNAME$;127.0.0.1;Drei;check-mk-ping;;1;check_mk,hh;1;1;0;1256194120;1255301435;Drei;;;24X7;0;0;4
Something to Parse - Livestatus
LQL Queries• “GET” and name of Table• Arbitrary number of header lines
consisting of a keyword, a colon and arguments.
• Empty line or ‘End of Transmission’
Tableshosts services hostgroupscontacts commands servicegroupslog timeperiods contactgroupsstatus downtimes hostsbygroupcolumns statehist commentsservicesbygroup servicesbyhostgroup
ColumnsColumns: <list of column names to return in order>
FiltersFilter: <column name> <operator> <value>
Operators: =, ~, =~, ~~, <, >, <=, >=, !=, !~, !=~, !~~Values: number, text
Combining filtersOr: <last x filters>And: <last X filters>Negate:
Others - Counting, Sums, Max, Min, Sd Dev, and more
Send out Notifications
Pushbullet• Free• Several API’s
– Android Extensions– iPhone– HTTP API
• https://docs.pushbullet.com
Were interested in the HTTP API, we are not writing a custom mobile app.
HTTP API Calls• Objects
– /v2/pushes– /v2/devices– /v2/contacts– /v2/users/me
• Accounts– /oath2
And more API calls which we don’t use.
Deliver to our Mobile Devices
Our Solution
nagios_push.sh
#!/bin/bash
# Get the person's access code for pushbulletread AccessCode < /home/$USER/PushBulletAccessCode
# Query nagios for host alerts and send them to pushbulletfor i in $(/opt/omd/versions/1.00/bin/unixcat < /usr/local/sbin/PushBullet_query_hosts /omd/sites/noc/tmp/run/live | tr ' ' '_' | cut -f1,2 -d';'); do
curl -u $AccessCode: https://api.pushbullet.com/v2/pushes -d type=note -d title="${i%;*}" -d body="${i#*;}" > /dev/null 2>&1done
# Query nagios for service alerts and send them to pushbullet
for i in $(/opt/omd/versions/1.00/bin/unixcat < /usr/local/sbin/PushBullet_query_services /omd/sites/noc/tmp/run/live | tr ' ' '_' | cut -f1,2 -d';'); do
curl -u $AccessCode: https://api.pushbullet.com/v2/pushes -d type=note -d title="${i%;*}" -d body="${i#*;}" > /dev/null 2>&1done
/usr/local/sbin/PushBullet_query_hosts
GET hostsColumns: name plugin_output stateFilter: state > 0Filter: acknowledged = 0Filter: host_scheduled_downtime_depth = 0
PushBullet Command Files
/usr/local/sbin/PushBullet_query_hosts
GET hostsColumns: name plugin_output stateFilter: state > 0Filter: acknowledged = 0Filter: host_scheduled_downtime_depth = 0
/usr/local/sbin/PushBullet_query_services
GET servicesColumns: name plugin_output stateFilter: state > 0Filter: acknowledged = 0Filter: scheduled_downtime_depth = 0
Our Support Scripts
npush_on
#!/bin/bash#Make sure it is not run as rootif [ $UID -eq 0 ]then
echo "Not to be run as root."exit
fi
if (crontab -l|grep -q nagios_push.sh)then#UnComment out the crontab
crontab -l | sed -e 's/#*\*\/4 \* \* \* \* \/usr\/local\/sbin\/nagios_push.sh/\*\/4 \* \* \* \* \/usr\/local\/sbin\/nagios_push.sh/'|crontabelse#Append the item to the crontab
(crontab -l; echo "*/4 * * * * /usr/local/sbin/nagios_push.sh")|crontabfi
#Let the user know when you are turning off the npushhour=$(date +%H)if [ "$hour" -lt 18 -a "$hour" -ge 6 ]; then
/usr/bin/at -f /usr/local/bin/npush_off 7pmecho "Turning off npush at 7 PM"
else/usr/bin/at -f /usr/local/bin/npush_off 7amecho "Turning off npush at 7 AM"
fi
npush_off
#!/bin/bash#Comment out the crontab
crontab -l | sed -e 's/\*\/4 \* \* \* \* \/usr\/local\/sbin\/nagios_push.sh/#\*\/4 \* \* \* \* \/usr\/local\/sbin\/nagios_push.sh/'|crontab
Future Upgrades
• Read Google Calendar for our schedule, no more remembering to turn it on.
• Send email alerts to PushBullet. (Without false alerts)• Remove the Crontab line, instead of commenting it out.• Anything else we can think of.
Questions