linuxclustersinstute: monitoring · linuxclustersinstute: monitoring...

Linux Clusters Ins.tute: Monitoring

Kyle Hutson – System Administrator for Kansas State University [email protected]

Why monitoring?

• How should we get no=fied? • What should we monitor? • How oAen should we monitor? •  Internal vs external •  Informa=onal vs urgent

2 May 20, 2015

How should we get no.fied?

• Urgent: •  Email or text •  Define this carefully

• Not-‐so urgent: • Web page updates

•  Especially helpful for historical data •  Email (filtered) •  End-‐user support requests

3 May 20, 2015

What should we monitor?

•  External: Basic Connec=vity •  Internal:

•  The urgent •  Power status •  Scheduler/head node status •  Cold-‐aisle temperatures •  Storage system

4 May 20, 2015

Lots of li?le things

• Overall cluster health •  Queue size •  Overall network usage •  Number of responding nodes

•  Individual node health •  Load average •  Memory used •  Network bandwidth •  CPU usage •  Temperature

•  Storage •  Capacity •  Degraded status •  Connec=vity

5 May 20, 2015

Security

•  Securing the cluster •  Security status updates •  Any failures

•  sudo reports •  Network login failures (e.g. fail2ban) •  crontab failures •  Logfile errors (customize to fit)

6 May 20, 2015

How oBen?

•  You will quickly get a feel for this •  Too much info is o,en worse than too li3le info •  The “urgent” – con=nually •  The “not-‐so-‐urgent” – anywhere from a few =mes per day to once per week

•  There’s nothing wrong with trial and error

7 May 20, 2015

How to make it happen

• Nagios/NRPE (Nagios Remote Plugin Executor) •  Generic executable that runs “plugins”

•  Plugins can monitor just about anything you can think of monitoring •  Even works with Windows •  Nagios (hap://www.nagios.org/) is by far the most common monitoring system

8 May 20, 2015


9 May 20, 2015


•  Icinga (haps://www.icinga.org/) •  Can use NRPE •  (New) version 2 has its own client •  Uses database backend for history •  Mul=-‐threaded and mul=homed

10 May 20, 2015


11 May 20, 2015


• Ganglia (hap://ganglia.sourceforge.net/) -‐ for historical and resource monitoring

•  Ours are public •  RRD files give historical data (a.k.a. “lots of preay graphs”)

12 May 20, 2015


13 May 20, 2015


• New alterna=ve to Ganglia: Graphite (hap://graphite.wikidot.com/) • Uses “whisper” instead of RRD (smaller files) •  Scaling is beaer than Ganglia • Dynamic data points let you see exactly what you want (with some prac=ce)

•  S=ll in beta

14 May 20, 2015

linuxclustersinstute: monitoring · linuxclustersinstute: monitoring...

Documents