linuxclustersinstute: monitoring · linuxclustersinstute: monitoring...
TRANSCRIPT
Linux Clusters Ins.tute: Monitoring
Kyle Hutson – System Administrator for Kansas State University [email protected]
Why monitoring?
• How should we get no=fied? • What should we monitor? • How oAen should we monitor? • Internal vs external • Informa=onal vs urgent
2 May 20, 2015
How should we get no.fied?
• Urgent: • Email or text • Define this carefully
• Not-‐so urgent: • Web page updates
• Especially helpful for historical data • Email (filtered) • End-‐user support requests
3 May 20, 2015
What should we monitor?
• External: Basic Connec=vity • Internal:
• The urgent • Power status • Scheduler/head node status • Cold-‐aisle temperatures • Storage system
4 May 20, 2015
Lots of li?le things
• Overall cluster health • Queue size • Overall network usage • Number of responding nodes
• Individual node health • Load average • Memory used • Network bandwidth • CPU usage • Temperature
• Storage • Capacity • Degraded status • Connec=vity
5 May 20, 2015
Security
• Securing the cluster • Security status updates • Any failures
• sudo reports • Network login failures (e.g. fail2ban) • crontab failures • Logfile errors (customize to fit)
6 May 20, 2015
How oBen?
• You will quickly get a feel for this • Too much info is o,en worse than too li3le info • The “urgent” – con=nually • The “not-‐so-‐urgent” – anywhere from a few =mes per day to once per week
• There’s nothing wrong with trial and error
7 May 20, 2015
How to make it happen
• Nagios/NRPE (Nagios Remote Plugin Executor) • Generic executable that runs “plugins”
• Plugins can monitor just about anything you can think of monitoring • Even works with Windows • Nagios (hap://www.nagios.org/) is by far the most common monitoring system
8 May 20, 2015
How to make it happen
9 May 20, 2015
How to make it happen
• Icinga (haps://www.icinga.org/) • Can use NRPE • (New) version 2 has its own client • Uses database backend for history • Mul=-‐threaded and mul=homed
10 May 20, 2015
How to make it happen
11 May 20, 2015
How to make it happen
• Ganglia (hap://ganglia.sourceforge.net/) -‐ for historical and resource monitoring
• Ours are public • RRD files give historical data (a.k.a. “lots of preay graphs”)
12 May 20, 2015
How to make it happen
13 May 20, 2015
How to make it happen
• New alterna=ve to Ganglia: Graphite (hap://graphite.wikidot.com/) • Uses “whisper” instead of RRD (smaller files) • Scaling is beaer than Ganglia • Dynamic data points let you see exactly what you want (with some prac=ce)
• S=ll in beta
14 May 20, 2015