evolution of the monitoring in the lhcb online system · icinga2 icinga2 early development stages...
TRANSCRIPT
![Page 1: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/1.jpg)
Evolution of the Monitoring in the LHCb OnlineSystem
Christophe HaenE.Bonaccorsi and N.Neufeld
LHCb Online team (CERN), GENEVA, CH
10th October 2013
![Page 2: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/2.jpg)
Plan
1 The LHCb Online System
2 Feedback of the current infrastructure
3 AlternativesNagios4ShinkenIcinga2
4 Benchmark
5 Conclusion
![Page 3: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/3.jpg)
The LHCb Online System
LHCb
One of the four largeexperiments of the LHC.
Relies on large andheterogeneous ITinfrastructure.
Thousands of servers,different hardwareconfigurations, greatvariety of tasks
Futur monitoring for LHCb 1 C. Haen
![Page 4: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/4.jpg)
The LHCb Online System
Distributed monitoringinfrastructure
Single Icinga 1.8.4instance
ido2db with local MySQL(SSD disks)
mod gearman 1.4.2
NRPE and NSClient++
Nand for mail aggregation
Futur monitoring for LHCb 2 C. Haen
![Page 5: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/5.jpg)
Positive aspects
Pros
Performance: 40 000 checks in a 5 mn window without latency
Ease of scalability with mod gearman
Group and template functionalities of the configuration:factorization
New web interface
Mail aggregation is good and necessary
Futur monitoring for LHCb 3 C. Haen
![Page 6: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/6.jpg)
Negative aspects
Cons
Icinga instance is a single point of failure
Dependency system unsatisfactory
Performance with big environment failures
Very static: no easy access to live information, noconfiguration change while running
(Configuration parsing and loading time in the database)
Futur monitoring for LHCb 4 C. Haen
![Page 7: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/7.jpg)
Nagios4
Nagios4
Major release, currently in beta version
Performance improvements
Better algorithms
Give up fork system for worker processes (mod gearman like)
Claim: -87% iops, -42% CPU, -64% memory
Configuration logic slightly changed (Beware!)
Futur monitoring for LHCb 5 C. Haen
![Page 8: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/8.jpg)
Shinken
Shinken (part 1)
Pioneer of the nextgeneration tools
All in Python
Extends Nagios’philosophy
Innovative technicaldesign
Futur monitoring for LHCb 6 C. Haen
![Page 9: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/9.jpg)
Shinken
Shinken (part 2)
Dynamic (“calculated”checks, cluster support,virtualization, etc)
Extends Nagios’configuration (servicesapplied to templates,composition of templates,macro “foreach” etc)
Automatic configurationgeneration
Business oriented
Futur monitoring for LHCb 7 C. Haen
![Page 10: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/10.jpg)
Icinga2
Icinga2
Early development stages
Separate branch fromIcinga 1.x
C++ with lots of Boost
Distributed core
Totally differentconfiguration
Remote agent
Dynamic
Business oriented
Futur monitoring for LHCb 8 C. Haen
![Page 11: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/11.jpg)
Test bench
Candidates
Tunned Icinga + 15 remote gearman workers
Shinken (slightly tuned)
Out of the box Icinga2
Out of the box nagios4
Procedure
60 000 services on 2 000 hosts
No historical data
t=0: everything OK
t=1000s: 90% services fail
t=2000s: everything recovers
Futur monitoring for LHCb 9 C. Haen
![Page 12: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/12.jpg)
Latency
Latency
Icinga gearman:knock on effect
Shinken: increasewhen big failures
nagios4: flateverywhere
Icinga2: bump atthe beginning
Futur monitoring for LHCb 10 C. Haen
![Page 13: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/13.jpg)
Service checks
Service checks
Icinga gearman:slow increase at1000s
Shinken: stepincrease
nagios4: steepincrease after 1000s
Icinga2: very fast atstartup
Futur monitoring for LHCb 11 C. Haen
![Page 14: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/14.jpg)
Reaction time
Reaction time
Icinga gearman:relatively slow
Shinken: stepfunction, slow
nagios4: fast
Icinga2: very fast
Futur monitoring for LHCb 12 C. Haen
![Page 15: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd5152ad6a402d66685da4/html5/thumbnails/15.jpg)
Conclusion
Conclusion
Icinga with Gearman was a good move
Still has some weaknesses
Nagios4 is not an option
Icinga2 extremely promising performance wise
Shinken seems slower, but very dynamic and many features
Further tests to be done when stable version for both
Futur monitoring for LHCb 13 C. Haen