log analysis in the accelerator sector

27
Log analysis in the accelerator sector Steen Jensen, BE-CO-DO

Upload: penney

Post on 24-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Log analysis in the accelerator sector. Steen Jensen, BE-CO-DO. Outline. Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions. Outline. Introduction to the controls system Our motivation for log analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Log  analysis in  the accelerator sector

Log analysis in the accelerator sectorSteen Jensen, BE-CO-DO

Page 2: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20122

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 3: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20123

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 4: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20124

CERN’s accelerator complex

Page 5: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20125

Cern’s Control Centre, CCC

Page 6: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20126

Control system infrastructureGPN

T

T T T T

OperatorConsoles

Fixed Displays

File Servers Appl Servers

PVSS Servers

Real-time Linux LynxOS Computers

WorldFIPSystems PLC

s

Cryo, Vacuum, …

Magnet Control, RFQuench Protection, …

Beam instrum. KickersMachine Protection, …

Other operationalConsoles

Oracle

Technet

Timing

Page 7: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20127

Hardware

425 consoles 300 servers 1300 front ends 600 module types ~85.000 device instances

Page 8: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20128

Software Java – operational code

400 GUIs, 200 servers ~8MLOC, > 1000 jars Up to 1000 processes 80 people, 10 groups

C/C++ > 1300 Front End servers Real-time processes Drivers (Linux/LynxOS) 80 people, 8 groups

Page 9: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 20129

Software environment

Operating systems Linux, LynxOS, Windows

Languages Java, C++, C, Scripts

Developers Staff, Fellows, Students – as primary or secondary activity CO, OP and Equipment Groups

Legacy 10+ years

Page 10: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201210

Connections in middleware layer

Page 11: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201211

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 12: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201212

Motivation

Further improve quality and availability of the controls system

New responsibility model => non-experts must be able to quickly identify who to call

Increase diagnostic efficiency Widen and deepen diagnostic capabilities Proactive & preventive maintenance & evolution We already have many log files that can be exploited

better

Page 13: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201213

Log data characteristics

Continuous stream of text messages ~2Gb per day, bursts > 10Gb per day Multiple transport mechanisms

syslog, log4J, JMS Scattered, system-specific log files Diversity in format and length Ambiguity in log level interpretation Variations in frequency Differences in relevance

Page 14: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201214

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 15: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201215

Use cases

Allow people to diagnose systems other than their own Facilitate correlation of messages across systems Detect what has changed if something does not work Observe trends over time Easily obtain customized views of one or more systems Receive notifications based on custom criteria

Page 16: Log  analysis in  the accelerator sector

16

Non-functional requirements

Easy to use and access by (non-)specialists Overviews and graphical visualisations Efficient data mining Allow for knowledge capture, sharing and re-use

grep/awk/sed does not provide this

Looking at Splunk, it is unrealistic to develop own system

Openlab workshop on data analytics, CERN, November 16th 2012

Page 17: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201217

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 18: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201218

Log data pre-processing

Centralization Filtering of irrelevant messages Throttling and burst protection

Down from 2+ Gb / day to ~100 Mb / day, not counting Java, i.e. likely to increase considerably

Page 19: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201219

Splunk setup

Central instance receiving pre-filtered messages via syslog and JMS

Automated alerts Dashboards and saved searches Manual monitoring and follow-up

Page 20: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201220

Splunk dashboard

Page 21: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201221

RBAC denials on SET

Page 22: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201222

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 23: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201223

Detailed case: Leap second

On July 1st, 01:59:59 a 60th leap second was added globally Using Splunk, a system administrator discovered that

Some computers failed to add the leap second on time Certain types of computers added the leap second at a later moment There seems to be no pattern over time No computer added a leap second more than once

Page 24: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201224

Use cases RBAC security

Tokens missing, malformed or expired CMW Client issues

Slow consumer, zombies, incorrect accesses, error handling Timing system

telegram layout issues Performance testing of new system

CO Frameworks Improvement: Token check prior to access Bug: applying wrong token in certain cases Bug: Loop on timing error

Page 25: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201225

Use cases, continued

Design considerations Architectural decisions based on usage info

Driver error reporting, missing module System issues

yum updates misbehaving due to race condition Separation between operational and test environments

Page 26: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201226

Outline

Introduction to the controls system Our motivation for log analysis Requirements The current setup Use cases Conclusions

Page 27: Log  analysis in  the accelerator sector

Openlab workshop on data analytics, CERN, November 16th 201227

Conclusions Log analysis is very valuable – we learned a lot using Splunk

Knowledge about control system usage Pro-active detection, trimming, improving Re-active diagnostics

It is a continuous discover-learn-improve process It involves a mix of

automated alerts manual monitoring ad hoc data mining

Experience allows improving log data quality Culture: Developer usage Structure: Key-value pairs

Splunk is a very promising tool in CO’s context, and projects show strong interest