experiments monitoring

Experiments Monitoring Plans and Progress

2

Summary by

Alberto AIMAR – CERN

Input from ALICE, ATLAS, CMS, LHCb

Outline

• ALICE

• ATLAS

• CMS

• LHCb

• Summary

3

ALICE/CERNmonitoring and QA

(slides from Jacek Otwinowski and Costin Grigoras)

ALICE - 1

Data processing monitoring

Current situation• Grid infrastructure monitored by MonALISA

• Jobs (resources, status), data servers, distributed services status etc.• Same solution for DAQ for custom real-time display• Online (High Level Trigger) cluster monitored by Zabbix

Online/Offline (O2) in Run3/4• Particular constraints on the performance of the future system:

• 100k sources (processes), 600kHz data points, <500ms latency• Considered solutions: Zabbix, MonALISA, monitoring stack

(collectd/flume/influxdb/grafana/kafka/spark/…)• Final solution after benchmarking the above

ALICE - 2

Quality Assurance in Run2Data Quality Monitoring/Online QA• Detector performance monitored by Automatic MOnitoRing Environment (AMORE)

• Agents are monitoring processes (receive and analyze raw data samples)• Visualization with dedicated AMORE GUI

• Detector and tracking performance on the High Level Trigger• Data producers and mergers communicate via ZeroMQ• OVERTWATCH system for detector performance monitoring via WEB interface

Offline QA• Processing on the grid with AliEn• Post-processing on the QA cluster at CERN• QA output (ROOT histograms and trees) stored on AFS/EOS file systems• Custom visualization tools

ALICE - 3

Quality Assurance in Run3/4

Online/Offline (O2) in Run3/4• Parallel synchronous and asynchronous data processing (ALFA framework

with FairMQ)• QA repository/database

• NoSQL database (Elasticsearch, Cassandra…or others) for metadata information

• EOS file system for storing QA objects (ROOT histograms and trees)• Data aggregation and enrichment

• Custom solution based on ROOT (friend trees)• Apache Kafka is considered

• Interactive QA analysis • SWAN/Jupyter Notebook• Custom database queries using Python and C++ APIs• Custom visualization tools based on JSROOT ALICE - 4

Example QA in Run3/4

Elasticsearch/CERN

Python API + Jupyter/SWAN

ALICE - 5

ATLAS Monitoring

(input from M.Lassnig, A.Di Girolamo, A.Filipcic, I.Vukotic)

ATLAS Monitoring and Analytics

• Understand distributed systems– Usage characterisation What do our users do? What do our systems do?– Performance characterisation How long does it take to retrieve? And to insert? …

• Monitor resource usage– CPU consumption, efficiency, usage by activities, site efficiency– Storage usage, reliability and throughput– Data migration, transfers, archival

• Key capabilities– Correlate Data from multiple systems– Model Using raw and aggregated data, data mining and machine learning toolkits– Ad-hoc Analytics for user-requested questions– Support Documentation and expert help

ATLAS -1

Baseline Infrastructure for new monitoring

• ElasticSearch– Dedicated instance for ATLAS hosted by CERN IT (IT ES)– Instance hosted by University of Chicago– As part of MONIT, with curated and enriched data

• Notebooks– Zeppelin instance, part of MONIT, and dedicated to ADC

• Hadoop– HDFS to store raw and aggregated data– Preparation of data for ElasticSearch ingestion– Machine learning with Spark

ATLAS -2

ATLAS and MONIT

ATLAS ADC Analytics and Monitoring launched Working Groups to check and improve, implement the ATLAS MONIT dashboards and reports

• DDM and DDM Accounting– Initial dashboards from MONIT on going– Need checking and improvements by experts

• Running jobs– Initial dashboards by MONIT– First beta Job monitoring / Job accounting dashboards– Running job slots were needed to start with dashboards

• Sites monitoring– SAM test results available in MONIT– Few features being developed (site grouping, availability profiles)– ASAP metrics (HammerCloud) modified to send results as JSON

ATLAS -2

ATLAS and ElasticSearch

ElasticSearch @ UChicago• Hardware-based

– 8 nodes total (each 8 core, 64GB RAM), 5 data nodes (3x1 TB SSD each)– 3 master nodes (3 masters, 1 indexer, 2 kibana)– 10 Gbps NICs

• Contents: 15'000'000'000 docs• Clients: Kibana,Notebooks, Embedded visualisations, Cron jobsIndices in production across ITES (8) and UChicago (8) = 82TB/year • Few options available

– Rolling buffers: Cannot look further back than 30 days — not enough for our reports– Throw away data: Painful to know what could be important upfront, esp. for operations– Redo all dashboards from Kibana/Grafana in Spark notebooks: Not enough human

capacities for thisATLAS -3

ATLAS and Hadoop @ CERN

• Exclusively use the analytics cluster at CERN– 40 nodes, 2TB RAM total, 2PB storage total

• Rucio (Data Management)– Dumps from Oracle– Flume from DDM servers and daemons (REST calls, logs, …)– Custom servlet to serve HTTP streams directly from HDFS– Reporting jobs — results shown via notebooks– Kerberos authentication via (cumbersome) acron method

ATLAS -5

ES and Hadoop Feedback

• Users are much more comfortable now than half a year ago– However, inexperienced users can hit the infrastructure hard– Documentation is difficult, systems are changing fast, easier to ask colleagues

• ElasticSearch and Hadoop are both critical and used in production– Most of the tools/capabilities are available– Interplay between tools not ideal and efficient, lots of custom-made duct-tape

• Hadoop is working well, ES is sufficient for now (=< late 2017)– Limiting for the applications (slidings windows, throwing away data,…)– Upgrades necessary wrt. storage space and IO rates at least on ITES cluster

• Run-3/4 considerations– 10 time increase of rates and volume, in line with WFMS and DDM upgrades– Event-level processing workflows will require unprecedented instrumentation

ATLAS -6

Wishlist

• Monitoring data volume– Data volume is huge, throwing away old data limits the usability– Deeper studies require data in ES for at least a year, rather longer

• Limited functionality– Grafana and Kibana not fully featured, crucial elements missing (e.g. log

scale), how to extend?• Data reliability

– Data/figure/plot validation (e.g. for RRB) is mandatory, needs a lot of experience and help from dedicated experts

• Presentation quality– Quality of presentation material needs to be at professional level

ATLAS -7

CMS Grid Monitoring and Migration to MONIT

(slides from Carl Vuosalo for CMS Monitoring Forum)

CMS Grid Monitoring and Migration to MONIT

CMS monitors its grid resources with variety of tools

In process of migrating to use CERN’s standard MONIT framework

CMS monitoring covers many kinds of resources:

• CMSWeb – front-end tool for displaying monitoring results

• Data Popularity – dataset usage

• SpaceMon – monitoring disk usage across CMS grid

• Site Status Board – shows status of CMS grid sites

• CRAB/ASO – monitors files transferred for user grid jobs

• HTCondor job monitoring

• WMAgent – production job submission and reporting

• WMArchive – grid job reports CMS - 1

CMSWeb and Data Popularity

CMSWeb

• CMSWeb hosts and displays other CMS services

• Already uses ElasticSearch and Kibana

• Planning to migrate

Data Popularity

• Data Popularity tracks usage of CMS datasets by CMS jobs

• It uses CMS tools to aggregate and display results

• Planning to migrate to CERN MONIT infrastructure this summer• Store and aggregate results using Hadoop file system and Spark• Display results with Grafana and Kibana

CMS - 2

SpaceMon and Site Status Board

CMS SpaceMon

• SpaceMon tracks disk usage at CMS sites

• Developed MONIT prototype that uses ElasticSearch and Kibana • Data source will move to message broker (Active MQ) instead of Flume

SSB

• Site Status Board displays status of CMS computing sites• Migrating to MONIT

• Sending data as JSON instead of XML

• Will implement MONIT features (dashboards, reports and extract via APIs)

CMS - 3

CMS SpaceMon

SSB - GlideinWMS Metrics

CRAB/ASO Ops and CMS Job Monitoring

CRAB and ASO Monitoring

• CRAB is CMS tool for user job submission

• ASO is used by CRAB for file transfer

• ASO has migrated all dashboards to MONIT

• Developed Grafana dashboards, developed with Spark jobs

CMS Job Monitoring (data via HTCondor)

• HTCondor job monitoring prototype

• Send raw data from jobs to MONIT via AMQ message broker

• Perform aggregation in MONIT

• Use Grafana for dashboardsCMS - 4

CRAB and ASO Monitoring

CMS Job Monitoring (data via HTCondor)

WMAgent and WMArchiveWMAgent

• WMAgent used for production job submission and reporting

• All dashboards now in MONIT

• Using Grafana – some fine-tuning still needed

WMArchive

• WMArchive stores Framework Job Reports from production jobs

• Migrating aggregated results into CERN MONIT system

• WMArchive and WMCore added support for ActiveMQ broker Stomp protocol

• Now streaming data to MONIT broker

• Developing periodic Spark job for sending data

• Creating Kibana displays and Grafana plotsCMS - 5

WMAgent

WMArchive (aggregated job reports)

LHCb Monitoring(input from F.Stagni and A. McNab)

LHCb - 1

Monitoring system

• Based on DIRAC framework and designed for:• real time monitoring (WMS jobs, DIRAC components, etc.)• managing semi-structured data (JSON)• efficient data storage, data analysis and retrieval• provide high quality reports

• Technologies used:• Elasticsearch distributed search and analytic engine • DIRAC Graph library for creating high quality plots• DIRAC web framework (front end) for visualize the plots• Messaging Queue systems: AMQ (via stomp) and RabbitMQ• Kibana flexible analytic and visualization framework (optional: for

component monitoring)LHCb - 2

LHCb - 3

Overview of the system

Monitoring web server

LHCb - 4

Experience and Plans• Experience:

• LHCb is using the Monitoring System since October 2016• Elasticsearch clusters are provided by CERN (one for

production, one for certification)• Size: 108GB, 1200 indexes, 1280 million documents• 8 nodes cluster: 3 master, 2 search and 3 data nodes

• Plans:• More monitoring data• Centralized logging of DIRAC components, based on

Elasticsearch directly (currently in pre-production version)• Job analytic: Worker node, Memory,CPU, Exectime, etc.

LHCb - 5

Pilot Monitoring (via NagiosProbes)

The WLCG SAM tests (ETF) are plugins for Nagios that check what is working • Either on the SAM/ETF central machines • Or inside jobs submitted by SAM/ETF

NagiosProbes is an extension to the next generation Pilot which can run these tests • Use Nagios Plugin API, so can use same scripts • Its tests currently logged to an HTTPS server • Will be replaced by logging to pilot logging service

LHCb - 6

Pilot NagiosProbes for SAM/ETF

Output of the probes is retrieved by a script we supply to SAM/ETF • Itself a Nagios plugin, that returns the original probe’s outputs and return

code • We can hide the HTTPS to Pilot Logging transition since its all inside our

script

Prototype of this is now working • Running on etf-lhcb-preprod• Once in production, VM-based resources will immediately appearing in

WLCG SAM dashboards / reports

LHCb - 7

Summary • All looking at external technologies instead of fully in-house solutions

• ElasticSearch, Kibana, Kafka, Messaging, Spark, HDFS

• less custom-made less control, but huge communities and free improvements

• ATLAS and CMS

• have their infrastructure for deeper analytics studies

• rely on MONIT for monitoring, and as migration from WLCG dashboards

• use MONIT curated data with aggregation, enrichment, processing, buffering,

dashboards, etc.

• ALICE and LHCb

• run their own infrastructure

• based on central IT services (ES, HDFS)

• could share (some) data in MONIT, if needed

36

experiments monitoring

Documents