experiments monitoring
TRANSCRIPT
![Page 1: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/1.jpg)
![Page 2: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/2.jpg)
Experiments Monitoring Plans and Progress
2
Summary by
Alberto AIMAR – CERN
Input from ALICE, ATLAS, CMS, LHCb
![Page 3: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/3.jpg)
Outline
• ALICE
• ATLAS
• CMS
• LHCb
• Summary
3
![Page 4: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/4.jpg)
ALICE/CERNmonitoring and QA
(slides from Jacek Otwinowski and Costin Grigoras)
ALICE - 1
![Page 5: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/5.jpg)
Data processing monitoring
Current situation• Grid infrastructure monitored by MonALISA
• Jobs (resources, status), data servers, distributed services status etc.• Same solution for DAQ for custom real-time display• Online (High Level Trigger) cluster monitored by Zabbix
Online/Offline (O2) in Run3/4• Particular constraints on the performance of the future system:
• 100k sources (processes), 600kHz data points, <500ms latency• Considered solutions: Zabbix, MonALISA, monitoring stack
(collectd/flume/influxdb/grafana/kafka/spark/…)• Final solution after benchmarking the above
ALICE - 2
![Page 6: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/6.jpg)
Quality Assurance in Run2Data Quality Monitoring/Online QA• Detector performance monitored by Automatic MOnitoRing Environment (AMORE)
• Agents are monitoring processes (receive and analyze raw data samples)• Visualization with dedicated AMORE GUI
• Detector and tracking performance on the High Level Trigger• Data producers and mergers communicate via ZeroMQ• OVERTWATCH system for detector performance monitoring via WEB interface
Offline QA• Processing on the grid with AliEn• Post-processing on the QA cluster at CERN• QA output (ROOT histograms and trees) stored on AFS/EOS file systems• Custom visualization tools
ALICE - 3
![Page 7: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/7.jpg)
Quality Assurance in Run3/4
Online/Offline (O2) in Run3/4• Parallel synchronous and asynchronous data processing (ALFA framework
with FairMQ)• QA repository/database
• NoSQL database (Elasticsearch, Cassandra…or others) for metadata information
• EOS file system for storing QA objects (ROOT histograms and trees)• Data aggregation and enrichment
• Custom solution based on ROOT (friend trees)• Apache Kafka is considered
• Interactive QA analysis • SWAN/Jupyter Notebook• Custom database queries using Python and C++ APIs• Custom visualization tools based on JSROOT ALICE - 4
![Page 8: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/8.jpg)
Example QA in Run3/4
Elasticsearch/CERN
Python API + Jupyter/SWAN
ALICE - 5
![Page 9: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/9.jpg)
ATLAS Monitoring
(input from M.Lassnig, A.Di Girolamo, A.Filipcic, I.Vukotic)
![Page 10: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/10.jpg)
ATLAS Monitoring and Analytics
• Understand distributed systems– Usage characterisation What do our users do? What do our systems do?– Performance characterisation How long does it take to retrieve? And to insert? …
• Monitor resource usage– CPU consumption, efficiency, usage by activities, site efficiency– Storage usage, reliability and throughput– Data migration, transfers, archival
• Key capabilities– Correlate Data from multiple systems– Model Using raw and aggregated data, data mining and machine learning toolkits– Ad-hoc Analytics for user-requested questions– Support Documentation and expert help
ATLAS -1
![Page 11: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/11.jpg)
Baseline Infrastructure for new monitoring
• ElasticSearch– Dedicated instance for ATLAS hosted by CERN IT (IT ES)– Instance hosted by University of Chicago– As part of MONIT, with curated and enriched data
• Notebooks– Zeppelin instance, part of MONIT, and dedicated to ADC
• Hadoop– HDFS to store raw and aggregated data– Preparation of data for ElasticSearch ingestion– Machine learning with Spark
ATLAS -2
![Page 12: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/12.jpg)
ATLAS and MONIT
ATLAS ADC Analytics and Monitoring launched Working Groups to check and improve, implement the ATLAS MONIT dashboards and reports
• DDM and DDM Accounting– Initial dashboards from MONIT on going– Need checking and improvements by experts
• Running jobs– Initial dashboards by MONIT– First beta Job monitoring / Job accounting dashboards– Running job slots were needed to start with dashboards
• Sites monitoring– SAM test results available in MONIT– Few features being developed (site grouping, availability profiles)– ASAP metrics (HammerCloud) modified to send results as JSON
ATLAS -2
![Page 13: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/13.jpg)
ATLAS and ElasticSearch
ElasticSearch @ UChicago• Hardware-based
– 8 nodes total (each 8 core, 64GB RAM), 5 data nodes (3x1 TB SSD each)– 3 master nodes (3 masters, 1 indexer, 2 kibana)– 10 Gbps NICs
• Contents: 15'000'000'000 docs• Clients: Kibana,Notebooks, Embedded visualisations, Cron jobsIndices in production across ITES (8) and UChicago (8) = 82TB/year • Few options available
– Rolling buffers: Cannot look further back than 30 days — not enough for our reports– Throw away data: Painful to know what could be important upfront, esp. for operations– Redo all dashboards from Kibana/Grafana in Spark notebooks: Not enough human
capacities for thisATLAS -3
![Page 14: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/14.jpg)
ATLAS and Hadoop @ CERN
• Exclusively use the analytics cluster at CERN– 40 nodes, 2TB RAM total, 2PB storage total
• Rucio (Data Management)– Dumps from Oracle– Flume from DDM servers and daemons (REST calls, logs, …)– Custom servlet to serve HTTP streams directly from HDFS– Reporting jobs — results shown via notebooks– Kerberos authentication via (cumbersome) acron method
ATLAS -5
![Page 15: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/15.jpg)
ES and Hadoop Feedback
• Users are much more comfortable now than half a year ago– However, inexperienced users can hit the infrastructure hard– Documentation is difficult, systems are changing fast, easier to ask colleagues
• ElasticSearch and Hadoop are both critical and used in production– Most of the tools/capabilities are available– Interplay between tools not ideal and efficient, lots of custom-made duct-tape
• Hadoop is working well, ES is sufficient for now (=< late 2017)– Limiting for the applications (slidings windows, throwing away data,…)– Upgrades necessary wrt. storage space and IO rates at least on ITES cluster
• Run-3/4 considerations– 10 time increase of rates and volume, in line with WFMS and DDM upgrades– Event-level processing workflows will require unprecedented instrumentation
ATLAS -6
![Page 16: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/16.jpg)
Wishlist
• Monitoring data volume– Data volume is huge, throwing away old data limits the usability– Deeper studies require data in ES for at least a year, rather longer
• Limited functionality– Grafana and Kibana not fully featured, crucial elements missing (e.g. log
scale), how to extend?• Data reliability
– Data/figure/plot validation (e.g. for RRB) is mandatory, needs a lot of experience and help from dedicated experts
• Presentation quality– Quality of presentation material needs to be at professional level
ATLAS -7
![Page 17: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/17.jpg)
CMS Grid Monitoring and Migration to MONIT
(slides from Carl Vuosalo for CMS Monitoring Forum)
![Page 18: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/18.jpg)
CMS Grid Monitoring and Migration to MONIT
CMS monitors its grid resources with variety of tools
In process of migrating to use CERN’s standard MONIT framework
CMS monitoring covers many kinds of resources:
• CMSWeb – front-end tool for displaying monitoring results
• Data Popularity – dataset usage
• SpaceMon – monitoring disk usage across CMS grid
• Site Status Board – shows status of CMS grid sites
• CRAB/ASO – monitors files transferred for user grid jobs
• HTCondor job monitoring
• WMAgent – production job submission and reporting
• WMArchive – grid job reports CMS - 1
![Page 19: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/19.jpg)
CMSWeb and Data Popularity
CMSWeb
• CMSWeb hosts and displays other CMS services
• Already uses ElasticSearch and Kibana
• Planning to migrate
Data Popularity
• Data Popularity tracks usage of CMS datasets by CMS jobs
• It uses CMS tools to aggregate and display results
• Planning to migrate to CERN MONIT infrastructure this summer• Store and aggregate results using Hadoop file system and Spark• Display results with Grafana and Kibana
CMS - 2
![Page 20: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/20.jpg)
SpaceMon and Site Status Board
CMS SpaceMon
• SpaceMon tracks disk usage at CMS sites
• Developed MONIT prototype that uses ElasticSearch and Kibana • Data source will move to message broker (Active MQ) instead of Flume
SSB
• Site Status Board displays status of CMS computing sites• Migrating to MONIT
• Sending data as JSON instead of XML
• Will implement MONIT features (dashboards, reports and extract via APIs)
CMS - 3
![Page 21: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/21.jpg)
CMS SpaceMon
![Page 22: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/22.jpg)
SSB - GlideinWMS Metrics
![Page 23: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/23.jpg)
CRAB/ASO Ops and CMS Job Monitoring
CRAB and ASO Monitoring
• CRAB is CMS tool for user job submission
• ASO is used by CRAB for file transfer
• ASO has migrated all dashboards to MONIT
• Developed Grafana dashboards, developed with Spark jobs
CMS Job Monitoring (data via HTCondor)
• HTCondor job monitoring prototype
• Send raw data from jobs to MONIT via AMQ message broker
• Perform aggregation in MONIT
• Use Grafana for dashboardsCMS - 4
![Page 24: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/24.jpg)
CRAB and ASO Monitoring
![Page 25: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/25.jpg)
CMS Job Monitoring (data via HTCondor)
![Page 26: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/26.jpg)
WMAgent and WMArchiveWMAgent
• WMAgent used for production job submission and reporting
• All dashboards now in MONIT
• Using Grafana – some fine-tuning still needed
WMArchive
• WMArchive stores Framework Job Reports from production jobs
• Migrating aggregated results into CERN MONIT system
• WMArchive and WMCore added support for ActiveMQ broker Stomp protocol
• Now streaming data to MONIT broker
• Developing periodic Spark job for sending data
• Creating Kibana displays and Grafana plotsCMS - 5
![Page 27: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/27.jpg)
WMAgent
![Page 28: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/28.jpg)
WMArchive (aggregated job reports)
![Page 29: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/29.jpg)
LHCb Monitoring(input from F.Stagni and A. McNab)
LHCb - 1
![Page 30: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/30.jpg)
Monitoring system
• Based on DIRAC framework and designed for:• real time monitoring (WMS jobs, DIRAC components, etc.)• managing semi-structured data (JSON)• efficient data storage, data analysis and retrieval• provide high quality reports
• Technologies used:• Elasticsearch distributed search and analytic engine • DIRAC Graph library for creating high quality plots• DIRAC web framework (front end) for visualize the plots• Messaging Queue systems: AMQ (via stomp) and RabbitMQ• Kibana flexible analytic and visualization framework (optional: for
component monitoring)LHCb - 2
![Page 31: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/31.jpg)
LHCb - 3
Overview of the system
![Page 32: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/32.jpg)
Monitoring web server
LHCb - 4
![Page 33: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/33.jpg)
Experience and Plans• Experience:
• LHCb is using the Monitoring System since October 2016• Elasticsearch clusters are provided by CERN (one for
production, one for certification)• Size: 108GB, 1200 indexes, 1280 million documents• 8 nodes cluster: 3 master, 2 search and 3 data nodes
• Plans:• More monitoring data• Centralized logging of DIRAC components, based on
Elasticsearch directly (currently in pre-production version)• Job analytic: Worker node, Memory,CPU, Exectime, etc.
LHCb - 5
![Page 34: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/34.jpg)
Pilot Monitoring (via NagiosProbes)
The WLCG SAM tests (ETF) are plugins for Nagios that check what is working • Either on the SAM/ETF central machines • Or inside jobs submitted by SAM/ETF
NagiosProbes is an extension to the next generation Pilot which can run these tests • Use Nagios Plugin API, so can use same scripts • Its tests currently logged to an HTTPS server • Will be replaced by logging to pilot logging service
LHCb - 6
![Page 35: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/35.jpg)
Pilot NagiosProbes for SAM/ETF
Output of the probes is retrieved by a script we supply to SAM/ETF • Itself a Nagios plugin, that returns the original probe’s outputs and return
code • We can hide the HTTPS to Pilot Logging transition since its all inside our
script
Prototype of this is now working • Running on etf-lhcb-preprod• Once in production, VM-based resources will immediately appearing in
WLCG SAM dashboards / reports
LHCb - 7
![Page 36: Experiments Monitoring](https://reader033.vdocument.in/reader033/viewer/2022060310/62943b29643ee94cc257bd07/html5/thumbnails/36.jpg)
Summary • All looking at external technologies instead of fully in-house solutions
• ElasticSearch, Kibana, Kafka, Messaging, Spark, HDFS
• less custom-made less control, but huge communities and free improvements
• ATLAS and CMS
• have their infrastructure for deeper analytics studies
• rely on MONIT for monitoring, and as migration from WLCG dashboards
• use MONIT curated data with aggregation, enrichment, processing, buffering,
dashboards, etc.
• ALICE and LHCb
• run their own infrastructure
• based on central IT services (ES, HDFS)
• could share (some) data in MONIT, if needed
36