a monitoring tool for a grid operation center sergio andreozzi (infn cnaf), sergio fantinel (infn...

27
A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone (INFN Napoli), Luca Vaccarossa (INFN Milano) CHEP2003 - March 24-28, 2003 - La Jolla, California

Upload: wesley-garrison

Post on 16-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

A monitoring tool for a GRID operation center

Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova),

David Rebatto (INFN Milano), Gennaro Tortone (INFN Napoli), Luca Vaccarossa (INFN Milano)

CHEP2003 - March 24-28, 2003 - La Jolla, California

Page 2: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Summary

introduction to DataTAG project monitoring of grid elements first implementation: WorldGRID resources monitoring the evolution: DataTAG WP4 resources monitoring future activities

Page 3: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Introduction

Page 4: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

DataTAG projectDataTAG is an EU-funded project will create a large-scale intercontinental Grid testbed that will focus upon advanced networking issues and interoperability between these intercontinental Grid domains.

The project will address the issues which arise in the sector of high performance inter-Grid networking, including sustained and reliable high performance data replication, end-to-end advanced network services, and novel monitoring techniques. The project will also directly address the issues which arise in the sector of interoperability between the Grid middleware layers such as information and security services. The advance made will be disseminated into each of the associated Grid projects.

detailed information on:http://www.datatag.org

Page 5: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

DataTAG Work Package 4

task of DataTAG WP4 (Interoperability between Grid domains) is to address issues of middleware interoperability between the European and US Grid domains and to enable a selected set of applications to run on the Transatlantic Grid Testbed

mainly activities include: Grid Resource Model for Computing and Storage resources (GLUE

schema) Virtual Organisation Membership Service implementation Grid Monitoring Resource Discovery LHC experiment applications integration

Page 6: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Monitoring of grid elements (1/2)

LOW LEVEL measurements CPU load memory usage disk usage (per partition) network activity number of processes number of users (UI) …

Computing Element Storage Element

Worker Node

Resource Broker

Information Index

Replica Manager

Replica Catalog

[…]

Grid services checks

gatekeeper gsiftp gris gdmp RB/LB …

“GRID” measurements

number of total CPUs

number of free CPUs number of running

jobs number of waiting

jobs SE free disk space …

Page 7: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Monitoring of grid elements (2/2)

sources of information LOW LEVEL measurements

plugins/sensors installed on each machine SERVICE checks

sensors installed on monitoring server GRID measurements

sensors installed on monitoring server

aggregate information (monitoring server side) per Virtual Organisation per site …

Page 8: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

first implementation: WorldGRID resources monitoring

Page 9: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

WorldGRID testbedWorldGRID is a “transatlantic grid” based on the existent European and American Grids with the goal of offering transparent access to the distributed computing infrastructure necessary to the “data-intensive” modern applications

The WorldGRID testbed has been successfully demonstrated during the WorldGRID demos at SuperComputing 2002 (Baltimore) and IST 2002 (Copenhagen) where real HEP application jobs were transparently submitted from US and Europe and run where resources were available, independently of their location

Page 10: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

WorldGRID monitoring based on Nagios (a host and service monitoring engine)

[detailed information on: http://www.nagios.org]

host local plug-ins – collect info from OS- CPU load- RAM- disk- jobs

MDS plug-ins - collect aggregate info from GRIS- number of running/waiting jobs- number of total/free CPUs

history graphs for all monitoring metrics

aggregate info/graphs per Site and Virtual Organisation

Page 11: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone
Page 12: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone
Page 13: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

the evolution:DataTAG-WP4 implementation for resources monitoring

Page 14: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

DescriptionGOAL

The objective of the task is to develop software for use in the Grid Operations Centres in order to monitor the overall functioning of the grid. The software should enable the grid administrators to quickly identify problems in the operation of the grid and take appropriate action to rectify them

People involved Sergio Andreozzi (INFN CNAF) Vincenzo Ciaschini (INFN CNAF) Sergio Fantinel (INFN Padova) Antonia Ghiselli (INFN CNAF) Flavia Donno (CERN-LCG) Gennaro Tortone (INFN Napoli) Cristina Vistoli (INFN CNAF)

Page 15: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Requirements (1/2)

Features required: scalability very low intrusivity automatic resource discovery fault detection and notification metrics graphs

The GOC administrator should be presented with an integrated view of the grid showing the overall functional status of the grid and the various sites with various levels of detail

Page 16: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Requirements (2/2)

The system should provide a facility of defining alarms on specific conditions as functions of the various parameters monitored. When such alarms are triggered, the administrator should be alerted appropriately

The system should poll all the sites and gather the static and dynamic information about the resources in the site.

static information covers parameters like number of computing elements, total storage capacities, total memory etc.

dynamic information covers parameters like number of running jobs, number of jobs in queue, free memory, free storage space, load average etc.

All interfaces should be web based

Page 17: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Features provided by current implementation The new Grid Monitoring Tool is based on Grid Information

System implemented by LDAP protocol with GLUE schema

it provides various monitoring levels:

host level (by GLUE schema monitoring extension) fabric level (by DataGRID WP4 monitoring framework) Virtual Organisation level (by automatic resources discovery and

checks scheduling)

it also provides an historical database in order to generate graphs or reports of some measurements

Page 18: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

GLUE schema(host level monitoring)

Conceptual model of grid resources to be used as a base schema of the GIS (Grid Information Service) for discovery and monitoring purposes

model of computing resources (CE) model of storage resources (SE) model of relationships among them (close CE/SE)

Implementation status (v. 1.0) (for Globus MDS) LDAP schema (DataTAG WP4.1) information providers (CE/SE)

we implemented an extension to include all monitoring metrics (“host level” added to GLUE schema)

Page 19: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone
Page 20: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

DataGrid-WP4 monitoring framework(fabric level monitoring)

It provides a client (Monitoring Sensor Agent - MSA) running sensors (Monitoring Sensors - MS) on each node to monitor, and a central server (Fabric Monitoring Server - fmonServer) to collect data.

The server receives samples as they are measured by MSA, and stores them in a flat file / Oracle database

The client is provided with a sensor (sensorLinuxProc) which uses /proc file system to measure various basic quantities on Linux (CPU load, network,etc.)

Page 21: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

EDG-WP4 monitoring framework

local farm element

computing element

Page 22: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Discovery process(Virtual Organisation level monitoring)

Through the GIIS, via LDAP, we can obtain the CE/SE available at a specific time.

Using a DB we compare the info from the GIIS with previous status of resources availability (an object can be new, disappeared, re-available)

Through the GRIS of the CE/SE we can obtain SITE/HOSTS info (we repeat the discovery process at site level to get site resources/info: queues, worker nodes, network adapters, disk partitions, supported transfer protocols, …)

Page 23: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Discovery process: base schema

Monitoring

Server

GIIS

GRIS

GIIS Server

Computing Element/Storage Element

1

2

34

SQL

1: LDAP Query2: available CE/SE3: LDAP Query4: CEIDs, WNs,

Steps 3,4 repeated for every CE/SE

LDAP

LDAP

Monitoring DB

Page 24: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

GRIS (GLUE schema)

EDG-WP4 fmonserver

computing element

information providers farm monitoringarchive

runldif output

write

read EDG-WP4 monitoring agent

worker node

/procfilesystem

WP4 sensor

run

readmetric output

metric output

EDG-WP4 monitoring agent

worker node

/procfilesystem

WP4 sensor

run

readmetric output

metric output

information index

GIIS (GLUE schema)

monitoring server

discovery service

monitoring service

ldap query

ldap query

web interface

CentralMonitoringDatabase

Page 25: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone
Page 26: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Future activities

Page 27: A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone

Future activities

job monitoring evaluation of OGSA monitoring service(s) evaluation/usage of SOAP interface provided by DataGrid WP4

monitoring framework in order to implement a distributed archive for monitoring metrics