Download - The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab
The SAM-Grid Fabric Services
Gabriele Garzoglio (for the SAM-Grid team)Computing DivisionFermilab
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job Management
The fabric-level services Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging
Gabriele Garzoglio, ACAT 2003
Introduction
SAM is a Data Handling System for HEP: the project was started in 1997 by DZeroSAM-Grid project started in 2001-2002 to handle DZero’s expanded needs for globally distributed computingCDF joined SAM-Grid at the end of 2002
JIM complements the data handling system (SAM) with Job and Info Management:SAM-Grid = JIM + SAMJIM is funded by PPDG and GridPPParticipated at SC02 and SC03
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job Management
The fabric-level services Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging
Gabriele Garzoglio, ACAT 2003JO
B
Computing Element
Submission Client
User Interface
QueuingSystem
Job ManagementUser
Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage Element
Storage Element
Storage Element
Storage Element
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
Data Handling System
Data Handling System
Data Handling System
Data Handling System
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job ManagementThe fabric-level services
Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging
Gabriele Garzoglio, ACAT 2003
Running jobs on Grid resources: the trend
Grid resources are not dedicated to a single experimentTranslation:
no daemons running on the worker nodes of a Batch Systemno experiment specific software installed
Gabriele Garzoglio, ACAT 2003
Running jobs on Grid resources: today
The situation is transitioning:Generally, experiments can install specific services on a node close to the cluster.Worker nodes typically access the software via shared FS: not scalable!Local resource configuration still too diverse to easily plug into the Grid
Today, most of our efforts are directed to coping with (the lack of) standard local fabric services
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job ManagementThe fabric-level services
Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging
Gabriele Garzoglio, ACAT 2003
Motivation
Problem: “standard” grid batch system adapters (globus job-managers) are too restrictive to fit all the local configurationsExamples:
the terms of the agreement for using the batch system can be expressed with special directives to the batch systemsystem administrators end up writing wrappers around the standard batch system commands
Gabriele Garzoglio, ACAT 2003
SAM Batch System Adapter
We factor out the local batch system configuration using an intermediate layer that abstracts the basic interactions with the batch system
submit commandlookup commandremove command
For each of the commands above, the administrator can specify how to parse the output to fish out the relevant information e.g. local job id when submittingWe have written JIM globus job managers that use this layer
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job ManagementThe fabric-level services
Local batch system adaptation Dynamic product retrieval
Local sandbox managementJob complex-status logging
Gabriele Garzoglio, ACAT 2003
Motivation
Portability of the software for DZero and CDF is still a problem not completely solved.Most of the CDF and DZero applications still rely on the offline software to be preinstalled at the site.Administrators need to install and maintain the software at each siteA job submitted to the grid must be able to execute at a site where its dependencies are installed
Gabriele Garzoglio, ACAT 2003
Old solution: software advertisement
Administrators install the software at each siteThe JIM advertisement framework senses the new product and advertises it to the broker as one of the characteristics of the siteDrawbacks:
the administrators still need to install the softwareincreased complexity of the advertisement framework: it needs to know how to detect the list of installed productsincreased complexity of the broker: it needs to enforce the matching to the eligible sitesjobs running on old software versions may not find an eligible site
Gabriele Garzoglio, ACAT 2003
New solution: dynamic software retrieval
Product developers store the software into SAM with appropriate metadataBefore running a job at a site, the infrastructure asks SAM for the delivery of the dependent productsThe products live in the SAM cache and are automatically managedDrawbacks:
increased complexity of local job submission
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job ManagementThe fabric-level services
Local batch system adaptation Dynamic product retrieval Local sandbox management
Job complex-status logging
Gabriele Garzoglio, ACAT 2003
Nomenclature
Input sandbox:from the client (user sandbox):
• the executable• configuration files• special dependencies (libraries, products,…)
from the local site• the product dependencies
Output sandbox:stdout, stderrlog filessmall custom output (e.g. histograms)
Gabriele Garzoglio, ACAT 2003
Requirements
We want an infrastructure that:Locally stores the user sandbox (from the Grid) at the site transports and installs the input sandbox to the worker nodepackages the output and hands it over to the Grid
Gabriele Garzoglio, ACAT 2003
Limitations to overcome
the file transport mechanism of a batch system is site specific and needs to be factored outshared file systems have scalability limits: we want to rely on them as little as possiblethe worker nodes may have connectivity restrictions (firewalls)
Gabriele Garzoglio, ACAT 2003
The sandbox management 1
It creates a sandbox area (reorganizing the native globus gass cache)It starts up a gridftp server for the communications between worker nodes and head node (no shared FS)It requests the delivery of the product dependenciesIt creates a self extracting archive that contains the gridftp client and a bootstrapping script; when running, this transfers and installs the product dependencies, then passes control to the application
Gabriele Garzoglio, ACAT 2003
The sandbox management 2
It submits to the batch system parallel instances of the self extracting archiveThe job relies on SAM for large input/output files transfersWhen the job finishes, stdout/stderr + custom output is packaged at the head node to be transported back to the submission site via grid mechanisms
Gabriele Garzoglio, ACAT 2003
Open problems
Not all the batch system allow the selection of a node with sufficient scratch space to install the needed softwareWe would greatly simplify this infrastructure if there were a “standard” local storage service at all the sites (e.g. DiskFarm)
Gabriele Garzoglio, ACAT 2003
Overview
IntroductionThe grid-level services: an overview
Job ManagementThe fabric-level services
Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging
Gabriele Garzoglio, ACAT 2003
Motivation
Distributed logging of job status/historyWeb monitoringStatistics on historical dataGrid scheduling based upon job status/history at a certain site
Gabriele Garzoglio, ACAT 2003
The XML DB Status Logger
The status of the job is reported to an XML database deployed at each execution siteThe information comes from the local batch system (simple job status e.g. “idle”, “running”, …) AND from the application (complex status e.g. “Processing executable X in the chain”)The XML database gives flexible remote access via standard mechanisms, such as XPath
Gabriele Garzoglio, ACAT 2003
Conclusions
The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info ManagementThe SAM-Grid adopts Fabric-level configurable solutions for batch system adaptation, product delivery, sandboxing and job complex-status loggingThe community needs to come up with standard fabric-level services to make any Grid usable
Gabriele Garzoglio, ACAT 2003
More info at…
http://www-d0.fnal.gov/computing/grid/
http://samgrid.fnal.gov:8080/
Morag Burgon-Lyon’s Talk on SAM-Grid for CDF!