grid job and information management (jim) for d0 and cdf
DESCRIPTION
Grid Job and Information Management (JIM) for D0 and CDF. Gabriele Garzoglio for the JIM Team. Overview. Introduction Grid-level Management SAM-Grid = SAM + JIM Job Management Information Management Fabric-level Management Running jobs on grid resources Local sandbox management - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/1.jpg)
Grid Job and Information Management (JIM) for D0 and CDF
Gabriele Garzoglio for the JIM Team
![Page 2: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/2.jpg)
OverviewIntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
![Page 3: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/3.jpg)
ContextD0 Grid project started in 2001-2002 to handle D0’s expanded needs for globally distributed computingJIM complements the data handling system (SAM) with jobs and info managementJIM is funded by PPDG (our team here), GridPP (Rod Walker in the UK)Collaborative effort with the experiments.CDF joined later in 2002
![Page 4: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/4.jpg)
HistoryDelivered JIM prototype for D0, Oct 10, 2002:
Remote job submissionBrokering based on data cachedWeb-based monitoring
SC-2002 demo – 11 sites (D0, CDF), big successMay 2003 – started deployment of V1Now – working on running MC in production on the Grid
![Page 5: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/5.jpg)
Overview IntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
![Page 6: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/6.jpg)
![Page 7: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/7.jpg)
SAM-Grid Logistics
SiteSite SiteSite SiteSite
Resource Selector
Info Collector
Info Gatherer
Match Making
User InterfaceUser Interface User InterfaceUser Interface
SubmissionGlobal Job Queue
Grid Client
SubmissionSubmission
User InterfaceUser Interface User InterfaceUser Interface
Global DH ServicesSAM Naming Server
SAM Log Server
Resource Optimizer
SAM DB ServerRC MetaData Catalog
Bookkeeping Service
SAM Stager(s)
SAM Station(+other servs)
Data Handling
Worker Nodes
Grid Gateway
Local Job Handler(CAF, D0MC, BS, ...)
JIM Advertise
Local Job Handling
Cluster
AAA
Dist.FS
Info Manager
XML DB server
Site Conf.Glob/Loc JID map...
Info Providers
MDS
MSS Cache Site
Web ServGrid Monitoring
User Tools
Flow of: job data meta-data
![Page 8: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/8.jpg)
Job Management Highlights
We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster)We consider 3 types of jobs
analysis: data intensivemonte carlo: CPU intensivereconstruction: data and CPU intensive
![Page 9: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/9.jpg)
Job Management – Distinct JIM Features
Decision making is based on both:Information existing irrespective of jobs (resource description)Functions of (jobs,resource)
Decision making is interfaced with data handling middleware Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperabilityBrokering algorithms can be extended via plug-ins
![Page 10: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/10.jpg)
JO
B
Computing Element
Submission Client
User Interface
QueuingSystem
Job ManagementUser Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage Element
Storage Element
Storage Element
Storage Element
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
Data Handling System
Data Handling System
Data Handling System
Data Handling System
![Page 11: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/11.jpg)
Information ManagementIn JIM’s view, this includes:
configuration frameworkresource description for job brokeringinfrastructure for monitoring
Main featuresSites (resources) and jobs monitoringDistributed knowledge about jobs etcIncremental knowledge buildingGMA for current state inquiries, Logging for recent history studiesAll Web based
![Page 12: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/12.jpg)
Information Management via Site Configuration
Main Site/cluster ConfigXMLDB
ResourceAdvertisement
classad
MonitoringConfiguration
LDIF
Service Instantiation
XML
…
TemplateXML
XSLTXSLTXSLTXSLT
![Page 13: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/13.jpg)
Overview IntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
![Page 14: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/14.jpg)
Running jobs on Grid resources
The trend: Grid resources are not dedicated to a single experimentTranslation:
no daemons running on the worker nodes of a Batch Systemno experiment specific software installed
![Page 15: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/15.jpg)
Running jobs on Grid resources
The situation today is transitioning:Worker nodes typically access the software via shared FS: not scalable!Generally, experiments can install specific services on a node close to the cluster.Local resource configuration still too diverse to easily plug into the Grid
![Page 16: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/16.jpg)
The JIM local sandbox managementIt keeps the job executable (from the Grid) at the head node and knows where its product dependencies areIt transports and installs the software to the worker nodeIt can instantiate services at the worker nodeIt sets up the environment for the job to runIt packages the output and hands it over to the Grid, so that it becomes available for the download at the submission site
![Page 17: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/17.jpg)
Running a DZero application
We have JIM sandbox: where is the problem now?JIM sandbox could immediately use the DZero Run Time Environment, but
Not all the DZero packages are RTE Compliant User don’t have experience/incentives in using it today
![Page 18: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/18.jpg)
Overview IntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
![Page 19: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/19.jpg)
Running Monte Carlo at UWisc
University of Wisconsin offered DZero the opportunity of using a 1000 node non-dedicated condor clusterWe are concentrating on putting it to use to run MC with mc_runjob (in production by year end)
![Page 20: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/20.jpg)
The challenges IMC code is not RTE compliant todayChain of 3-5 stages. Each binary 50-200 MB, dynamically linkedAre compiled from 40 packages (total for D0 621). Need these packages at run time for RPC filesRoot, Motif, X11, Ace libraries are found as dependencies (for MC generators…)MC tarballs exist but are hand-crafted (and bug-prone) every time. Size unpacked 2GB (versus 12-15 GB full D0 app tree).
![Page 21: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/21.jpg)
The challenges II
About every advanced C++ feature, every libc library call, every system call, are usedOne can get different results on two RedHat 7.2 systems.Total release tree takes N hours (up to 20+) to build – not something easy to do dynamically at remote site
![Page 22: Grid Job and Information Management (JIM) for D0 and CDF](https://reader036.vdocument.in/reader036/viewer/2022062314/568134f5550346895d9c3d16/html5/thumbnails/22.jpg)
Summary
The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info ManagementJIM provides Fabric-level management tools for sandboxingThe applications need to be improved to run on Grid resources
1060204 %d0%bb%d0%b8%d1%81%d1%82%d0%be%d0%b2%d0%ba%d0%b0 %d0%bf%d0%be %d1%82%d0%b5%d0%bf%d0%bb%d0%be
Top quark physics W. Verkerke (Nikhef), Representing ATLAS, CMS, CDF & D0 Rencontres de Blois 2013 1