workload management wp status and next steps massimo sgaravatto infn padova

16
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Upload: morgan-oneal

Post on 28-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Workload Management WP

Status and next steps

Massimo SgaravattoINFN Padova

Page 2: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Where we are CMS-HLT use case (Monte Carlo

production and reconstruction) analyzed in terms of GRID requirements and GRID tools availability Discussions with Globus team and Condor

team Definition of a prototype architecture of

workload management system Use of Globus and Condor mechanisms But major developments needed

Page 3: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Prototype workload management system architecture

GlobusGRAM

CONDOR

GlobusGRAM

LSF

GlobusGRAM

PBS

Site1Site2 Site3

condor_submit(Globus Universe)

Condor-G

Master Grid InformationService (GIS)

Submit jobs

ResourceDiscovery

LocalResource

ManagementSystems

Globus GRAMas uniform interface

to different local resource management systems

Condor-G able toprovide a

reliable/crashproof job submission service

Master chooses in whichGlobus resources the jobs

must be submitted

Farms

Info

Page 4: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Where we are Evaluating the existing components (D1.1) and “putting

together” the various building blocks Evaluation of Globus

Collaboration with WP 1 of INFN-GRID project (Evaluation of the Globus toolkit) http://www.infn.it/globus

Evaluation of Globus GRAM GRAM as uniform interface to different underlying resource management

systems Evaluation of RSL “Cooperation” between GRAM and GIS

Evaluation of Condor-G The current implementation is a prototype

It works, but some problems must be solved Globus + Condor-G tested with a real CMS MC production

Many many many memory leaks found in the Globus jobmanager !!! Fixes (provided by Francesco Prelz) submitted to Globus team

Feedback only for what concerning the bugs in the GAA and GSS modules (new fixes “merged” with the original ones)

Page 5: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

First deliverables Month 3: Report on current technology

(report) D1.1 Month 6: Definition of architecture for

scheduling, resource management, security and job description (report) D1.2

Month 9: Components and documentation for the 1st release: initial workload management system (prototype) D1.3

Page 6: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Proposed work plan Let’s continue the implementation of the proposed

prototype Evaluation of current technologies (Globus, Condor) (D1.1) Functionalities for the 1st release

First release We can propose the functionalities that could be

implemented “Negotiation” in the ATF

To understand if these functionalities “address” the proposed use cases

To understand if our module can be “plugged” together with the other “pieces”

To understand if the other WPs can provide the required (by WP 1) functionalities

Page 7: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Proposed functionalities for the 1st release

First version of job description language (JDL)

First version of broker (master), that decides where to submit the jobs

Job submission service First version of logging and

bookkeeping services First user interface

Page 8: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Job Description Language (JDL) Used when the job is submitted, to specify

The application The input data set

File ? Collection of files ? “Logical” or “physical” names ? Need to be discussed with WP 2, WP 8, ATF

Where the output data must be saved (Required and preferable) resources Info for bookkeeping … ???

Prototype: Condor ClassAds

Page 9: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Broker/Master Choice of resource (farm) where to

submit job Input: JDL expression Output: computing resource choice

Published resource access lists (gridmap-files in the Globus-based prototype) are checked as a first step in the resource match-making

Page 10: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Broker/Master The “accessible” computing resources are

matched with the job request according to: Availability of the requested input data set

In the 1st release the broker will have to choose a resource where this input data set is already available (we are not going to “trigger” the replica of the input data set)

Availability of the appropriate application "sandbox“ If necessary, it could be necessary to "copy" and install

this sandbox if not already available in the executing farm (“code migration”) (in the 1st release ???)

Queue characteristics and status (architecture, etc…) vs. job requests

Let’s start with a few, simple parameters Availability of the requested amount of scratch space

Page 11: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Broker/Master We assume that all the information

needed by the broker are “published” in one “Grid Information Space” (GIS in the Globus-based prototype) by the other WPs

Prototype: Condor matchmaking library Match between the info published in the GIS

and the ClassAds defined in the JDL Necessary a “translator” GIS attributes

ClassAds Some work already done by Globus team ???

Page 12: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Job submission service Input: job to submit + computing resource

choice (provided by broker) Reliable, fault tolerant, crash proof service

Reliability in the executing machines up to WP 4

Prototype: Condor-G Submission of jobs to Globus resources (farms) New implementation of Condor-G (+ new

Globus job manager) available soon

Page 13: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

“Code” migration Not easy at all !!!

Necessary to “install” in the target farm a complex run time environment

Necessary a STRONG collaboration with WP 8 (and WP 4) to define an “application sandbox”, that can easily be installed in one farm, and doesn’t “conflict” with other sandboxes

Use of “application repositories” ??? When an application must be installed on one

farm, the sandbox is downloaded from such repository

Page 14: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Bookkeeping Necessary to “record” for each job

Submitting user identity Input data Output data Status of processing Where and when the processing has been

done Other bookkeeping info specified in the JDL …???

Page 15: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Logging Necessary to keep tracks of the

significant events occurred in the system Requests by users Computing resource choice (by

broker) Submission to resource …???

Page 16: Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

User Interface Job management

Job submission Job removal Job status monitoring

Access to bookkeeping info Access to logging info …???