bqs integration in glite-ce tcg meeting, cern 01/11/2006 sylvain reynaud, fabio hernandez

17
BQS integration in gLite- BQS integration in gLite- CE CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

Upload: shannon-lambert

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

BQS Integration to gLite CE3 BQS integration in LCG-CE Gatekeeper BQS job-manager BDII Local batch system CE Submit job Provided CC-IN2P3 To be done UIRB BQS Information Provider BQS

TRANSCRIPT

Page 1: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS integration in gLite-CEBQS integration in gLite-CETCG meeting, CERN01/11/2006Sylvain Reynaud, Fabio Hernandez

Page 2: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 2

ContextContext

We have been running a BQS-backed computing element since the early days of Datagrid– BQS Information Provider

• Maps BQS information data to Glue Schema (ldiff)– bqs-jobmanager

• Maps Globus commands to BQS commands• Maps job queues to “BQS classes”, requests AFS tokens for jobs

needing them, archives job information, logs job information for accounting purposes, creates the BQS job wrapper, caches job status information…

Currently trying to integrate BQS to gLite-CE– STEP 1: develop a “BLAH-to-Globus jobmanager” adapter

• So that we can reuse the bqs-jobmanager currently in production with LCG-CE

– STEP 2: develop a grid-neutral front-end to BQS and use it with several CE (e.g. gLite-CE, CREAM, GT4 WS-GRAM)

We areWe areherehere

Page 3: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 3

BQS integration in LCG-CEBQS integration in LCG-CE

Gatekeeper

BQS job-manager

BDII

Localbatchsystem

CE

Submitjob

Provided

CC-IN2P3

To be done

UIUIRBRB

BQS InformationProvider

BQS

Page 4: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 4

BQS job-manager

BQS integration in gLite-CE (STEP 1)BQS integration in gLite-CE (STEP 1)

BQS

Gatekeeper BDII

Condor-CBlahpd

LaunchCondor-C

LaunchCondor-C

Localbatchsystem

CE

Submitjob

fork job-manager

BLAH to Globus

Provided

CC-IN2P3

To be done

BQS InformationProvider

UIUIWMSWMS

Page 5: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 5

Purpose of this presentationPurpose of this presentation

Provide feedback about the difficulties to integrate a new LRMS to gLite-CE

– These difficulties are not specific to BQS

– No impossibility to do it– …but can not do it efficiently !

Page 6: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 6

OverviewOverview

Difficulties– gLite-CE installation– Plug-in development– Plug-in testing

BQS integration in CREAM

Discussion

Page 7: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 7

gLite-CE installationgLite-CE installation

On a standard Scientific Linux 3.0.5– gLite 3.0.0 and 3.0.1: solution to most bugs were found on mailing-lists archives– gLite 3.0.2 update 6: almost no more bugs for installation

On our site-customized Scientific Linux 3.0.5– Customization related to

• different releases of language interpreters (perl, python)• modified environment variables

– Sensible to modifications on the execution environment• About 2/3 of problems found were specific to this customization

– Such kind of problems were not observed with other software packages (e.g. GT4)– Some problems were hard to resolve (e.g. Globus fork-jobmanager script modified to set a

specific and non-trivial order of directories in $PATH)• It seems to work now (with PBS), but there may be some remaining problems with

untested features– Not yet re-tested with gLite 3.0.2 update 6

Page 8: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 8

Plug-in developmentPlug-in development

BLAH expects 5 commands for interacting with the underlying LRMS– One per action (submit, status, cancel, hold, resume)– In the case of PBS and LSF, these commands are implemented as

Shell scripts

Lack of complete documentation is not a big issue– Provided plug-ins for PBS and LSF are a good starting point– Following the job lifecycle through testing is also instructive for

understanding the system• But testing is the hard part (more on next slides)

Page 9: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 9

Plug-in testing (1/4)Plug-in testing (1/4)

CAN NOT TEST EFFICIENTLY BECAUSE…

Can not test CE in standalone mode (without WMS)– This adds complexity and lot of opportunities for job failures– We had to deploy a WMS locally

• WMS deployed on PPS were not stable enough (before summer)• Needed to understand where and why jobs fail

Each job submission test takes too long time to complete– Around 4’30” to execute a “hello-world” job on not loaded machines

connected to the same LAN– 15’ for an abnormally ended job=> No test can be done in less than 5 minutes !

Page 10: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 10

Plug-in testing (2/4)Plug-in testing (2/4)

Some services sometimes fail to start, start in a bad way or stop working (WMS, CE)– (NOT security related problems: time synchronization, CRL &

gridmap file updates)– Occur after a configuration change or a simple service restart

=> restart the relevant services several times in different order– Sometimes unable to get back to a working configuration (even by

resetting original values) => reinstalling is the fastest solution

We haven’t been able to deactivate automatic retry of jobs– (setting RetryCount/ShallowRetryCount to 0 in JDL does not do it)– Lifecycle of failed jobs is longer to complete– Previous failed jobs continue to pollute the CE log files

Page 11: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 11

Plug-in testing (3/4)Plug-in testing (3/4)

Job cancellation often does not work– The glite-job-cancel command always returns “request has been

successfully submitted”, but has often no effect on the job– Don’t know how to get WMS & CE back to a “clean” state

First submitted job almost always fails– Not systematic anymore with latest release, but still very often– We often face this situation because the development phase

implies frequent configuration changes, and this often requires restarting the gLite services

Page 12: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 12

Plug-in testing (4/4)Plug-in testing (4/4)

Hard to find the cause of failures– Many silent failures or useless messages

"The PeriodicHold expression 'Matched =!= TRUE &&CurrentTime > QDate + 900' evaluated to TRUE".

– Command “glite-job-logging-info -v 2” does not often help to understand why the job has been retried for 900 seconds

– Need to follow the job life by looking at the log files, but they are dispersed, and some are ephemeral (they disappear too quickly)

• Several log files per component: Globus gatekeeper, Globus job-manager, Condor-C (ephemeral logs), BLAH (ephemeral logs), GridManager, …

• Several directories contain logs: /var/log, $HOME, /tmp, …– No error detection when the LRMS-specific BLAH scripts return

unexpected output

Page 13: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 13

BQS integration in CREAMBQS integration in CREAM

Currently exploring the integration of BQS to CREAM– Have just started installing CREAM with PBS (27/10/2006)

CREAM installation (ongoing)– Not yet automated, but not sensible to modification on the execution

environment Plug-in development (not started yet)

– STEP 1: Implementing a “BLAH Log Parser” is required=> reusing code developed for LCG-CE may require modifications

– STEP 2: Develop a CREAM connector for BQS Plug-in testing (not started yet)

– Seems to have none of previously mentioned difficulties

Thanks to Massimo Sgaravatto for providing early access to CREAM for gLite 3.1

Page 14: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 14

BQS integration in CREAM (STEP 1)BQS integration in CREAM (STEP 1)

BQS job-manager

CREAM CEMon

Blahpd

Localbatchsystem

CE

BLAH connector

BLAH to Globus

Provided

CC-IN2P3

To be done

ICEICE

BQSBLAH Log Parser

???

Submitjob

BQS InformationProvider

BQS

Page 15: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 15

BQS integration in CREAM (STEP 2)BQS integration in CREAM (STEP 2)

CREAM CEMon

Localbatchsystem

CE

BLAH connector BQS connector

Provided

CC-IN2P3

To be done

ICEICE Submitjob

BQS InformationProvider

BQS grid-neutral front-end

BQS

Page 16: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 16

ReferencesReferences

gLite– http://glite.web.cern.ch/glite/documentation/

BLAH– http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/ce_blahp.shtml

CREAM– http://grid.pd.infn.it/cream/field.php?n=Main.HomePage

Page 17: BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE 17

DiscussionDiscussion

Are there tips to work more efficiently with WMS and gLite-CE components ?– How to configure WMS/gLite-CE to reduce time to complete ?– How to deactivate automatic retry of jobs ?

What is the recommended way to proceed ?– Will the next releases of gLite-CE provide some answers to the problems

reported in this talk?– Should we instead concentrate on working on the BQS integration to

CREAM? (our preferred way)• Will WMS support CREAM before the support for LCG-CE will be dropped?

– As a site, will we have to support both gLite-CE and CREAM ?

Is there any plan to drop support for LCG-CE in the near future ?