wp1 wms rel. 2.0 some issues massimo sgaravatto infn padova

WP1 WMS rel. 2.0Some issues

Massimo SgaravattoINFN Padova

Outline Some issues to discuss (and let’s try to decide)

LB server choice New CondorG Proxy renewal RLS integration WP2 Optor integration Output data upload and registration LB issues Gangmatching Security of files on the WM node Disk quota management in WM node VOMS integration Job exit code ISB/OSB transfer errors Accounting integration User vs host proxies … ?

LB server choice Allow multiple LB servers for a single WM

for increased reliability and performance Approach

UI responsible to choose the LB server (e.g. via a round robin) ?

List of available LB servers in UI conf file, waiting for having this VO specific info published in a “VO repository” (R-GMA/IS/VOMS) ?

Move list of available NSs in this VO repository as well, when available

Not too clear yet what could be this VO repository (discussions within ATF)

New CondorG New CondorG negotiated with Condor people (more

details by Francesco P.) Released by end of March, included in VDT, and to be used in rel

2.0 Two proxies

X509UserProxy One per job

X509ManagementProxy One per user’s DN or one “serving” n jobs for that user’s DN A CondorG <gridmanager, gahp-servers> pair for a given

X509ManagementProxy

Details on the whole machinery to be discussed Where is this user’s DN X509ManagementProxy mapping kept

and managed ? Proxy renewal ? …

Proxy renewal Necessary to have a “persistent” proxy renewal daemon

(i.e. if it is restarted it shouldn’t loose control of the “managed” jobs as it happens now)

Necessary to discuss and decide on various issues Renewal of X509UserProxy

Done only if requested by the user (if MyProxyServer specified in the JDL ?) ?

No MyproxyServer in WM conf file anymore ? And what about renewal of X509ManagementProxy ?

If a new proxy “arrives” from UI and extends the validity of the existing one, the new one replace the old one ?

Not enough: what about if at least a job of that user asked for proxy renewal ?

Necessary to renew also X509ManagementProxy Who does registration ? NS ? Who does un-registration ?? …

RLS integration At J+27 RB/MM will have to query the WP2 RLS

instead of WP2 RC to get the SFNs given a LFN (or LCN, or a GUID)

On-going negotiation of this WP1-WP2 interface New JDL attribute (VirtualOrganization) to make

possible to refer to the “official” VO’s RLS (needed by WP2 services)

Not needed anymore when VOMS integrated and therefore it will be possible to get the VO from user’s proxy

Optional JDL attribute to make possible to specify a “non-official” RLS ?

edgReplicaManager::listReplicas to have the SFNs New BrokerInfo content (under negotiation)

Integration with WP2 Optor Completely different approach than querying the RLS to have the

PFNs (mutually exclusive) … RB calls getAccessCost for all the suitable CEs (the ones where the user is

authorized to submit jobs and matching the JDL “Requirements” expression) and for all the specified input data (LFNs, LCNs, GUIDs)

A “cost” is returned for each CE The RB chooses the CE, taking into account this cost and also the other

Ranks (to be decided how) In some cases the WM has also to trigger the replica of files to the closeSE

Not too difficult, but very high impact on scheduling/planning performed by RB/MM

Integration WMS-Optor Planned after J+27 However according to WP2, this stuff ready and tested well before J+27 To discuss details of integration

How ? A binary flag in the WM conf file to enable/disable Optor ? When ?

Output data upload and registration Problem discussed and solution agreed in

the ATF Approach (details by Fabrizio P.):

OutputData JDL attribute (optional) to specify output file names, output LFNs and output SEs

Jobwrapper at the end has to call the WP2 function copyAndRegister

Issues Some details about copyAndRegister to be

sorted out Release date of this stuff not decided yet

LB What happens exactly at J+27 wrt:

“Advanced query to LB” ? “LB – RGMA integration” ?

How ? Interfaces (e.g. for advanced queries) ? Issues ?

Ales ??

Gangmatching Problem: take into account both CE and SE

information in the matchmaking For example to require a job to run on a CE close

to a SE with “enough space” Salvo has been working on this for a while,

also after some negotiations with Condor team (A. Roy)

Salvo’s talk for details (e.g. JDL) and discussions

When can this stuff be released ? J+27 ?

Security of files on the WM node Approach

WP1 services (NS, …) running as edguser.edguser in WM node

Different user’s subjects mapped to different local users in grid-mapfile: user1.user, user2.user, …

Patched gridftp server (by Massimo M.) running on the NS node, so that the InputSandbox files are transferred in the NS node belonging to edguser as group and rwxrwx--- as mask

So a user can not access files belonging to an other user anymore

Issues When ? J+27 ? How ? Gridftp server RPM released by WP1 ?

Disk quota management on the WM node Having different DN users mapped to

different local users in the grid-mapfile of the WM node allows to set disk quota for the various users

NS to be modified (for J+27) so that it has to reject a job if no enough disk quota available to store the input sandbox files

Issues ? Marco ??

VOMS integration E.g.: voms-proxy-init –vo CMS

VO info in the generated proxy Impact on WP1 software

Retrieve VO from user’s proxy So not necessary to provide it anymore in the JDL, for querying

the RLS Check for authorization not node anymore with a

matchmaking considering User Cert Subject but according to VO

Proxy used by the various services (NS, LB, etc.) generated by VOMS ?

Issues VOMS deployed at J+37 but not too clear which and when

integration will take place Not clear yet which VOMS APIs available

Job exit code For release 2.0 we agreed to

return job exit code to user with dg-job-status

What about if exit code <> 0 ? Done-ok in any case ? Done-failed (and therefore

resubmission) ?

ISB/OSB transfer errors In release 1.x job considered failed (and therefore

resubmission attempted) if JobWrapper detects errors when transferring a file of ISB/OSB between RB node and WN

But failure could be simply because of user’s error when writing ISB/OSB expressions in JDL …

And what about if the job crashed for “internal” problems and therefore some OSB files not produced ?

Is it ok to mark the job as failed and re-attempt the submission or is it better to consider the job as done-ok ?

Approach in release 2.0 JobAdapter should check and issue globus-url-copy only for

ISB-OSB files which exist (simple for OSB, bit more complex for ISB) and/or globus-url-copy errors ignored ?

Accounting integration What exactly happens at J+27

(“Accounting infrastructure”) ? And later, after release 2.0 (“Full

integration of cost estimation/accouting into scheduling policies”) ?

Dependencies and interfaces with other components and other WPs at J+27 and later ?

Host vs user proxies Can we rely on user’s proxies

instead of host proxies for authentication when possible, as recommended ? E.g. in LB logging Other cases ?

wp1 wms rel. 2.0 some issues massimo sgaravatto infn padova

Documents