wp1 wms rel. 2.0 some issues massimo sgaravatto infn padova
TRANSCRIPT
WP1 WMS rel. 2.0Some issues
Massimo SgaravattoINFN Padova
Outline Some issues to discuss (and let’s try to decide)
LB server choice New CondorG Proxy renewal RLS integration WP2 Optor integration Output data upload and registration LB issues Gangmatching Security of files on the WM node Disk quota management in WM node VOMS integration Job exit code ISB/OSB transfer errors Accounting integration User vs host proxies … ?
LB server choice Allow multiple LB servers for a single WM
for increased reliability and performance Approach
UI responsible to choose the LB server (e.g. via a round robin) ?
List of available LB servers in UI conf file, waiting for having this VO specific info published in a “VO repository” (R-GMA/IS/VOMS) ?
Move list of available NSs in this VO repository as well, when available
Not too clear yet what could be this VO repository (discussions within ATF)
New CondorG New CondorG negotiated with Condor people (more
details by Francesco P.) Released by end of March, included in VDT, and to be used in rel
2.0 Two proxies
X509UserProxy One per job
X509ManagementProxy One per user’s DN or one “serving” n jobs for that user’s DN A CondorG <gridmanager, gahp-servers> pair for a given
X509ManagementProxy
Details on the whole machinery to be discussed Where is this user’s DN X509ManagementProxy mapping kept
and managed ? Proxy renewal ? …
Proxy renewal Necessary to have a “persistent” proxy renewal daemon
(i.e. if it is restarted it shouldn’t loose control of the “managed” jobs as it happens now)
Necessary to discuss and decide on various issues Renewal of X509UserProxy
Done only if requested by the user (if MyProxyServer specified in the JDL ?) ?
No MyproxyServer in WM conf file anymore ? And what about renewal of X509ManagementProxy ?
If a new proxy “arrives” from UI and extends the validity of the existing one, the new one replace the old one ?
Not enough: what about if at least a job of that user asked for proxy renewal ?
Necessary to renew also X509ManagementProxy Who does registration ? NS ? Who does un-registration ?? …
RLS integration At J+27 RB/MM will have to query the WP2 RLS
instead of WP2 RC to get the SFNs given a LFN (or LCN, or a GUID)
On-going negotiation of this WP1-WP2 interface New JDL attribute (VirtualOrganization) to make
possible to refer to the “official” VO’s RLS (needed by WP2 services)
Not needed anymore when VOMS integrated and therefore it will be possible to get the VO from user’s proxy
Optional JDL attribute to make possible to specify a “non-official” RLS ?
edgReplicaManager::listReplicas to have the SFNs New BrokerInfo content (under negotiation)
Integration with WP2 Optor Completely different approach than querying the RLS to have the
PFNs (mutually exclusive) … RB calls getAccessCost for all the suitable CEs (the ones where the user is
authorized to submit jobs and matching the JDL “Requirements” expression) and for all the specified input data (LFNs, LCNs, GUIDs)
A “cost” is returned for each CE The RB chooses the CE, taking into account this cost and also the other
Ranks (to be decided how) In some cases the WM has also to trigger the replica of files to the closeSE
Not too difficult, but very high impact on scheduling/planning performed by RB/MM
Integration WMS-Optor Planned after J+27 However according to WP2, this stuff ready and tested well before J+27 To discuss details of integration
How ? A binary flag in the WM conf file to enable/disable Optor ? When ?
Output data upload and registration Problem discussed and solution agreed in
the ATF Approach (details by Fabrizio P.):
OutputData JDL attribute (optional) to specify output file names, output LFNs and output SEs
Jobwrapper at the end has to call the WP2 function copyAndRegister
Issues Some details about copyAndRegister to be
sorted out Release date of this stuff not decided yet
LB What happens exactly at J+27 wrt:
“Advanced query to LB” ? “LB – RGMA integration” ?
How ? Interfaces (e.g. for advanced queries) ? Issues ?
Ales ??
Gangmatching Problem: take into account both CE and SE
information in the matchmaking For example to require a job to run on a CE close
to a SE with “enough space” Salvo has been working on this for a while,
also after some negotiations with Condor team (A. Roy)
Salvo’s talk for details (e.g. JDL) and discussions
When can this stuff be released ? J+27 ?
Security of files on the WM node Approach
WP1 services (NS, …) running as edguser.edguser in WM node
Different user’s subjects mapped to different local users in grid-mapfile: user1.user, user2.user, …
Patched gridftp server (by Massimo M.) running on the NS node, so that the InputSandbox files are transferred in the NS node belonging to edguser as group and rwxrwx--- as mask
So a user can not access files belonging to an other user anymore
Issues When ? J+27 ? How ? Gridftp server RPM released by WP1 ?
Disk quota management on the WM node Having different DN users mapped to
different local users in the grid-mapfile of the WM node allows to set disk quota for the various users
NS to be modified (for J+27) so that it has to reject a job if no enough disk quota available to store the input sandbox files
Issues ? Marco ??
VOMS integration E.g.: voms-proxy-init –vo CMS
VO info in the generated proxy Impact on WP1 software
Retrieve VO from user’s proxy So not necessary to provide it anymore in the JDL, for querying
the RLS Check for authorization not node anymore with a
matchmaking considering User Cert Subject but according to VO
Proxy used by the various services (NS, LB, etc.) generated by VOMS ?
Issues VOMS deployed at J+37 but not too clear which and when
integration will take place Not clear yet which VOMS APIs available
Job exit code For release 2.0 we agreed to
return job exit code to user with dg-job-status
What about if exit code <> 0 ? Done-ok in any case ? Done-failed (and therefore
resubmission) ?
ISB/OSB transfer errors In release 1.x job considered failed (and therefore
resubmission attempted) if JobWrapper detects errors when transferring a file of ISB/OSB between RB node and WN
But failure could be simply because of user’s error when writing ISB/OSB expressions in JDL …
And what about if the job crashed for “internal” problems and therefore some OSB files not produced ?
Is it ok to mark the job as failed and re-attempt the submission or is it better to consider the job as done-ok ?
Approach in release 2.0 JobAdapter should check and issue globus-url-copy only for
ISB-OSB files which exist (simple for OSB, bit more complex for ISB) and/or globus-url-copy errors ignored ?
Accounting integration What exactly happens at J+27
(“Accounting infrastructure”) ? And later, after release 2.0 (“Full
integration of cost estimation/accouting into scheduling policies”) ?
Dependencies and interfaces with other components and other WPs at J+27 and later ?
Host vs user proxies Can we rely on user’s proxies
instead of host proxies for authentication when possible, as recommended ? E.g. in LB logging Other cases ?