infso-ri-508833 enabling grids for e-science dags with data placement nodes: the “shish-kebab”...
TRANSCRIPT
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
DAGs with data placement nodes:the “shish-kebab” jobs
Francesco Prelz Enzo MartelliINFN Milano
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
• Why should we bother to schedule data jobs ?
• Fundamental ingredients of data jobs:– Quoting Ian Bird, the SRM functionality foreseen in LCG is:
V1.1 + space management, pin/unpin, etc Not full set of V2.1 V3 not required CMS still to confirm agreement with this set
– Should any additional low-level interface be considered ?
• What interaction with matchmaking?– We consider these scenarios:
Job needing to reserve space (for output) on a given tactical (or even strategic) SE, and to release it at the end.
Job needing to pre-stage a file in from a mass-storage system, and/or to keep the file pinned until the end of execution
– Should anything else be considered ?
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
The fundamental concept
Execute the job
Stage-out
• Stage-in
• Execute the Job
• Stage-out
Stage-in
Release any temporary space
used
Allocate space for input & output data
Data Placement Jobs
Computational Jobs
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
Just a few more details
Stage-in
Execute the job
Stage-out
Allocate space for input & output data
Should we deal with multiple matches ? Match-making
For how long? Probably File pinning should be renewed.
How does the executable find thefiles? •Always via POSIX, relative to CWD,with a mapping that is known in advance and is applied by the sites?•Should mapping be carried with the job?
Where? Or: when should
Files should be secured to ´strategic´storage, but how hard should we tryto move them to their final destination ?
occur ?
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
SRM APIs
API PURPOSE
Tells us when files are available and pinned.
extend lifetime of file accessible via URL
release space previously allocated
srmPrepareToGet IN: arrayOfFileRequest userRequestDescription OUT: requestToken returnStatus
is intended to pin a file if the SRM already has the file;otherwise the SRM will allocate space, copy a file fromits archive or a remote location, and pin the file.
userRequestDescription: size of space required, lifetime, etc.requestToken: needed for further request like “extendLIfeTimeâ€
srmStatusOfGetRequest IN: requestToken OUT: returnStatus
SrmExtendFileLifeTime IN: requestToken siteURL newLifeTime OUT: returnStatus
srmReserveSpace IN: sizeOfTotalSpaceRequired lifetimeOfSpaceToReserve OUT: spaceToken returnStatus
allocate space with a lifetime policy
spaceToken: needed for further request
srmReleaseSpace IN: spaceToken OUT: returnStatus
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
APIs used in each node
Stage-in:the SRM pin the file if already has the it;otherwise allocate space, copy the file and pin it. Previous allocation may be avoided.
Release any temporary space
used
srmReserveSpace (either directlyor via reservation framework)
SrmPrepareToGet, waitand srmStatusOfGetRequest
srmReleaseSpace (either directlyor via reservation framework)
Allocate space for input & output data
File pinningSrmExtendFileLifeTime
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
How should pinning and reservation be renewed in the job flow?
• Should we add more ad-hoc machinery, as done for the proxy renewal ?
• It is probably worth to generalise a renewal solution for renewing the allocation of various reservable resources.
• We are studying how to integrate an architecture for resource reservation (see T. Ferrari/E. Ronchieri's talk)– We'll need to resolve the renewal issues in that context.
• Should we have a different approach just for data matchmaking jobs ? How ?
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
Agreement Service Architecture
Agreement Initiators
Agreement Offer
Storage/Computing/Network Agreement Service
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
Just a DAG ? Really a DAG ?
Stage-in
Execute job
Stage-out
Match-making
This can also fail, what do we do ? First
This should likely be skipped in case of job failure, but, we should not forget to
Release any temporary space
used
?
Release any temporary space
used
Then go back to
File pinning
Oh, this canfail, too!
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
More details about Match-making
What data attributes should contribute to the rank ?• Currently number of close (administratively local) files.
• Should prefetch time estimates be contributing ? Is srmGetReqEstTime going to be there ?
• Should the possibility of remote access be taken into account ? Estimated size and number of accesses if remote file access is allowed ?
What should be the status of a job that failed to releasespace ? OK, But ? And who should be told about this ?
What data attributes should contribute to the requirements ?• This is the same as saying: should we allow a match to occur
only after some independent data movement actions are taken?
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
Other details
• What should be the status of a job that failed to release space ?
• OK, But ? • And who should be told about this ?
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
Non-conclusive questions...
• Did we get a reasonable view of the non SRM v1.1 functions that are going to be there ?
• We will be test-driving the generic reservation framework, applied to storage.
•This will require to apply some renewal/extension semantics , should it be added ad-hoc ?
• Handling job flows with data seems to require capabilities beyond DAG.
•Should we be implementing a state machine? A shell? Any other idea ?
JRA1 All Hands meeting, Brno June 20-21-22/06/2005 1
Enabling Grids for E-sciencE
INFSO-RI-508833
References
• SRM V1 API http://sdm.lbl.gov/srm-wg/doc/SRM.Joint.Functional.Design.Jan2002.pdf
• SRM V2 API:
http://sdm.lbl.gov/srm-wg/doc/SRM.spec.v2.1.1.html