evaluation of the globus gram service massimo sgaravatto infn padova
Post on 22-Dec-2015
216 Views
Preview:
TRANSCRIPT
Evaluation of the Globus GRAM Service
Massimo SgaravattoINFN Padova
Evaluation of GRAM Service
GRAM
CONDOR
GRAM
LSF
GRAM
PBS
Site1Site2 Site3
Submit jobs (using Globus tools)
GIS
Information on characteristics andstatus of local resources
Evaluation of GRAM Service Job submission tests using Globus tools
(globusrun, globus-job-run, globus-job-submit)
GRAM as uniform interface to different underlying resource management systems
“Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to
specify resources Tests performed with Globus 1.1.2 and
1.1.3 and Linux machines
GRAM & fork system call
Client Server (fork)
Globus
Globus
GRAM & CondorClient Server
(Condor front-end machine)
Globus Globus
Condor
Condor pool
GRAM & Condor Tests considering:
Standard Condor jobs (relinked with Condor library)
INFN WAN Condor pool configured as Globus resource
~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain
Vanilla jobs (“normal” jobs) PC farm configured as Globus resource
Single file system and UID domain
GRAM & LSF
Server (LSF front-end machine)
Client
Globus
Globus LSF
LSF Cluster
Results Some bugs found and fixed (fixes included in INFNGRID 1.1
distribution) Standard output and error for vanilla Condor jobs globus-job-status …
Some bugs can be solved without major re-design and/or re-implementation:
For LSF the RSL parameter (count=x) is translated into: bsub –n x … Just allocates x processors, and dispatches the job to the first one
Used for parallel applications Should be: bsub … x times Maybe we don’t need to solve this problem (see later…)
… Two major problems:
Scalability Fault tolerance
Globus GRAM Architecture
Client
LSF/ Condor/ PBS/ …
Globus front-end machine
Jobmanager
Job
pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl
file.rsl:&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename(stdout=/diskCms/Cmsim/filename)(count=1)
pc1 pc2
Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???
1000 globusrun 1000 jobmanagers running in the front-end machine !!!
%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rslfile.rsl:
&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename)(stdout=/diskCms/CmsimOut/filename)(count=1000)
It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …
$(Process) in Condor Problems with job monitoring (globus-job-status) Therefore (count=x) with x>1 not very useful !
Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted,
Globus assumes that the job(s) has been completed
Example of problem Submission of n jobs on a cluster managed
by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart
Orphan jobs Globus assumes that the jobs have been successfully completed
GRAM & GIS How the local GRAMs provide the
GIS with characteristics and status of local resources ?
Tests performed considering: Condor pool LSF cluster
GRAM & Condor & GIS
GRAM & LSF & GIS
Must be fixed
Jobs & GIS Info on Globus jobs published in the GIS:
User Subject of certificate Local user name
RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…
GRAM & GIS The information on characteristics and status
of local resources and on jobs is not enough As local resources we must consider Farms and not
the single workstations Other information (i.e. total and available CPU
power) needed Fortunately the default schema can be
integrated with other info provided by specific agents
The needed information must be identified first
RSL We need a uniform language to specify
resources, between different resource management systems
The RSL syntax model seems suitable to define even complicated resource specification expressions
The common set of RSL attributes is often not sufficient The attributes not belonging to the common
set are ignored
RSL More flexibility is required
Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model)
Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach
Next steps Bug fixes
Modification of Globus LSF scripts for GIS Problem (count=x) with LSF ???
Tests with real applications and real environments (CMS fall production)
Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it
Let’s start with information provided by the underlying resource management system
Tests with GRAM API Not necessary tests with other resource management systems Scalability and robustness problems
Not so simple and straightforward !!! Up to Workload management WP, possible collaboration with Globus
team and Condor team
top related