henp grid testbeds, applications and demonstrations rob gardner university of chicago chep03 march...
TRANSCRIPT
HENP Grid Testbeds, Applications and Demonstrations
Rob GardnerUniversity of Chicago
CHEP03March 29, 2003
Ruth PordesFermilab
2
Overview
High altitude survey of contributions– group, application, testbed, services/tools
Discuss common and recurring issues– grid building, services development, use
Concluding thoughts
– Acknowledgement to all the speakers who gave fine presentations, and my apologies in advance for not providing this *very limited* sampling
3
Testbeds, applications, and development of tools and services
Testbeds:– Alien grids– BaBar Grid– CrossGrid– DataTAG– EDG Testbed(s)– Grid Canada – IGT Testbed (US CMS)– Korean DataGrid– NorduGrid(s)– SAMGrid– US ATLAS Testbed– WorldGrid
Evaluations– EDG testbed evaluations and
experience in multiple exps.– Testbed management experience
Applications– ALICE production– ATLAS production– BaBar analysis, file replication– CDF/D0 analysis– CMS production – LHCb production– Medical applications in Italy– Phenix– Sloan sky survey
Tools development– Use cases (HEPCAL)– Proof/Grid analysis– LCG Pool and grid catalogs– SRM, Magda– Clarens, Ganga, Genius, Grappa, JAS
4
EDG TB History
Version Date
1.1.2 27 Feb 2002
1.1.3 02 Apr 2002
1.1.4 04 Apr 2002
1.2.a1 11 Apr 2002
1.2.b1 31 May 2002
1.2.0 12 Aug 2002
1.2.1 04 Sep 2002
1.2.2 09 Sep 2002
1.2.3 25 Oct 2002
1.3.0 08 Nov 2002
1.3.1 19 Nov 2002
1.3.2 20 Nov 2002
1.3.3 21 Nov 2002
1.3.4 25 Nov 2002
1.4.0 06 Dec 2002
1.4.1 07 Jan 2003
1.4.2 09 Jan 2003
1.4.3 14 Jan 2003
1.4.4 18 Jan 2003
1.4.5 26 Feb 2003
1.4.6 4 Mar 2003
1.4.7 8 Mar 2003
Successes• Matchmaking/Job Mgt.• Basic Data Mgt.Known Problems:• High Rate Submissions• Long FTP TransfersKnown Problems:
• GASS Cache Coherency• Race Conditions in Gatekeeper• Unstable MDS
Intense Use by Applications!Limitations: • Resource Exhaustion• Size of Logical Collections
Successes• Improved MDS Stability• FTP Transfers OKKnown Problems:• Interactions with RC
ATLAS phase 1 start
CMS stress test Nov.30 - Dec. 20
CMS, ATLAS, LHCB, ALICE
Emanuele Leonardi
5
Resumé of experiment DC use of EDG-see experiment talks elsewhere at CHEP
ATLAS were first, in August 2002. The aim was to repeat part of the Data Challenge. Found two serious problems which were fixed in 1.3
CMS stress test production Nov-Dec 2002 – found more problems in area of job submission and RC handling – led to 1.4.x
ALICE started on Mar 4: production of 5,000 central Pb-Pb events - 9 TB; 40,000 output files; 120k CPU hours
– Progressing with similar efficiency levels to CMS
– About 5% done by Mar 14
– “Pull” architecture LHCb started mid Feb
– ~70K events for physics
– Like ALICE, using a pull architecture
BaBar/D0
– Have so far done small scale tests
– Larger scale planned with EDG 2
No.
of
evts
–
25
0k
Time – 21 days
Stephen Burke
6
CMS Data Challenge 2002 on Grid
Two “official” CMS productions on the grid in 2002– CMS-EDG Stress Test on EDG testbed + CMS sites
> ~260K events CMKIN and CMSIM steps> Top-down approach: more functionality but less robust, large
manpower needed
– USCMS IGT Production in the US> 1M events Ntuple-only (full chain in single job)> 500K up to CMSIM (two steps in single job)> Bottom-up approach: less functionality but more stable, little
manpower needed
– See talk by P.Capiluppi
C. Grande
7
CMS production components interfaced to EDG
•Four submitting UIs: Bologna/CNAF (IT), Ecole Polytechnique (FR), Imperial College (UK), Padova/INFN (IT)
•Several Resource Brokers (WMS), CMS-dedicated and shared with other Applications: one RB for each CMS UI + “backup”
•Replica Catalog at CNAF, MDS (and II) at CERN and CNAF, VO server at NIKHEF
SECECMS software
BOSSDB
WorkloadManagement
System
JDL
RefDB
parameters
data registration
Job output filteringRuntime monitoring
input
dat a
lo
cat i
on
Push data or info
Pull info
UIIMPALA/BOSS
Replica Manager
CECMS software
CECMS software
CE
WN
SECE
CMS software
SE
SE
SE
ReadWrite
CMSCMS EDGEDG
CMS ProdTools
on UI
8
CMS/EDG Production
~260K events produced
~7 sec/event average
~2.5 sec/event peak (12-14 Dec)
# E
ven
ts
30 Nov
20 Dec
CMS Week
Upgrade of MWHit some limit
of implement.
P. Capiluppi talk
9
US-CMS IGT Production
25 Oct
28 Dec
> 1 M events4.7 sec/event average2.5 sec/event peak (14-20 Dec 2002)Sustained efficiency: about 44%
P. Capiluppi talk
10
Grid in ATLAS DC1*
US-ATLAS EDG Testbed Prod NorduGrid
part of Phase 1 reproduce part of full phase 1 & 2
production phase 1 data production
Full Phase 2 several tests
production
[ * See other ATLAS talks for more details]
G.Poulard
11
Contribution to the overall CPU-time (%) per country
1,41%
10,92%
0,01%
1,46%9,59%2,36%
4,94%
10,72%
2,22%
3,15%
4,33%
1,89%
3,99%
14,33%
0,02%
28,66%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
ATLAS DC1 Phase 1 : July-August 02
3200 CPU‘s110 kSI9571000 CPU days
5*10*7 events generated1*10*7 events simulated3*10*7 single particles30 Tbytes35 000 files
39 Institutes in 18 Countries1. Australia
2. Austria3. Canada4. CERN5. Czech Republic6. France7. Germany8. Israel9. Italy10. Japan11. Nordic12. Russia13. Spain14. Taiwan15. UK16. USA
grid tools used at 11 sites
G.Poulard
12
Linker
Configurator A Configurator B Configurator C
I want to run applications A, B, and C
Attach A, B, C
Make Job(Framework)
/bin/sh scriptto run App A
/bin/sh scriptto run App C
/bin/sh scriptto run App B
#!/bin/env shscriptAscriptBscriptC
Configure
ScriptGenerator
Meta Systems
MCRunJob approach by CMS production team
Framework for dealing with multiple grid resources and testbeds (EDG, IGT)
G.Graham
13
Hybrid production model
MCRunJob
Site Manager startsan assignment
RefDBPhys.Group asks for
an official dataset
User starts aprivate production
Production Managerdefines assignments
DAG
job job
job
job
JDL
shellscripts
DAGMan(MOP)
LocalBatch Manager
EDGScheduler
Computer farm
LCG-1testbe
d
User’s Site Resources
ChimeraVDL
Virtual DataCatalogue
Planner
C. Grande
14
Interoperability: glue
CECE
UIUI
SESERBRB
VDT Client
VDT Server
RCRC
ISIS
15
Integrated Grid Systems
Two examples of integrating advanced production and analysis to multiple grids
SamGrid AliEn
16
SamGrid Map
•CDF–Kyungpook National University, Korea–Rutgers State University, New Jersey, US–Rutherford Appelton Laboratory, UK–Texas Tech, Texas, US–University of Toronto, Canada
•DØ–Imperial College, London, UK–Michigan State University, Michigan, US–University of Michigan, Michigan, US–University of Texas at Arlington, Texas, US
17
Physics with SAM-Grid
Standard CDF analysis job submitted via SAM-Grid and executed somewhere
z0(µ1) z0(µ2)
J/ψ => µ+ µ-
S. Stonjek
18
VO
RC
RB
CE
SE
WN
CE
SE
WNCE
SE
WN
CE
SE
WN CE SE WN
The BaBar Grid as of March 2003 D. Boutigny
special challenges faced by a running experimentwith heterogeneous data requirements, root, Objy
19
Grid Applications, Interfaces, Portals
Clarens Ganga Genius Grappa JAS-Grid Magda Proof-Grid
and higher level services– Storage Resource
Manager (SRM)
– Magda data management
– POOL-Grid interface
20
PROOF and Data Grids
Many services are a good fit– Authentication
– File Catalog, replication services
– Resource brokers
– Monitoring
Use abstract interfaces Phased integration
– Static configuration
– Use of one or multiple Grid services
– Driven by Grid infrastructure
Fons Rademakers
21
Different PROOF–GRID Scenarios
Static stand-alone– Current version, static config file, pre-installed
Dynamic, PROOF in control– Using grid file catalog and resource broker, pre-
installed Dynamic, ALiEn in control
– Idem, but installed and started on the fly by AliEn Dynamic, Condor in control
– Idem, but allowing in addition slave migration in a Condor pool
Fons Rademakers
22
RB/JSS II
SE
input data location
Replica Catalog TOP
GIIS
. . .CE
Executable = "/usr/bin/env";Arguments = "zsh prod.dc1_wrc 00001";
VirtualOrganization="datatag";Requirements=Member(other.GlueHostApplicationSoftwareRunTimeEnvironment,"ATLAS-3.2.1" );Rank = other.GlueCEStateFreeCPUs;InputSandbox={"prod.dc1_wrc",“rc.conf","plot.kumac"};OutputSandbox={"dc1.002000.test.00001.hlt.pythia_jet_17.log","dc1.002000.test.00001.hlt.pythia_jet_17.his","dc1.002000.test.00001.hlt.pythia_jet_17.err","plot.kumac"};ReplicaCatalog="ldap://dell04.cnaf.infn.it:9211/lc=ATLAS,rc=GLUE,dc=dell04,dc=cnaf,dc=infn,dc=it";InputData = {"LF:dc1.002000.evgen.0001.hlt.pythia_jet_17.root"};StdOutput = " dc1.002000.test.00001.hlt.pythia_jet_17.log";StdError = "dc1.002000.test.00001.hlt.pythia_jet_17.err";DataAccessProtocol = "file";
JDL GLUE-aware files
WNATLAS sw
data
registration
GLUE-Schema basedInformation System
GLUETestbed
JDL
Job
GENIUS
UI
see WorldGrid Poster this conf.
23
Ganga: ATLAS and LHCb
Server
BookkeepingDBProductio
nDB
EDG UI
PYTHON SW BUS
XML RPC server
XML RPC module
GANGA Core Module
OS Module
Athena\GAUDI
GaudiPython PythonROOT
PYTHON SW BUSG
UI
JobConfiguration
DB
Remote user
(client)
Local JobDB
LAN/WAN
GRID
LRMS
C. Tull
24
Ganga EDG Grid Interface
Job class Job class JobsRegistry classJobsRegistry classJob Handler
classJob Handler
class
Data management
service
Data management
service
Job submissionJob submission Job monitoring Job monitoring Security serviceSecurity service
dg-job-list-match
dg-job-submit
dg-job-cancel
dg-job-list-match
dg-job-submit
dg-job-cancel
grid-proxy-init
MyProxy grid-proxy-init
MyProxy
dg-job-status
dg-job-get-logging-info
GRM/PROVE
dg-job-status
dg-job-get-logging-info
GRM/PROVE
edg-replica-manager
dg-job-get-output
globus-url-copy
GDMP
edg-replica-manager
dg-job-get-output
globus-url-copy
GDMPEDG UI
C. Tull
25
Comment: Building Grid Applications
P is a dynamic configuration script
Turns abstract bundle into a concrete one
Challenge:– building
integrated systems
– distributed developers and support
Grid Component Library
CTL ATL
GTL
abstractbundles
templates
U2
P1a
U1
concretebundles
P1c
attributes:user info:grid info
P
26
In summary…Common issues
Installation and configuration of MW Application packaging, run time environments Authentication mechanisms Policies differing among sites Private networks, firewalls, ports Fragility of services, job submission chain Inaccuracies, poor performance of information
services Monitoring and several levels Debugging, site cleanup
27
Conclusions Progress in the past 18 months has been dramatic!
– lots of experience gained in building integrated grid systems
– demonstrated functionality with large scale production – more attention being given to analysis
Many pitfalls exposed, areas for improvement identified– some of these are core middleware feedback given to
technology providers– Policy issues remain – using shared resources,
authorization– operation of production services – user interactions, support models to be developed
Many thanks to the contributors to this session