experience with atlas data challenge production on the u.s. grid testbed kaushik de university of...
Post on 15-Jan-2016
224 views
TRANSCRIPT
Experience with ATLAS Data Challenge Production
on the U.S. Grid Testbed
Kaushik DeKaushik De
University of Texas at ArlingtonUniversity of Texas at Arlington
CHEP03CHEP03
March 27, 2003March 27, 2003
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 2
•Multi-purpose experiment at the Large Hadron Collider, CERN
•14 GeV c.m. pp collisions starting in 2007
•Physics: Higgs, SUSY, new searches...
•Petabytes/year of data analyzed by >2000 physicists worldwide - need the GRID
The ATLAS Experiment
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 3
U.S. ATLAS Grid Testbed
BNL - U.S. Tier 1, 2000 BNL - U.S. Tier 1, 2000 nodes, 5% ATLAS, 10 TBnodes, 5% ATLAS, 10 TB
LBNL - pdsf cluster, 400 LBNL - pdsf cluster, 400 nodes, 5% ATLAS, 1 TBnodes, 5% ATLAS, 1 TB
Boston U. - prototype Boston U. - prototype Tier 2, 64 nodesTier 2, 64 nodes
Indiana U. - prototype Indiana U. - prototype Tier 2, 32 nodesTier 2, 32 nodes
UT Arlington - 20 nodesUT Arlington - 20 nodes
Oklahoma U. - 12 nodesOklahoma U. - 12 nodes
U. Michigan - 10 nodesU. Michigan - 10 nodes
ANL - test nodesANL - test nodes
SMU - 6 nodesSMU - 6 nodes
UNM - new siteUNM - new site
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 4
U.S. Testbed Goals
DeploymentDeployment Set up grid infrastructure and ATLAS software
Test installation procedures (PACMAN)
Development & TestingDevelopment & Testing Grid applications - GRAT, Grappa, Magda...
Other software - monitoring, packaging...
Run ProductionRun Production For U.S. physics data analysis and tests
Main focus - ATLAS Data Challenges Simulation, pileup
Reconstruction
Connection to GRID projectsConnection to GRID projects GriPhyN - Globus, Condor, Chimera… use & test
iVDGL - VDT, glue schema testbed, Worldgrid testbed, demos… use and test
EDG, LCG… testing & deployment
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 5
ATLAS Data Challenges
DC’s - Generate and analyse simulated data (see talk by Gilbert Poulard on Tuesday)
Original Goals (Nov 15, 2001)Original Goals (Nov 15, 2001) Test computing model, its software, its data
model, and to ensure the correctness of the technical choices to be made
Data Challenges should be executed at the prototype Tier centres
Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU
Current StatusCurrent Status Goals are evolving as we gain experience
Sequence of increasing scale & complexity
DC0 (completed), DC1 (underway)
DC2, DC3, and DC4 planned
Grid deployment and testing major part of DC’s
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 6
GRAT Software
GRid Applications ToolkitGRid Applications Toolkit
Used for U.S. Data Challenge productionUsed for U.S. Data Challenge production
Based on Globus, Magda & MySQLBased on Globus, Magda & MySQL
Shell & Python scripts, modular designShell & Python scripts, modular design
Rapid development platformRapid development platform Quickly develop packages as needed by DC
Single particle production
Higgs & SUSY production
Pileup production & data management
Reconstruction
Test grid middleware, test grid performanceTest grid middleware, test grid performance
Modules can be easily enhanced or Modules can be easily enhanced or replaced by Condor-G, EDG resource replaced by Condor-G, EDG resource broker, Chimera, replica catalogue, broker, Chimera, replica catalogue, OGSA… (in progress)OGSA… (in progress)
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 7
GRAT Execution Model
1. Resource Discovery2. Partition Selection3. Job Creation4. Pre-stage5. Batch Submission6. Job Parameterization7. Simulation8. Post-stage9. Cataloging10. Monitoring
DC1
Prod.(UTA)
RemoteGatekeeper
Replica(local)
MAGDA(BNL)
Param(CERN)
BatchExecution
scratch
1,4,5,10
2
3
4
5
6
7
89
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 8
Middleware Evolution of U.S. Applications
Used in currentproduction software(GRAT & Grappa)
Tested successfully(not yet used for largescale production)
Under developmentand testing
Tested for simulation(will be used for largescale reconstruction)
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 9
Databases used in GRAT
MySQL databases central to GRATMySQL databases central to GRAT
Production databaseProduction database define logical job parameters & filenames
track job status, updated periodically by scripts
Data management (Magda)Data management (Magda) file registration/catalogue
grid based file transfers
Virtual Data CatalogueVirtual Data Catalogue simulation job definition
job parameters, random numbers
Metadata catalogue (AMI)Metadata catalogue (AMI) post-production summary information
data provenance
Similar scheme being considered ATLAS-Similar scheme being considered ATLAS-wide by the Grid Technical Boardwide by the Grid Technical Board
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 10
DC1 Production on U.S. Grid
August/September 2002August/September 2002 3 week DC1 production run using GRAT
Generated 200,000 events, using ~ 1,300 CPU days, 2000 files, 100 GB storage at 4 sites
December 2002December 2002 Generated 75k SUSY and Higgs events for DC1
Total DC1 files generated and stored > 500 GB, total CPU used >1000 CPU days in 4 weeks
January 2002January 2002 More SUSY sample
Started pile-up production on the grid, both high and low luminosity, for 1-2 months at all sites
February/March 2002February/March 2002 Discovered bug in software (non grid part)
Regenerating all SUSY, Higgs & pile-up samples
~15TB data, 15k files, 2M events, 10k CPU days
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 11
DC1 Production Examples
Each production run requires development & Each production run requires development & deployment of new software at selected sitesdeployment of new software at selected sites
DC simulation Aug/Sep 2002
0
10
2030
40
50
60
7080
90
100
8/15
8/22
8/29 9/
59/
129/
199/
2610
/310
/10
date
nu
mb
er
of
job
s UTAOULBL
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 12
DC1 Production Experience
Grid paradigm works, using GlobusGrid paradigm works, using Globus Opportunistic use of existing resources, run
anywhere, from anywhere, by anyone...
Successfully exercised grid middleware Successfully exercised grid middleware with increasingly complex taskswith increasingly complex tasks Simulation: create physics data from pre-defined
parameters and input files, CPU intensive
Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive
Reconstruction: data intensive, multiple passes
Data tracking: multiple steps, one -> many -> many more mappings
Tested grid applications developed by U.S.Tested grid applications developed by U.S. For example, PACMAN (Saul Youssef - BU)
Magda (see talk by Wensheng Deng)
Virtual Data Catalogue (see Poster by P. Nevski)
GRAT (this talk), GRAPPA (see talk by D. Engh)
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 13
Grid Quality of Service
Anything that can go wrong, WILL go wrongAnything that can go wrong, WILL go wrong During 18 days of grid production (in August),
every system died at least once
Local experts were not always be accessible
Examples: scheduling machines died 5 times (thrice power failure, twice system hung), Network outages multiple times, Gatekeeper died at every site at least 2-3 times
Three databases used - production, magda and virtual data. Each died at least once!
Scheduled maintenance - HPSS, Magda server, LBNL hardware, LBNL Raid array…
Poor cleanup, lack of fault tolerance in Globus
These outages should be expected on the These outages should be expected on the grid - software design must be robustgrid - software design must be robust
We managed > 100 files/day (~80% We managed > 100 files/day (~80% efficiency) in spite of these problems!efficiency) in spite of these problems!
March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 14
Conclusion
The largest (>10TB) grid based production The largest (>10TB) grid based production in ATLAS was done by U.S. testbedin ATLAS was done by U.S. testbed
Grid production is possible, but not easy Grid production is possible, but not easy right now - need to harden middleware, right now - need to harden middleware, need higher level servicesneed higher level services
Many tools are missing - monitoring, Many tools are missing - monitoring, operations center, data managementoperations center, data management
Requires iterative learning process, with Requires iterative learning process, with rapid evolution of software designrapid evolution of software design
Pile-up was a major data management Pile-up was a major data management challenge on the grid - moving >0.5 TB/daychallenge on the grid - moving >0.5 TB/day
Successful so farSuccessful so far
Continuously learning and improvingContinuously learning and improving
Many more DC’s coming up!Many more DC’s coming up!