experience with atlas data challenge production on the u.s. grid testbed kaushik de university of...

Experience with ATLAS Data Challenge Production

on the U.S. Grid Testbed

Kaushik DeKaushik De

University of Texas at ArlingtonUniversity of Texas at Arlington

CHEP03CHEP03

March 27, 2003March 27, 2003

March 27, 2003March 27, 2003K. De CHEP03K. De CHEP03 2

•Multi-purpose experiment at the Large Hadron Collider, CERN

•14 GeV c.m. pp collisions starting in 2007

•Physics: Higgs, SUSY, new searches...

•Petabytes/year of data analyzed by >2000 physicists worldwide - need the GRID

The ATLAS Experiment


U.S. ATLAS Grid Testbed

BNL - U.S. Tier 1, 2000 BNL - U.S. Tier 1, 2000 nodes, 5% ATLAS, 10 TBnodes, 5% ATLAS, 10 TB

LBNL - pdsf cluster, 400 LBNL - pdsf cluster, 400 nodes, 5% ATLAS, 1 TBnodes, 5% ATLAS, 1 TB

Boston U. - prototype Boston U. - prototype Tier 2, 64 nodesTier 2, 64 nodes

Indiana U. - prototype Indiana U. - prototype Tier 2, 32 nodesTier 2, 32 nodes

UT Arlington - 20 nodesUT Arlington - 20 nodes

Oklahoma U. - 12 nodesOklahoma U. - 12 nodes

U. Michigan - 10 nodesU. Michigan - 10 nodes

ANL - test nodesANL - test nodes

SMU - 6 nodesSMU - 6 nodes

UNM - new siteUNM - new site


U.S. Testbed Goals

DeploymentDeployment Set up grid infrastructure and ATLAS software

Test installation procedures (PACMAN)

Development & TestingDevelopment & Testing Grid applications - GRAT, Grappa, Magda...

Other software - monitoring, packaging...

Run ProductionRun Production For U.S. physics data analysis and tests

Main focus - ATLAS Data Challenges Simulation, pileup

Reconstruction

Connection to GRID projectsConnection to GRID projects GriPhyN - Globus, Condor, Chimera… use & test

iVDGL - VDT, glue schema testbed, Worldgrid testbed, demos… use and test

EDG, LCG… testing & deployment


ATLAS Data Challenges

DC’s - Generate and analyse simulated data (see talk by Gilbert Poulard on Tuesday)

Original Goals (Nov 15, 2001)Original Goals (Nov 15, 2001) Test computing model, its software, its data

model, and to ensure the correctness of the technical choices to be made

Data Challenges should be executed at the prototype Tier centres

Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU

Current StatusCurrent Status Goals are evolving as we gain experience

Sequence of increasing scale & complexity

DC0 (completed), DC1 (underway)

DC2, DC3, and DC4 planned

Grid deployment and testing major part of DC’s


GRAT Software

GRid Applications ToolkitGRid Applications Toolkit

Used for U.S. Data Challenge productionUsed for U.S. Data Challenge production

Based on Globus, Magda & MySQLBased on Globus, Magda & MySQL

Shell & Python scripts, modular designShell & Python scripts, modular design

Rapid development platformRapid development platform Quickly develop packages as needed by DC

Single particle production

Higgs & SUSY production

Pileup production & data management

Reconstruction

Test grid middleware, test grid performanceTest grid middleware, test grid performance

Modules can be easily enhanced or Modules can be easily enhanced or replaced by Condor-G, EDG resource replaced by Condor-G, EDG resource broker, Chimera, replica catalogue, broker, Chimera, replica catalogue, OGSA… (in progress)OGSA… (in progress)


GRAT Execution Model

1. Resource Discovery2. Partition Selection3. Job Creation4. Pre-stage5. Batch Submission6. Job Parameterization7. Simulation8. Post-stage9. Cataloging10. Monitoring

DC1

Prod.(UTA)

RemoteGatekeeper

Replica(local)

MAGDA(BNL)

Param(CERN)

BatchExecution

scratch

1,4,5,10

2

3

4

5

6

7

89


Middleware Evolution of U.S. Applications

Used in currentproduction software(GRAT & Grappa)

Tested successfully(not yet used for largescale production)

Under developmentand testing

Tested for simulation(will be used for largescale reconstruction)


Databases used in GRAT

MySQL databases central to GRATMySQL databases central to GRAT

Production databaseProduction database define logical job parameters & filenames

track job status, updated periodically by scripts

Data management (Magda)Data management (Magda) file registration/catalogue

grid based file transfers

Virtual Data CatalogueVirtual Data Catalogue simulation job definition

job parameters, random numbers

Metadata catalogue (AMI)Metadata catalogue (AMI) post-production summary information

data provenance

Similar scheme being considered ATLAS-Similar scheme being considered ATLAS-wide by the Grid Technical Boardwide by the Grid Technical Board


DC1 Production on U.S. Grid

August/September 2002August/September 2002 3 week DC1 production run using GRAT

Generated 200,000 events, using ~ 1,300 CPU days, 2000 files, 100 GB storage at 4 sites

December 2002December 2002 Generated 75k SUSY and Higgs events for DC1

Total DC1 files generated and stored > 500 GB, total CPU used >1000 CPU days in 4 weeks

January 2002January 2002 More SUSY sample

Started pile-up production on the grid, both high and low luminosity, for 1-2 months at all sites

February/March 2002February/March 2002 Discovered bug in software (non grid part)

Regenerating all SUSY, Higgs & pile-up samples

~15TB data, 15k files, 2M events, 10k CPU days


DC1 Production Examples

Each production run requires development & Each production run requires development & deployment of new software at selected sitesdeployment of new software at selected sites

DC simulation Aug/Sep 2002

0

10

2030

40

50

60

7080

90

100

8/15

8/22

8/29 9/

59/

129/

199/

2610

/310

/10

date

nu

mb

er

of

job

s UTAOULBL


DC1 Production Experience

Grid paradigm works, using GlobusGrid paradigm works, using Globus Opportunistic use of existing resources, run

anywhere, from anywhere, by anyone...

Successfully exercised grid middleware Successfully exercised grid middleware with increasingly complex taskswith increasingly complex tasks Simulation: create physics data from pre-defined

parameters and input files, CPU intensive

Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive

Reconstruction: data intensive, multiple passes

Data tracking: multiple steps, one -> many -> many more mappings

Tested grid applications developed by U.S.Tested grid applications developed by U.S. For example, PACMAN (Saul Youssef - BU)

Magda (see talk by Wensheng Deng)

Virtual Data Catalogue (see Poster by P. Nevski)

GRAT (this talk), GRAPPA (see talk by D. Engh)


Grid Quality of Service

Anything that can go wrong, WILL go wrongAnything that can go wrong, WILL go wrong During 18 days of grid production (in August),

every system died at least once

Local experts were not always be accessible

Examples: scheduling machines died 5 times (thrice power failure, twice system hung), Network outages multiple times, Gatekeeper died at every site at least 2-3 times

Three databases used - production, magda and virtual data. Each died at least once!

Scheduled maintenance - HPSS, Magda server, LBNL hardware, LBNL Raid array…

Poor cleanup, lack of fault tolerance in Globus

These outages should be expected on the These outages should be expected on the grid - software design must be robustgrid - software design must be robust

We managed > 100 files/day (~80% We managed > 100 files/day (~80% efficiency) in spite of these problems!efficiency) in spite of these problems!


Conclusion

The largest (>10TB) grid based production The largest (>10TB) grid based production in ATLAS was done by U.S. testbedin ATLAS was done by U.S. testbed

Grid production is possible, but not easy Grid production is possible, but not easy right now - need to harden middleware, right now - need to harden middleware, need higher level servicesneed higher level services

Many tools are missing - monitoring, Many tools are missing - monitoring, operations center, data managementoperations center, data management

Requires iterative learning process, with Requires iterative learning process, with rapid evolution of software designrapid evolution of software design

Pile-up was a major data management Pile-up was a major data management challenge on the grid - moving >0.5 TB/daychallenge on the grid - moving >0.5 TB/day

Successful so farSuccessful so far

Continuously learning and improvingContinuously learning and improving

Many more DC’s coming up!Many more DC’s coming up!

experience with atlas data challenge production on the u.s. grid testbed kaushik de university of...

Documents

data model

petabytesyear of data

simulated data

atlas grid testbedbnl

data challenge productionbased

physics data analysis

grid infrastructure

grid deployment