santiago gonzález de la hoz ( [email protected] )

Santiago González de la Hoz ([email protected]@ific.uv.es)

on behalf of ATLAS DC2 Collaboration

EGC 2005 Amsterdam, 14/02/2005

k

IFIC Instituto de Física CorpuscularCSIC-Universitat de ValènciaSpain

ATLAS Data Challenge 2:A massive Monte Carlo Production on the Grid

Amsterdam, 14-Feb-2005

EGC [email protected]

2

Introductiono ATLAS experimento Data Challenge program

ATLAS production system DC2 production phases

o The 3 Grid flavours (LCG, GRID3 and NorduGrid) o ATLAS DC2 production

Distributed analysis system Conclusions

Overview



3

LHC (CERN)IntroductionIntroduction: LHC/CERN: LHC/CERN

Mont Blanc, 4810 m

Geneva



4

The challenge of the LHC computing

Storage – Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 5-8 PetaBytes/year

10 PetaBytes of disk

Processing – 200,000 of today’s fastest PCs



5

IntroductionIntroduction: ATLAS: ATLAS Detector for the study of high-

energy proton-proton collision. The offline computing will have

to deal with an output event rate of 100 Hz. i.e 109 events per year with an average event size of 1 Mbyte.

Researchers are spread all over the world.



6

Scope and Goals:o In 2002 ATLAS computing planned a first series of Data

Challenges (DC’s) in order to validate its: Computing Model Software Data Model

The major features of the DC1 were:o The development and deployment of the software required

for the production of large event sampleso The production of those samples involving institutions

worldwide. ATLAS collaboration decided to perform the DC2 and in

the future the DC3 using the Grid middleware developed in several Grid projects (Grid flavours) like:o LHC Computing Grid project (LCG), to which CERN is

committedo GRID3o NorduGRID

IntroductionIntroduction: Data : Data ChallengesChallenges



7

ATLAS ATLAS production systemproduction system The production database, which

contains abstract job definitions; The windmill supervisor that

reads the production database for job definitions and present them to the different GRID executors in an easy-to-parse XML format;

The Executors, one for each GRID flavor, that receive the job-definitions in XML format and convert them to the job description language of that particular GRID;

Don Quijote, the Atlas Data Management System, moves files from their temporary output locations to their final destination on some Storage Element and registers the files in the Replica Location Service of that GRID.

In order to handle the task of ATLAS DC2 an automated production system was designed

The ATLAS production system consists of 4 components:



8

DC2 DC2 production phasesproduction phases

HitsMCTruth

Digits(RDO)

MCTruth

BytestreamRaw

Digits

ESD

ESD

Geant4

Reconstruction

Reconstruction

Pile-up

BytestreamRaw

Digits

BytestreamRaw

Digits

HitsMCTruth

Digits(RDO)

MCTruth

Physicsevents

EventsHepMC

EventsHepMC

HitsMCTruth

Digits(RDO)

MCTruthGeant4

Geant4

Digitization

Digits(RDO)

MCTruth

BytestreamRaw

Digits

BytestreamRaw

Digits

BytestreamRaw

DigitsEventsHepMC

HitsMCTruth

Geant4Pile-up

Digitization

Mixing

Mixing Reconstruction ESD

Pyt

hia

Event generation

Detector Simulation

Digitization (Pile-up)

ReconstructionEventMixingByte stream

EventsHepMC

Min. biasEvents

Piled-upevents Mixed events

Mixed eventsWith

Pile-up

Task Flow for DC2 data

~5 TB 20 TB 30 TB20 TB 5 TB

TBVolume of datafor 107 events

Persistency:Athena-POOL



9

DC2 DC2 production phasesproduction phasesProcess No. of

eventsEvent/size

CPU power Volume of data

MB kSI2k-s TBEvent generation 107 0.06 156Simulation 107 1.9 504 30

Pile-up/Digitization

107 3.3/1.9 ~144/16 ~35

Event mixing & Byte-stream

107 2.0 ~5.4 ~20

The ATLAS DC2 which started in July 2004 finished the simulation part at the end of September 2004.

10 million events (100000 jobs) were generated and simulated using the three Grid Flavors:o The Grid technologies have provided the tools to generate a large

Monte Carlo simulation samples The digitization and Pile-up part was completed in December. The pile-up

was done on a sub-sample of 2 M events. The event mixing and byte-stream production are going on



10

The 3 GridThe 3 Grid flavors flavors LCG (http://lcg.web.cern.ch/LCG/)

o The job of the LHC Computing Grid Project – LCG – is to prepare the computing is to prepare the computing infrastructure for the simulation, processing and analysis of LHC data for all four of the LHC infrastructure for the simulation, processing and analysis of LHC data for all four of the LHC collaborations.collaborations. This includes both the common infrastructure of libraries, tools and frameworks required to support the physics application software, and the development and deployment of the computing services needed to store and process the data, providing batch and interactive facilities for the worldwide community of physicists involved in LHC.

NorduGrid (http://www.nordugrid.org/)o The aim of the NorduGrid collaboration is to deliver a robust, scalable, portable and fully is to deliver a robust, scalable, portable and fully

featured solution for a global computational and data Grid systemfeatured solution for a global computational and data Grid system. NorduGrid develops and deploys a set of tools and services – the so-called ARC middleware, which is a free software.

Grid3 (http://www.ivdgl.org/grid2003/)o The Grid3 collaboration has deployed an international Data Gridhas deployed an international Data Grid with dozens of sites and

thousands of processors. The facility is operated jointly by the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS.

Both Grid3 and NorduGrid have similar approaches Both Grid3 and NorduGrid have similar approaches using the same foundations (GLOBUS) as LCG but with using the same foundations (GLOBUS) as LCG but with slightly different middleware.slightly different middleware.

http://www.nordugrid.org/middleware/



11

The 3 Grid flavors: The 3 Grid flavors: LCG LCG

•82 sites, 22 countries (This number is evolving very fast)•6558 TB•~7269 CPUs (shared)

This infrastructure has been operating since 2003.

The resources used (computational and storage) are installed at a large number of Regional Computing Centers, interconnected by fast networks.



12

The 3 Grid flavors: The 3 Grid flavors: NorduGRID NorduGRID

•11 countries, 40+ sites, ~4000 CPUs, •~30 TB storage

NorduGrid is a research collaboration established mainly across Nordic Countries but includes sites from other countries.

They contributed to a significant part of the DC1 (using the Grid in 2002).

It supports production on non-RedHat 7.3 platforms



13

The 3 Grid flavors: The 3 Grid flavors: GRID3 GRID3

The deployed infrastructure has been in operation since November 2003 At this moment running 3 HEP and 2 Biological applications Over 100 users authorized to run in GRID3

Sep 04•30 sites, multi-VO•shared resources•~3000 CPUs (shared)



14

ATLAS DC2 production on: ATLAS DC2 production on: LCGLCG, , GRID3GRID3 and NorduGrid and NorduGrid

-20000

0

20000

40000

60000

80000

100000

120000

140000

4062

3

4062

6

4062

9

4070

2

4070

5

4070

8

4071

1

4071

4

4071

7

4072

0

4072

3

4072

6

4072

9

4080

1

4080

4

4080

7

4081

0

4081

3

4081

6

4081

9

4082

2

4082

5

4082

8

4083

1

4090

3

4090

6

4090

9

4091

2

4091

5

4091

8

Days

Nu

mb

er

of

job

s

LCGNorduGridGrid3Total

# V

alid

ated

Job

s

total

Day

G4 simulation



15

Typical job distribution on: Typical job distribution on: LCGLCG, , GRID3GRID3 and NorduGrid and NorduGrid

LCG1%

2%

0%

1%

1%

21%

2%

1%

1%

10%

8%4%1%4%

0%

1%

4%

2%

2%

0%

3%

1%

13%

0%

0%

1%

1%11%

1% 4%

at.uibk

ca.triumf

ca.ualberta

ca.umontreal

ca.utoronto

ch.cern

cz.golias

cz.skurut

de.fzk

es.ifae

es.ific

es.uam

fr.in2p3

it.infn.cnaf

it.infn.lnf

it.infn.lnl

it.infn.mi

it.infn.na

it.infn.roma

it.infn.to

nl.nikhef

pl.zeus

tw.sinica

uk.bham

uk.ic

uk.lancs

uk.man

uk.ral

uk.shef

uk.ucl

Grid3

5%

12%

13%

4%

6%

10%6%

1%

1%

4%

4%

13%

1%

3%

0%

17%

1%

BNL_ATLAS

BNL_ATLAS_BAK

BU_ATLAS_Tier2

CalTech_PG

FNAL_CMS

IU_ATLAS_Tier2

PDSF

Rice_Grid3

SMU_Physics_Cluster

UBuffalo_CCR

UCSanDiego_PG

UC_ATLAS_Tier2

UFlorida_PG

UM_ATLAS

UNM_HPC

UTA_dpcc

UWMadison

NorduGrid

50%

12%

8%

3%

3%

3%

3%

2%

2%

2%

2%

2%

2%

2%

2%

0%

0%

SWEGRID

brenta.ijs.si

benedict.aau.dk

hypatia.uio.no

farm.hep.lu.se

fire.ii.uib.no

fe10.dcsc.sdu.dk

lxsrv9.lrz-muenchen.de

atlas.hpc.unimelb.edu.au

grid.uio.no

lheppc10.unibe.ch

morpheus.dcgc.dk

genghis.hpc.unimelb.edu.au

charm.hpc.unimelb.edu.au

lscf.nbi.dk

atlas.fzk.de

grid.fi.uib.no

ATLAS DC2 - CPU usage

41%

30%

29%

LCG

NorduGrid

Grid3



16

Distributed Analysis system: Distributed Analysis system: ADA ADA

The physicists want to use the Grid to perform the analysis of the data too.

ADA (ATLAS Distributed Analysis) project aims at putting together all software components to facilitate the end-user analysis.

R O O T P Y T H O N

A M I D B S D IA L A S A T P R O D A S A R D A A S

LS F , C O N D O R gLite W M SA T P R O D

G U I andc o m m and l inec l ie nts

H igh le ve l s e rvic e sfo r c atalo ging andjo b s ubm is s io n andm o nito r ing

W o rklo adm anage m e nts ys te m s

AJ D L

s h S Q L g L ite

AM I w s

AJ D L

The ADA architecture

DIAL: It defines the job components (dataset, task, applications, etc..). Together with LSF or Condor provides “interactivity” ( a low response time).

ATPROD: production system to be used for low mass scale

ARDA: Analysis system to be interfaced to EGEE middleware



17

Main problemso The production system was in development during DC2

phase.o The beta status of the services of the Grid caused troubles

while the system was in operation For example, the Globus RLS, the Resource Broker and the

information system were unstable at the initial phase.o Specially on LCG, lack of uniform monitoring system.o The mis-configuration of sites and site stability related

problems. Main achievements

o To have an automatic production system making use of Grid infrastructure.

o 6 TB (out of 30 TB) of data have been moved among the different Grid flavours using Don Quijote servers.

o 235000 jobs were submitted by the production systemo 250000 logical files were produced and 2500-3500 jobs per

day distributed over the three Grid flavours per day.

Lessons learned Lessons learned from DC2from DC2



18

The generation and simulation of events for ATLAS DC2 have been completed using 3 flavours of Grid Technology.o They have been proven to be usable in a coherent way for

a real production and this is a major achievement. This exercise has taught us that all the involved

elements (Grid middleware, production system, deployment and monitoring tools) need improvements.

Between the start of DC2 in July 2004 and the end of September 2004 (it corresponds G4-simulation phase), the automatic production system has submitted 235000 jobs, they consumed ~1.5 million SI2K months of cpu and produced more than 30TB of physics data.

ATLAS is also pursuing a model for distributed analysis which would improve the productivity of end users by profiting from Grid available resources.

ConclusionsConclusions



19

Backup Slides



20

Supervisor-Executors

Windmill

numJobsWantedexecuteJobsgetExecutorDatagetStatusfixJobkillJob

Jabber communicationpathway executors

Prod DB(jobs database) execution

sites(grid)

1. lexor2. dulcinea3. capone4. legacy

supervisors

execution sites(grid)

Don Quijote(file catalog)



21

NorduGRIDNorduGRID: ARC features: ARC features ARC is based on Globus Toolkit with core services

replacedo Currently uses Globus Toolkit 2

Alternative/extended Grid services:o Grid Manager that

Checks user credentials and authorization Handles jobs locally on clusters (interfaces to LRMS) Does stage-in and stage-out of files

o Lightweight User Interface with built-in resource brokero Information System based on MDS with a NorduGrid

schemao xRSL job description language (extended Globus RSL)o Grid Monitor

Simple, stable and non-invasive



22

LCGLCG software software LCG-2 core packages:

o VDT (Globus2, condor)o EDG WP1 (Resource Broker, job submission tools)o EDG WP2 (Replica Management tools) + lcg tools

One central RMC and LRC for each VO, located at CERN, ORACLE backendo Several bits from other WPs (Config objects, InfoProviders, Packaging…)o GLUE 1.1 (Information schema) + few essential LCG extensionso MDS based Information System with significant LCG enhancements

(replacements, simplified (see poster))o Mechanism for application (experiment) software distribution

Almost all components have gone through some reengineeringo robustnesso scalabilityo efficiency o adaptation to local fabrics

The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture)



23

Grid3Grid3 software software

Grid environment built from core Globus and Condor middleware, as delivered through the Virtual Data Toolkit (VDT)o GRAM, GridFTP, MDS, RLS, VDS

…equipped with VO and multi-VO security, monitoring, and operations services

…allowing federation with other Grids where possible, eg. CERN LHC Computing Grid (LCG)o USATLAS: GriPhyN VDS execution on LCG sites o USCMS: storage element interoperability (SRM/dCache)

Delivering the US LHC Data Challenges



24

ATLAS DC2 (CPU)

LCG41%

Grid330%

NorduGrid29%

LCG

NorduGrid

Grid3

Total

~ 1470 kSI2k.months~ 100000 jobs~ 7.94 million events (fully simulated)~ 30 TB



25

Typical job distribution on LCG

1% 2% 0%1% 2%

14%

3%

1%

3%

9%

8%

3%2%5%1%

4%

1%

1%

3%

0%

1%

1%

4%1%

0%

12%

0%

1%

1%

2%

10%

1% 4%

at.uibk

ca.triumf

ca.ualberta

ca.umontreal

ca.utoronto

ch.cern

cz.golias

cz.skurut

de.fzk

es.ifae

es.ific

es.uam

fr.in2p3

it.infn.cnaf

it.infn.lnl

it.infn.mi

it.infn.na

it.infn.na

it.infn.roma

it.infn.to

it.infn.lnf

jp.icepp

nl.nikhef

pl.zeus

ru.msu

tw.sinica

uk.bham

uk.ic

uk.lancs

uk.man

uk.rl

uk.shef

uk.ucl



26

0% 7%

10%

13%

4%

4%

0%

10%0%4%1%1%4%

5%

13%

0%

5%

0%

18%

1%ANL_HEPBNL_ATLASBNL_ATLAS_BAKBU_ATLAS_Tier2CalTech_PGFNAL_CMSFNAL_CMS2IU_ATLAS_Tier2OU_OSCERPDSFRice_Grid3SMU_Physics_ClusterUBuffalo_CCRUCSanDiego_PGUC_ATLAS_Tier2UFlorida_PGUM_ATLASUNM_HPCUTA_dpccUWMadison

Typical Job distribution on Grid3



27

Jobs distribution on NorduGrid

September 19

45%

12%

10%

6%

5%

4%

3%

3%

3%2%

2%2%1%1%1%1% 0%

swegridbrenta.ijs.sibenedict.aau.dkhypatia.uio.nofe10.dcsc.sdu.dkfarm.hep.lu.sefire.ii.uib.nogrid.uio.nomorpheus.dcgc.dkatlas.hpc.unimelb.edu.aulxsrv9.lrz-muenchen.deatlas.fzk.delheppc10.unibe.chgenghis.hpc.unimelb.edu.aucharm.hpc.unimelb.edu.aulscf.nbi.dkgrid.fi.uib.no

santiago gonzález de la hoz ( [email protected] )

Documents

atlas computing

atlas collaboration

task of atlas dc2

dc2 data

automated production

atlas data management

grid projects grid flavours

grid middleware