santiago gonzález de la hoz ( [email protected] )
DESCRIPTION
IFIC Instituto de Física Corpuscular CSIC-Universitat de València Spain. ATLAS Data Challenge 2: A massive Monte Carlo Production on the Grid. Santiago González de la Hoz ( [email protected] ) on behalf of ATLAS DC2 Collaboration EGC 2005 Amsterdam, 14/02/2005. k. Overview. - PowerPoint PPT PresentationTRANSCRIPT
Santiago González de la Hoz ([email protected]@ific.uv.es)
on behalf of ATLAS DC2 Collaboration
EGC 2005 Amsterdam, 14/02/2005
k
IFIC Instituto de Física CorpuscularCSIC-Universitat de ValènciaSpain
ATLAS Data Challenge 2:A massive Monte Carlo Production on the Grid
Amsterdam, 14-Feb-2005
2
Introductiono ATLAS experimento Data Challenge program
ATLAS production system DC2 production phases
o The 3 Grid flavours (LCG, GRID3 and NorduGrid) o ATLAS DC2 production
Distributed analysis system Conclusions
Overview
Amsterdam, 14-Feb-2005
3
LHC (CERN)IntroductionIntroduction: LHC/CERN: LHC/CERN
Mont Blanc, 4810 m
Geneva
Amsterdam, 14-Feb-2005
4
The challenge of the LHC computing
Storage – Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk
Processing – 200,000 of today’s fastest PCs
Amsterdam, 14-Feb-2005
5
IntroductionIntroduction: ATLAS: ATLAS Detector for the study of high-
energy proton-proton collision. The offline computing will have
to deal with an output event rate of 100 Hz. i.e 109 events per year with an average event size of 1 Mbyte.
Researchers are spread all over the world.
Amsterdam, 14-Feb-2005
6
Scope and Goals:o In 2002 ATLAS computing planned a first series of Data
Challenges (DC’s) in order to validate its: Computing Model Software Data Model
The major features of the DC1 were:o The development and deployment of the software required
for the production of large event sampleso The production of those samples involving institutions
worldwide. ATLAS collaboration decided to perform the DC2 and in
the future the DC3 using the Grid middleware developed in several Grid projects (Grid flavours) like:o LHC Computing Grid project (LCG), to which CERN is
committedo GRID3o NorduGRID
IntroductionIntroduction: Data : Data ChallengesChallenges
Amsterdam, 14-Feb-2005
7
ATLAS ATLAS production systemproduction system The production database, which
contains abstract job definitions; The windmill supervisor that
reads the production database for job definitions and present them to the different GRID executors in an easy-to-parse XML format;
The Executors, one for each GRID flavor, that receive the job-definitions in XML format and convert them to the job description language of that particular GRID;
Don Quijote, the Atlas Data Management System, moves files from their temporary output locations to their final destination on some Storage Element and registers the files in the Replica Location Service of that GRID.
In order to handle the task of ATLAS DC2 an automated production system was designed
The ATLAS production system consists of 4 components:
Amsterdam, 14-Feb-2005
8
DC2 DC2 production phasesproduction phases
HitsMCTruth
Digits(RDO)
MCTruth
BytestreamRaw
Digits
ESD
ESD
Geant4
Reconstruction
Reconstruction
Pile-up
BytestreamRaw
Digits
BytestreamRaw
Digits
HitsMCTruth
Digits(RDO)
MCTruth
Physicsevents
EventsHepMC
EventsHepMC
HitsMCTruth
Digits(RDO)
MCTruthGeant4
Geant4
Digitization
Digits(RDO)
MCTruth
BytestreamRaw
Digits
BytestreamRaw
Digits
BytestreamRaw
DigitsEventsHepMC
HitsMCTruth
Geant4Pile-up
Digitization
Mixing
Mixing Reconstruction ESD
Pyt
hia
Event generation
Detector Simulation
Digitization (Pile-up)
ReconstructionEventMixingByte stream
EventsHepMC
Min. biasEvents
Piled-upevents Mixed events
Mixed eventsWith
Pile-up
Task Flow for DC2 data
~5 TB 20 TB 30 TB20 TB 5 TB
TBVolume of datafor 107 events
Persistency:Athena-POOL
Amsterdam, 14-Feb-2005
9
DC2 DC2 production phasesproduction phasesProcess No. of
eventsEvent/size
CPU power Volume of data
MB kSI2k-s TBEvent generation 107 0.06 156Simulation 107 1.9 504 30
Pile-up/Digitization
107 3.3/1.9 ~144/16 ~35
Event mixing & Byte-stream
107 2.0 ~5.4 ~20
The ATLAS DC2 which started in July 2004 finished the simulation part at the end of September 2004.
10 million events (100000 jobs) were generated and simulated using the three Grid Flavors:o The Grid technologies have provided the tools to generate a large
Monte Carlo simulation samples The digitization and Pile-up part was completed in December. The pile-up
was done on a sub-sample of 2 M events. The event mixing and byte-stream production are going on
Amsterdam, 14-Feb-2005
10
The 3 GridThe 3 Grid flavors flavors LCG (http://lcg.web.cern.ch/LCG/)
o The job of the LHC Computing Grid Project – LCG – is to prepare the computing is to prepare the computing infrastructure for the simulation, processing and analysis of LHC data for all four of the LHC infrastructure for the simulation, processing and analysis of LHC data for all four of the LHC collaborations.collaborations. This includes both the common infrastructure of libraries, tools and frameworks required to support the physics application software, and the development and deployment of the computing services needed to store and process the data, providing batch and interactive facilities for the worldwide community of physicists involved in LHC.
NorduGrid (http://www.nordugrid.org/)o The aim of the NorduGrid collaboration is to deliver a robust, scalable, portable and fully is to deliver a robust, scalable, portable and fully
featured solution for a global computational and data Grid systemfeatured solution for a global computational and data Grid system. NorduGrid develops and deploys a set of tools and services – the so-called ARC middleware, which is a free software.
Grid3 (http://www.ivdgl.org/grid2003/)o The Grid3 collaboration has deployed an international Data Gridhas deployed an international Data Grid with dozens of sites and
thousands of processors. The facility is operated jointly by the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS.
Both Grid3 and NorduGrid have similar approaches Both Grid3 and NorduGrid have similar approaches using the same foundations (GLOBUS) as LCG but with using the same foundations (GLOBUS) as LCG but with slightly different middleware.slightly different middleware.
Amsterdam, 14-Feb-2005
11
The 3 Grid flavors: The 3 Grid flavors: LCG LCG
•82 sites, 22 countries (This number is evolving very fast)•6558 TB•~7269 CPUs (shared)
This infrastructure has been operating since 2003.
The resources used (computational and storage) are installed at a large number of Regional Computing Centers, interconnected by fast networks.
Amsterdam, 14-Feb-2005
12
The 3 Grid flavors: The 3 Grid flavors: NorduGRID NorduGRID
•11 countries, 40+ sites, ~4000 CPUs, •~30 TB storage
NorduGrid is a research collaboration established mainly across Nordic Countries but includes sites from other countries.
They contributed to a significant part of the DC1 (using the Grid in 2002).
It supports production on non-RedHat 7.3 platforms
Amsterdam, 14-Feb-2005
13
The 3 Grid flavors: The 3 Grid flavors: GRID3 GRID3
The deployed infrastructure has been in operation since November 2003 At this moment running 3 HEP and 2 Biological applications Over 100 users authorized to run in GRID3
Sep 04•30 sites, multi-VO•shared resources•~3000 CPUs (shared)
Amsterdam, 14-Feb-2005
14
ATLAS DC2 production on: ATLAS DC2 production on: LCGLCG, , GRID3GRID3 and NorduGrid and NorduGrid
-20000
0
20000
40000
60000
80000
100000
120000
140000
4062
3
4062
6
4062
9
4070
2
4070
5
4070
8
4071
1
4071
4
4071
7
4072
0
4072
3
4072
6
4072
9
4080
1
4080
4
4080
7
4081
0
4081
3
4081
6
4081
9
4082
2
4082
5
4082
8
4083
1
4090
3
4090
6
4090
9
4091
2
4091
5
4091
8
Days
Nu
mb
er
of
job
s
LCGNorduGridGrid3Total
# V
alid
ated
Job
s
total
Day
G4 simulation
Amsterdam, 14-Feb-2005
15
Typical job distribution on: Typical job distribution on: LCGLCG, , GRID3GRID3 and NorduGrid and NorduGrid
LCG1%
2%
0%
1%
1%
21%
2%
1%
1%
10%
8%4%1%4%
0%
1%
4%
2%
2%
0%
3%
1%
13%
0%
0%
1%
1%11%
1% 4%
at.uibk
ca.triumf
ca.ualberta
ca.umontreal
ca.utoronto
ch.cern
cz.golias
cz.skurut
de.fzk
es.ifae
es.ific
es.uam
fr.in2p3
it.infn.cnaf
it.infn.lnf
it.infn.lnl
it.infn.mi
it.infn.na
it.infn.roma
it.infn.to
nl.nikhef
pl.zeus
tw.sinica
uk.bham
uk.ic
uk.lancs
uk.man
uk.ral
uk.shef
uk.ucl
Grid3
5%
12%
13%
4%
6%
10%6%
1%
1%
4%
4%
13%
1%
3%
0%
17%
1%
BNL_ATLAS
BNL_ATLAS_BAK
BU_ATLAS_Tier2
CalTech_PG
FNAL_CMS
IU_ATLAS_Tier2
PDSF
Rice_Grid3
SMU_Physics_Cluster
UBuffalo_CCR
UCSanDiego_PG
UC_ATLAS_Tier2
UFlorida_PG
UM_ATLAS
UNM_HPC
UTA_dpcc
UWMadison
NorduGrid
50%
12%
8%
3%
3%
3%
3%
2%
2%
2%
2%
2%
2%
2%
2%
0%
0%
SWEGRID
brenta.ijs.si
benedict.aau.dk
hypatia.uio.no
farm.hep.lu.se
fire.ii.uib.no
fe10.dcsc.sdu.dk
lxsrv9.lrz-muenchen.de
atlas.hpc.unimelb.edu.au
grid.uio.no
lheppc10.unibe.ch
morpheus.dcgc.dk
genghis.hpc.unimelb.edu.au
charm.hpc.unimelb.edu.au
lscf.nbi.dk
atlas.fzk.de
grid.fi.uib.no
ATLAS DC2 - CPU usage
41%
30%
29%
LCG
NorduGrid
Grid3
Amsterdam, 14-Feb-2005
16
Distributed Analysis system: Distributed Analysis system: ADA ADA
The physicists want to use the Grid to perform the analysis of the data too.
ADA (ATLAS Distributed Analysis) project aims at putting together all software components to facilitate the end-user analysis.
R O O T P Y T H O N
A M I D B S D IA L A S A T P R O D A S A R D A A S
LS F , C O N D O R gLite W M SA T P R O D
G U I andc o m m and l inec l ie nts
H igh le ve l s e rvic e sfo r c atalo ging andjo b s ubm is s io n andm o nito r ing
W o rklo adm anage m e nts ys te m s
AJ D L
s h S Q L g L ite
AM I w s
AJ D L
The ADA architecture
DIAL: It defines the job components (dataset, task, applications, etc..). Together with LSF or Condor provides “interactivity” ( a low response time).
ATPROD: production system to be used for low mass scale
ARDA: Analysis system to be interfaced to EGEE middleware
Amsterdam, 14-Feb-2005
17
Main problemso The production system was in development during DC2
phase.o The beta status of the services of the Grid caused troubles
while the system was in operation For example, the Globus RLS, the Resource Broker and the
information system were unstable at the initial phase.o Specially on LCG, lack of uniform monitoring system.o The mis-configuration of sites and site stability related
problems. Main achievements
o To have an automatic production system making use of Grid infrastructure.
o 6 TB (out of 30 TB) of data have been moved among the different Grid flavours using Don Quijote servers.
o 235000 jobs were submitted by the production systemo 250000 logical files were produced and 2500-3500 jobs per
day distributed over the three Grid flavours per day.
Lessons learned Lessons learned from DC2from DC2
Amsterdam, 14-Feb-2005
18
The generation and simulation of events for ATLAS DC2 have been completed using 3 flavours of Grid Technology.o They have been proven to be usable in a coherent way for
a real production and this is a major achievement. This exercise has taught us that all the involved
elements (Grid middleware, production system, deployment and monitoring tools) need improvements.
Between the start of DC2 in July 2004 and the end of September 2004 (it corresponds G4-simulation phase), the automatic production system has submitted 235000 jobs, they consumed ~1.5 million SI2K months of cpu and produced more than 30TB of physics data.
ATLAS is also pursuing a model for distributed analysis which would improve the productivity of end users by profiting from Grid available resources.
ConclusionsConclusions
Amsterdam, 14-Feb-2005
20
Supervisor-Executors
Windmill
numJobsWantedexecuteJobsgetExecutorDatagetStatusfixJobkillJob
Jabber communicationpathway executors
Prod DB(jobs database) execution
sites(grid)
1. lexor2. dulcinea3. capone4. legacy
supervisors
execution sites(grid)
Don Quijote(file catalog)
Amsterdam, 14-Feb-2005
21
NorduGRIDNorduGRID: ARC features: ARC features ARC is based on Globus Toolkit with core services
replacedo Currently uses Globus Toolkit 2
Alternative/extended Grid services:o Grid Manager that
Checks user credentials and authorization Handles jobs locally on clusters (interfaces to LRMS) Does stage-in and stage-out of files
o Lightweight User Interface with built-in resource brokero Information System based on MDS with a NorduGrid
schemao xRSL job description language (extended Globus RSL)o Grid Monitor
Simple, stable and non-invasive
Amsterdam, 14-Feb-2005
22
LCGLCG software software LCG-2 core packages:
o VDT (Globus2, condor)o EDG WP1 (Resource Broker, job submission tools)o EDG WP2 (Replica Management tools) + lcg tools
One central RMC and LRC for each VO, located at CERN, ORACLE backendo Several bits from other WPs (Config objects, InfoProviders, Packaging…)o GLUE 1.1 (Information schema) + few essential LCG extensionso MDS based Information System with significant LCG enhancements
(replacements, simplified (see poster))o Mechanism for application (experiment) software distribution
Almost all components have gone through some reengineeringo robustnesso scalabilityo efficiency o adaptation to local fabrics
The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture)
Amsterdam, 14-Feb-2005
23
Grid3Grid3 software software
Grid environment built from core Globus and Condor middleware, as delivered through the Virtual Data Toolkit (VDT)o GRAM, GridFTP, MDS, RLS, VDS
…equipped with VO and multi-VO security, monitoring, and operations services
…allowing federation with other Grids where possible, eg. CERN LHC Computing Grid (LCG)o USATLAS: GriPhyN VDS execution on LCG sites o USCMS: storage element interoperability (SRM/dCache)
Delivering the US LHC Data Challenges
Amsterdam, 14-Feb-2005
24
ATLAS DC2 (CPU)
LCG41%
Grid330%
NorduGrid29%
LCG
NorduGrid
Grid3
Total
~ 1470 kSI2k.months~ 100000 jobs~ 7.94 million events (fully simulated)~ 30 TB
Amsterdam, 14-Feb-2005
25
Typical job distribution on LCG
1% 2% 0%1% 2%
14%
3%
1%
3%
9%
8%
3%2%5%1%
4%
1%
1%
3%
0%
1%
1%
4%1%
0%
12%
0%
1%
1%
2%
10%
1% 4%
at.uibk
ca.triumf
ca.ualberta
ca.umontreal
ca.utoronto
ch.cern
cz.golias
cz.skurut
de.fzk
es.ifae
es.ific
es.uam
fr.in2p3
it.infn.cnaf
it.infn.lnl
it.infn.mi
it.infn.na
it.infn.na
it.infn.roma
it.infn.to
it.infn.lnf
jp.icepp
nl.nikhef
pl.zeus
ru.msu
tw.sinica
uk.bham
uk.ic
uk.lancs
uk.man
uk.rl
uk.shef
uk.ucl
Amsterdam, 14-Feb-2005
26
0% 7%
10%
13%
4%
4%
0%
10%0%4%1%1%4%
5%
13%
0%
5%
0%
18%
1%ANL_HEPBNL_ATLASBNL_ATLAS_BAKBU_ATLAS_Tier2CalTech_PGFNAL_CMSFNAL_CMS2IU_ATLAS_Tier2OU_OSCERPDSFRice_Grid3SMU_Physics_ClusterUBuffalo_CCRUCSanDiego_PGUC_ATLAS_Tier2UFlorida_PGUM_ATLASUNM_HPCUTA_dpccUWMadison
Typical Job distribution on Grid3
Amsterdam, 14-Feb-2005
27
Jobs distribution on NorduGrid
September 19
45%
12%
10%
6%
5%
4%
3%
3%
3%2%
2%2%1%1%1%1% 0%
swegridbrenta.ijs.sibenedict.aau.dkhypatia.uio.nofe10.dcsc.sdu.dkfarm.hep.lu.sefire.ii.uib.nogrid.uio.nomorpheus.dcgc.dkatlas.hpc.unimelb.edu.aulxsrv9.lrz-muenchen.deatlas.fzk.delheppc10.unibe.chgenghis.hpc.unimelb.edu.aucharm.hpc.unimelb.edu.aulscf.nbi.dkgrid.fi.uib.no