CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LHC Computing Models and Data LHC Computing Models and Data ChallengesChallenges
David SticklandDavid Stickland
Princeton UniversityPrinceton University
For the LHC Experiments:For the LHC Experiments:
ALICE, ATLAS, CMS and LHCbALICE, ATLAS, CMS and LHCb
(Though with entirely my own bias (Though with entirely my own bias and errors)and errors)
David SticklandPrinceton Univ.30 Sept 2004
Page 2/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Outline
Elements of a Computing Model
Recap of LHC Computing Scales
Recent/Current LHC Data Challenges
Critical Issues
David SticklandPrinceton Univ.30 Sept 2004
Page 3/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
The Real Challenge…
Quoting from David Williams’ Opening Talk– Computing hardware isn’t the biggest challenge
– Enabling the collaboration to bring their combined intellect to bear on the problems is the real challenge
Empowering the intellectual capabilities in very large collaborations to contribute to the analysis of the new energy frontier
– Not (just) from sociological/political perspective
– You never know where the critical ideas or work will come from
– The simplistic vision of central control is elitist and illusory
David SticklandPrinceton Univ.30 Sept 2004
Page 4/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Goals of An (Offline) Computing Model
Worldwide Collaborator access and ability to contribute to the analysis of the experiment:
Safe Data Storage
Feedback to the running experiment
Reconstruction according to priority scheme with graceful fall back solutions
– Introduce no Deadtime
Optimum (Or at least acceptable) usage of Computing Resources
Efficient Data Management
David SticklandPrinceton Univ.30 Sept 2004
Page 5/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Where are LHC Computing Models today?
In a state of flux!– Last chance to influence initial purchasing planning in next 12 months
Basic principles – were contained in MONARC and went into the first assessment of LHC
computing requirements– Aka “Hoffmann Report”; Instrumental in establishment of LCG
But, now we have two critical milestones to meet– Computing Model Papers (Dec 2004)– Experiment and LCG TDR’s (July 2005)
And, we more or less know what will, or won’t, be ready in time
– Restrict enthusiasm to attainable goals
Process now to review models in all four experiments.– Very interested in running experiment experiences, (This Conference!)– Need the maximum expertise and experience included in these final
CM’s
David SticklandPrinceton Univ.30 Sept 2004
Page 6/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LHC Computing Scale (Circa 2008)
CERN T0/T1– Disk Space [PB] 5– Mass Storage Space [ PB] 20– Processing Power [MSI2K] 20– WAN [10Gb/s] ~5?
Tier-1s (Sum of ~10)– Disk Space [PB] 20– Mass Storage Space [ PB] 20– Processing Power [MSI2K] 45– WAN [10Gb/s/Tier-1]
~1?
Tier-2s (Sum of ~40)– Disk Space [PB] 12– Mass Storage Space [ PB] 5– Processing Power [MSI2K] 40– WAN [10Gb/s/Tier-2]
~.2?
Cost Sharing 30% At CERN, 40% T1s, 30% T2’s
LAN/WAN
TapeCPU
Disk
CERN T0/T1 Cost Sharing
Disk
CPU
Tape
LAN/WAN
T1 Cost Sharing
Disk
CPU
Tape
LAN/WAN
T2 Cost Sharing
David SticklandPrinceton Univ.30 Sept 2004
Page 7/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Elements of a Computing Model (I)
Data Model– Event data sizes, formats,
streaming– Data “Tiers” (DST/ESD/AOD
etc)• Roles, accessibility,
distribution,…
– Calibration/Conditions data• Flow, latencies, update freq
– Simulation. Sizes, distribution
– File size
Analysis Model– Canonical group needs in
terms of data, streams, re-processing, calibrations
– Data Movement, Job Movement, Priority management
– Interactive analysis
Computing Strategy and Deployment
– Roles of Computing Tiers– Data Distribution between Tiers– Data Management Architecture– Databases.
• Masters, Updates, Hierarchy
– Active/Passive Experiment Policy
Computing Specifications– Profiles (Tier N & Time)
• Processors, • Storage, • Network (Wide/Local),• DataBase services,• Specialized servers,• Middleware requirements
David SticklandPrinceton Univ.30 Sept 2004
Page 8/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Common Themes
Move a copy of the raw data away from CERN in “real-time”
– Second secure copy• 1 copy at CERN • 1 copy spread over N sites
– Flexibility. • Serve raw data even if Tier-0
saturated with DAQ
– Ability to run even primary reconstruction offsite
Streaming online and offline– (Maybe not a common theme
yet)
Tier-1 centers in-line to Online and Tier-0
Simulation at T2 centers– Except LHCb, if simulation load
remains high, use Tier-1
ESD Distributed n copies over N Tier-1 sites
– Tier-2 centers run complex selections at Tier-1, download skims
AOD Distributed to all (?) tier-2 centers
– Maybe not a common theme. • How useful is AOD, how early in
LHC?• Some Run II experience
indicating long term usage of “raw” data
Horizontal Streaming– RAW, ESD, AOD,TAG
Vertical Streaming– Trigger streams, Physics
Streams, Analysis Skims
David SticklandPrinceton Univ.30 Sept 2004
Page 9/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Purpose and structure of ALICE PDC04
Test and validate the ALICE Offline computing model:– Produce and analyse ~10% of the data sample collected in a
standard data-taking year– Use the entire ALICE off-line framework: AliEn, AliRoot, LCG, PROOF…– Experiment with Grid enabled distributed computing– Triple purpose: test of the middleware, the software and physics
analysis of the produced data for the Alice PPR
Three phases– Phase I - Distributed production of underlying Pb+Pb events with
different centralities (impact parameters) and of p+p events– Phase II - Distributed production mixing different signal events into
the underlying Pb+Pb events (reused several times)– Phase III – Distributed analysis
Principles:– True GRID data production and analysis: all jobs are run on the GRID,
using only AliEn for access and control of native computing resources and, through an interface, the LCG resources
– In phase III GLite+ARDA
David SticklandPrinceton Univ.30 Sept 2004
Page 10/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Master job submission, Job Optimizer (splitting in sub-jobs), RB, File catalogue,
process monitoring and control, SE…
Central servers
CEs
Sub-jobs
Job processing
AliEn-LCG interface
Sub-jobs
RB
Job processing
CEs
Storage
CERN CASTOR: disk servers, tape
Output files
File transfer system: AIOD
LCG is one AliEn CE
Job structure and production (phase I)
David SticklandPrinceton Univ.30 Sept 2004
Page 11/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
CEs: 15 directly controlled through AliEn + CERN-LCG and Torino-LCG (Grid.it)
Phase I CPU contributions
David SticklandPrinceton Univ.30 Sept 2004
Page 12/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Issues
Too many files in MSS Stager (Also CMS)– Solved by splitting the data on two stagers
(Persistent problems with local configurations reducing the availability of GRID sites
– Frequent black holes– Problems often come back (e.g. nfs mounts!)– Local disk space on WN
Quality of the information in the II
Workload Management System does not ensure an even distribution of jobs in the different centres
Lack of support for bulk operations makes the WMS response time critical
KeyHole approach and lack of appropriate monitoring and reporting tools make debugging difficult
David SticklandPrinceton Univ.30 Sept 2004
Page 13/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Phase II (started 1/07) – statistics
In addition to phase I– Distributed production of
signal events and merging with phase I events
– Network and file transfer tools stress
– Storage at remote SEs and stability (crucial for phase III)
Conditions, jobs …:– 110 conditions total– 1 million jobs – 10 TB produced data– 200 TB transferred from
CERN– 500 MSI2k hours CPU
End by 30 September
Signal
Signals / Underlying
event
Underlying events
MB per signal event
kSI2Ks per signal event
TB [MSI2K x h]
Jets cent1 cycles: 2Jets PT 20-24 GeV/c 5 1666 5.2 940 0.09 4.35Jets PT 24-29 GeV/c 5 1666 5.2 946 0.09 4.38Jets PT 29-35 GeV/c 5 1666 5.3 952 0.09 4.41Jets PT 35-42 GeV/c 5 1666 5.3 958 0.09 4.43Jets PT 42-50 GeV/c 5 1666 5.4 964 0.09 4.46Jets PT 50-60 GeV/c 5 1666 5.4 970 0.09 4.49Jets PT 60-72 GeV/c 5 1666 5.5 976 0.09 4.52Jets PT 72-86 GeV/c 5 1666 5.5 982 0.09 4.54Jets PT 86-104 Gev/c 5 1666 5.6 988 0.09 4.57Jets PT 104-125 GeV/c 5 1666 5.6 994 0.09 4.6Jets PT 125-150 GeV/c 5 1666 5.7 1000 0.09 4.63Jets PT 150-180 GeV/c 5 1666 5.7 1006 0.09 4.66Total signal 199920 1.08 54.04Jets with quenching cent1 cycles: 2Total signal 199920 1.08 54.04Jets per1 cycles: 2Jets PT 20-24 GeV/c 5 1666 2.6 940 0.04 2.18Jets PT 24-29 GeV/c 5 1666 2.6 946 0.04 2.19Jets PT 29-35 GeV/c 5 1666 2.65 952 0.04 2.2Jets PT 35-42 GeV/c 5 1666 2.65 958 0.04 2.22Jets PT 42-50 GeV/c 5 1666 2.7 964 0.04 2.23Jets PT 50-60 GeV/c 5 1666 2.7 970 0.04 2.24Jets PT 60-72 GeV/c 5 1666 2.75 976 0.05 2.26Jets PT 72-86 GeV/c 5 1666 2.75 982 0.05 2.27Jets PT 86-104 Gev/c 5 1666 2.8 988 0.05 2.29Jets PT 104-125 GeV/c 5 1666 2.8 994 0.05 2.3Jets PT 125-150 GeV/c 5 1666 2.85 1000 0.05 2.31Jets PT 150-180 GeV/c 5 1666 2.85 1006 0.05 2.33Total signal 199920 0.54 27.02Jets with quenching per1 cycles: 2Total signal 199920 0.54 27.02PHOS cent1 cycles: 1Jet-Jet PHOS 1 20000 8.6 3130 0.17 17.39Gamma-jet PHOS 1 20000 8.6 3130 0.17 17.39Total signal 40000 0.34 34.78D0 cent1 cycles: 1D0 5 20000 2.3 820 0.23 22.77Total signal 100000 0.23 22.77Charm & Beauty cent1 cycles: 1Charm (semi-e) + J/psi 5 20000 2.3 820 0.23 22.78Beauty (semi-e) + Y 5 20000 2.3 820 0.23 22.78Total signal 200000 0.46 45.56MUON cent1 cycles: 1Muon coctail cent1 100 20000 0.04 67 0.08 37.22Muon coctail HighPT 100 20000 0.04 67 0.08 37.22Muon coctail single 100 20000 0.04 67 0.08 37.22Total signal 6000000 0.24 111.66MUON per1 cycles: 1Muon coctail per1 100 20000 0.04 67 0.08 37.22Muon coctail HighPT 100 20000 0.04 67 0.08 37.22Muon coctail single 100 20000 0.04 67 0.08 37.22Total signal 6000000 0.24 111.66
All signals 4.75 488.55MUON per4 cycles: 1Muon coctail per4 5 20000 Muon coctail single 100 20000 proton-proton no merging cycles: 1proton-proton 100000
David SticklandPrinceton Univ.30 Sept 2004
Page 14/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Master job submission, Job Optimizer (N sub-jobs), RB, File
catalogue, processes monitoring and control, SE…
Central servers
CEs
Sub-jobs
Job processing
AliEn-LCG interface
Sub-jobs
RB
Job processing
CEs
Storage
CERN CASTOR: underlying events
Local SEs
CERN CASTOR: backup copy
Storage
Primary copy Primary copy
Local SEs
Output files Output files
Underlying event input files
zip archive of output files
Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN
edg(lcg) copy®ister
File catalogu
e
Structure of event production in phase II
David SticklandPrinceton Univ.30 Sept 2004
Page 15/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Master job submission, Job Optimizer (N sub-jobs), RB, File
catalogue, processes monitoring and control, SE…
Central servers
CEs
Sub-jobs
Job processing
AliEn-LCG interface
Sub-jobs
RB
Job processing
CEs
Local SEs
Primary copy Primary copy
Local SEs
Input files Input files
File catalogu
e
Job splitter
File catalogu
eMetadata
lfn 1lfn 2lfn 3
lfn 7lfn 8
lfn 4lfn 5lfn 6
PFN = (LCG SE:) LCG LFNPFN = AliEn PFN
Query LFN’s
Get PFN’s
User query
Structure analysis in phase 3
David SticklandPrinceton Univ.30 Sept 2004
Page 16/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
ALICE DC04 Conclusions
The ALICE DC04 started out with (almost unrealistically) ambitious objectives
They are coming very close to reach these objectives and LCG has played an important role.
They are ready and willing to move to gLite as soon as possible and contribute to its evolution with their feedback
David SticklandPrinceton Univ.30 Sept 2004
Page 17/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Consider DC2 as a three-part operation:– part I: production of simulated data (July-September 2004)
• running on “Grid”• Worldwide
– part II: test of Tier-0 operation (November 2004)• Do in 10 days what “should” be done in 1 day when real data-
taking start• Input is “Raw Data” like• output (ESD+AOD) will be distributed to Tier-1s in real time for
analysis
– part III: test of distributed analysis on the Grid • access to event and non-event data from anywhere in the world
both in organized and chaotic ways
Requests– ~30 Physics channels ( 10 Millions of events)
– Several millions of events for calibration (single particles and physics samples)
ATLAS-DC2 operation
David SticklandPrinceton Univ.30 Sept 2004
Page 18/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LCG NG Grid3 LSF
LCGexe
LCGexe
NGexe
G3exe
LSFexe
super super super super super
prodDBdms
RLS RLS RLS
jabber jabber soap soap jabber
Don Quijote
Windmill
Lexor
AMI
CaponeDulcinea
ATLAS Production system
David SticklandPrinceton Univ.30 Sept 2004
Page 19/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
CPU usage & Jobs
NorduGrid
50%
12%
8%
3%
3%
3%
3%
2%
2%
2%
2%
2%
2%
2%
2%
0%
0%
SWEGRID
brenta.ijs.si
benedict.aau.dk
hypatia.uio.no
farm.hep.lu.se
fire.ii.uib.no
fe10.dcsc.sdu.dk
lxsrv9.lrz-muenchen.de
atlas.hpc.unimelb.edu.au
grid.uio.no
lheppc10.unibe.ch
morpheus.dcgc.dk
genghis.hpc.unimelb.edu.au
charm.hpc.unimelb.edu.au
lscf.nbi.dk
atlas.fzk.de
grid.fi.uib.no
ATLAS DC2 - CPU usage
41%
30%
29%
LCG
NorduGrid
Grid3
ATLAS DC2 - Grid3 - September 7
7%
10%
13%
4%
5%
0%
10%5%
1%
1%
5%
4%
13%
1%
4%
0%
18%
1%
BNL_ATLAS
BNL_ATLAS_BAK
BU_ATLAS_Tier2
CalTech_PG
FNAL_CMS
FNAL_CMS2
IU_ATLAS_Tier2
PDSF
Rice_Grid3
SMU_Physics_Cluster
UBuffalo_CCR
UCSanDiego_PG
UC_ATLAS_Tier2
UFlorida_PG
UM_ATLAS
UNM_HPC
UTA_dpcc
UWMadison
ATLAS DC2 - LCG - September 7
0%
2%
14%
3%
1%
3%
9%
8%
3%2%5%1%4%
1%
1%
3%
0%
1%
4%1%
0%
12%
0%
1%
1%
2%
10%
1% 4%
2%1%
1%
1%
at.uibk
ca.triumf
ca.ualberta
ca.umontreal
ca.utoronto
ch.cern
cz.golias
cz.skurut
de.fzk
es.ifae
es.ific
es.uam
fr.in2p3
it.infn.cnaf
it.infn.lnl
it.infn.mi
it.infn.na
it.infn.na
it.infn.roma
it.infn.to
it.infn.lnf
jp.icepp
nl.nikhef
pl.zeus
ru.msu
tw.sinica
uk.bham
uk.ic
uk.lancs
uk.man
uk.rl
uk.shef
uk.ucl
David SticklandPrinceton Univ.30 Sept 2004
Page 20/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
ATLAS DC2 Status
Major efforts in the past few months
– Redesign of the ATLAS Event Data Model and Detector Description
– Integration of the LCG components (G4; POOL; …)
– Introduction of the Production System
• Interfaced with 3 Grid flavors (and “legacy” systems)
Delays in all activities have affected the schedule of DC2
– Note that Combined Test Beam is ATLAS 1st priority
– And DC2 schedule was revisited
• To wait for the readiness of the software and of the Production system
DC2– About 80% of the Geant4
simulation foreseen for Phase I has been completed using only Grid and using the 3 flavors coherently;
– The 3 Grids have been proven to be usable for a real production and this is a major achievement
BUT– Phase I progressing slower than
expected and it’s clear that all the involved elements (Grid middleware; Production System; deployment and monitoring tools over the sites) need improvements
– It is a key goal of the Data Challenges to identify these problems as early as possible.
David SticklandPrinceton Univ.30 Sept 2004
Page 21/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Testing CMS Computing Model in DC04 Focused on organized (CMS-managed) data flow/access
Functional DST with streams for Physics and Calibration– DST size ok, almost usable by “all” analyses; (new version ready now)
Tier-0 farm reconstruction– 500 CPU. Ran at 25Hz. Reconstruction time within estimates.
Tier-0 Buffer Management and Distribution to Tier-1’s– TMDB: a CMS-built Agent system communicating via a Central Database.– Manages dynamic dataset “state”, not a file catalog
Tier-1 Managed Import of Selected Data from Tier-0– TMDB system worked.
Tier-2 Managed Import of Selected Data from Tier-1– Meta-data based selection ok. Local Tier-1 TMDB ok.
Real-Time analysis access at Tier-1 and Tier-2– Achieved 20 minute latency from Tier 0 reconstruction to job launch at Tier-
1 and Tier-2
Catalog Services, Replica Management– Significant performance problems found and being addressed
David SticklandPrinceton Univ.30 Sept 2004
Page 22/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
PICBarcelona
FZKKarlsruhe
CNAFBologna
RALOxford
IN2P3Lyon
T1
T1
T1
T1
T1 T0
FNALChicago
T1
DC04 Data Challenge Focused on organized (CMS-managed) data flow/access
T0 at CERN in DC04– 25 Hz Reconstruction– Events filtered into streams– Record raw data and DST– Distribute raw data and DST to
T1’s
T1 centres in DC04– Pull data from T0 to T1 and
store– Make data available to PRS– Demonstrate quasi-realtime
analysis of DST’s
T2 centres in DC04– Pre-challenge production at >
30 sites– Modest tests of DST analysis
T2
Legnaro
T2 T2
CIEMATMadrid
Florida
T2
ICLondon
T2
Caltech
David SticklandPrinceton Univ.30 Sept 2004
Page 23/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Tier-2Tier-2
Physicist
T2T2storagestorage
ORCALocal Job
Tier-2Tier-2
Physicist
T2T2storagestorage
ORCALocal Job
Tier-1Tier-1Tier-1agent
T1T1storagestorage
ORCAAnalysis
Job
MSS
ORCAGrid Job
Tier-1Tier-1Tier-1agent
T1T1storagestorage
ORCAAnalysis
Job
MSS
ORCAGrid Job
DC04 layout
Tier-0Tier-0
Castor
IBIB
fake on-lineprocess
RefDB
POOL RLScatalogue
TMDB
ORCARECO
Job
GDBGDBTier-0
data distributionagents
EBEB
LCG-2Services
Tier-2Tier-2
Physicist
T2T2storagestorage
ORCALocal JobTier-1Tier-1
Tier-1agent
T1T1storagestorage
ORCAAnalysis
Job
MSS
ORCAGrid Job
David SticklandPrinceton Univ.30 Sept 2004
Page 24/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Next Steps
Physics TDR requires physicist access to DC04 data
– Re-reconstruction passes– Alignment studies– Luminosity effects
• Estimate 10M events/month throughput required
CMS Summer-Timeout to focus new effort on
– DST format/contents– Data Management “RTAG”– Workload Management
deployment for physicist data access now
– Cross-project coordination group focused on end-user Analysis
Use Requirements of Physics TDR to build understanding of analysis model, while doing the analysis
– Make it work for Physics TDR
Component Data Challenges in 2005
– Not a big-bang where everything has to work at the same time
Readiness challenge in 2006.
– 100% Startup scale– Concurrent Production,
Distribution, Ordered and Chaotic Analysis
David SticklandPrinceton Univ.30 Sept 2004
Page 25/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LHCb DC’04 aims
Gather information for LHCb Computing TDR
Physics Goals:– HLT studies, consolidating efficiencies.– B/S studies, consolidate background estimates + background
properties.
Requires quantitative increase in number of signal and background events:
– 30 106 signal events (~80 physics channels).– 15 106 specific backgrounds.– 125 106 background (B inclusive + min. bias, 1:1.8).
Split DC’04 in 3 Phases:– Production: MC simulation (Done).– Stripping: Event pre-selection (To start soon).– Analysis (In preparation).
David SticklandPrinceton Univ.30 Sept 2004
Page 26/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
DIRAC JobManagement
Service
DIRAC JobManagement
Service
DIRAC CEDIRAC CEDIRAC CEDIRAC CE
DIRAC CEDIRAC CE
LCGLCGResourceBroker
ResourceBroker
CE 1CE 1
DIRAC SitesDIRAC Sites
AgentAgent AgentAgent AgentAgent
CE 2CE 2
CE 3CE 3
Productionmanager
Productionmanager GANGA UIGANGA UI User CLI User CLI
JobMonitorSvcJobMonitorSvc
JobAccountingSvcJobAccountingSvc
AccountingDB
Job monitorJob monitor
InfomationSvcInfomationSvc
FileCatalogSvcFileCatalogSvc
MonitoringSvcMonitoringSvc
BookkeepingSvcBookkeepingSvc
BK query webpage BK query webpage
FileCatalogbrowser
FileCatalogbrowser
Userinterfaces
DIRACservices
DIRACresources
DIRAC StorageDIRAC Storage
DiskFileDiskFile
gridftpgridftp bbftpbbftp
rfiorfio
DIRAC Services & Resources
David SticklandPrinceton Univ.30 Sept 2004
Page 27/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Phase 1 Completed
DIRAC alone
LCG inaction
1.8 106/day
LCG paused
3-5 106/day
LCG restarted
186 M Produced Events
Phase 1 Completed
David SticklandPrinceton Univ.30 Sept 2004
Page 28/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LCG Performance (I)
Submitted Jobs
Cancelled Jobs
Aborted Jobs (Before Running)
211k Submitted Jobs
After Running:
-113 k Done (Successful)
-34 k Aborted
David SticklandPrinceton Univ.30 Sept 2004
Page 29/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LCG Performance (II)
Jobs(k) %Sub %RemainSubmitted 211 100.0%Cancelled 26 12.2%Remaining 185 87.8% 100.0%Aborted (not Run) 37 17.6% 20.1%Running 148 70.0% 79.7%Aborted (Run) 34 16.2% 18.5%Done 113 53.8% 61.2%Retrieved 113 53.8% 61.2%
LCG Job Submission Summary Table
LCG Efficiency: 61 %
David SticklandPrinceton Univ.30 Sept 2004
Page 30/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LHCb DC’04 Status
LHCb DC’04 Phase 1 is over.
The Production Target has been achieved:– 186 M Events in 424 CPU years.
– ~ 50% on LCG Resources (75-80% at the last weeks).
Right LHCb Strategy: – Submitting “empty” DIRAC Agents to LCG has proven to be very flexible
allowing a good success rate.
Big room for improvements, both on DIRAC and LCG:– DIRAC needs to improve in the reliability of the Servers:
• big step already during DC.
– LCG needs improvement on the single job efficiency:• ~40% aborted jobs.
– In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in.
Congratulation and warm thanks to the complete LCG team for their support and dedication
David SticklandPrinceton Univ.30 Sept 2004
Page 31/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Personal Observations on Data Challenge Results Tier-0 Operations at 25% scale demonstrated
– Job couplings from objectivity era - gone
Directed data flow/management T0>T1>T2. Worked (Intermittently)
Massive Simulation on LCG, Grid3, NorduGrid. Worked
Beginning to get experience with input-data-intensive jobs
Not many users out there yet stressing the chaotic side– The next 6 months are critical, we have to see broad and growing
adoption, not having a personal grid user certificate will have to seem odd
Many problems are classical computer center ones– Full disks, reboots, SW installation, Dead disks, ..
– Actually this is bad news. No Middleware silver-bullet. Hard work getting so many centers up to required performance
David SticklandPrinceton Univ.30 Sept 2004
Page 32/32
CH
EP
‘04
CH
EP
‘04
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
LH
C C
om
pu
tin
g M
od
els
an
d D
ata
Ch
allen
ges
Critical Issues for early 2005
Data Management– Building Experiment Data Management Solutions
Demonstrating End-User access to remote resources– Data and processing
Managing Conditions and Calibration databases– And their global distribution
Managing Network expectations– Analysis can place (currently) impossible loads on network and DM
components• Planning for the future, while maintaining priority controls
Determining the pragmatic mix of Grid responsibilities and experiment responsibilities
– Recall the “Data” in DataGrid, LHC is Data Intensive,– Configuring the experiment and grid software to use generic resources
is wise– But (I think) data location will require a more ordered approach in
practice