Download - CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton University For the LHC Experiments: ALICE, ATLAS, CMS and LHCb (Though with

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LHC Computing Models and Data LHC Computing Models and Data ChallengesChallenges

David SticklandDavid Stickland

Princeton UniversityPrinceton University

For the LHC Experiments:For the LHC Experiments:

ALICE, ATLAS, CMS and LHCbALICE, ATLAS, CMS and LHCb

(Though with entirely my own bias (Though with entirely my own bias and errors)and errors)

David SticklandPrinceton Univ.30 Sept 2004

/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Outline

Elements of a Computing Model

Recap of LHC Computing Scales

Recent/Current LHC Data Challenges

Critical Issues


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

The Real Challenge…

Quoting from David Williams’ Opening Talk– Computing hardware isn’t the biggest challenge

– Enabling the collaboration to bring their combined intellect to bear on the problems is the real challenge

Empowering the intellectual capabilities in very large collaborations to contribute to the analysis of the new energy frontier

– Not (just) from sociological/political perspective

– You never know where the critical ideas or work will come from

– The simplistic vision of central control is elitist and illusory


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Goals of An (Offline) Computing Model

Worldwide Collaborator access and ability to contribute to the analysis of the experiment:

Safe Data Storage

Feedback to the running experiment

Reconstruction according to priority scheme with graceful fall back solutions

– Introduce no Deadtime

Optimum (Or at least acceptable) usage of Computing Resources

Efficient Data Management


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Where are LHC Computing Models today?

In a state of flux!– Last chance to influence initial purchasing planning in next 12 months

Basic principles – were contained in MONARC and went into the first assessment of LHC

computing requirements– Aka “Hoffmann Report”; Instrumental in establishment of LCG

But, now we have two critical milestones to meet– Computing Model Papers (Dec 2004)– Experiment and LCG TDR’s (July 2005)

And, we more or less know what will, or won’t, be ready in time

– Restrict enthusiasm to attainable goals

Process now to review models in all four experiments.– Very interested in running experiment experiences, (This Conference!)– Need the maximum expertise and experience included in these final

CM’s


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LHC Computing Scale (Circa 2008)

CERN T0/T1– Disk Space [PB] 5– Mass Storage Space [ PB] 20– Processing Power [MSI2K] 20– WAN [10Gb/s] ~5?

Tier-1s (Sum of ~10)– Disk Space [PB] 20– Mass Storage Space [ PB] 20– Processing Power [MSI2K] 45– WAN [10Gb/s/Tier-1]

~1?

Tier-2s (Sum of ~40)– Disk Space [PB] 12– Mass Storage Space [ PB] 5– Processing Power [MSI2K] 40– WAN [10Gb/s/Tier-2]

~.2?

Cost Sharing 30% At CERN, 40% T1s, 30% T2’s

LAN/WAN

TapeCPU

Disk

CERN T0/T1 Cost Sharing

Disk

CPU

Tape

LAN/WAN

T1 Cost Sharing

Disk

CPU

Tape

LAN/WAN

T2 Cost Sharing


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Elements of a Computing Model (I)

Data Model– Event data sizes, formats,

streaming– Data “Tiers” (DST/ESD/AOD

etc)• Roles, accessibility,

distribution,…

– Calibration/Conditions data• Flow, latencies, update freq

– Simulation. Sizes, distribution

– File size

Analysis Model– Canonical group needs in

terms of data, streams, re-processing, calibrations

– Data Movement, Job Movement, Priority management

– Interactive analysis

Computing Strategy and Deployment

– Roles of Computing Tiers– Data Distribution between Tiers– Data Management Architecture– Databases.

• Masters, Updates, Hierarchy

– Active/Passive Experiment Policy

Computing Specifications– Profiles (Tier N & Time)

• Processors, • Storage, • Network (Wide/Local),• DataBase services,• Specialized servers,• Middleware requirements


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Common Themes

Move a copy of the raw data away from CERN in “real-time”

– Second secure copy• 1 copy at CERN • 1 copy spread over N sites

– Flexibility. • Serve raw data even if Tier-0

saturated with DAQ

– Ability to run even primary reconstruction offsite

Streaming online and offline– (Maybe not a common theme

yet)

Tier-1 centers in-line to Online and Tier-0

Simulation at T2 centers– Except LHCb, if simulation load

remains high, use Tier-1

ESD Distributed n copies over N Tier-1 sites

– Tier-2 centers run complex selections at Tier-1, download skims

AOD Distributed to all (?) tier-2 centers

– Maybe not a common theme. • How useful is AOD, how early in

LHC?• Some Run II experience

indicating long term usage of “raw” data

Horizontal Streaming– RAW, ESD, AOD,TAG

Vertical Streaming– Trigger streams, Physics

Streams, Analysis Skims


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Purpose and structure of ALICE PDC04

Test and validate the ALICE Offline computing model:– Produce and analyse ~10% of the data sample collected in a

standard data-taking year– Use the entire ALICE off-line framework: AliEn, AliRoot, LCG, PROOF…– Experiment with Grid enabled distributed computing– Triple purpose: test of the middleware, the software and physics

analysis of the produced data for the Alice PPR

Three phases– Phase I - Distributed production of underlying Pb+Pb events with

different centralities (impact parameters) and of p+p events– Phase II - Distributed production mixing different signal events into

the underlying Pb+Pb events (reused several times)– Phase III – Distributed analysis

Principles:– True GRID data production and analysis: all jobs are run on the GRID,

using only AliEn for access and control of native computing resources and, through an interface, the LCG resources

– In phase III GLite+ARDA


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Master job submission, Job Optimizer (splitting in sub-jobs), RB, File catalogue,

process monitoring and control, SE…

Central servers

CEs

Sub-jobs

Job processing

AliEn-LCG interface

Sub-jobs

RB

Job processing

CEs

Storage

CERN CASTOR: disk servers, tape

Output files

File transfer system: AIOD

LCG is one AliEn CE

Job structure and production (phase I)


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

CEs: 15 directly controlled through AliEn + CERN-LCG and Torino-LCG (Grid.it)

Phase I CPU contributions


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Issues

Too many files in MSS Stager (Also CMS)– Solved by splitting the data on two stagers

(Persistent problems with local configurations reducing the availability of GRID sites

– Frequent black holes– Problems often come back (e.g. nfs mounts!)– Local disk space on WN

Quality of the information in the II

Workload Management System does not ensure an even distribution of jobs in the different centres

Lack of support for bulk operations makes the WMS response time critical

KeyHole approach and lack of appropriate monitoring and reporting tools make debugging difficult


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Phase II (started 1/07) – statistics

In addition to phase I– Distributed production of

signal events and merging with phase I events

– Network and file transfer tools stress

– Storage at remote SEs and stability (crucial for phase III)

Conditions, jobs …:– 110 conditions total– 1 million jobs – 10 TB produced data– 200 TB transferred from

CERN– 500 MSI2k hours CPU

End by 30 September

Signal

Signals / Underlying

event

Underlying events

MB per signal event

kSI2Ks per signal event

TB [MSI2K x h]

Jets cent1 cycles: 2Jets PT 20-24 GeV/c 5 1666 5.2 940 0.09 4.35Jets PT 24-29 GeV/c 5 1666 5.2 946 0.09 4.38Jets PT 29-35 GeV/c 5 1666 5.3 952 0.09 4.41Jets PT 35-42 GeV/c 5 1666 5.3 958 0.09 4.43Jets PT 42-50 GeV/c 5 1666 5.4 964 0.09 4.46Jets PT 50-60 GeV/c 5 1666 5.4 970 0.09 4.49Jets PT 60-72 GeV/c 5 1666 5.5 976 0.09 4.52Jets PT 72-86 GeV/c 5 1666 5.5 982 0.09 4.54Jets PT 86-104 Gev/c 5 1666 5.6 988 0.09 4.57Jets PT 104-125 GeV/c 5 1666 5.6 994 0.09 4.6Jets PT 125-150 GeV/c 5 1666 5.7 1000 0.09 4.63Jets PT 150-180 GeV/c 5 1666 5.7 1006 0.09 4.66Total signal 199920 1.08 54.04Jets with quenching cent1 cycles: 2Total signal 199920 1.08 54.04Jets per1 cycles: 2Jets PT 20-24 GeV/c 5 1666 2.6 940 0.04 2.18Jets PT 24-29 GeV/c 5 1666 2.6 946 0.04 2.19Jets PT 29-35 GeV/c 5 1666 2.65 952 0.04 2.2Jets PT 35-42 GeV/c 5 1666 2.65 958 0.04 2.22Jets PT 42-50 GeV/c 5 1666 2.7 964 0.04 2.23Jets PT 50-60 GeV/c 5 1666 2.7 970 0.04 2.24Jets PT 60-72 GeV/c 5 1666 2.75 976 0.05 2.26Jets PT 72-86 GeV/c 5 1666 2.75 982 0.05 2.27Jets PT 86-104 Gev/c 5 1666 2.8 988 0.05 2.29Jets PT 104-125 GeV/c 5 1666 2.8 994 0.05 2.3Jets PT 125-150 GeV/c 5 1666 2.85 1000 0.05 2.31Jets PT 150-180 GeV/c 5 1666 2.85 1006 0.05 2.33Total signal 199920 0.54 27.02Jets with quenching per1 cycles: 2Total signal 199920 0.54 27.02PHOS cent1 cycles: 1Jet-Jet PHOS 1 20000 8.6 3130 0.17 17.39Gamma-jet PHOS 1 20000 8.6 3130 0.17 17.39Total signal 40000 0.34 34.78D0 cent1 cycles: 1D0 5 20000 2.3 820 0.23 22.77Total signal 100000 0.23 22.77Charm & Beauty cent1 cycles: 1Charm (semi-e) + J/psi 5 20000 2.3 820 0.23 22.78Beauty (semi-e) + Y 5 20000 2.3 820 0.23 22.78Total signal 200000 0.46 45.56MUON cent1 cycles: 1Muon coctail cent1 100 20000 0.04 67 0.08 37.22Muon coctail HighPT 100 20000 0.04 67 0.08 37.22Muon coctail single 100 20000 0.04 67 0.08 37.22Total signal 6000000 0.24 111.66MUON per1 cycles: 1Muon coctail per1 100 20000 0.04 67 0.08 37.22Muon coctail HighPT 100 20000 0.04 67 0.08 37.22Muon coctail single 100 20000 0.04 67 0.08 37.22Total signal 6000000 0.24 111.66

All signals 4.75 488.55MUON per4 cycles: 1Muon coctail per4 5 20000 Muon coctail single 100 20000 proton-proton no merging cycles: 1proton-proton 100000


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Master job submission, Job Optimizer (N sub-jobs), RB, File

catalogue, processes monitoring and control, SE…

Central servers

CEs

Sub-jobs

Job processing

AliEn-LCG interface

Sub-jobs

RB

Job processing

CEs

Storage

CERN CASTOR: underlying events

Local SEs

CERN CASTOR: backup copy

Storage

Primary copy Primary copy

Local SEs

Output files Output files

Underlying event input files

zip archive of output files

Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN

edg(lcg) copy&register

File catalogu

e

Structure of event production in phase II


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Master job submission, Job Optimizer (N sub-jobs), RB, File

catalogue, processes monitoring and control, SE…

Central servers

CEs

Sub-jobs

Job processing

AliEn-LCG interface

Sub-jobs

RB

Job processing

CEs

Local SEs

Primary copy Primary copy

Local SEs

Input files Input files

File catalogu

e

Job splitter

File catalogu

eMetadata

lfn 1lfn 2lfn 3

lfn 7lfn 8

lfn 4lfn 5lfn 6

PFN = (LCG SE:) LCG LFNPFN = AliEn PFN

Query LFN’s

Get PFN’s

User query

Structure analysis in phase 3


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

ALICE DC04 Conclusions

The ALICE DC04 started out with (almost unrealistically) ambitious objectives

They are coming very close to reach these objectives and LCG has played an important role.

They are ready and willing to move to gLite as soon as possible and contribute to its evolution with their feedback


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Consider DC2 as a three-part operation:– part I: production of simulated data (July-September 2004)

• running on “Grid”• Worldwide

– part II: test of Tier-0 operation (November 2004)• Do in 10 days what “should” be done in 1 day when real data-

taking start• Input is “Raw Data” like• output (ESD+AOD) will be distributed to Tier-1s in real time for

analysis

– part III: test of distributed analysis on the Grid • access to event and non-event data from anywhere in the world

both in organized and chaotic ways

Requests– ~30 Physics channels ( 10 Millions of events)

– Several millions of events for calibration (single particles and physics samples)

ATLAS-DC2 operation


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LCG NG Grid3 LSF

LCGexe

LCGexe

NGexe

G3exe

LSFexe

super super super super super

prodDBdms

RLS RLS RLS

jabber jabber soap soap jabber

Don Quijote

Windmill

Lexor

AMI

CaponeDulcinea

ATLAS Production system


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

CPU usage & Jobs

NorduGrid

50%

12%

8%

3%

3%

3%

3%

2%

2%

2%

2%

2%

2%

2%

2%

0%

0%

SWEGRID

brenta.ijs.si

benedict.aau.dk

hypatia.uio.no

farm.hep.lu.se

fire.ii.uib.no

fe10.dcsc.sdu.dk

lxsrv9.lrz-muenchen.de

atlas.hpc.unimelb.edu.au

grid.uio.no

lheppc10.unibe.ch

morpheus.dcgc.dk

genghis.hpc.unimelb.edu.au

charm.hpc.unimelb.edu.au

lscf.nbi.dk

atlas.fzk.de

grid.fi.uib.no

ATLAS DC2 - CPU usage

41%

30%

29%

LCG

NorduGrid

Grid3

ATLAS DC2 - Grid3 - September 7

7%

10%

13%

4%

5%

0%

10%5%

1%

1%

5%

4%

13%

1%

4%

0%

18%

1%

BNL_ATLAS

BNL_ATLAS_BAK

BU_ATLAS_Tier2

CalTech_PG

FNAL_CMS

FNAL_CMS2

IU_ATLAS_Tier2

PDSF

Rice_Grid3

SMU_Physics_Cluster

UBuffalo_CCR

UCSanDiego_PG

UC_ATLAS_Tier2

UFlorida_PG

UM_ATLAS

UNM_HPC

UTA_dpcc

UWMadison

ATLAS DC2 - LCG - September 7

0%

2%

14%

3%

1%

3%

9%

8%

3%2%5%1%4%

1%

1%

3%

0%

1%

4%1%

0%

12%

0%

1%

1%

2%

10%

1% 4%

2%1%

1%

1%

at.uibk

ca.triumf

ca.ualberta

ca.umontreal

ca.utoronto

ch.cern

cz.golias

cz.skurut

de.fzk

es.ifae

es.ific

es.uam

fr.in2p3

it.infn.cnaf

it.infn.lnl

it.infn.mi

it.infn.na

it.infn.na

it.infn.roma

it.infn.to

it.infn.lnf

jp.icepp

nl.nikhef

pl.zeus

ru.msu

tw.sinica

uk.bham

uk.ic

uk.lancs

uk.man

uk.rl

uk.shef

uk.ucl


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

ATLAS DC2 Status

Major efforts in the past few months

– Redesign of the ATLAS Event Data Model and Detector Description

– Integration of the LCG components (G4; POOL; …)

– Introduction of the Production System

• Interfaced with 3 Grid flavors (and “legacy” systems)

Delays in all activities have affected the schedule of DC2

– Note that Combined Test Beam is ATLAS 1st priority

– And DC2 schedule was revisited

• To wait for the readiness of the software and of the Production system

DC2– About 80% of the Geant4

simulation foreseen for Phase I has been completed using only Grid and using the 3 flavors coherently;

– The 3 Grids have been proven to be usable for a real production and this is a major achievement

BUT– Phase I progressing slower than

expected and it’s clear that all the involved elements (Grid middleware; Production System; deployment and monitoring tools over the sites) need improvements

– It is a key goal of the Data Challenges to identify these problems as early as possible.


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Testing CMS Computing Model in DC04 Focused on organized (CMS-managed) data flow/access

Functional DST with streams for Physics and Calibration– DST size ok, almost usable by “all” analyses; (new version ready now)

Tier-0 farm reconstruction– 500 CPU. Ran at 25Hz. Reconstruction time within estimates.

Tier-0 Buffer Management and Distribution to Tier-1’s– TMDB: a CMS-built Agent system communicating via a Central Database.– Manages dynamic dataset “state”, not a file catalog

Tier-1 Managed Import of Selected Data from Tier-0– TMDB system worked.

Tier-2 Managed Import of Selected Data from Tier-1– Meta-data based selection ok. Local Tier-1 TMDB ok.

Real-Time analysis access at Tier-1 and Tier-2– Achieved 20 minute latency from Tier 0 reconstruction to job launch at Tier-

1 and Tier-2

Catalog Services, Replica Management– Significant performance problems found and being addressed


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

PICBarcelona

FZKKarlsruhe

CNAFBologna

RALOxford

IN2P3Lyon

T1

T1

T1

T1

T1 T0

FNALChicago

T1

DC04 Data Challenge Focused on organized (CMS-managed) data flow/access

T0 at CERN in DC04– 25 Hz Reconstruction– Events filtered into streams– Record raw data and DST– Distribute raw data and DST to

T1’s

T1 centres in DC04– Pull data from T0 to T1 and

store– Make data available to PRS– Demonstrate quasi-realtime

analysis of DST’s

T2 centres in DC04– Pre-challenge production at >

30 sites– Modest tests of DST analysis

T2

Legnaro

T2 T2

CIEMATMadrid

Florida

T2

ICLondon

T2

Caltech


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Tier-2Tier-2

Physicist

T2T2storagestorage

ORCALocal Job

Tier-2Tier-2

Physicist

T2T2storagestorage

ORCALocal Job

Tier-1Tier-1Tier-1agent

T1T1storagestorage

ORCAAnalysis

Job

MSS

ORCAGrid Job

Tier-1Tier-1Tier-1agent

T1T1storagestorage

ORCAAnalysis

Job

MSS

ORCAGrid Job

DC04 layout

Tier-0Tier-0

Castor

IBIB

fake on-lineprocess

RefDB

POOL RLScatalogue

TMDB

ORCARECO

Job

GDBGDBTier-0

data distributionagents

EBEB

LCG-2Services

Tier-2Tier-2

Physicist

T2T2storagestorage

ORCALocal JobTier-1Tier-1

Tier-1agent

T1T1storagestorage

ORCAAnalysis

Job

MSS

ORCAGrid Job


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Next Steps

Physics TDR requires physicist access to DC04 data

– Re-reconstruction passes– Alignment studies– Luminosity effects

• Estimate 10M events/month throughput required

CMS Summer-Timeout to focus new effort on

– DST format/contents– Data Management “RTAG”– Workload Management

deployment for physicist data access now

– Cross-project coordination group focused on end-user Analysis

Use Requirements of Physics TDR to build understanding of analysis model, while doing the analysis

– Make it work for Physics TDR

Component Data Challenges in 2005

– Not a big-bang where everything has to work at the same time

Readiness challenge in 2006.

– 100% Startup scale– Concurrent Production,

Distribution, Ordered and Chaotic Analysis


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LHCb DC’04 aims

Gather information for LHCb Computing TDR

Physics Goals:– HLT studies, consolidating efficiencies.– B/S studies, consolidate background estimates + background

properties.

Requires quantitative increase in number of signal and background events:

– 30 106 signal events (~80 physics channels).– 15 106 specific backgrounds.– 125 106 background (B inclusive + min. bias, 1:1.8).

Split DC’04 in 3 Phases:– Production: MC simulation (Done).– Stripping: Event pre-selection (To start soon).– Analysis (In preparation).


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

DIRAC JobManagement

Service

DIRAC JobManagement

Service

DIRAC CEDIRAC CEDIRAC CEDIRAC CE

DIRAC CEDIRAC CE

LCGLCGResourceBroker

ResourceBroker

CE 1CE 1

DIRAC SitesDIRAC Sites

AgentAgent AgentAgent AgentAgent

CE 2CE 2

CE 3CE 3

Productionmanager

Productionmanager GANGA UIGANGA UI User CLI User CLI

JobMonitorSvcJobMonitorSvc

JobAccountingSvcJobAccountingSvc

AccountingDB

Job monitorJob monitor

InfomationSvcInfomationSvc

FileCatalogSvcFileCatalogSvc

MonitoringSvcMonitoringSvc

BookkeepingSvcBookkeepingSvc

BK query webpage BK query webpage

FileCatalogbrowser

FileCatalogbrowser

Userinterfaces

DIRACservices

DIRACresources

DIRAC StorageDIRAC Storage

DiskFileDiskFile

gridftpgridftp bbftpbbftp

rfiorfio

DIRAC Services & Resources


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Phase 1 Completed

DIRAC alone

LCG inaction

1.8 106/day

LCG paused

3-5 106/day

LCG restarted

186 M Produced Events

Phase 1 Completed


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LCG Performance (I)

Submitted Jobs

Cancelled Jobs

Aborted Jobs (Before Running)

211k Submitted Jobs

After Running:

-113 k Done (Successful)

-34 k Aborted


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LCG Performance (II)

Jobs(k) %Sub %RemainSubmitted 211 100.0%Cancelled 26 12.2%Remaining 185 87.8% 100.0%Aborted (not Run) 37 17.6% 20.1%Running 148 70.0% 79.7%Aborted (Run) 34 16.2% 18.5%Done 113 53.8% 61.2%Retrieved 113 53.8% 61.2%

LCG Job Submission Summary Table

LCG Efficiency: 61 %


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LHCb DC’04 Status

LHCb DC’04 Phase 1 is over.

The Production Target has been achieved:– 186 M Events in 424 CPU years.

– ~ 50% on LCG Resources (75-80% at the last weeks).

Right LHCb Strategy: – Submitting “empty” DIRAC Agents to LCG has proven to be very flexible

allowing a good success rate.

Big room for improvements, both on DIRAC and LCG:– DIRAC needs to improve in the reliability of the Servers:

• big step already during DC.

– LCG needs improvement on the single job efficiency:• ~40% aborted jobs.

– In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in.

Congratulation and warm thanks to the complete LCG team for their support and dedication


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Personal Observations on Data Challenge Results Tier-0 Operations at 25% scale demonstrated

– Job couplings from objectivity era - gone

Directed data flow/management T0>T1>T2. Worked (Intermittently)

Massive Simulation on LCG, Grid3, NorduGrid. Worked

Beginning to get experience with input-data-intensive jobs

Not many users out there yet stressing the chaotic side– The next 6 months are critical, we have to see broad and growing

adoption, not having a personal grid user certificate will have to seem odd

Many problems are classical computer center ones– Full disks, reboots, SW installation, Dead disks, ..

– Actually this is bad news. No Middleware silver-bullet. Hard work getting so many centers up to required performance


/32

CH

EP

‘04

CH

EP

‘04

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

LH

C C

om

pu

tin

g M

od

els

an

d D

ata

Ch

allen

ges

Critical Issues for early 2005

Data Management– Building Experiment Data Management Solutions

Demonstrating End-User access to remote resources– Data and processing

Managing Conditions and Calibration databases– And their global distribution

Managing Network expectations– Analysis can place (currently) impossible loads on network and DM

components• Planning for the future, while maintaining priority controls

Determining the pragmatic mix of Grid responsibilities and experiment responsibilities

– Recall the “Data” in DataGrid, LHC is Data Intensive,– Configuring the experiment and grid software to use generic resources

is wise– But (I think) data location will require a more ordered approach in

practice

Download - CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton University For the LHC Experiments: ALICE, ATLAS, CMS and LHCb (Though with

Top Related