1 lhcb on the grid raja nandakumar (with contributions from greig cowan) gridpp21 3 rd september...

1

LHCb on the Grid

Raja Nandakumar(with contributions from Greig Cowan)

GridPP21 3rd September 2008

2

LHCb computing model

➟ CERN (Tier-0) is the hub of all activity Full copy at CERN of all raw data

and DSTs All T1s have a full copy of dst-s

➟ Simulation at all possible sites (CERN, T1, T2) LHCb has used about 120 sites on

5 continents so far➟ Reconstruction, Stripping and

Analysis at T0 / T1 sites only Some analysis may be possible at

“large” T2 sites in the future➟ Almost all the computing (except

for development / tests) will be run on the grid. Large productions : production

team Ganga (Dirac) grid user interface

3

LHCb on the grid

Small amount of activity over past year▓ DIRAC3 has been under development▓ Physics groups have not asked for new productions▓ Situation has changed recently...

4

LHCb on the grid➟DIRAC3

Nearing stable production release▓ Extensive experience with CCRC08 and follow-up

exercises▓ Used as THE production system for LHCb

Now testing of the interfaces by Ganga developers

➟Generic pilot agent frameworkCritical problems found with the gLite WMS 3.0,

3.1▓ Mixing of VOMS roles under certain reasonably common

conditions Cannot have people with different VOMS roles!

▓ Savannah bug #39641▓ Being worked on by developers

Waiting for this to be solved before restarting tests

5

DIRAC3 Production

>90,000 jobs in past 2 months

Real production activity and testing of gLite WMS

6

DIRAC3 Job Monitor

https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/display

7

LHCb storage at RAL

➟LHCb storage primarily on the Tier-1s and CERN

➟CASTOR used as storage system at RALFully moved out of dCache in May 2008

▓ One tape damaged and file on it marked lostWas stable (more or less) until 20 Aug 2008

▓ Not been able to take great load on servers Low upper limit (8) on lsf job slots on various castor

diskservers Too many jobs (>500) can come into the batch system.

The concerned service class hangs then Temporarily fixed for now. Needs to be monitored

(probably by the shifter on duty?)» Increase limit to >100 rfio jobs per server» Not all hardware can handle a limit of 200 jobs (start using

swap space) Problem seen many times now over the last few months

▓ Castor now in downtime▓ This is worrying given how close we are to data taking

8

LHCb at RAL➟ Move to srm-v2 by LHCb

Needed to retire srm-v1 endpoints, hardware for RAL When DIRAC3 becomes baseline for User analysis

▓ Already used for almost all production▓ Ganga working on submitting through DIRAC3▓ Needs LHCb also to rename files in the LFC

All space tokens, etc have been setup Target : Turn off srm-v1 access by end September

➟ Currently use srm-v1 for user analysis▓ DIRAC2 does not support srm-v2

➟ Batch system : Pausing of jobs during downtime?

▓ Not clear about the status of this For now, stop the batch system from accepting LHCb

jobs a few hours before scheduled downtimes▓ No LHCb job should run for >24 hours

Announce beginning and end of downtimes▓ Problems with broadcast tools▓ GGUS ticket opened by Derek Ross

9

LHCb and CCRC08

➟Planned tasks : Test the LHCb computing modelRaw data distribution from pit to T0 centre

▓ Use of rfcp into CASTOR from pit - T1D0 Raw data distribution from T0 to T1 centres

▓ Use of FTS - T1D0 Recons of raw data at CERN & T1 centres

▓ Production of rDST data - T1D0 ▓ Use of SRM 2.2

Stripping of data at CERN & T1 centres ▓ Input data: RAW & rDST - T1D0 ▓ Output data: DST - T1D1 ▓ Use SRM 2.2

Distribution of DST data to all other centres ▓ Use of FTS

10

LHCb and CCRC08

Reconstruction

Stripping

11

LHCb CCRC08 Problems➟ CCRC08 highlighted areas to be improved

File access problems▓ Random or permanent failure to open files using gsidcap

Request IN2P3 and NL-T1 to allow dcap protocol for local read access

Now using xroot at IN2P3 – appears to be successful▓ Wrong file status returned by dCache SRM after a put

bringOnline was not doing anything Software area access problems

▓ Site banned for a while until problem is fixed Application crashes

▓ Fixed with new SW release and deployment Major issues with LHCb bookkeeping

▓ Especially for stripping

➟ Lessons learned Better error reporting in pilot logs and workflow Alternative forms of data access needed in emergencies

▓ Downloading of files to WN (used at IN2P3, RAL)

12

LHCb Grid Operations➟ Grid Operations and Production team has been

created

13

Communications➟ LHCb sites

Grid operations team keep track of problems Report to sites via GGUS and eLogger

▓ All posts are reported on [email protected]▓ Please subscribe if you want to know what is going on

➟ LHCb users Mailing lists

▓ [email protected] All problems directed here

▓ Specific lists for each LHCb application and Ganga

Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps▓ User by developers and “power” users

Software weeks provide training sessions for using Grid tools Weekly distributed analysis meetings (starts Friday)

▓ DIRAC, Ganga, core software developers along with some users▓ Aims to identify needs and coordinate release plans

http://lblogbook.cern.ch/OperationsRSS feed available

http://lblogbook.cern.ch/Operations

mailto:[email protected]

http://lblogbook.cern.ch/Operations

14

Summary

➟ Concerned about CASTOR stability close to data taking

➟ DIRAC3 workload and data management system now online Has been extensively tested when running LHCb productions Now moving it into the user analysis system

▓ Ganga needs some additional development

➟ Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently

➟ LHCb looking forward to imminent switch on of the LHC!

15

Backup - CCRC08 Throughput

1 lhcb on the grid raja nandakumar (with contributions from greig cowan) gridpp21 3 rd september...

Documents

stripping of data

production system

data takinglhcb

cern t1 centres input

srmv2batch system

dirac3needs lhcb

storage system

downtimesno lhcb job