towards the operations of the italian tier-1 for cms: lessons learned from the cms data challenge d....
Post on 18-Dec-2015
216 views
TRANSCRIPT
Towards the operations ofTowards the operations ofthe Italian Tier-1 for CMS:the Italian Tier-1 for CMS:
lessons learned from the CMS Data Challengelessons learned from the CMS Data Challenge
D. Bonacorsi(on behalf of INFN-CNAF Tier-1 staff and the CMS experiment)
ACAT 2005X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research
May 22nd-27th, 2005 - DESY, Zeuthen, Germany
2 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
OutlineOutline
The past CMS operational environment during the Data Challenge
focus on INFN-CNAF Tier-1 resources and set-up
The present lessons learned from the challenge
The future … try to apply what we (think we) learned…
3 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
The INFN-CNAF Tier-1The INFN-CNAF Tier-1
Located at INFN-CNAF centre, in Bologna (Italy) computing facility for INFN HNEP community
one of the main nodes of GARR network
Multi-experiment Tier-1 LHC experiments + AMS, Argo, BaBar, CDF, Magic, Virgo, … evolution: dynamic sharing of resources among involved exps
CNAF is a relevant Italian site from a Grid perspective partecipating to LCG, EGEE, INFN-GRID projects support to R&D activities, develop/testing prototypes/components “traditional” access to resources granted also, but more ‘manpower-
consuming’
4 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
Tier-1 resources and servicesTier-1 resources and services
computing power CPU farms for ~1300 kSI2k + few dozen of servers
biproc boxes [320 @0.8-2.4 GHz, 350 @3 GHz], ht activated storage
on-line data access (disks) IDE, SCSI, FC; 4 NAS systems IDE, SCSI, FC; 4 NAS systems [~60 TB][~60 TB], 2 SAN systems , 2 SAN systems [~225 TB][~225 TB]
custodial task on MSS (tapes in Castor HSM system) Stk L180 lib - overall ~18 TBoverall ~18 TB Stk 5500 lib - 6 LTO-2 [~ 240 TB] + 2 9940b [~ 136 TB] (more to be installed)6 LTO-2 [~ 240 TB] + 2 9940b [~ 136 TB] (more to be installed)
networking T1 LAN
rack FE switches with 2xGbps uplinks to core switch (ds via GE to core) upgrade foreseen rack Gb switches
1 Gbps T1 link to WAN (+1 Gbps is for Service Challenge) will be 10 Gbps [Q3 2005]
More: infrastructure (electric power, UPS, etc.) system administration, database services administration, etc. support to experiment-specific activities coordination with Tier-0, other Tier-1’s, and Tier-n’s (n>1)
5 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
The CMS Data Challenge: The CMS Data Challenge: whatwhat and and howhow
CMS Pre-Challenge Production (PCPPCP) up to digitization (needed as input for DC) mainly non-grid productions…
• …but also grid prototypes (CMS/LCG-0, LCG-1, Grid3)
GenerationGenerationSimulationSimulation
DigitizationDigitization~70M Monte Carlo events (20M with Geant-4) produced,750K jobs ran, 3500 KSI2000 months, 80 TB of data
CMS Data Challenge (DC04DC04) Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity 25% of start-up lumi
• sustain a 25 Hz reconstruction rate in the Tier-0 farm• register data and metadata to a world-readable catalogue• distribute reconstructed data from Tier-0 to Tier-1/2’s• analyze reconstructed data at the Tier-1/2’s as they arrive• monitor/archive information on resources and processes
not a CPU challenge.. aimed to the demostration of feasibility of the full chain
Reconstruction
Analysis
Validate the CMS computing model on a sufficient number of Tier-0/1/2’s large scale test of the computing/analysis models
6 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
BOSS DB
Dataset
metadataJob
metadata
McRunjobSite Manager startsan assignment
RefDB
Phys.Group asks fora new dataset
shellscripts
LocalBatch Manager
Computer farm
Job level query
Data-levelquery
Production Managerdefines assignments
Push data or info
Pull info
JDL Grid (LCG)Scheduler LCG-x
RLS
DAG
job job
job
job
DAGMan(MOP)
ChimeraVDL
Virtual DataCatalogue
Planner
Grid3
PCP set-up: a hybrid modelPCP set-up: a hybrid model
7 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
EU-CMS: submit to LCG scheduler CMS-LCG “virtual” Regional Center
0.5 Mevts Generation [“heavy” pythia](~2000 jobs ~8 hours* each, ~10 KSI2000 months)
~ 2.1 Mevts Simulation [CMSIM+OSCAR](~8500 jobs ~10hours* each, ~130 KSI2000 months)
~2 TB data
* PIII 1GHz
CMSIM: ~1.5 Mevtson CMS/LCG-0
OSCAR: ~0.6 Mevtson LCG-1
PCP grid-based prototypesPCP grid-based prototypes
“traditional” production
constant work of integration in CMS between: CMS software and production tools evolving EDG-XLCG-Y middleware
in several phases:
CMS “Stress Test” with EDG<1.4, then:
PCP on the CMS/LCG-0 testbed
PCP on LCG-1
… towards DC04 with LCG-2
CMS prod. steps: INFN/CMS [%]Generation 13 %Simulation 14 %ooHitformatting 21 %Digitisation 18 %
Strong INFN contributionto crucial PCP production,in both:
8 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
GlobalGlobal DC04 layout and workflow DC04 layout and workflow
CastorMSS
IBIB
fake on-lineprocess
RefDB
POOL RLScatalogue
ORCARECO
Job
GDBGDB LCG-2Services
Physicist
ORCAJob
ORCAJob
T0 datadistribution
agents
Tier-0Tier-0
Tier-1Tier-1
Tier-2Tier-2
TMDB
disk-SEdisk-SEEBsEBs
T2T2Disk-SEDisk-SE
T1 datadistribution
agents
T1T1Castor-SECastor-SE
T1T1disk-SEdisk-SE
CastorMSS
Hierarchy of RCs &data distribution chains3 distinct scenariosdeployed and tested
9 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
INFN-specificINFN-specific DC04 workflow DC04 workflow
disk-SEExport Buffer
TRA-Agent
TransferManagement
DB
T1Castor SE
LTO-2tape library
T1disk-SE
REP-Agent
T2disk-SE
CNAF T1CNAF T1
Legnaro T2Legnaro T2
localMySQL
SAFE-Agent
query db
update db
data flow
data movement T0T1 data custodial task: interface to MSS data movement T1T2 for “real-time analysis”
Basic issuesaddressed at T1:
10 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
An example:An example:
Data flow during just 1 day of DC04Data flow during just 1 day of DC04
CNAF T1 disk-SE
green
CNAF T1 Castor SE
CNAF T1 Castor SE
eth I/O inputfrom SE-EB
eth I/O inputfrom Castor SE
TCP connections
RAM memory
Legnaro T2 disk-SEeth I/O input from Castor SE
Just one day:Apr, 19th
11 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
DC04 outcome DC04 outcome (grand-summary + focus on INFN T1)(grand-summary + focus on INFN T1)
reconstruction/data-transfer/analysis may run at 25 Hz automatic registration and distribution of data, key role of the TMDB
was the embrional PhEDEx! support a (reasonable) variety of different data transfer tools and set-up
Tier-1’s: different performances, related to operational choices SRB, LCG Replica Manager and SRM investigated: see CHEP04 talk
INFN T1: good performance of LCG-2 chain (PIC T1 also) register all data and metadata (POOL) to a world-readable catalogue
RLS: good as a global file catalogue, bad as a global metadata catalogue analyze the reconstructed data at the Tier-1’s as data arrive
LCG components: dedicated bdII+RB; UIs, CEs+WNs at CNAF and PIC real-time analysis at Tier-2’s was demonstrated to be possible
~15k jobs submitted time window between reco data availability - start of analysis jobs can be
reasonably low (i.e. 20 mins) reduce number of files (i.e. increase <#events>/<#files>)
more efficient use of bandwidth reduce overhead of commands address scalability of MSS systems (!)
12 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
Some general considerations may apply: although a DC is experiment-specific, maybe its conclusions
are not
an “experiment-specific” problem is better addressed if conceived as a “shared” one in a shared Tier-1
an experiment DC just provides hints, real work gives insight
crucial role of the experiments at the Tier-1• find weaknesses of CASTOR MSS system in particular operating
conditions• stress-test new LSF farm with official production jobs by CMS• testing DNS-based load-balancing by serving data for production and/or
analysis from CMS disk-servers• test new components, newly installed/upgraded Grid tools, etc… • find bottleneck and scalability problems in DB services• give feedback on monitoring and accounting activities• …
Learn from DC04 lessons…Learn from DC04 lessons…
13 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
T1 today: T1 today: farmingfarming What changed since DC04? What changed since DC04?
RUNNING
PENDING
Total nb. jobs
Max nb. slots
Analysis “controlled” and “fake” (DC04) vs. “unpredictable” and “real” (now)
T1 provides one full LCG site + 2 dedicated RBs/bdII + support to CRABers Interoperability: always an issue, even harder in a transition period
dealing with ~2-3 sub-farms in use by ~10 exps (in prod) resource use optimization: still to be achieved
Migration in progress: OS
RH v.7.3 SLC v.3.0.4 middleware
upgrade to LCG v.2.4.0 install/manage WNs/servers
lcfgng Quattor integration LCG-Quattor
batch scheduler Torque+Maui LSF v.6.0 queues for prod/anal manage Grid interfacing
see see [ N.DeFilippis session II day 3][ N.DeFilippis session II day 3]
14 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
Storage issues (1/2): disks driven by requirements of LHC data processing at the Tier-1
i.e. simultaneous access of ~PBs of data from ~1000 nodes at high rate main focus is on robust, load-balanced, redundant solutions to grant proficient
and stable data access to distributed users namely: “make both sw and data accessible from jobs running on WNs”
• remote access (gridftp) and local access (rfiod, xrootd, GPFS) services, afs/nfs to share exps’ sw on WNs, filesystems tests, specific problem solving in analysts’ daily operations, CNAF participation to SC2/3, etc.
a SAN approach with a parallel filesystem on-top looks promising
Storage issues (2/2): tapes CMS DC04 helped to focus some problems:
LTO-2 drives not efficiently used by exps in production at T1• performance degradation increases as file size decreases• hangs on locate/fskip after ~100 not-sequential reading• not-full tapes are labelled ‘RDONLY’ after 50-100 GB written only
CASTOR performances increase with clever pre-staging of files• some reliability achieved only on sequential/pre-staged reading
solutions?• from the HSM sw side: fix coming with CASTOR v.2 (Q2 2005)?• from the HSM hw side: test 9940b drives in prod (see PIC T1)• from the exp side: explore possible solutions
▪ e.g. file-merging in coupling PhEDEx tool to CMS production system▪ e.g. depict a pure-disk buffer in front of MSS disantangled from CASTOR
see see [ P.P.Ricci session II day 3][ P.P.Ricci session II day 3]
T1 today: T1 today: storagestorage What changed since DC04? What changed since DC04?
15 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CMS activities at the Tier-1CMS activities at the Tier-1
Current CMS set-up at the Tier-1Current CMS set-up at the Tier-1
Castordisk buffer
CastorMSS
CE
LS
F SE
SE
Grid.it / LCG layer
Productiondisks
Analysisdisks
Import-ExportImport-ExportBufferBuffer
SE
Ph
ED
Ex
ag
ents
OverflowOverflowWN WN WN
WN WN WN
WN WN WN
WN WN WN
CoreCoreWN WN WN
WN WN
CPUsCPUs
WN
shared
CMSlocal
remote “access”
logical grouping
OperationsOperationscontrolcontrol
gw/UI
UI
Local prod
PhEDEx agentsGrid prod/anal
Resources manag.Resources manag.
“control”
16 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
PhEDEx in CMSPhEDEx in CMS
PhEDExPhEDEx (Physics Experiment Data Export) used by CMS
components: TMDB from DC04
• files, topology, subscriptions... coherent set of sw agents,
loosely coupled, inter-operating and communicating with TMDB blackboard
• e.g. agents for data allocation (based on site data subscriptions), file import/export, migration to MSS, routing (based on implemented topologies), monitoring, etc…
INFN T1 mainly on INFN T1 mainly on data transferdata transfer…… INFN T1 mainly on INFN T1 mainly on prod/analprod/anal
overall infrastructure for data transfer management in CMS allocation and transfers of CMS physics data among Tier-0/1/2’s
• different datasets move on bidirectional routes among Regional Centers• data should reside on SEs (e.g. gsiftp or srm protocols)
born, and growing fast… >70 TB known to PhEDEx, >150 TB total replicated
17 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CNAF T1 diskserver I/O
Rate out of CERN Tier-0
PhEDEx transfer rates T0PhEDEx transfer rates T0INFN T1INFN T1weekly daily
18 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
PhEDEx at INFNPhEDEx at INFN
INFN-CNAF is a T1 ‘node’ in PhEDEx CMS DC04 experience was crucial to start-up PhEDEX in INFN
CNAF node operational since the beginning First phase (Q3/4 2004):
Agent code development + focus on operations: T0T1 transfers >1 TB/day T0T1 demonstrated feasible
• … but the aim is not to achieve peaks, but to sustain them in normal operations
Second phase (Q1 2005): PhEDEx deployment in INFN to Tier-n, n>1:
“distributed” topology scenario• Tier-n agents run at remote sites, not at the T1: know-how required, T1 support
already operational at Legnaro, Pisa, Bari, Bologna
Third phase (Q>1 2005): Many issues.. e.g. stability of service, dynamic routing, coupling PhEDEx to CMS official production system, PhEDEx involvement in SC3-phaseII, etc…
~450 Mbps CNAF T1 ~450 Mbps CNAF T1 LNL-T2 LNL-T2 ~205 Mbps CNAF T1 ~205 Mbps CNAF T1 Pisa-T2 Pisa-T2An example:
data flow to T2’s in daily operations (here: a test with ~2000 files, 90 GB, with no optimization)
19 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CMS production system evolving into a permanent effort strong contribution of INFN T1 to CMS productions
252 ‘assignments’ in PCP-DC04, for all production step [both local and Grid] plenty of assignments (simulation only) now running on LCG (Italy+Spain)
• CNAF support for ‘direct’ submitters + backup SEs provided for Spain currently, digitization/DST efficiently run locally (mostly at T1)
produced data hence injected in the CMS data distribution infrastructure future of T1 productions: rounds of “scheduled” reprocessing
DST production at INFN T1
~11.8 Mevts prodotti
~12.9 Mevts assegnati
CMS MonteCarlo productionsCMS MonteCarlo productions
20 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
coming next: coming next: Service Challenge (SC3)Service Challenge (SC3)
data transfer and data serving in real use-cases review existing infrastructure/tools and give a boost details of the challenge are currently under definition
Two phases: Jul05: SC3 “throughput” phase
Tier-0/1/2 simultaneous import/export, MSS involved move real files, store on real hw
>Sep05: SC3 “service” phase small scale replica of the overall system
• modest throughput, main focus is on testing in a quite complete environment, with all the crucial components
space for experiment-specific tests and inputs
Goals test crucial components, push to prod-quality, and measure. towards the next production service
INFN T1 participated in SC2, and is joining SC3
21 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
ConclusionsConclusions
INFN-CNAF T1 is quite young but ramping-up towards stable production-quality services optimized use of resources + interfaces to the Grid policy/HR to support experiments at the Tier-1
the Tier-1 actively partecipated to CMS DC04 good hints: identified bottlenecks in managing resources, scalability, …
Learn the lessons: overall revision of CMS set-up at the T1 involves both Grid and non-Grid access first results are encouraging, success of daily operations
local/Grid productions + distributed analysis are running…
Go ahead: long path… next step on it: preparation for SC3, also with CMS applications
22 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
Back-up slidesBack-up slides
23 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CNAF T1 diskserver I/O
Rate out of CERN Tier-0
weekly daily
PhEDEx transfer rates T0PhEDEx transfer rates T0INFN T1INFN T1
24 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CNAF T1 diskserver I/O
Rate out of CERN Tier-0
weekly daily
PhEDEx transfer rates T0PhEDEx transfer rates T0INFN T1INFN T1
25 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CNAF “autopsy” of DC04CNAF “autopsy” of DC04
DC04DC04
Lethalinjuries only
Agents drain data from SE-EB down to CNAF/PIC T1’s andland directly on a Castor SE buffer it occurred that in DC04 these files were many and small
So: for any file on the Castor SE fs, a tape migration isforeseen with a given policy, regardless of their size/nb..
this strongly affected data transfer at CNAF T1 (MSS below is STK tape lib with LTO-2 tapes)
Castor stager scalability issues many small files (mostly 500B-50kB) stager db bad performances of stager db for >300-400k entries (may need more RAM?)
• CNAF fast set-up of an additional stager in DC04: basically worked• REP-Agent cloned to transparently continue replication to disk-SEs
tape library LTO-2 issues high nb. segments on tape bad tape read/write performances, LTO-2 SCSI errors, repositioning failures, slow migration to tape and delays in the TMDB “SAFE”-labelling, inefficient tape space usage
A–posteriori solutions: consider a disk-based Import Buffer in front of MSS…
[ see next slide ][ see next slide ]
26 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
minor (?) Castor/tape-library issues
Castor filename length (more info: Castor ticket CT196717CT196717) ext3 file-system corruption on a partition of the old stager tapes blocked in the library
several crashes/hanging of the TRA-Agent (rate: ~ 3 times per week) created from time to time some backlogs, nevertheless fast to be recovered post-mortem analysis in progress
experience with the Replica Manager interface e.g. files of size 0 created at destination when trying to replicate from Castor SE some data which are temporarily not accessible for stager (or other) problems on the Castor side needs further tests to achieve reproducibility and then Savannah reports
Globus-MDS Information System instabilities (rate: ~ once per week) some temporary stop of data transfer (i.e. ‘no SE found’ means ‘no replicas’)
RLS instabilities (rate: ~ once per week) some temporary stop of data transfer (cannot both list replicas and (de)register files)
SCSI driver problems on CNAF disk-SE (rate: just once but affected fake-analysis) disks mounted but no I/O: under investigation
CNAF “autopsy” of DC04CNAF “autopsy” of DC04Non-lethal
injuries
constant and painfuldebugging…
27 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CMS DC04: number and sizes of filesCMS DC04: number and sizes of files
DC04 datatime window:51 (+3) days
March 11th – May 3rd
Global CNAF network activityGlobal CNAF network activity ~340 Mbps~340 Mbps(>42 MB/s)
sustainedfor ~5 hours
(max was383.8 Mbps383.8 Mbps)
May 2May 2ndndMay 1May 1stst
>3k files for >750 GB
28 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
POOL RLScatalogue
RM/SRM/SRB EB agents
Configurationagent
Tier-1Transfer agent
LCGORCA
AnalysisJob
SRBGMCAT
XMLPublication
Agent
ReplicaManager
1. Register Files
2. Find Tier-1 Location (based on metadata)
3. Copy/delete files to/from export buffers
4. Copy filesto Tier-1’s
6. Process DSTand registerprivate data
Local POOLcatalogue
TMDB
ResourceBroker5. Submit
analysis job
Specific client tools: POOL CLI, Replica Manager CLI, C++ LRC API based programs, LRC java API tools (SRB/GMCAT), Resource Broker
CNAF RLSreplica
ORACLEmirroring
Description of RLS usageDescription of RLS usage
29 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
Tier-0 in DC04Tier-0 in DC04
Systems• LSF batch system
3 racks, 44 nodes each, dedicated: tot 264 CPUs Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT
Dedicated cmsdc04 batch queue, 500 RUN-slots• Disk servers:
DC04 dedicated stager, with 2 pools 2 pools: IB and GDB, 10 + 4 TB
Export Buffers• EB-SRM ( 4 servers, 4.2 TB total )• EB-SRB ( 4 servers, 4.2 TB total )• EB-SE ( 3 servers, 3.1 TB total )
Databases• RLS (Replica Location Service)• TMDB (Transfer Management DB)
Transfer steering• Agents steering data transfers
on a dedicated node (close monitoring..)
Monitoring Services
Architecture built on:
Castor
IBIB
fake on-lineprocess
RefDB
POOL RLScatalogue
TMDB
ORCARECO
Job
GDBGDBTier-0
data distrib.agents
EBEB
Tier-0Tier-0
30 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi
CMS Production tools
CMS production tools (OCTOPUS)
RefDBContains production requests with all needed parameters to produce
the dataset and the details about the production process
MCRunJob Evolution of IMPALA: more modular (plug-in approach)Tool/framework for job preparation and job submission
BOSSReal-time job-dependent parameter tracking. The running job
standard output/error are intercepted and filtered information are stored in BOSS database. The remote updator is based on MySQL but a remote updator based on R-GMA is being developed.