ian bird cern it lcg deployment area manager egee operations manager lcg status report lhcc open...
Post on 21-Jan-2016
220 Views
Preview:
TRANSCRIPT
Ian BirdIan BirdCERN ITCERN ITLCG Deployment Area managerLCG Deployment Area managerEGEE Operations ManagerEGEE Operations Manager
LCG Status Report
LHCC Open Session
CERN28th June 2006
October 7, 20052
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Outline
Project Status Organisation for Phase II
Applications Area CERN Tier 0
Castor-2 Tier 0 infrastructure LHC networking
Grid Infrastructure Status Service Challenges – results and plans Regional centres Middleware status
Physics Support & Analysis Summary
October 7, 20053
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
The Worldwide LHC Computing Grid
Purpose Develop, build and maintain a distributed computing
environment for the storage and analysis of data from the four LHC experiments
Ensure the computing service … and common application libraries and tools
Phase I – 2002-2005 - Development & planning
Phase II – 2006-2008 – Deployment & commissioning of the initial services
October 7, 20054
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
WLCG Collaboration
The Collaboration ~130 computing centres 12 large centres
(Tier-0, Tier-1) 40-50 federations of smaller
“Tier-2” centres 29 countries
Memorandum of Understanding Agreed in October 2005, now being signed
Purpose Focuses on the needs of the 4 LHC experiments Commits resources
Each October for the coming year 5-year forward look
Agrees on standards and procedures
Physics Support
Grid Deployment Board – chair Kors Bos (NIKHEF)
With a vote: One person from a major site in each country One person from each experimentWithout a vote: Experiment Computing Coordinators Site service management representatives Project Leader, Area Managers
Management Board – chair Project Leader
Experiment Computing CoordinatorsOne person fromTier-0 and each Tier-1 SiteGDB chairProject Leader, Area ManagersEGEE Technical Director
Architects Forum – chair Pere Mato (CERN)
Experiment software architects Applications Area Manager Applications Area project managers
Collaboration Board – chair Neil Geddes (RAL)
Sets the main technical directions
One person from Tier-0 and each Tier-1, Tier-2 (or Tier-2 Federation) Experiment spokespersons
Overview Board – chair Jos Engelen (CERN CSO)
Committee of the Collaboration Boardoversee the projectresolve conflicts
One person from Tier-0, Tier-1s Experiment spokespersons
October 7, 20056
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
More information on the collaboration
Boards and Committees
All boards except the
OB have open access to
agendas, minutes,
documents
Planning data: MoU Documents and Resource Data Technical Design Reports Phase 2 Plans Status and Progress Reports Phase 2 Resources and costs at CERN
http://www.cern.ch/lcg
LCG Applications Area
October 7, 20058
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Merge of SEAL and ROOT projects
Single team working together successfully for more than one year
~50 % of SEAL functionality has been migrated to ROOT In use by the experiments (will be in production for this
year’s data challenges) What is left is easily maintainable (no new development)
Started to plan the migration of the second 50% Collected information from experiments Detailed plan in preparation In general no urgency from the experiments Will need to persuade experiments to migrate when
software is ready
October 7, 20059
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
AA Project status (1)
Software Process Infrastructure Project (SPI) Stable running of services and improving them: savannah,
hypernews, software installations, and software distributions Support experiments directly to provide complete software
configurations Support for new platforms: SLC4, MacOSX
Core Libraries and Services Project (ROOT) Many developments for the integration of Reflex and CINT.
Plan to release new system this fall Consolidation of the new Math libraries. New packages:
Multi-Variate analysis, Fast Fourier Transforms Many performance improvements in many areas (e.g. I/O
and Trees) Many new developments in PROOF: asynchronous queries,
connect/disconnect mode, package manager, monitoring, etc.
Improvements and new functionality in GUI and Graphics packages
October 7, 200510
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
AA Project Status (2)
Persistency Framework Project (POOL & COOL) CORAL, a reliable generic RDBMS interface for Oracle, MySQL, SQLight
and FroNTier LCG 3D project Provides db lookup, failover, connection pooling, authentication, monitoring COOL and POOL can access all back-ends via CORAL CORAL also used as separate package by ATLAS/CMS online
Improved COOL versioning functionalities (user tags and hierarchical tags)
Simulation Project Improved tools for geometry model interchange (GDML) Extended framework for interfacing test beam simulations with Geant4
and Fluka. Physics analysis expected soon. Considerable effort on the study of hadronic shower shapes to resolve
discrepancies with test beam data. Improved regression suite to investigate and compare hadronic shower shapes
New C++ Monte Carlo generators (Pythia8, ThePEG/Herwig++) have been added to the generator library (GENSER)
Created new precise elastic process for protons and neutrons in Geant4 New efficient method to detect overlaps in geometries when constructed
and support for parallel geometries New Python interface module interfacing key Geant4 classes
CERN Tier 0 and
LHC Networking status
October 7, 200512
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
CERN Castor storage systemA CASTOR2 review took place at CERN on June 6th – 9th:Members : John Harvey (CERN, chair), Migual Branco (ATLAS), Don Petravick (FNAL),
Shaun de Witt (RAL)
Details and the final report http://indico.cern.ch/conferenceDisplay.py?confId=2916
October 7, 200513
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Castor2 Highlights in 2006
ATLAS Tier 0 test in January at nominal rates (320 MB/s, no Tier 1 export)
Various large scale data challenges: SC4 data export from Castor2 disk pool at ~1.6 GB/s Castor2 disk pool stress tests at 4.3 GB/s, cf the expected load
of 4.5 GB/s aggregate for all 4 experiments during pp running Successful integration of 2 new tape storage systems from
IBM and STK with tested peak rates of 1.6 GB/s to tape Successful transition of all 4 experiments from Castor1 to
Castor2 Today ~ 1 PB disk space in Castor2 disk pools with ~2.5
million files on disk Castor2 disk pool for CMS served analysis data successfully
for ~ 1000 simultaneous clients with 1 GB/s aggregate performance
Second ATLAS Tier 0 test just started at nominal rates:
October 7, 200514
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Tier 0 ramp-up12,000 spinning disks
2 PB of diskspaceMay 2006 Sep 2006 Feb 2007
space [TB]
servers space [TB]
servers
space [TB]
servers
Alice 78 20 231 ~60 500
Atlas 123 25 176 ~45 370
CMS 138 27 176 ~45 370
LHCb 121 26 188 ~45 370
total LHC
460 98 771 ~180 1610 ~480
SC4 187 40
ITDC 169 42 169 42 170 ~40
public ~200 ~100
total 816 180 940 220 ~2000
~600
boxes kSI2K
Today 2300 4300
2007 -1000+1200=2500
+5700=1000
0
2008 25000
Batch system
October 7, 200515
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Computer Centre Electrical Infrastructure …
The new substation is operational Two power cuts were caused by the new equipment in
January (6th, 24th); reasons understood rapidly and fixed
Critical services maintained as designed during problems on May 16th; full services back within 3 hours after power restored.
1st new UPS module being installed will be commissioned by mid-July No capacity increase; replaces current UPS only
Additional UPS capacity only at end-2006 extremely tight schedule
requires removal/relocation of existing equipment from July 15th – August 15th!
and two month period for 2nd phase of foundation reinforcement
October 7, 200516
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
… and Cooling infrastructure
Work (much) delayed wrt initial plan weather delays more than expected many safety concerns
Three major cooling problems since end-March (and other minor problems):
Focus has been to maintain critical lab services (network, admin services, email,…) physics services were shutdown to reduce heat load
Production chillers being commissioned now 1st unit in production June 19th; 2nd on June 23rd. 3rd unit in production by June 30th
Final configuration in by mid-July Future work
Installation of sensors: 2-3 per equipment row Completion of ducts on righthand (barn) side [4th chiller; yet to be funded]
October 7, 200517
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
The new European Network Backbone
LCG working group with Tier-1s and national/ regional research network organisations
New GÉANT 2 – research network backbone
Strong correlation with major European LHC centres
Swiss PoP at CERN
October 7, 200518
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Grid Infrastructure
October 7, 200520
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
LCG Service Hierarchy
Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands Tier-1 (Amsterdam)Nordic countries – distributed Tier-1
Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)
Tier-1 – “online” to the data acquisition process high availability
Managed Mass Storage – grid-enabled data service
Data-heavy analysis National, regional support
Tier-2 : ~120 centres (40-50 federations) in ~29 countries Simulation End-user analysis – batch and interactive
October 7, 200521
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
LCG depends on 2 major science grid infrastructures …
The LCG service runs & relies on grid infrastructure provided by:
EGEE - Enabling Grids for E-ScienceOSG - US Open Science Grid
22
Enabling Grids for E-sciencE
INFSO-RI-508833
EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage
EGEE Grid Sites : Q1 2006
sites
CPU
EGEE: Steady growth over the lifetime of the project
23
Enabling Grids for E-sciencE
INFSO-RI-508833
A global, federated e-Infrastructure
EGEE infrastructure~ 200 sites in 39 countries~ 20 000 CPUs> 5 PB storage> 35 000 concurrent jobs per day> 60 Virtual Organisations
EUIndiaGrid
EUMedGrid
SEE-GRID
EELA
BalticGrid
EUChinaGridOSGNAREGI
24
Enabling Grids for E-sciencE
INFSO-RI-508833
Use of the infrastructureK-Jobs/Day - EGEE Grid - All VOs
0
5
10
15
20
25
30
35
40
45
Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 May-06
month
K-J
ob
s/D
ay
.
alice atlas cms lhcb geant4 biomed
compchem egeode egrid esr fusion magic
ops planck Other VOs dteam
More than 35K jobs/day on the EGEE GridLHC VOs 30K jobs/day
Sustained & regular workloads of >35K jobs/day• spread across full infrastructure• doubling/tripling in last 6 months – no effect on operations
Several applications now depend on EGEE as their primary computing resource
Sustained & regular workloads of >35K jobs/day• spread across full infrastructure• doubling/tripling in last 6 months – no effect on operations
Several applications now depend on EGEE as their primary computing resource
25
Enabling Grids for E-sciencE
INFSO-RI-508833
EGEE Operations Process• Grid operator on duty
– 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei
– Crucial in improving site stability and management
– Expanding to all ROCs in EGEE-II• Operations coordination
– Weekly operations meetings– Regular ROC managers meetings– Series of EGEE Operations Workshops
Nov 04, May 05, Sep 05, June 06• Geographically distributed responsibility
for operations:– There is no “central” operation– Tools are developed/hosted at different sites:
GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon)
• Procedures described in Operations Manual
– Introducing new sites– Site downtime scheduling– Suspending a site– Escalation procedures– etc
Highlights:• Distributed operation• Evolving and maturing procedures• Procedures being in introduced into and shared with the related infrastructure projects
26
Enabling Grids for E-sciencE
INFSO-RI-508833
Site testing
77% target 88%
89% 85% 83%
89% 68% 58%
77% 87% 68%
average (all sites)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
77% target 88%
89% 85% 83%
89% 68% 58%
77% 87% 68%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
• Measuring response times and availability:
Site Availability Monitor – SAM– Based upon Site Functional Test suite– monitoring services by running regular tests– basic services – SRM, LFC, FTS, CE, RB,
Top-level BDII, Site BDII, MyProxy, VOMS, R-GMA, ….
– VO environment – tests supplied by experiments
– results stored in database– displays & alarms for sites, grid operations,
experiments– high level metrics for management– integrated with EGEE operations-portal - main
tool for daily operations– Mechanism and tests shared with OSG
27
Enabling Grids for E-sciencE
INFSO-RI-508833
Sustainability: Beyond EGEE-II
• Need to prepare for permanent Grid infrastructure– Maintain Europe’s leading position in global science Grids
– Ensure a reliable and adaptive support for all sciences
– Independent of short project funding cycles
– Modelled on success of GÉANT Infrastructure managed in collaboration
with national grid initiatives
28
Enabling Grids for E-sciencE
INFSO-RI-508833
Structure
Federated model bringing together National Grid Initiatives (NGIs) to build a European organisation
EGEE federations wouldevolve into NGIs
Each NGI is a national body• Recognised at the national level• Mobilises national funding and resources• Contributes and adheres to international standards and policies• Operates the national e-Infrastructure• Application independent, open to new user communities and resource
providers
29
OSG & WLCG
OSG IInfrastructurenfrastructure is a core piece of the WLCG.
OSG delivers accountable resources and cycles for LHC experiment production and analysis.
OSG Federates with other infrastructures.
Experiments see a seamless global computing facility
30
Ramp up of OSG use last 6 months
OSG 0.4.1 deployment
OSG 0.4.0deployment
31
Data Transfer by VOs
e.g. CMS
32
Operations
Grid Operations Center.
Facility, Service and VO Support Centers.
Manual or automated flow of tickets within OSG and bridged to other Grids.
Ownership of problems at end-points and by GOC.
Guided by Operations Model, Standard Procedures, Support Center Agreementshttp://
osg.ivdgl.org/twiki/bin/view/Operations/WebHome
October 7, 200533
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
WLCG Interoperability
Cross-grid job submission: Most advanced with OSG – cross job submission has been
put in place for WLCG Used in production by US-CMS for several months
EGEE Generic Info Provider installed on OSG site (now in VDT)
Allows all sites to be seen in info system Monitoring (GStat and SFT) can run on OSG sites EGEE clients installed on OSG-LCG sites Inversely – EGEE sites can run OSG jobs All use SRM SEs; File catalogues are application choice – LFC widely used
Support and operations: Workflows and processes being put in place and tested Operations workshop last week tried to finalise some of the
open issues
October 7, 200534
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
LCG Service planning
full physicsrun
first physics
cosmics
2007
2008
2006Pilot Services – stable service from 1 June 06
LHC Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance
LHC service commissioned – 1 Apr 07
October 7, 200535
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Service Challenges
Purpose Understand what it takes to operate a real grid servicereal grid service – run for weeks/months at a time (not
just limited to experiment Data Challenges) Trigger and verify Tier1 & large Tier-2 planning and deployment –
- tested with realistic usage patterns Get the essential grid services ramped up to target levels of reliability, availability,
scalability, end-to-end performance
Four progressive steps from October 2004 thru September 2006 End 2004 - SC1 – data transfer to subset of Tier-1s Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s 2nd half 2005 – SC3 – Tier-1s, >20 Tier-2s –first set of baseline services
Jun-Sep 2006 – SC4 – pilot service
Autumn 2006 – LHC service in continuous operation – ready for data taking in 2007
October 7, 200536
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
SC4 – the Pilot LHC Service from June 2006
A stable service on which experiments can make a full demonstration of experiment offline chain
DAQ Tier-0 Tier-1data recording, calibration, reconstruction
Offline analysis - Tier-1 Tier-2 data exchangesimulation, batch and end-user analysis
And sites can test their operational readiness Service metrics MoU service levels Grid services Mass storage services, including magnetic tape
Extension to most Tier-2 sites
Evolution of SC3 rather than lots of new functionality
In parallel – Development and deployment of distributed database services
(3D project) Testing and deployment of new mass storage services (SRM 2.2)
October 7, 200537
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Sustained Data Distribution Rates: CERN Tier-1s
Centre ALICE ATLAS CMS LHCb Rate into T1 MB/sec (pp run)
ASGC, Taipei X X 100
CNAF, Italy X X X X 200
PIC, Spain X X X 100
IN2P3, Lyon X X X X 200
GridKA, Germany X X X X 200
RAL, UK X X X 150
BNL, USA X 200
FNAL, USA X 200
TRIUMF, Canada X 50
NIKHEF/SARA, NL X X X 150
Nordic Data Grid Facility
X X 50
Totals 1,600
Design target is twice these rates to enable catch-up after
problems
October 7, 200538
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
SC4 T0-T1: Results
Target: sustained disk – disk transfers at 1.6GB/s out of CERN at full nominal rates for ~10 days
Result: just managed this rate on Easter Sunday (1/10)
Easter w/eTarget 10 day period
October 7, 200539
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
ATLAS SC4 tests
From last week: initial ATLAS SC4 work Rates to ATLAS T1 sites close to target rates
ATLAS transfers Background transfers
October 7, 200540
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Service readiness
Internal LCG review of services was held 8-9th June Mandate:
Assess the service readiness and preparations of the Tier 1 sites
Scope: all aspects of LCG except applications area First day:
Review of each Tier 1: status, planning, issues Second day:
Middleware plans and priorities (EGEE and OSG) Interoperability Experiment views of status of middleware Status of storage interface (SRM)
Difficult to assess overall status of sites – each Tier 1 is unique in its management, environment, issues
All are now taking the timescale seriously Final report from the review expected in July
October 7, 200541
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Middleware: Baseline services
In June 2005 the set of baseline service were agreed: Basic set of middleware required from the grid infrastructures Agreed by all experiments – minor variations of priority Baseline service group, and later workshops documented
missing features LCG priorities for development agreed at Mumbai workshop
in Feb Now reflected in EGEE & OSG middleware development plans
gLite-3.0 (released in May for SC4) contains all of the baseline services
SRM v2.2 for storage interfaces has a longer timescale (Nov) Still reliability, performance, management issues to be
addressed gLite-3.0 is an evolution of the previous LCG-2.7 and gLite-1.x
middleware Deployed in production without disturbing production
environment Forms the basis for evolution of the services to add missing
features, improve performance and reliability Several services (FTS, LFC, VOMS, BDII) are used everywhere
(not just EGEE sites)
Physics Support and Analysis
October 7, 200543
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Supporting the experiments in grid activities
Original activity on the Grid has focused on large productions Essential activity Still requiring effort (middleware and experiments sw evolving)
Genuine need now for user analysis Big step forward compared to production Preparation stages still going on
Tools maturing All components are being finalised
Concrete signs of analysis activity
ALICE: Support production and analysis Integration and support
ATLAS: Distributed analysis
coordination and analysis (Ganga)
Experiment dashboard Integration and support Job reliability
CMS: Experiment dashboard Integration and support Job reliability
LHCb: Support analysis (Ganga) Integration and support
October 7, 200544
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Analysis efforts (CMS)
6k analysis jobs/day It was negligible less than 1 year ago A factor of two increase since late
2005 Jobs to finalise Physics TDR
October 7, 200545
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Analysis efforts (cont)
ALICE 3 tutorials for users have started (Jan
06) – more than 50 attendees Typically 15-20 active users
ATLAS and LHCb Use a common tool to expose users
(Ganga) Several demos and tutorials
CHEP06 presentation (U. Egede)
#users over last two months (2 servicesconnecting users to the grid)
October 7, 200546
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Experiment dashboard
Originally proposed by CMS Now in production
ATLAS dashboard Similar concept and re-use of experience and software Preview available
ATLAS production
Aggregation of monitor information from all sources
Allow to follow the history of activity
Allow to correlate information (e.g. data sets and sites)
Allow to track down problems
October 7, 200547
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Service reliability
Bring together monitoring of experiment-specific services and applications with that of the middleware components to study and improve the LCG service
Middleware weaknesses Infrastructure mis-configuration and instabilities
Feedback into LCG/EGEE Deployment & Middleware development
Example: 20th June Top “good” sites (“grid”
efficiency) MIT = 99.6% DESY = 100.% Bari = 100 % Pisa = 100% FNAL = 100% ULB-VUB = 96.8% KBFI = 100% CNAF = 99.6 ITEP = 100%
October 7, 200548
LCG Status report Ian.Bird@cern.ch LHCC Open Meeting; 28th June 2006
Summary International science grid infrastructures are really operational
And relied upon for daily production use at large scale More than 200 sites in EGEE and OSG Real grid operations in place for over a year
LCG depends upon 2 major science grid infrastructures: EGEE and OSG ~130 computer centres in 49 countries Excellent global networking
Good understanding now of: Experiment computing models and requirements Agreement on the baseline grid services Experience of the problems and issues
But: Reliability must be improved The full computing models will be tested this year Big ramp-up needed in terms of capacity, number of jobs, Tier 2 sites
participating Will there be a scaling problem? must be tested in the next 12 months
Data will arrive next year No new developments make what we have work absolutely reliably, and be scaleable, and performant
top related