overview & status al-ain, uae november 2007
DESCRIPTION
Overview & Status Al-Ain, UAE November 2007. Outline. Introduction – The computing challenge -why grid computing? Overview of the LCG Project Project Status Challenges & Outlook. The LHC Computing Challenge. Signal/Noise 10 -9 Data volume - PowerPoint PPT PresentationTRANSCRIPT
Ian BirdIan BirdCERN CERN LCG Deployment ManagerLCG Deployment Manager
Overview & Status
Al-Ain, UAENovember 2007
Outline
Introduction – The computing challenge -why grid computing?
Overview of the LCG Project
Project Status
Challenges & Outlook
25-Nov-07 [email protected]
The LHC Computing Challenge
25-Nov-07 [email protected] 3
Signal/Noise 10-9
Data volume High rate * large number of
channels * 4 experiments
15 PetaBytes of new data each year
Compute power Event complexity * Nb. events *
thousands users
100 k of (today's) fastest CPUs Worldwide analysis & funding
Computing funding locally in major regions & countries
Efficient analysis everywhere
GRID technology
Timeline: LHC Computing
25-Nov-07 [email protected] 4
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
LHC approved
ATLAS & CMS approved
ALICEapproved
LHCb approved
“Hoffmann”Review
7x107 MIPS1,900 TB disk
ATLAS (or CMS) requirementsfor first year at design luminosity
ATLAS&CMSCTP
107 MIPS100 TB disk
LHC start
ComputingTDRs
55x107 MIPS70,000 TB disk
(140 MSi2K)
Evolution of CPU Capacity at CERN
SC (0.6GeV)
PS (28GeV)ISR (300GeV)
SPS (400GeV)
ppbar(540GeV)
LEP (100GeV)
LEP II (200GeV)
LHC (14 TeV)
Costs (2007Swiss Francs)
Includes infrastructurecosts (comp.centre,
power, cooling, ..) and physics tapes
25-Nov-07 [email protected]
LHC Computing Multi-science Grid
1999 - MONARC project First LHC computing architecture – hierarchicaldistributed model
2000 – growing interest in grid technology HEP community main driver in launching the
DataGrid project 2001-2004 - EU DataGrid project
middleware & testbed for an operational grid 2002-2005 – LHC Computing Grid – LCG
deploying the results of DataGrid to provide aproduction facility for LHC experiments
CERN
25-Nov-07 [email protected]
The Worldwide LHC Computing Grid Purpose
Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments
Ensure the computing service … and common application libraries and tools
Phase I – 2002-05 - Development & planning
Phase II – 2006-2008 – Deployment & commissioning of the initial services
25-Nov-07 [email protected]
WLCG Collaboration
The Collaboration 4 LHC experiments ~250 computing centres 12 large centres
(Tier-0, Tier-1) 38 federations of smaller
“Tier-2” centres Growing to ~40 countries Grids: EGEE, OSG, Nordugrid
Technical Design Reports WLCG, 4 Experiments: June 2005
Memorandum of Understanding Agreed in October 2005
Resources 5-year forward look
LCG Service Hierarchy
25-Nov-07 [email protected] 10
Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres
Tier-1 – “online” to the data acquisition process high availability
Managed Mass Storage – grid-enabled data service
Data-heavy analysis National, regional support
Tier-2: ~130 centres in ~35 countries End-user (physicist, research group)
analysis – where the discoveries are made Simulation
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF/SARA (Amsterdam)Nordic countries – distributed Tier-1
Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)
CERN18%
All Tier-1s39%
All Tier-2s43%
CERN12%
All Tier-1s55%
All Tier-2s33%
CERN34%
All Tier-1s66%
CPU Disk Tape
Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005
CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53
Distribution of Computing Services
about 100,000CPU cores
New data will grow atabout 15 PetaBytes
per year – with two copies
Significant fraction of the resourcesdistributed over more than
120 computing centres25-Nov-07 [email protected]
Grid Activity
Continuing increase in usage of the EGEE and OSG grids All sites reporting accounting data (CERN, Tier-1, -2, -3) Increase in past 17 months – 5 X number of jobs
- 3.5 X cpu usage
100K jobs/day
October 2007 - CPU UsageCERN, Tier-1s, Tier-2s
> 85% of CPU Usage is external to CERN
* NDGF usage for September 2007
*
25-Nov-07 [email protected]
Tier-2 Sites – October 2007
30 sites deliver 75% of the cpu 30 sites deliver 1%
25-Nov-07 [email protected]
LHCOPN Architecture
Tier-2s and Tier-1s are inter-connected by the general
purpose research networks
Any Tier-2 mayaccess data at
any Tier-1
Tier-2 IN2P3
TRIUMF
ASCC
FNAL
BNL
Nordic
CNAF
SARAPIC
RAL
GridKa
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2Tier-2
Tier-2
25-Nov-07 [email protected]
Data Transfer out of Tier-0
25-Nov-07 [email protected] 16
Middleware: Baseline Services
Storage Element Castor, dCache, DPM (with SRM 1.1) Storm added in 2007 SRM 2.2 – long delays incurred
- being deployed in production Basic transfer tools – Gridftp, .. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O –
Grid File Access Library (GFAL) Synchronised databases T0T1s
3D project
Information System Compute Elements
Globus/Condor-C web services (CREAM)
gLite Workload Management in production at CERN
VO Management System (VOMS)
VO Boxes Application software installation Job Monitoring Tools
The Basic Baseline Services – from the TDR (2005)
... continuing evolutionreliability, performance, functionality, requirements
25-Nov-07 [email protected]
Site Reliability – CERN + Tier-1s
“Site Reliability” a function of
grid services middleware site operations storage management
systems networks ........
Targets – CERN + Tier-1sBefore
July July 07 Dec 07 Avg.last 3 months
Each site 88% 91% 93% 89%
8 best sites 88% 93% 95% 93%
Tier-2 Site Reliability
average all sites 81%top 50% 96%top 20% 99%
Tier-2 Site Reliability October 2007
Site ReliabilityTier-2 Sites
83 Tier-2 sites being monitored
Improving Reliability
Monitoring Metrics Workshops Data challenges Experience Systematic
problem analysis Priority from software
developers
LCG depends on two major science grid infrastructures ….
EGEE - Enabling Grids for E-ScienceOSG - US Open Science Grid
25-Nov-07 [email protected]
LHC Computing Multi-science Grid
1999 - MONARC project First LHC computing architecture – hierarchicaldistributed model
2000 – growing interest in grid technology HEP community main driver in launching the DataGrid
project 2001-2004 - EU DataGrid project
middleware & testbed for an operational grid 2002-2005 – LHC Computing Grid – LCG
deploying the results of DataGrid to provide aproduction facility for LHC experiments
2004-2006; 2006-2008 – EU EGEE project starts from the LCG grid shared production infrastructure expanding to other communities and sciences Now preparing 3rd phase
CERN
25-Nov-07 [email protected]
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
240 sites45 countries45,000 CPUs12 PetaBytes> 5000 users> 100 VOs> 100,000 jobs/day
ArcheologyAstronomyAstrophysicsCivil ProtectionComp. ChemistryEarth SciencesFinanceFusionGeophysicsHigh Energy PhysicsLife SciencesMultimediaMaterial Sciences…
Grid infrastructure project co-funded by the European Commission - now in 2nd phase with 91 partners in 32 countries
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
EGEE infrastructure use
LHCC Comprehensive Review; 19-20 November 2007 24
> 90k jobs/day LCG>143 k jobs/day total> 90k jobs/day LCG>143 k jobs/day total
Data from EGEE accounting systemData from EGEE accounting system
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 25
EGEE working with related infrastructure projects
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Sustainability: Beyond EGEE-II
• Need to prepare permanent, common Grid infrastructure• Ensure the long-term sustainability of the European e-infrastructure
independent of short project funding cycles• Coordinate the integration and interaction between National Grid
Infrastructures (NGIs)• Operate the European level of the production Grid infrastructure for
a wide range of scientific disciplines to link NGIs
26
EGI – European Grid Initiative
www.eu-egi.orgwww.eu-egi.org EGI Design Study proposal to the European Commission (started Sept 07)
Supported by 37 National Grid Initiatives (NGIs)
2 year project to prepare the setup and operation of a new organizational model for a sustainable pan-European grid infrastructure after the end of EGEE-3
Challenges
Short timescale Preparation for start-up:
○ Resource ramp-up across Tier 1 and 2 sites○ Site and service reliability
Longer term Infrastructure – power and cooling Multi-core CPU – how will we make best use of them? Supporting large scale analysis activities – just starting now –
what will be the new problems that arise?
Migration from today’s grid to a model of national infrastructures – how to ensure that LHC gets what it needs
25-Nov-07 [email protected]
Combined Computing Readiness Challenge - CCRC
A combined challenge by all Experiments & Sites validate the readiness of the WLCG computing infrastructure before start of data taking at a scale comparable to that need for data taking in 2008
Should be done well in advance of the start of data taking to identify flaws, bottlenecks and allow time to fix them
Wide battery of tests – simultaneously – all experiments Driven from DAQ with full Tier-0 processing Site-site data transfers, storage system to storage system Required functionality and performance Data access patterns similar to 2008 processing CPU and data loads simulated as required to reach 2008 scale
Coordination team in place Two test periods – February, May
Ramp-up Needed for Startup
Jul Sep Apr -07 -07 -08
3.7 X
Sep Jul Apr -06 -07 -08
Sep Jul Apr -06 -07 -08
2.3 X 3.7 Xtarget usageusage
pledgeinstalled
25-Nov-07 [email protected]
Summary
We have an operational grid service for LHC
EGEE – The European Grid Infrastructure - is the world’s largest multi-disciplinary grid for science ~240 sites; > 100 application groups
Over the next months before LHC comes on-line: Ramp-up resources to the MoU levels Improve service reliability and availability Full program of “dress-rehearsals” to
demonstrate the complete computing system
Tier-1 Centers: TRIUMF (Canada); GridKA(Germany); IN2P3 (France); CNAF (Italy); SARA/NIKHEF (NL); Nordic Data Grid Facility (NDGF); ASCC (Taipei);
RAL (UK); BNL (US); FNAL (US); PIC (Spain)
The Grid is now in operation, working on: reliability, scaling up, sustainability