25 may 2004 - 1 lcg deployment status oliver keeble cern it gd-gis [email protected]

23
25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS [email protected]

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 1

LCG Deployment Status

Oliver KeebleCERN IT GD-GIS

[email protected]

Page 2: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 2

Overview Overview

LHC Computing Grid LHC and its computing requirements Challenges for LCG The project Deployment and Status LCG-EGEE Summary

Large Hadron Collider

Page 3: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 3

The Four Experiments at LHCThe Four Experiments at LHC

LHCbFederico.carminati , EU review presentation

Page 4: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 4

Challenges for the Challenges for the LLHC HC CComputing omputing GGrid rid http://lcg.web.cern.ch/lcg

• LHC (Large Hadron Collider)• with 27 km of magnets the largest superconducting installation• 40 million events per second from each of the 4 experiments• after triggers and filters 100-1000MBytes/second remain• every year ~15PetaByte of data will be stored• this data has to be reconstructed and analyzed by the users• in addition large computational effort to produce Monte Carlo data

Page 5: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 5

Challenges for the Challenges for the LLHC HC CComputing omputing GGrid rid http://lcg.web.cern.ch/lcg

• CERN Collaborators• global effort• > 6000 users from 450 institutes• none has the required computing• all have access to some computing

Europe: 267 institutes4603 users

Elsewhere: 208 institutes1632 users

Page 6: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 6

The LCG Project The LCG Project (and what it isn’t)(and what it isn’t)

• Mission• To prepare, deploy and operate the computing environment for the

experiments to analyze the data from the LHC detectors

• Two phases:• Phase 1: 2002 – 2005• Build a prototype, based on existing grid middleware• LCG-2• Deploy and run a production service• Produce the Technical Design Report (TDR) for the final system• Phase 2: 2006 – 2008 • Build and commission the initial LHC computing environment

LCG is NOT a development project for middlewarebut problem fixing is permitted (even if writing code is required)

Page 7: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 7

LCG Time LineLCG Time Line

Testing, with simulated event productions

2003

2004

2005

2006

2007

first data

physicscomputing service

open LCG-1 (achieved) – 15 Sept

* TDR – technical design report

Computing models

TDR for the Phase 2 grid

experiment setup & preparation

Phase 2 service in production

Phase 2 service acquisition, installation, commissioning

principal service for LHC data challenges

LCG-3 – second generation middlewarevalidation of computing models

Second generation middleware prototyping, development

LCG-2 - upgraded middleware, mgt. and ops tools

Page 8: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 8

LCG-1 Experience (2003)LCG-1 Experience (2003)

• Jan 2003 GDB agreed to take VDT and EDG components• March 2003 LCG-0

• existing middleware, waiting for EDG-2 release

• September 2003 LCG-1 • first production release – integrate sites and operate a grid• integrated 32 sites ~300 CPUs• 3 month late -> reduced functionality• extensive middleware certification process• operated until early January, first use for production• introduced hierarchical support model (primary and secondary sites)

• worked well for some regions (less for others)• communication/cooperation between sites needed to be established

• installation and configuration was an issue• only time to package software for the LCFGng tool (problematic)• not sufficient documentation (partially compensated by travel)• manual installation procedure documented when new staff arrived

Page 9: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 9

LCG-1 -> LCG-2LCG-1 -> LCG-2

• Deployed MDS + EDG-BDII in a robust way • redundant regional GIISes – BIG STEP FORWARD• vastly improved the scalability and robustness of the information system

• upgrades, especially non backward compatible ones took very long• not all sites showed the same dedication

• still some problems with the reliability of some of the core services

Project Level 1 Deployment milestones for 2003: July: Introduce the initial publicly available LCG-1 global grid service

• With 10 Tier 1 centres in 3 continents November: Expanded LCG-1 service with resources and functionality

sufficient for the 2004 Computing Data Challenges• Additional Tier 1 centres, several Tier 2 centres – more countries• Expanded resources at Tier 1s • Agreed performance and reliability targets• Around 30 sites

Page 10: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 10

Data ChallengesData Challenges

• 2004 the “LHC Data Challenges”“LHC Data Challenges”• Large-scale tests of the experiments’ computing models, processing

chains, operating infrastructure, grid readiness• ALICE and CMS data challenges started at the beginning of March• LHCb and ATLAS – start in May• The big challenge for this year - data –

- integrating mass storage (SRM)

• December 2003 LCG-2 • Full set of functionality for DCs, but only “classic SE”, first MSS

integration• Deployed in January• Data challenges started in February -> testing in production• Large sites integrate resources into LCG (MSS and farms)

• May 2004• Improved services • SRM enabled storage for disk and MSS systems• Significant broadening of participation

Page 11: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 11

LCG - a CollaborationLCG - a Collaboration

• Building and operating the LHC Grid – an international collaboration between

• The physicists and computing specialists from the experiments

• The projects in Europe and the US that have been developing Grid middleware

• European DataGrid (EDG)• US Virtual Data Toolkit (Globus, Condor, PPDG,

iVDGL, GriPhyN)

• The regional and national computing centres that provide resources for LHC

• some contribution from HP (tier2 centre)

• The research networks

Researchers

Software Engineers

Service Providers

Page 12: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 12

RAL

IN2P3

BNL

FZK

CNAF

PIC ICEPP

FNAL

USC

NIKHEF

Krakow

CIEMAT

Rome

Taipei

TRIUMF

CSCS

Legnaro

UB

IFCA

IC

MSU

Prague

Budapest

Cambridge

Tier-1small

centres Tier-2desktopsportables

LCG Scale and Computing ModelLCG Scale and Computing Model

Sites classified by resources• Tier-0

• reconstruct Experiment Summary Data (ESD)

• record raw and ESD• distribute data to tier-1

• Tier-1• data heavy analysis• permanent, managed grid-enabled

storage (raw, analysis, ESD), MSS• reprocessing• regional support

• Tier-2• managed disk storage• CPU intensive tasks e.g. simulation• end user analysis• parallel interactive analysis

Processing M SI2000**

Disk PetaBytes

Mass Storage

PetaBytes

CERN 20 5 20

Major data handling centres

(Tier 1)45 20 18

Other large centres (Tier 2)

40 12 5

Totals 105 37 43

** Current fast processor ~1K SI2000

Current estimates of Computing Resources needed at Major LHC Centres

First full year of data - 2008

Data distribution~70Gbits/sec

Page 13: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 13

LCG2LCG2

• Operate large scale productionproduction service• Started with 8 “core” sites

• each bringing significant resources• sufficient experience to react quickly• weekly meetings: core site phone conference, each of the experiments,

joined meeting of the sites and the experiments (GDA)

• Introduced a testZone for new sites• LCG1 showed that ill configured sites can affect all sites• sites stay in the testZone until they have been stable for some time

• Further improved (simplified) information system “LCG BDII”• addresses: manageability, improves: robustness and scalability• allows partitioning of the grid into independent views

• Introduced local testbed for experiment integration• runs TAG N+1• rapid feedback on functionality from the experiments• triggered several changes to the RLS system

Page 14: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 14

LCG2LCG2

• Focus on integrating local resources• batch systems at CNAF, CERN, NIKHEF already integrated• MSS systems with CASTOR at several sites, enstore at FNAL, RAL very

soon

• Experiment software distribution mechanism• based on shared file system with access for privileged users• tool to publish the installed software (or compatibility) in the information

system

• Improved documentation and installation • sites have the choice to use LCFGng or follow a manual installation guide• LCFGng has a large overhead and is only appropriate for >10 nodes• full set of manual install documentation• documentation includes simple tests, sites join in a better state• install notes: http://markusw.home.cern.ch/markusw/LCG2InstallNotes.html

Page 15: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 15

Operations Services Operations Services

• Operations Service:• RAL is leading sub-project on developing operations services• Initial prototype http://goc.grid-support.ac.uk/

• Basic monitoring tools• Mail lists and for problem resolution• GDB database containing contact and site information• GOC ultimately to be distributed 24hr service

• Monitoring:• GridICE (development of DataTag Nagios-based tools) • GridPP job submission monitoring

• User support service • FZK leading sub-project to develop user support services • Draft on user support policy• Web portal for problem reporting http://gus.fzk.de/

Page 16: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 16

Release ProcessRelease Process

• Priorities for future releases• agreed in Grid Deployment Area meetings• based on:

• experiments experience, problems and needs

• operational experience

• Monthly coordinated releases• in the past everything not perfect was labeled “showstopper”

• releases took very long

• now we reach gradually a more stable situation

• all components have to pass the C&T testing• not all releases will be deployed• releases go first to core sites

Page 17: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 17

Last

3 w

eeks

fro

m 2

8 to

50

site

sLast 2 w

eeks from 2200 to 3340 C

PU

SGIIS monitor 16:40:45 05/17/04 GMTNo Site Reports zone bbdii sanity totalCPU cpuUsed % runJob waitJob seAvail seUsed %

1 CERN-LCG2 pro,ext Pass ok 914 92 25 0 1862.7TB 992 CNAF-LCG2 pro,ext Pass ok 810 25 153 4 931.3TB 1003 RAL-LCG2 pro,ext Pass ok 144 18 27 0 56GB 214 nikhef.nl pro,ext Pass warn 240 28 244 506 na na5 FZK-LCG2 pro,ext Pass ok 142 1 2 0 3.3TB 396 FNAL-LCG2 pro,ext Pass warn 18 72 14 0 na na7 Taiwan-LCG2 pro,ext Pass ok 77 0 0 0 1.8TB 08 PIC-LCG2 pro,ext Pass ok 140 29 4 0 1862.7TB 1009 TRIUMF pro,ext Pass ok 10 40 4 0 692GB 5

10 NCU-TAIWAN pro,ext Pass ok 8 62 5 0 515GB 411 USC-LCG2 pro,ext Pass ok 8 50 6 47 361GB 15912 IMPERIAL pro,ext Pass ok 44 11 4 1 404GB 313 CAVN pro,ext Pass ok 6 50 3 0 4GB 26414 IFIC pro,ext Pass ok 95 0 0 0 931.3TB 10015 KRAKOW pro,ext Pass ok 12 25 3 0 93GB 1116 INFN-TORINO pro,ext Pass ok 24 95 23 0 442GB 35117 INFN-MILANO pro,ext Pass ok 38 42 16 0 1.3TB 1218 INFN-LEGNARO pro,ext Pass warn 142 40 29 0 210GB 55419 IFCA pro,ext Pass ok 2 0 0 0 49GB 320 CIEMAT pro,ext Pass ok 2 0 0 0 56GB 2321 DESY ext Pass ok 2 0 0 0 65GB 522 SINP ext Pass ok 10 0 0 0 33GB 523 PRAGUE ext Pass warn na na na na 2.3TB 924 UB-BARCELONA ext Pass ok 3 0 0 0 9GB 025 #HG-01-GRNET ext Pass ok 48 0 0 0 1.0TB 026 WUPPERTAL ext Pass ok 2 0 0 0 540GB 21027 HEPHYUIBK ext Pass ok 6 0 0 0 520GB 728 UAM ext Pass ok 49 0 0 0 1.2TB 029 SHEF ext Pass ok 16 0 0 0 175GB 030 LIP ext Pass ok 4 25 1 0 0GB na31 LANC ext Pass ok 26 0 0 0 1.8TB 032 BUDAPEST ext Pass ok 82 23 16 0 12GB 1333 PRAGUE-CESNET ext Pass ok 28 0 0 0 55GB 2134 #WEIZMANN ext Fail ok 24 0 0 0 0GB na35 MONTREAL ext Pass ok 10 10 1 0 67GB 236 ALBERTA ext Pass ok 2 0 0 0 0GB na37 QMUL ext Pass ok 2 0 0 0 102GB 138 MANCHESTER ext Pass ok 4 0 0 0 15GB 1039 INFN-ROMA1 ext Pass ok 50 16 8 0 30GB 1040 AACHEN ext Pass ok 2 0 0 0 0GB na41 INFN-NAPOLI ext Pass ok 12 0 0 0 824GB 542 UCL ext Pass ok 2 0 0 0 3GB 4643 TRIUMF-GC ext Pass warn 0 na 0 0 na na44 FRASCATI ext Pass ok 6 0 0 0 373GB 045 RALPP ext Pass ok 42 11 5 0 46GB 046 GLASGOW ext Pass ok 1 0 0 0 102GB 147 TAU ext Pass warn na na na na 0GB na48 TORONTO ext Pass warn 4 0 0 0 0GB na49 CARLETON ext Pass warn 22 0 0 0 na na50 JINR ext Pass ok 2 0 0 0 15GB 10

3337 765 593 558 2203

Page 18: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 18

LCG2 Data ChallengesLCG2 Data Challenges

• Most services stable, esp IS • Lessons learned

• LHC experiments use multiple grids and additional services• Integration, Interoperability

• Service provision• planned to provide shared services RBs, BDIIs, UIs etc, but…• experiments need to augment the services on the UI and need to define their

super/subsets of the grid– individual RB/BDII/UIs for each VO (optional on one node)

• Resource usage• expected uniform utilization (100% from start on)• turned out to have some granularity

– steep build up on almost empty service, followed by a plateau and then tapering off

• Application level software distribution – not ideal, but improved over LCG-1

Page 19: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 19

LCG2 Data ChallengesLCG2 Data Challenges

• Lessons learned• Scalable services for data transport needed

• SRM needed for more than Castor and Enstore• Performance issues with RLS tools

• bulk file registration with RLS, understood, work around and fix • Local Resource Managers (batch systems) are too smart for Glue

• GlueSchema can’t express the richness of batch systems (LSF etc.)– Users cannot reliably anticipate loading

• New flatter IS architecture• First scaling problems encountered around 40 sites (RB)

• RB slows down, solution will make it into the May release

• DCs need resources • Disk storage not sufficient

Page 20: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 20

Expected Developments in 2004Expected Developments in 2004

• General:• LCG-2 will be an incrementally evolving, stable service

• Some functional improvements:• Extend access to MSS – tape systems, and managed disk pools• GFAL – Posix I/O to heterogeneous MSS (large range used across

labs)• Operational improvements:

• Monitoring systems – move towards proactive problem finding, ability to take sites on/offline; experiment monitoring (R-GMA), accounting

• “Cookbook” to cover planning, installation and operation• Activate regional centres more to provide and support services

• this has improved over time, but in general there is too little sharing of tasks• Address integration issues:

• With large clusters, with storage systems, with different OSs• Sites will not run consistent and identical middleware• Better integration of farms with non routed networks for the WNs• Regional centres already supporting other experiments• CERN integrating projects for the accelerator group• National Grid infrastructures coming

Page 21: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25 May 2004 - 21

Grid Guide DocGrid Guide Doc

• Tool to consolidate all site configuration parameters in one place

• Web interface• Wizard-style interface upon initial registration• Management of configuration

• Advice

• Full site configuration exported as an XML file• Providing the input to installation scripts• Creating customised documentation• Building other tools• Debugging

• Aims to• Reduce the barriers towards participation in the grid• Enable sites to join in a more stable state

Page 22: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25/05 2004 - 22

LCG – EGEE

• EU project to build an e-science grid for Europe

• LCG-2 will be the production service during 2004• Will also form basis of EGEE initial production service• Will be maintained as a stable service• Will continue to be developed

• Expect in parallel a development service – Q204• Based on EGEE middleware prototypes• Run as a service on a subset of EGEE/LCG production sites

• The core infrastructure of the LCG and EGEE grids will be operated as a single service

• LCG includes US and Asia, EGEE includes other sciences

• The ROCs support Resource Centres and applications • Similar to LCG primary sites • Some ROCs and LCG primary sites will be merged

• LCG Deployment Manager is the EGEE Operations Manager • member of PEB of both projects

EGEE

LCG

geographical

appl

icat

ions

Page 23: 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

25/05 2004 - 23

Summary

LCG-2 is running as a production service

Anticipate further improvements in infrastructure

Broadening of participation and increase in available resources

In 2004 we must show that we can handle the data - meeting the Data Challenges is the key goal of 2004

2004

2005

2006

2007first data

Initial service in operation

Decisions on final core middleware

Demonstrate core data handling and batch analysis

Installation and commissioning