cern lcg-1 deployment plan ian bird lcg project deployment area manager it division, cern gridpp 7...
TRANSCRIPT
CERN
LCG-1 Deployment PlanLCG-1 Deployment Plan
Ian BirdLCG Project Deployment Area Manager
IT Division, CERN
GridPP 7th Collaboration MeetingOxford
1 July 2003
CERN
OverviewOverview
• Milestones and goals for 2003
• LCG-1 Roll-out plan– Where, how, when
• Infrastructure Status– Middleware functionality & status– Security & operational issues
• Plans for rest of 2003– Additional resources– Additional functionality– Operational improvements
CERN
LCG - GoalsLCG - Goals
• The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments
• Two phases:
– Phase 1: 2002 – 2005– Build a service prototype, based on existing grid middleware– Gain experience in running a production grid service– Produce the TDR for the final system
– Phase 2: 2006 – 2008 – Build and commission the initial LHC computing environment
LCG is not a development project – it relies on other grid projects for grid middleware development and support
CERN
LCG - TimescaleLCG - Timescale• Why such a rush – LHC won’t start until 2007 ???
• TDR must be written in mid-2005:– Approval of TDR– Need 1 year to procure, build, test, deploy, commission the
computing fabrics and infrastructure – to be in place end 2006
• In order to write the TDR, essential to have at least 1 year experience– In running a production service– At a scale that is representative of the final system (50% of 1
expt)– Running data challenges – including analysis, not just
simulations
• It can easily take 6 months to prepare such a service We must start now … goal is to have a service in place in July
CERN
LCG - MilestonesLCG - Milestones
• The agreed Level 1 project milestones for Phase 1 are:The agreed Level 1 project milestones for Phase 1 are:– deployment milestones are in red
M1.1 - July 03 First Global Grid Service (LCG-1) available
M1.2 - June 03 Hybrid Event Store (Persistency Framework) available for general users
M1.3a - November 03 LCG-1 reliability and performance targets achieved
M1.3b - November 03
Distributed batch production using grid services
M1.4 - May 04 Distributed end-user interactive analysis from “Tier 3” centre
M1.5 - December 04 “50% prototype” (LCG-3) available
M1.6 - March 05 Full Persistency Framework
M1.7 - June 05 LHC Global Grid TDR
CERN
LCG Regional CentresLCG Regional Centres
Tier 0 • CERNTier 1 Centres• Brookhaven National Lab • CNAF Bologna• Fermilab• FZK Karlsruhe • IN2P3 Lyon• Rutherford Appleton Lab
(UK)• University of Tokyo• CERN
Other Centres• Academica Sinica (Taipei)• Barcelona• Caltech• GSI Darmstadt• Italian Tier 2s(Torino, Milano,
Legnaro)• Manno (Switzerland)• Moscow State University• NIKHEF Amsterdam• Ohio Supercomputing Centre• Sweden (NorduGrid)• Tata Institute (India)• Triumf (Canada)• UCSD• UK Tier 2s• University of Florida–
Gainesville • University of Prague• ……
Confirmed Resources: http://cern.ch/lcg/peb/rc_resources
Centres taking part in the LCG prototype service : 2003 – 2005 Centres taking part in the LCG prototype service : 2003 – 2005
CERN
Elements of a Production LCG ServiceElements of a Production LCG Service
• Middleware:– Testing and certification – Packaging, configuration, distribution and site validation– Support – problem determination and resolution; feedback to
middleware developers
• Operations:– Grid infrastructure services– Site fabrics run as production services– Operations centres – trouble and performance monitoring,
problem resolution – 24x7 globally
• Support:– Experiment integration – ensure optimal use of system– User support – call centres/helpdesk – global coverage;
documentation; training
CERN
2003 Milestones2003 Milestones
Project Level 1 Deployment milestones for 2003:
– July: Introduce the initial publicly available LCG-1 global grid service
• With 10 Tier 1 centres in 3 continents
– November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges
• Additional Tier 1 centres, several Tier 2 centres – more countries• Expanded resources at Tier 1s (e.g. at CERN make the LXBatch
service grid-accessible)• Agreed performance and reliability targets
CERN
LCG Resource Commitments – 1Q04LCG Resource Commitments – 1Q04
CPU (kSI2K)
Disk TB
Support FTE
Tape TB
CERN 700 160 10.0 1000
Czech Republic 60 5 2.5 5
France 420 81 10.2 540
Germany 207 40 9.0 62
Holland 124 3 4.0 12
Italy 507 60 16.0 100
Japan 220 45 5.0 100
Poland 86 9 5.0 28
Russia 120 30 10.0 40
Taiwan 220 30 4.0 120
Spain 150 30 4.0 100
Sweden 179 40 2.0 40
Switzerland 26 5 2.0 40
UK 1656 226 17.3 295
USA 801 176 15.5 1741
Total 5600 1169 120.0 4223
CERN
Deployment Goals for LCG-1Deployment Goals for LCG-1
• Production service for Data Challenges in 2H03 & 2004– Initially focused on batch production work – But ’04 data challenges have (as yet undefined) interactive analysis
• Experience in close collaboration between the Regional Centres– Must have wide enough participation to understand the issues
• Learn how to maintain and operate a global grid
• Focus on a production-quality service – Robustness, fault-tolerance, predictability, and supportability take
precedence; additional functionality gets prioritized
• LCG should be integrated into the sites’ physics computing services – should not be something apart– This requires coordination between participating sites in:
• Policies and collaborative agreements• Resource planning and scheduling • Operations and Support
CERN
Middleware DeploymentMiddleware Deployment
• LCG-0 was deployed and installed at 10 Tier 1 sites– Installation procedure was straightforward and repeatable– Many local integration issues were addressed
• LCG-1 will be deployed to these 10 sites to meet the July milestone– Time is short – integrating the middleware components took
much longer than anticipated– Planning under way to do the deployment in a short time once
the middleware is packaged– LCG team will work directly with these sites during the
deployment
– Initially testing activities to stabilise service will take priority– Expect experiments to start to test the service by mid-August
CERN
LCG-0 Deployment StatusLCG-0 Deployment Status
Site Scheduled Status
Tier 1
0 CERN 15/2/03 Done
1 CNAF 28/2/03 Done
2 RAL 28/2/03 Done
3 FNAL 30/3/03 Done
4 Taipei 15/4/03 Done
5 FZK 30/4/03 Done
6 IN2P3 7/5/03 In prep.
7 BNL 15/5/03 Done
8 Russia (Moscow) 21/5/03 In prep.
9 Tokyo 21/5/03 Done
Tier 2
10 Legnaro (INFN) After CNAF Done
These sites deployed the LCG-0 pilot system and will be the first sites to deploy LCG-1
CERN
LCG-1 DistributionLCG-1 Distribution
• Packaging & Configuration– Service machines – fully automated installation
• LCFGng – either full or light version
– Worker nodes – aim is to allow sites to use existing tools as required
• LCFGng – provides automated installation• Installation scripts provided by us – manual installation• Instructions allowing system managers to use their existing tools
– User interface• LCFGng• Installed on a cluster (e.g. Lxplus at CERN)• Pacman?
• Distribution– Distribution web site being set up now (updated from LCG-0)
• Sets of rpm’s etc organised by service and machine type• User guide, Installation guides, release notes, etc., being written
now
CERN
Middleware StatusMiddleware Status
• Integration work of EDG 2.0 has taken longer than hoped– EDG has not quite released version 2.0 – imminent
• LCG has a working system – able to run jobs:– Resource Broker: many changes since previous version, needs
significant testing to determine scalability and limitations– RLS: Initial deployment will be single instance (per VO) of LRC/RMC
• Distributed service with many LRC and indexes not yet debugged• Initially will run LRC for all VOs at CERN with Oracle service backend
– Information system: • R-GMA is not yet stable• We will initially use MDS: work to improve stability (bug fixes), and
redundancy – based on experience with EDG testbeds and Nikhef, NorduGrid work
• Intend to make direct comparison between MDS and R-GMA on certification testbed
• Waiting for bug fixes – of several components• Still to do before release:
– Reasonable level of testing– Packaging and preparation for deployment
CERN
Certification & TestingCertification & Testing
• This is primary tool to stabilise and debug the system– Process and testbed has been set up– This is intended to parallel the production service
• Certification testbed:– Set of 4 clusters at CERN – simulates a grid on LAN– External sites that will be part of cert. tb.
• U. Wisconsin, FNAL – currently• Moscow, Italy – soon
• This testbed is being used to test the release candidate– Will be used to reproduce and resolve problems found in the
production system, and to do regression testing for updated middleware components before deployment
CERN
Infrastructure for initial service - 2Infrastructure for initial service - 2
• Security issues– Agreement on set of CAs that all LCG sites will accept
• EDG list of traditional CAs• FNAL on-line KCA
– Agreement on basic registration procedure for users• LCG VO where users sign Acceptable Usage Rules for LCG• 4 experiment VOs – will use existing EDG services run by Nikhef• Agreement on basic set of information to be collected
– All initial registrations will expire in 6 months – we know the procedures will change
– Experiment VO managers will verify bona fides of users– Acceptable Use Rules – adaptation based on EDG policy for
now– Audit trails – basic set of tools and log correlations to provide
basic essential functions
CERN
Infrastructure – 3 Infrastructure – 3
• Operations Service:– RAL is leading sub-project on developing operations services– Initial prototype for July –
• Basic monitoring tools• Mail lists and rapid communications/coordination for problem
resolution
– Monitoring:• GridICE (development of DataTag Nagios-based tools) being
integrated with release candidate• Existing Nagios-based tools• GridPP job submission monitoring• Together these give a reasonable coverage of basic operational
issues
• User support– FZK leading sub-project to develop user support services– Initial prototype for July –
• Web portal for problem reporting • Expectation that initially experiments will triage problems and
experts will submit LCG problems to the support service
CERN
Initial Deployment ServicesInitial Deployment Services
RLS(RMC&LRC)
CMSRLS
(RMC&LRC)ATLAS
RLS(RMC&LRC)
ALICERLS
(RMC&LRC)LCG-Team
VOLCG
Proxy
UIAFS users
RB-2RB-1
SEDisk
UI-bLxplus
CE-1 CE-2
WNWN
WNWN
WN
PBS
VOLCG-Team
LCGRegistration
Server
LCGCVS Server
RLS(RMC&LRC)
LHCb
UI
ProxyRB-1
SEDisk
CE
WNWN
WNWN
WN
PBS/????
VOCMS
VOLHCb
VOATLAS
VOALICE
@ NIKHEF
Services at CERN
Services at other sites
CE-3 CE-4
WNWN
WNWN
WN
LSF
CERN
RegionA1GIIS
RegionA2GIIS
BDII ALDAP
BDII BLDAP
RB
RegionB1GIIS
RegionB2GIIS
CE1GRIS
CE2GRIS
SE1GRIS
SE2GRIS
SiteCGIIS
CE1GRIS
CE2GRIS
SE1GRIS
SE2GRIS
SiteDGIIS
CE1GRIS
CE2GRIS
SE1GRIS
SE2GRIS
SiteAGIIS
CE1GRIS
CE2GRIS
SE1GRIS
SE2GRIS
SiteBGIIS
Query
Register
/dataCurrent/.. /dataNew/..
BDIILDAP Swap&Restart
QueryWhile using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted.The restart takes less than 0.5 sec.To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David).This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes
secondary
primary
Using multiple BDIIs requires RB changes
LCG-1 First Launch Information System Overview
CERNLCG-1 First Launch Information System Sites and Regions
A Region should contain not too many sites since we have observed problems with MDS if a large number of sites are involved.To allow for future expansion, but not to make the system too complex I suggest to start with two regions and if needed split later to smaller regions.The regions are: West of 0 degree and East.The idea is to have a large region and a small one and see how they workFor the West 2 region GIISes and for the East 3 should be setup at the beginning,
RAL
FNAL
BNL
WEST1RegionGIIS
WEST2RegionGIIS
CERN
CNAF
LYON MOSCOW
FZK TOKYO
TAIWAN
EAST1RegionGIIS
EAST2RegionGIIS
EAST3RegionGIIS
CERN
Plans for remainder of 2003Plans for remainder of 2003
• Once the service has been deployed, priorities are:– Problem resolution and bug fixing – to address problems of
reliability and scalability in existing middleware– Incrementally adding additional functionality– Adding additional sites– Expanding site resources accessible to the grid service– Addressing integration issues
• Worker node WAN connectivity, etc.
– Developing distributed prototypes of• Operations centres• User support services• To provide reasonable level of global coverage
– Improving security model – Developing tools to facilitate operating the service
CERN
Plans for 2003 – 2 Plans for 2003 – 2 Middleware functionality
• Top priority is problem resolution and issues of stability/scalability
• RLS developments– Distributed service – multiple LRC, and RLI– Later: develop a service to replace client command-line tools
• VOMS service– To permit user and role-based authorization
• Validation of R-GMA– And then deployment of multiple registries – initial implementation has
singleton• Grid File Access Library
– LCG development: POSIX-like I/O layer to provide local file access• Development of SRM/SE interfaces to other MSS
– Work that must happen at each site with a MSS• Basic upgrades
– Compiler support– Move to Globus 2.4 (release supported through 2004)
• Cut-off for functionality improvements is October – in order to have a stable system for 2004
CERN
Incremental DeploymentIncremental Deployment
Development of LCG middleware
July starting point:as much as feasible
…
VDT upgrade
VOMS
RLS (distributed)
R-GMA
RB
RLS (basic)
VDT
Globus
Continuous bug fixing & re-release
…
RH 8.x
gcc 3.2
October 1: cut-off defines functionality for 2004
EDG Integration, ends September
CERN
Expansion of LCG resourcesExpansion of LCG resources
• Adding new sites– Will be a continuous process as sites are ready to join the service– Expect as a minimum 15 sites (15 countries have committed
resources for LCG in 1Q04), reasonable to expect 18-20 sites by end 2003
– LCG team will work directly with Tier 1 (or primary site in a region)
– Tier 1s will provide first level support for bringing Tier 2 sites into the service
• Once the Tier 1s are stable this can go in parallel in many regions• LCG team will provide 2nd level support for Tier 2s
• Increase grid resources available at many sites– Requires LCG to demonstrate utility of service – experiments in
agreement with site managers add resources to the LCG service
CERN
Operational plans for 2003Operational plans for 2003
• Security– Develop full security policy– Develop longer term user registration procedures and tools to
support it– Develop Acceptable Use policy for longer term – requires legal
review
• Operations– Develop distributed prototype operations centres/services
• Monitoring developments driven by experience– Provide at least 16 hr/day global coverage – problem response– Basic level of resource use accounting – by VO and user– Minimal level of security incident response and coordination
• User Support– Development direction depends strongly on experience in the
deployed system– Operations and User Support must address the issues of
interchanging problem reports – with each other and with sites, network ops, etc.
CERN
Middleware roadmapMiddleware roadmap
• Short term (2003)– Use what exists – try and stabilize, debug, fix problems, etc.– Exceptions may be needed – WN connectivity, client tools
rather than services, user registration, …
• Medium term (2004 - ?)– Same middleware, but develop missing services, remove
exceptions– Separate services from WNs – aim for more generic clusters – Initial tests of re-engineered middleware (service based,
defined interfaces, protocols)
• Longer term (2005? - )– LCG service based on service definitions, interfaces, protocols,
- aim to be able to have interoperating, different implementations of a service
CERN
Inter-operabilityInter-operability
• Since LCG will be VDT + higher level EDG components:
• Sites running same VDT version should be able to be part of LCG, or continue to work as now
• LCG (as far as possible) has goal of appearing as a layer of services in front of a cluster, storage system, etc.– State of the art currently implies compromises …
CERN
Integration IssuesIntegration Issues• LCG will try to be non-intrusive:
– Will assume base OS is already installed– Provide installation & config tool for service nodes– Provide recipes for installation of WNs – assume sites will use
existing tools to manage their clusters
• No imposition of a particular batch system – As long as your batch system talks to Globus
• (OK for LSF, PBS, Condor, BQS, FBSng)
• No longer requirement for shared filesystem between gatekeeper and WNs – was a problem for AFS, NFS does not scale to large clusters
• Information publishing– Define what information a site should provide (accounting,
status, etc), rather than imposing tools
• But … maybe some compromises in short term (2003)
CERN
Worker Node connectivityWorker Node connectivity
• In general (and eventually) it cannot be assumed that the cluster nodes will have connectivity to remote sites– Many clusters on non-routed networks (for many reasons)– Security issues– In any case this assumption will not scale
• BUT…– To achieve this several things are necessary:– Some tools (e.g. replica management) must become services– Databases (e.g. conditions db) must either be replicated to
each site (or equivalent), or proxy service, or …– Analysis models must take this into account– Again, short term exceptions (up to a point) possible
• Current additions to LXbatch at CERN have this limitation
CERN
Timeline for the LCG servicesTimeline for the LCG services
Event simulation productions
Service for Data Challenges,batch analysis, simulation
Validation of computing models
Acquisition, installation, testing of Phase 2 service
Phase 2 service in production
2003 2006
LCG-1LCG-1 LCG-2LCG-2 LCG-3LCG-3
CMS DC04
Agree LCG-1 Spec
LCG-1 service opens
LCG-2 with upgraded m/w, management etc.
TDR for Phase 2
Stabilize, expand, develop Evaluation 2nd generation middleware
LCG-3 full multi-tier prototype batch+interactive service
Computing model TDR’s
20052004
CERN
Resource Requests
Resources – compute & storage Grid Infrastructure Services
Grid Deployment OrganisationGrid Deployment Organisation
GridDeployment
manager
GridDeployment
manager
LCGsecuritygroup
LCGsecuritygroup
grid infra-structure
team
grid infra-structure
team
GridDeploymentBoard (GDB)
GridDeploymentBoard (GDB)
LCGoperations
team
LCGoperations
team
experimentsupport
team
experimentsupport
team
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
regionalcentre
operations
CERN-based teams
policies, strategy, scheduling,standards, recommendations Grid Resource
CoordinatorGrid ResourceCoordinator
ALICEATLAS
CMSLHCb
securitytools
securitytools
operationscall centreoperationscall centre
gridmonitoring
gridmonitoring
JointTrillium/EDG/LCGtestingteam
JointTrillium/EDG/LCGtestingteam
anticipated teams at other institutes
LCG toolkitintegration &certification
LCG toolkitintegration &certification
core infra-structure
core infra-structure
CERN
ConclusionsConclusions
• Essential to start operating a service as soon as possible – we need 6 months to be able to develop this to a reasonably stable service
• Middleware components are late – but we will still deploy a service of reasonable functionality and scale – Much work will be necessary on testing and improving the
basic service
• Several functional and operational improvements are expected during 3Q03
• Expansion of sites and resources foreseen during 2003 should provide adequate resources for 2004 data challenges
• There are many issues to resolve and a lot of work to do – but this must be done incrementally on the running service
CERN
ConclusionsConclusions
• From the point of view of the LCG plan – we are late in having testable middleware with the functionality that we had hoped for
• We will keep to the July deployment schedule– We expect to have the major components – the user view of
the middleware (i.e. via the RB) should not change– Expect to be able to do less testing and commissioning than
planned– But hopefully, with a suitable process we will incrementally
improve & add functionality as it becomes available and tested