cern lcg-1 deployment plan ian bird lcg project deployment area manager it division, cern gridpp 7...

CERN

LCG-1 Deployment PlanLCG-1 Deployment Plan

Ian BirdLCG Project Deployment Area Manager

IT Division, CERN

GridPP 7th Collaboration MeetingOxford

1 July 2003

[email protected] 2

CERN

OverviewOverview

• Milestones and goals for 2003

• LCG-1 Roll-out plan– Where, how, when

• Infrastructure Status– Middleware functionality & status– Security & operational issues

• Plans for rest of 2003– Additional resources– Additional functionality– Operational improvements

[email protected] 3

CERN

LCG - GoalsLCG - Goals

• The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments

• Two phases:

– Phase 1: 2002 – 2005– Build a service prototype, based on existing grid middleware– Gain experience in running a production grid service– Produce the TDR for the final system

– Phase 2: 2006 – 2008 – Build and commission the initial LHC computing environment

LCG is not a development project – it relies on other grid projects for grid middleware development and support

[email protected] 4

CERN

LCG - TimescaleLCG - Timescale• Why such a rush – LHC won’t start until 2007 ???

• TDR must be written in mid-2005:– Approval of TDR– Need 1 year to procure, build, test, deploy, commission the

computing fabrics and infrastructure – to be in place end 2006

• In order to write the TDR, essential to have at least 1 year experience– In running a production service– At a scale that is representative of the final system (50% of 1

expt)– Running data challenges – including analysis, not just

simulations

• It can easily take 6 months to prepare such a service We must start now … goal is to have a service in place in July

[email protected] 5

CERN

LCG - MilestonesLCG - Milestones

• The agreed Level 1 project milestones for Phase 1 are:The agreed Level 1 project milestones for Phase 1 are:– deployment milestones are in red

M1.1 - July 03 First Global Grid Service (LCG-1) available

M1.2 - June 03 Hybrid Event Store (Persistency Framework) available for general users

M1.3a - November 03 LCG-1 reliability and performance targets achieved

M1.3b - November 03

Distributed batch production using grid services

M1.4 - May 04 Distributed end-user interactive analysis from “Tier 3” centre

M1.5 - December 04 “50% prototype” (LCG-3) available

M1.6 - March 05 Full Persistency Framework

M1.7 - June 05 LHC Global Grid TDR

[email protected] 6

CERN

LCG Regional CentresLCG Regional Centres

Tier 0 • CERNTier 1 Centres• Brookhaven National Lab • CNAF Bologna• Fermilab• FZK Karlsruhe • IN2P3 Lyon• Rutherford Appleton Lab

(UK)• University of Tokyo• CERN

Other Centres• Academica Sinica (Taipei)• Barcelona• Caltech• GSI Darmstadt• Italian Tier 2s(Torino, Milano,

Legnaro)• Manno (Switzerland)• Moscow State University• NIKHEF Amsterdam• Ohio Supercomputing Centre• Sweden (NorduGrid)• Tata Institute (India)• Triumf (Canada)• UCSD• UK Tier 2s• University of Florida–

Gainesville • University of Prague• ……

Confirmed Resources: http://cern.ch/lcg/peb/rc_resources

Centres taking part in the LCG prototype service : 2003 – 2005 Centres taking part in the LCG prototype service : 2003 – 2005

[email protected] 7

CERN

Elements of a Production LCG ServiceElements of a Production LCG Service

• Middleware:– Testing and certification – Packaging, configuration, distribution and site validation– Support – problem determination and resolution; feedback to

middleware developers

• Operations:– Grid infrastructure services– Site fabrics run as production services– Operations centres – trouble and performance monitoring,

problem resolution – 24x7 globally

• Support:– Experiment integration – ensure optimal use of system– User support – call centres/helpdesk – global coverage;

documentation; training

[email protected] 8

CERN

2003 Milestones2003 Milestones

Project Level 1 Deployment milestones for 2003:

– July: Introduce the initial publicly available LCG-1 global grid service

• With 10 Tier 1 centres in 3 continents

– November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges

• Additional Tier 1 centres, several Tier 2 centres – more countries• Expanded resources at Tier 1s (e.g. at CERN make the LXBatch

service grid-accessible)• Agreed performance and reliability targets

[email protected] 9

CERN

LCG Resource Commitments – 1Q04LCG Resource Commitments – 1Q04

CPU (kSI2K)

Disk TB

Support FTE

Tape TB

CERN 700 160 10.0 1000

Czech Republic 60 5 2.5 5

France 420 81 10.2 540

Germany 207 40 9.0 62

Holland 124 3 4.0 12

Italy 507 60 16.0 100

Japan 220 45 5.0 100

Poland 86 9 5.0 28

Russia 120 30 10.0 40

Taiwan 220 30 4.0 120

Spain 150 30 4.0 100

Sweden 179 40 2.0 40

Switzerland 26 5 2.0 40

UK 1656 226 17.3 295

USA 801 176 15.5 1741

Total 5600 1169 120.0 4223

[email protected] 10

CERN

Deployment Goals for LCG-1Deployment Goals for LCG-1

• Production service for Data Challenges in 2H03 & 2004– Initially focused on batch production work – But ’04 data challenges have (as yet undefined) interactive analysis

• Experience in close collaboration between the Regional Centres– Must have wide enough participation to understand the issues

• Learn how to maintain and operate a global grid

• Focus on a production-quality service – Robustness, fault-tolerance, predictability, and supportability take

precedence; additional functionality gets prioritized

• LCG should be integrated into the sites’ physics computing services – should not be something apart– This requires coordination between participating sites in:

• Policies and collaborative agreements• Resource planning and scheduling • Operations and Support


CERN

Middleware DeploymentMiddleware Deployment

• LCG-0 was deployed and installed at 10 Tier 1 sites– Installation procedure was straightforward and repeatable– Many local integration issues were addressed

• LCG-1 will be deployed to these 10 sites to meet the July milestone– Time is short – integrating the middleware components took

much longer than anticipated– Planning under way to do the deployment in a short time once

the middleware is packaged– LCG team will work directly with these sites during the

deployment

– Initially testing activities to stabilise service will take priority– Expect experiments to start to test the service by mid-August


CERN

LCG-0 Deployment StatusLCG-0 Deployment Status

Site Scheduled Status

Tier 1

0 CERN 15/2/03 Done

1 CNAF 28/2/03 Done

2 RAL 28/2/03 Done

3 FNAL 30/3/03 Done

4 Taipei 15/4/03 Done

5 FZK 30/4/03 Done

6 IN2P3 7/5/03 In prep.

7 BNL 15/5/03 Done

8 Russia (Moscow) 21/5/03 In prep.

9 Tokyo 21/5/03 Done

Tier 2

10 Legnaro (INFN) After CNAF Done

These sites deployed the LCG-0 pilot system and will be the first sites to deploy LCG-1


CERN

LCG-1 DistributionLCG-1 Distribution

• Packaging & Configuration– Service machines – fully automated installation

• LCFGng – either full or light version

– Worker nodes – aim is to allow sites to use existing tools as required

• LCFGng – provides automated installation• Installation scripts provided by us – manual installation• Instructions allowing system managers to use their existing tools

– User interface• LCFGng• Installed on a cluster (e.g. Lxplus at CERN)• Pacman?

• Distribution– Distribution web site being set up now (updated from LCG-0)

• Sets of rpm’s etc organised by service and machine type• User guide, Installation guides, release notes, etc., being written

now


CERN

Middleware StatusMiddleware Status

• Integration work of EDG 2.0 has taken longer than hoped– EDG has not quite released version 2.0 – imminent

• LCG has a working system – able to run jobs:– Resource Broker: many changes since previous version, needs

significant testing to determine scalability and limitations– RLS: Initial deployment will be single instance (per VO) of LRC/RMC

• Distributed service with many LRC and indexes not yet debugged• Initially will run LRC for all VOs at CERN with Oracle service backend

– Information system: • R-GMA is not yet stable• We will initially use MDS: work to improve stability (bug fixes), and

redundancy – based on experience with EDG testbeds and Nikhef, NorduGrid work

• Intend to make direct comparison between MDS and R-GMA on certification testbed

• Waiting for bug fixes – of several components• Still to do before release:

– Reasonable level of testing– Packaging and preparation for deployment


CERN

Certification & TestingCertification & Testing

• This is primary tool to stabilise and debug the system– Process and testbed has been set up– This is intended to parallel the production service

• Certification testbed:– Set of 4 clusters at CERN – simulates a grid on LAN– External sites that will be part of cert. tb.

• U. Wisconsin, FNAL – currently• Moscow, Italy – soon

• This testbed is being used to test the release candidate– Will be used to reproduce and resolve problems found in the

production system, and to do regression testing for updated middleware components before deployment


CERN

Infrastructure for initial service - 2Infrastructure for initial service - 2

• Security issues– Agreement on set of CAs that all LCG sites will accept

• EDG list of traditional CAs• FNAL on-line KCA

– Agreement on basic registration procedure for users• LCG VO where users sign Acceptable Usage Rules for LCG• 4 experiment VOs – will use existing EDG services run by Nikhef• Agreement on basic set of information to be collected

– All initial registrations will expire in 6 months – we know the procedures will change

– Experiment VO managers will verify bona fides of users– Acceptable Use Rules – adaptation based on EDG policy for

now– Audit trails – basic set of tools and log correlations to provide

basic essential functions


CERN

Infrastructure – 3 Infrastructure – 3

• Operations Service:– RAL is leading sub-project on developing operations services– Initial prototype for July –

• Basic monitoring tools• Mail lists and rapid communications/coordination for problem

resolution

– Monitoring:• GridICE (development of DataTag Nagios-based tools) being

integrated with release candidate• Existing Nagios-based tools• GridPP job submission monitoring• Together these give a reasonable coverage of basic operational

issues

• User support– FZK leading sub-project to develop user support services– Initial prototype for July –

• Web portal for problem reporting • Expectation that initially experiments will triage problems and

experts will submit LCG problems to the support service


CERN

Initial Deployment ServicesInitial Deployment Services

RLS(RMC&LRC)

CMSRLS

(RMC&LRC)ATLAS

RLS(RMC&LRC)

ALICERLS

(RMC&LRC)LCG-Team

VOLCG

Proxy

UIAFS users

RB-2RB-1

SEDisk

UI-bLxplus

CE-1 CE-2

WNWN

WNWN

WN

PBS

VOLCG-Team

LCGRegistration

Server

LCGCVS Server

RLS(RMC&LRC)

LHCb

UI

ProxyRB-1

SEDisk

CE

WNWN

WNWN

WN

PBS/????

VOCMS

VOLHCb

VOATLAS

VOALICE

@ NIKHEF

Services at CERN

Services at other sites

CE-3 CE-4

WNWN

WNWN

WN

LSF


CERN

RegionA1GIIS

RegionA2GIIS

BDII ALDAP

BDII BLDAP

RB

RegionB1GIIS

RegionB2GIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteCGIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteDGIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteAGIIS

CE1GRIS

CE2GRIS

SE1GRIS

SE2GRIS

SiteBGIIS

Query

Register

/dataCurrent/.. /dataNew/..

BDIILDAP Swap&Restart

QueryWhile using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted.The restart takes less than 0.5 sec.To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David).This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes

secondary

primary

Using multiple BDIIs requires RB changes

LCG-1 First Launch Information System Overview


CERNLCG-1 First Launch Information System Sites and Regions

A Region should contain not too many sites since we have observed problems with MDS if a large number of sites are involved.To allow for future expansion, but not to make the system too complex I suggest to start with two regions and if needed split later to smaller regions.The regions are: West of 0 degree and East.The idea is to have a large region and a small one and see how they workFor the West 2 region GIISes and for the East 3 should be setup at the beginning,

RAL

FNAL

BNL

WEST1RegionGIIS

WEST2RegionGIIS

CERN

CNAF

LYON MOSCOW

FZK TOKYO

TAIWAN

EAST1RegionGIIS

EAST2RegionGIIS

EAST3RegionGIIS


CERN

Plans for remainder of 2003Plans for remainder of 2003

• Once the service has been deployed, priorities are:– Problem resolution and bug fixing – to address problems of

reliability and scalability in existing middleware– Incrementally adding additional functionality– Adding additional sites– Expanding site resources accessible to the grid service– Addressing integration issues

• Worker node WAN connectivity, etc.

– Developing distributed prototypes of• Operations centres• User support services• To provide reasonable level of global coverage

– Improving security model – Developing tools to facilitate operating the service


CERN

Plans for 2003 – 2 Plans for 2003 – 2 Middleware functionality

• Top priority is problem resolution and issues of stability/scalability

• RLS developments– Distributed service – multiple LRC, and RLI– Later: develop a service to replace client command-line tools

• VOMS service– To permit user and role-based authorization

• Validation of R-GMA– And then deployment of multiple registries – initial implementation has

singleton• Grid File Access Library

– LCG development: POSIX-like I/O layer to provide local file access• Development of SRM/SE interfaces to other MSS

– Work that must happen at each site with a MSS• Basic upgrades

– Compiler support– Move to Globus 2.4 (release supported through 2004)

• Cut-off for functionality improvements is October – in order to have a stable system for 2004


CERN

Incremental DeploymentIncremental Deployment

Development of LCG middleware

July starting point:as much as feasible

…

VDT upgrade

VOMS

RLS (distributed)

R-GMA

RB

RLS (basic)

VDT

Globus

Continuous bug fixing & re-release

…

RH 8.x

gcc 3.2

October 1: cut-off defines functionality for 2004

EDG Integration, ends September


CERN

Expansion of LCG resourcesExpansion of LCG resources

• Adding new sites– Will be a continuous process as sites are ready to join the service– Expect as a minimum 15 sites (15 countries have committed

resources for LCG in 1Q04), reasonable to expect 18-20 sites by end 2003

– LCG team will work directly with Tier 1 (or primary site in a region)

– Tier 1s will provide first level support for bringing Tier 2 sites into the service

• Once the Tier 1s are stable this can go in parallel in many regions• LCG team will provide 2nd level support for Tier 2s

• Increase grid resources available at many sites– Requires LCG to demonstrate utility of service – experiments in

agreement with site managers add resources to the LCG service


CERN

Operational plans for 2003Operational plans for 2003

• Security– Develop full security policy– Develop longer term user registration procedures and tools to

support it– Develop Acceptable Use policy for longer term – requires legal

review

• Operations– Develop distributed prototype operations centres/services

• Monitoring developments driven by experience– Provide at least 16 hr/day global coverage – problem response– Basic level of resource use accounting – by VO and user– Minimal level of security incident response and coordination

• User Support– Development direction depends strongly on experience in the

deployed system– Operations and User Support must address the issues of

interchanging problem reports – with each other and with sites, network ops, etc.


CERN

Middleware roadmapMiddleware roadmap

• Short term (2003)– Use what exists – try and stabilize, debug, fix problems, etc.– Exceptions may be needed – WN connectivity, client tools

rather than services, user registration, …

• Medium term (2004 - ?)– Same middleware, but develop missing services, remove

exceptions– Separate services from WNs – aim for more generic clusters – Initial tests of re-engineered middleware (service based,

defined interfaces, protocols)

• Longer term (2005? - )– LCG service based on service definitions, interfaces, protocols,

- aim to be able to have interoperating, different implementations of a service


CERN

Inter-operabilityInter-operability

• Since LCG will be VDT + higher level EDG components:

• Sites running same VDT version should be able to be part of LCG, or continue to work as now

• LCG (as far as possible) has goal of appearing as a layer of services in front of a cluster, storage system, etc.– State of the art currently implies compromises …


CERN

Integration IssuesIntegration Issues• LCG will try to be non-intrusive:

– Will assume base OS is already installed– Provide installation & config tool for service nodes– Provide recipes for installation of WNs – assume sites will use

existing tools to manage their clusters

• No imposition of a particular batch system – As long as your batch system talks to Globus

• (OK for LSF, PBS, Condor, BQS, FBSng)

• No longer requirement for shared filesystem between gatekeeper and WNs – was a problem for AFS, NFS does not scale to large clusters

• Information publishing– Define what information a site should provide (accounting,

status, etc), rather than imposing tools

• But … maybe some compromises in short term (2003)


CERN

Worker Node connectivityWorker Node connectivity

• In general (and eventually) it cannot be assumed that the cluster nodes will have connectivity to remote sites– Many clusters on non-routed networks (for many reasons)– Security issues– In any case this assumption will not scale

• BUT…– To achieve this several things are necessary:– Some tools (e.g. replica management) must become services– Databases (e.g. conditions db) must either be replicated to

each site (or equivalent), or proxy service, or …– Analysis models must take this into account– Again, short term exceptions (up to a point) possible

• Current additions to LXbatch at CERN have this limitation


CERN

Timeline for the LCG servicesTimeline for the LCG services

Event simulation productions

Service for Data Challenges,batch analysis, simulation

Validation of computing models

Acquisition, installation, testing of Phase 2 service

Phase 2 service in production

2003 2006

LCG-1LCG-1 LCG-2LCG-2 LCG-3LCG-3

CMS DC04

Agree LCG-1 Spec

LCG-1 service opens

LCG-2 with upgraded m/w, management etc.

TDR for Phase 2

Stabilize, expand, develop Evaluation 2nd generation middleware

LCG-3 full multi-tier prototype batch+interactive service

Computing model TDR’s

20052004


CERN

Resource Requests

Resources – compute & storage Grid Infrastructure Services

Grid Deployment OrganisationGrid Deployment Organisation

GridDeployment

manager

GridDeployment

manager

LCGsecuritygroup

LCGsecuritygroup

grid infra-structure

team

grid infra-structure

team

GridDeploymentBoard (GDB)

GridDeploymentBoard (GDB)

LCGoperations

team

LCGoperations

team

experimentsupport

team

experimentsupport

team

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

regionalcentre

operations

CERN-based teams

policies, strategy, scheduling,standards, recommendations Grid Resource

CoordinatorGrid ResourceCoordinator

ALICEATLAS

CMSLHCb

securitytools

securitytools

operationscall centreoperationscall centre

gridmonitoring

gridmonitoring

JointTrillium/EDG/LCGtestingteam

JointTrillium/EDG/LCGtestingteam

anticipated teams at other institutes

LCG toolkitintegration &certification

LCG toolkitintegration &certification

core infra-structure

core infra-structure


CERN

ConclusionsConclusions

• Essential to start operating a service as soon as possible – we need 6 months to be able to develop this to a reasonably stable service

• Middleware components are late – but we will still deploy a service of reasonable functionality and scale – Much work will be necessary on testing and improving the

basic service

• Several functional and operational improvements are expected during 3Q03

• Expansion of sites and resources foreseen during 2003 should provide adequate resources for 2004 data challenges

• There are many issues to resolve and a lot of work to do – but this must be done incrementally on the running service


CERN

ConclusionsConclusions

• From the point of view of the LCG plan – we are late in having testable middleware with the functionality that we had hoped for

• We will keep to the July deployment schedule– We expect to have the major components – the user view of

the middleware (i.e. via the RB) should not change– Expect to be able to do less testing and commissioning than

planned– But hopefully, with a suitable process we will incrementally

improve & add functionality as it becomes available and tested

cern lcg-1 deployment plan ian bird lcg project deployment area manager it division, cern gridpp 7...

Documents

cern lcg

lcg milestones

cern tier

global grid service

lcg prototype service

available lcg

prototype lcg

lcg goals