deployment summary gridpp12 jeremy coles [email protected] 1 st february 2005
TRANSCRIPT
Contents
• LCG operations workshop• EGEE structures• Operations model• Current status• Support• Planning• Metrics• Some of the recurring issues at GridPP12 • Future activities
Some operational issues
• Slow response from sites (central perception) – Upgrades, response to problems, etc– Problems reported daily – some problems last for weeks
• Lack of staff available to fix problems– All on vacation, …
• Misconfigurations (units, gridmap-file builds, user profiles, pools …)• Lack of configuration management – problems that are fixed reappear• Lack of fabric management
– Is it GDA responsibility to provide solutions to these problems?
• Lack of understanding (training?)– Admins reformat disks of SE …
• Firewall issues – coordination between grid admins and firewall maintainers
• PBS problems– Are we seeing the scaling limits of PBS?
• People not reading documentation … Background to workshop
LCG Workshop Nov 2004
• Operational Security– Incident Handling Process– Variance in site support availability – Reporting Channels – Service Challenges
• Operational Support– Workflow for operations & security actions– What tools are needed to implement the model– “24X7” global support
• sharing operational load (CIC-on-duty)– Communications (news)– Problem Tracking System– Defining Responsibilities
• problem follow-up• deployment of new releases
– Interface to User Support
LCG (EGEE) discussion on superset of topics discussed at GridPP11
LCG Workshop Nov 2004
• Fabric Management– System installations (tools, intrfacing tools with each other)– Batch/scheduling Systems (openPBS/Torque, MAUI. fair-share) – Fabric monitoring – Software installation – Representation of site status (load) in the Information System
• Software Management– Operations on and for VOs (add/remove/service discovery)– Fault tolerance, operations on running services (stop,upgrades, re-
starts)– Link to developers– What level of intrusion can be tolerated on the WNs (farm nodes)
• application (experiment) software installation– Removing/(re-adding) sites with (fixed)troubles– Multiple views in the information system (maintenance)
GDB
LCG Grid Deployment Board• One representative from each country (with a Regional Centre) involved in the LCG and one representative from each experiment• Chairman changes annually • Meet in person once per month
What it does!• Explores issues of global concern to the LCG community• Makes decisions on deployment, operations and planning for LCG• Provides mechanisms for resource forecasting
How?• By calling upon experts to present latest information on specific topics• By creating and overseeing working groups to tackle important areas
• Currently three groups: The Security, Networking and Quattor groups
Who is involved in UKI• UK representative: John Gordon• Security group coordinator: Dave Kelsey• GDB secretary: Jeremy Coles
Proposed escalation procedure
• Because unstable and badly configured sites cause a big problem:– Unstable sites that have frequent problems
• Will appear on a list of bad sites
– Sites that do not respond to problem reports• Including not upgrading middleware versions
– Will be removed from the information systems and maps
– Will have to be re-certified to get back in– Will be reported to the GDB (LCG) or PMB (EGEE)
representative as non-responsive
ROCs
Regional Operations Centres (ROCs)• Part of the EGEE SA1 activity (http://egee-sa1.web.cern.ch/egee%2Dsa1/)• The regions are CERN, France, Italy, UK & Ireland, Germany & Switzerland, Northern Europe, South West Europe, South East Europe, Central Europe and Russia.
What they do• Coordinate regional efforts in all activities (support, operations representation, security)• Take up operations and deployment issues at cross project meetings• Provide forum for agreeing work needed – pre-production service
How?• Setup ROC structures within the region • Create common groups to work on areas like pre-production services, helpdesk interfaces• Meet fortnightly via telephone (http://agenda.cern.ch/displayLevel.php?fid=339) to discuss regional issues and problems
Who is involved for UK?• General: John Gordon• Support: Andy Richards• Security: Romain Wartel
EGEE Background
CICs
Core Infrastructure Centre (CIC)• The CICs cover more than one region and deal with operations issues. There are currently 4 CICs France, Italy, UK & Ireland and CERN• Coordinated by the Operations Management Centre team at CERN. • Meet weekly via telephone (http://agenda.cern.ch/displayLevel.php?fid=258)• Each CIC is “on-duty” for 1 week in 4.
What they do!• Operational and performance monitoring• Troubleshooting and following up identified problems• Operate general grid services (e.g. VO related services)• Provide information via the CIC portal http://cic.in2p3.fr/
How?• Review monitoring data such as gstat, daily test results• Enter problems identified into Savannah (moving to GGUS portal soon)• Follow up problems using email and telephone contacts• Troubleshoot using experts, Wiki etc.
Who is involved in UKI• Steve Traylen & Philippa Strange
EGEE Background
CIC portal
http://cic.in2p3.fr/
• Regional Operations Centres (9)– Act as front-line support for user and
operations issues– Provide local knowledge and
adaptations
• User Support Centre (GGUS)– In FZK –provide single point of contact
(service desk)
• Core Infrastructure Centres (4)– CICs build on the LCG GOC at RAL– Also run essential infrastructure
services– Provide support for other (non-LHC)
applications– Provide 2nd level support to ROCs
• Coordination:– At CERN (Operations Management
Centre) and CIC for HEP
LCG-2/EGEE Operations
• Taipei provide operations centre, and 2nd instance of GGUS start to build round-the-clock coverage
• Discussions with Grid3/OSG on how to collaborate on ops support
– Share coverage?
(New) Operations Model
• Operations Center role rotates through the CICs– CIC on duty for one week– Procedures and tasks are currently defined
• first operations manual is available (living document) – tools, frequency of checks, escalation procedures, hand over procedures
• CIC on duty website: – Problems are tracked with a tracking tool
• now central in Savannah • migration to GGUS (remedy) with link to ROCs PT tools• problems can be added at GGUS or ROC level
– CICs monitor service, spot and track problems• interact with sites on short term problems (service restart etc,)• interact with ROCs on longer, non trivial problems• all communication with a site is visible for the ROC• build FAQs
– ROCs support• installation, first certification• resolving complex problems
Operations Model
OMCOMC
CICCICCICCIC
CICCICCICCIC
ROCROC ROCROC
ROCROC
ROCROC
ROCROC
RCRC
RCRCRCRC
Other GridOther Grid
Other GridOther Grid
RCRC
RCRCRCRCRCRC
How does support map onto this?
OMCOMC
CICCICCICCIC
CICCICCICCIC
ROC helpdeskROC helpdesk ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
RCRC
RCRCRCRC
Other GridOther Grid
Other GridOther Grid
RCRC
RCRCRCRCRCRC
Savannah
GGUS
How does user support map onto
this?OMCOMC
CICCICCICCIC
CICCICCICCIC
ROC helpdeskROC helpdesk ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
RCRC
RCRCRCRC
Other GridOther Grid
Other GridOther Grid
RCRC
RCRCRCRCRCRC
Savannah
GGUS
VO1
VO2
VO3
How does user support map onto
this?OMCOMC
CICCICCICCIC
CICCICCICCIC
ROC helpdeskROC helpdesk ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
ROC helpdeskROC helpdesk
RCRC
RCRCRCRC
Other GridOther Grid
Other GridOther Grid
RCRC
RCRCRCRCRCRC
Savannah
GGUS
VO1
VO2
VO3
We need to work out a better
model for this in the UK
Site updates
No Site Reports GIIS Host sanity GridPP11 GridPP12
1 BHAM-LCG2 epcf36.ph.bham.ac.uk ok LCG-2_2_0 LCG-2_2_02 BitLab-LCG2 dgc-grid-35.brunel.ac.uk ok LCG-2_2_0 LCG-2_2_03 CAVENDISH-LCG2 farm012.hep.phy.cam.ac.uk warn LCG-2_1_1 LCG-2_2_04 IC-LCG2 gw39.hep.ph.ic.ac.uk warn LCG-2_2_0 LCG-2_3_05 Lancs-LCG2 lunegw.lancs.ac.uk ok LCG-2_1_1 LCG-2_3_0
6 LivHEP-LCG2 hepgrid2.ph.liv.ac.uk warn LCG-2_1_1 LCG-2_2_07 ManHEP-LCG2 bohr0001.tier2.hep.man.ac.uk ok LCG-2_1_1 LCG-2_3_08 OXFORD-01-LCG2 t2ce01.physics.ox.ac.uk ok LCG-2_1_1 LCG-2_3_0
9 QMUL-eScience ce01.ph.qmul.ac.uk ok LCG-2_1_0 LCG-2_1_010 RAL-LCG2 lcgce02.gridpp.rl.ac.uk warn LCG-2_1_1 LCG-2_2_011 RALPP-LCG heplnx131.pp.rl.ac.uk ok LCG-2_2_0 LCG-2_3_012 RHUL-LCG2 ce1.pp.rhul.ac.uk warn LCG-2_1_1 LCG-2_2_013 ScotGRID-Edinburgh glenlivet.epcc.ed.ac.uk ok LCG-2_0_0 LCG-2_2_014 scotgrid-gla ce1-gla.scotgrid.ac.uk warn LCG-2_2_0 LCG-2_2_015 SHEFFIELD-LCG2 ce.gridpp.shef.ac.uk warn LCG-2_1_1 LCG-2_3_016 UCL-CCC ce-a.ccc.ucl.ac.uk ok LCG-2_2_0 LCG-2_2_017 UCL-HEP pc31.hep.ucl.ac.uk warn LCG-2_1_1 LCG-2_2_018 Durham LCG-2_2_0
Most sites have stated an intention to move to SL3 and LCG 2.3 over the next few weeks
Monitoring progress
http://goc.grid-support.ac.uk/gridsite/monitoring/
Produced:Certification testsGPPMonMapsRSS feeds
Can we:Have a single viewIntegrate network info
Today’s functional test results
Region Site Name Site CE Test date VersionSoftware Version
CA RPMs Version
BrokerInfo
R-GMA client
CSH testBDII
LDAP (RM)
CopyAndReg. WN -
> defaultSE
Copy defaultSE
-> WN
Replicate defaultSE
to castorgri
d
3rd Party Rep.
castorgrid to
defaultSE
3rd Party cp
castorgrid to WN
Delete Replica
from defaultSE
GFAL infosys
lcg-cr -> defaultSE
lcg-cp defaultSE
-> WN
lcg-rep defaultSE
-> castorgri
d
lcg-rep castorgri
d to defaultSE
lcg-cp castorgrid to WN
lcg-del from
defaultSE
UKI BHAM-LCG2 epcf36.ph.bham.ac.uk 01/02/2005 06:05LCG-2_2_0
LCG-2_2_0 OK OK OK OK
ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKI BITLab-LCGdgc-grid-35.brunel.ac.uk 01/02/2005 06:05
LCG-2_2_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKICAVENDISH-LCG2
serv03.hep.phy.cam.ac.uk 01/02/2005 06:05
LCG-2_2_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKI csTCDie gridgate.cs.tcd.ie 01/02/2005 06:05LCG-2_2_0
LCG-2_2_0 OK OK OK OK
ldap://cagraidsvr19.cs.tcd.ie:2170 FAILED FAILED FAILED FAILED FAILED FAILED OK FAILED FAILED FAILED FAILED FAILED FAILED
UKI Durhamhelmsley.dur.scotgrid.ac.uk 01/02/2005 06:05
LCG-2_2_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKI IC-LCG2 gw39.hep.ph.ic.ac.uk 01/02/2005 06:05LCG-2_3_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKI Lancs-LCG2 lunegw.lancs.ac.uk 01/02/2005 06:05LCG-2_3_0
LCG-2_3_0 OK OK OK OK
ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKI LivHEP-LCG2 hepgrid2.ph.liv.ac.uk 01/02/2005 06:05 FAILED ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKIManHEP-LCG2
bohr0001.tier2.hep.man.ac.uk 01/02/2005 06:05
LCG-2_3_0
LCG-2_3_0 OK OK OK OK
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKIOXFORD-01-LCG2
t2ce01.physics.ox.ac.uk 01/02/2005 06:05
LCG-2_3_0
LCG-2_3_0 OK OK OK OK
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKIQMUL-eScience ce01.ph.qmul.ac.uk 01/02/2005 06:05
LCG-2_1_0 ?? OK OK ?? OK
ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK n/a n/a n/a n/a n/a n/a
UKI RAL-LCG2 lcgce02.gridpp.rl.ac.uk 01/02/2005 06:05LCG-2_2_0
LCG-2_2_0 OK OK ?? OK
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK ?? ?? ?? ?? ??
UKI RALPP-LCG heplnx201.pp.rl.ac.uk 01/02/2005 06:05LCG-2_3_0
LCG-2_3_0 OK OK OK OK
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKI RHUL-LCG2 ce1.pp.rhul.ac.uk 01/02/2005 06:05 FAILED ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKIScotGRID-Edinburgh glenlivet.epcc.ed.ac.uk 01/02/2005 06:05
LCG-2_2_0
LCG-2_2_0 OK OK OK OK
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKI scotgrid-gla ce1-gla.scotgrid.ac.uk 01/02/2005 06:05 FAILED ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
UKISHEFFIELD-LCG2 lcgce0.shef.ac.uk 01/02/2005 06:05
LCG-2_3_0
LCG-2_3_0 OK OK FAILED OK
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
UKI UCL-CCC ce-a.ccc.ucl.ac.uk 01/02/2005 06:05LCG-2_2_0
LCG-2_2_0 OK OK OK OK
ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK
Job list match failed #ffcc39Critical tests failed #cc3cffOK #99ff99Test job still waiting for execution #ffff33Job Submission failed (Job Manager) #cc3c00
Scheduled downtime #c0c0c0
• The tests show similar patterns across EGEE as a whole• How can tests be made more useable by those who can react?
Accounting progress
http://goc.grid-support.ac.uk/gridsite/accounting/
Well done:Imperial CollegeManchesterOxfordRAL Tier-1RAL PPDEdinburghGlasgowUCL – CCCDurham
What next?More sites!!Provide older dataAnalyse & use
ALL sites need to keep their log files. Details in the accounting page FAQ.
Ganglia
Well done:ManchesterEdinburghLancasterQMULSheffieldBristolOxford Liverpool
What next?We need all sites Review against MoUsUse data for warnings?
http://www.gridpp.ac.uk/ganglia/
Status of planning
Status of planning
We have developed a plan for deployment at a high level. The deliverables form part of the GridPP2 project map. Each area has consequences for Tiers-1, 2 and 3 in for example:
•Service challenges•Data challenges•Networking•Security•Resource provision•Core services•MoU commitments•Functionality•Accounting •Scheduling of use•Support…. It is still evolving and there is a lot of work here!
What metrics and why?
• Number of sites in production – simple count based on GOCDB information?• Number of registered users – count of certificates issued?• Number of active users• Number of supported VOs• Percentage of available resources utilised • Peak number of concurrent jobs – measured by Gstat for grid jobs• Average number of concurrent jobs – measured by Gstat for grid jobs• Number of jobs not terminated by themselves or the batch system• Accumulated site downtime per week (scheduled and un-scheduled)• Total CPUs deployed• CPUs available• Storage available and used• CPU hours per VO• UK relative contribution to experiments
The list shared before…
Subject of DTEAM discussion 16:00-18:00 todayWhat is actually useful now?
LHCb DC feedback
Jobs(k) %Sub %RemainSubmitted 211 100.0%Cancelled 26 12.2%Remaining 185 87.8% 100.0%Aborted (not Run) 37 17.6% 20.1%Running 148 70.0% 79.7%Aborted (Run) 34 16.2% 18.5%Done 113 53.8% 61.2%Retrieved 113 53.8% 61.2%
LCG Job Submission Summary Table
LCG Efficiency: 61 %
… but note Tony Cass’s comments earlier of improving performance
DO MC performance
–
CE Success Failed
bohr0001.tier2.hep.man.ac.uk 237 3
cclcgceli01.in2p3.fr - 14
grid-ce.physik.uni-wuppertal.de - -
gridkap01.fzk.de 2564 19
golias25.farm.particle.cz 198 15
heplnx131.pp.rl.ac.uk 246 4
lcgce02.gridpp.rl.ac.uk 293 10
mu6.matrix.sara.nl 397 7
tbn18.nikhef.nl 154 2
Total 4089 74
Efficiency 98 %Is this “much less than production quality” ?
98.8%
98.4%
96.7%
DO MC performance
LCG Efficiency 99 %We need to be careful with what we mean !
Error
Aborted 35 LCG error: f.e. file not found
Cancelled 21 Done by us for various reasons
Cleared 5 Done by us, enough events
Running 10 D0 softw.error: infinite loop
Scheduled 3 Can be OK, CZ disk crash
Total 74 Really 35 LCG errors
• Ability to plan (service challenges, networking, resources)• Responsiveness of sites• Security• gLite, gLite, gLite
GridPP12 Deployment issues
} This is a “production” service
Concept behind the “pre-production” service:
• New middleware (gLite, …) can be demonstrated and validated before being deployed in production
• Understand the migration strategy to 2nd generation middleware
• Use the existing production service as the baseline comparison
• Ability to plan (service challenges, networking, resources)• Responsiveness of sites• Security• gLite, gLite, gLite• Tier-2s operating as real Tier-2s• Use of Tier-2s (experiment models)• Metrics (“get fit” plan)• Use of Tier-2 SEs• SRM = Storage Really Matters!• Engagement with experiments• On-demand tests and other tools• Support• Communications
} This is a “production” service
GridPP12 Deployment issues
Deployment web-pages
WORK IN PROGRESS
Summary
• LCG workshop was useful. Some progress but not enough answers. Roadmaps proposed.
• EGEE has a deployment structure and GridPP deployment works within the UKI ROC/CIC
• We need to unravel the support problems and introduce something that works well for UK
• Sites are responding to requests but sometimes slowly. Better communications are needed.
• We still have significant planning challenges to overcome (LCG SC1 failed and there is no clear gLite migration strategy. gLite could require a step back in deployment terms! Implications of experiment computing models.)
• By the next GridPP meeting we must be reporting on carefully defined metrics
• THANK YOU to everyone involved. Please remember - we need your feedback to improve the deployment mechanisms and GridPP service.