southgrid status pete gronbech: 2 nd april 2009 gridpp22 ucl
TRANSCRIPT
SouthGrid Status
Pete Gronbech: 2nd April 2009GridPP22 UCL
UK Tier 2 reported CPU – Historical View to Q109
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
Jan-08
Feb-08
Mar-08
Apr-08
May-08
Jun-08
Jul-08
Aug-08
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
Mar-09
K SPEC int 2000 hours
UK-London-Tier2
UK-NorthGrid
UK-ScotGrid
UK-SouthGrid
UK Tier 2 reported CPU – Q1 2009
SouthGrid SitesAccounting as reported
by APEL
0
100000
200000
300000
400000
500000
600000
Jan-08
Feb-08
Mar-08
Apr-08
May-08
Jun-08
Jul-08 Aug-08
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
Mar-09
K SPEC int 2000 hours
JET
BHAM
BRIS
CAM
OX
RALPPD
Job distribution
Site Upgrades since gridpp21
• RALPPD Increase of 640 cores (1568KSI2K) +380TB
• Cambridge 32 cores (83KSI2K) + 20TB• Birmingham 64 cores on pp cluster and 128
cores HPC cluster which add ~430KSI2K• Bristol original cluster replaced by new quad
cores systems 16 cores + increased share of the HPC cluster 53KSI2k + 44TB
• Oxford extra 208 cores 540KSI2K + 60TB• Jet extra 120 cores 240KSI2K
New Total Q109SouthGrid
999.55545
6332815
160972
60455
55120
90700
1.5483
Totals
RALPPD
Oxford
Cambridge
Bristol
Birmingham
EDFA-JET
Storage (TB)CPU (kSI2K)
GridPP
MoU
% of MoU CPU % of MoU Disk
304.35% 142.86%
96.77% 343.75%
469.07% 230.77%
592.68% 363.64%
329.63% 374.56%
377.47% 314.31%
Network rate capping
• Oxford recently had its network link rate capped to 100mbs
• This was as a result of continuous 300-350mbs traffic caused by CMS commissioning testing.– As it happens this test completed at the same time as we
were capped, so we passed the test, and current normal use is not expected to be this high
• Oxfords Janet link is actually 2*1gbit links which had become saturated.
• Short term solution is to only rate cap JANET traffic to 200mbs, all other on site traffic remains at 1gbs.
• Long term plan is to upgrade the JANET link to 10gbs within the year.
spec benchmarking
• Purchased the SPEC 2006 benchmark suite
• Ran using the Hepix scripts to run the HEPspec06 way
• Using the HEP spec benchmark should provide a level playing field.
• In the past sites could choose any one of the many published values on the spec benchmark site.
Staff Changes
• Jon Waklin and Yves Coppens left in Feb 09
• Kashif Mohammad started in Jan 09 as the deputy coordinator for SouthGrid.
• Chris Curtis will replace Yves starting in May. He is currently doing his PhD on the Atlas project.
• The Bristol post will be advertised, it is jointly funded by IS and GridPP.
gridppnagios
Resilience
• What do we mean by resilience?• The ability to maintain high availability
and reliability of our grid service• Guard against failures
– Hardware– Software
Availability / Reliability
Hardware Failures
• The hardware– Critical Servers
• Good quality equipment• Dual PSU• Dual mirrored systems disks and RAID for storage arrays• All systems have 3 year maintenance with on site spares pool. (disks,
psu’s, ipmi cards)• Similar kit bought for servers so can swap h/w.• IPMI cards allow remote operation and control
– The environment• UPS for critical servers• Network connected PDU’s for monitoring and power switching• Professional Computer room / rooms • Air Conditioning: Need to monitor the temperature• Actions based on the above environmental monitoring
– Configure your UPS to shutdown systems in the event of sustained power loss
– Shutdown cluster in the event of high temperature
Hardware continued
• So having guarded against the h/w failing if it does then we need to ensure rapid replacement
• Restore from backups or reinstall• Automated installation system;
– pxe, kickstart, cfengine– Good documentation
• Duplication of Critical Servers– Multiple ce’s– Virtualisation of some services allows migration
to alternative VM servers (mon, bdii, and ce’s)– Less reliance on external services
• Could setup Local WMS, Top level BDII
Software Failures
• Main cause of loss of availability is software failures– Miss configuration– Fragility of glite middleware– OS system problems
• Disks filling up• Service failures (eg ntp)
• Good communications can help solve problems quickly.– Mailing lists, wikis, blogs, meetings, – Good monitoring and alerting (Nagios etc)– Learn from mistakes. Update systems and
procedures to prevent reoccurrence.
Recent example
• Many SAM failures occasional passes• All test jobs pass• Almost all ATLAS jobs pass• Error logs revealed messages about proxy not being
valid yet!• ntp on se head node had stopped• AND cfengine had been switched off on that node (so
no automatic check and restart)• SAM test always gets a new proxy and if it got
through the WMS and on to our cluster in to a reserved express queue slot within 4 mins would fail.
• In this case the SAM tests were not accurately reflecting the usability of our cluster BUT it was showing a real problem.
Conclusions
• These systems are extremely complex• Automatic configuration and good
monitoring can help but systems need careful tending
• Sites should adopt best practice and learn from others
• We are improving but its an ongoing task