southgrid status pete gronbech: 2 nd april 2009 gridpp22 ucl

SouthGrid Status

Pete Gronbech: 2nd April 2009GridPP22 UCL

UK Tier 2 reported CPU – Historical View to Q109

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

Jan-08

Feb-08

Mar-08

Apr-08

May-08

Jun-08

Jul-08

Aug-08

Sep-08

Oct-08

Nov-08

Dec-08

Jan-09

Feb-09

Mar-09

K SPEC int 2000 hours

UK-London-Tier2

UK-NorthGrid

UK-ScotGrid

UK-SouthGrid

UK Tier 2 reported CPU – Q1 2009

SouthGrid SitesAccounting as reported

by APEL

0

100000

200000

300000

400000

500000

600000

Jan-08

Feb-08

Mar-08

Apr-08

May-08

Jun-08

Jul-08 Aug-08

Sep-08

Oct-08

Nov-08

Dec-08

Jan-09

Feb-09

Mar-09

K SPEC int 2000 hours

JET

BHAM

BRIS

CAM

OX

RALPPD

Job distribution

Site Upgrades since gridpp21

• RALPPD Increase of 640 cores (1568KSI2K) +380TB

• Cambridge 32 cores (83KSI2K) + 20TB• Birmingham 64 cores on pp cluster and 128

cores HPC cluster which add ~430KSI2K• Bristol original cluster replaced by new quad

cores systems 16 cores + increased share of the HPC cluster 53KSI2k + 44TB

• Oxford extra 208 cores 540KSI2K + 60TB• Jet extra 120 cores 240KSI2K

New Total Q109SouthGrid

999.55545

6332815

160972

60455

55120

90700

1.5483

Totals

RALPPD

Oxford

Cambridge

Bristol

Birmingham

EDFA-JET

Storage (TB)CPU (kSI2K)

GridPP

MoU

% of MoU CPU % of MoU Disk

304.35% 142.86%

96.77% 343.75%

469.07% 230.77%

592.68% 363.64%

329.63% 374.56%

377.47% 314.31%

Network rate capping

• Oxford recently had its network link rate capped to 100mbs

• This was as a result of continuous 300-350mbs traffic caused by CMS commissioning testing.– As it happens this test completed at the same time as we

were capped, so we passed the test, and current normal use is not expected to be this high

• Oxfords Janet link is actually 2*1gbit links which had become saturated.

• Short term solution is to only rate cap JANET traffic to 200mbs, all other on site traffic remains at 1gbs.

• Long term plan is to upgrade the JANET link to 10gbs within the year.

spec benchmarking

• Purchased the SPEC 2006 benchmark suite

• Ran using the Hepix scripts to run the HEPspec06 way

• Using the HEP spec benchmark should provide a level playing field.

• In the past sites could choose any one of the many published values on the spec benchmark site.

Staff Changes

• Jon Waklin and Yves Coppens left in Feb 09

• Kashif Mohammad started in Jan 09 as the deputy coordinator for SouthGrid.

• Chris Curtis will replace Yves starting in May. He is currently doing his PhD on the Atlas project.

• The Bristol post will be advertised, it is jointly funded by IS and GridPP.

gridppnagios

Resilience

• What do we mean by resilience?• The ability to maintain high availability

and reliability of our grid service• Guard against failures

– Hardware– Software

Availability / Reliability

Hardware Failures

• The hardware– Critical Servers

• Good quality equipment• Dual PSU• Dual mirrored systems disks and RAID for storage arrays• All systems have 3 year maintenance with on site spares pool. (disks,

psu’s, ipmi cards)• Similar kit bought for servers so can swap h/w.• IPMI cards allow remote operation and control

– The environment• UPS for critical servers• Network connected PDU’s for monitoring and power switching• Professional Computer room / rooms • Air Conditioning: Need to monitor the temperature• Actions based on the above environmental monitoring

– Configure your UPS to shutdown systems in the event of sustained power loss

– Shutdown cluster in the event of high temperature

Hardware continued

• So having guarded against the h/w failing if it does then we need to ensure rapid replacement

• Restore from backups or reinstall• Automated installation system;

– pxe, kickstart, cfengine– Good documentation

• Duplication of Critical Servers– Multiple ce’s– Virtualisation of some services allows migration

to alternative VM servers (mon, bdii, and ce’s)– Less reliance on external services

• Could setup Local WMS, Top level BDII

Software Failures

• Main cause of loss of availability is software failures– Miss configuration– Fragility of glite middleware– OS system problems

• Disks filling up• Service failures (eg ntp)

• Good communications can help solve problems quickly.– Mailing lists, wikis, blogs, meetings, – Good monitoring and alerting (Nagios etc)– Learn from mistakes. Update systems and

procedures to prevent reoccurrence.

Recent example

• Many SAM failures occasional passes• All test jobs pass• Almost all ATLAS jobs pass• Error logs revealed messages about proxy not being

valid yet!• ntp on se head node had stopped• AND cfengine had been switched off on that node (so

no automatic check and restart)• SAM test always gets a new proxy and if it got

through the WMS and on to our cluster in to a reserved express queue slot within 4 mins would fail.

• In this case the SAM tests were not accurately reflecting the usability of our cluster BUT it was showing a real problem.

Conclusions

• These systems are extremely complex• Automatic configuration and good

monitoring can help but systems need careful tending

• Sites should adopt best practice and learn from others

• We are improving but its an ongoing task

southgrid status pete gronbech: 2 nd april 2009 gridpp22 ucl

Documents

reported cpu q1

reported cpu historical

hoursuk tier

southgrid sitesaccounting

2009gridpp22 ucluk tier

southgrid statuspete