wlcg service report ~~~ wlcg management board, 7 th july 2009

13
1 WLCG Service Report [email protected] [email protected] ~~~ WLCG Management Board, 7 th July 2009

Upload: melissa-oconnor

Post on 17-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Decreasing participation STEP09

TRANSCRIPT

Page 1: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

1

WLCG Service Report

[email protected] [email protected] ~~~

WLCG Management Board, 7th July 2009

Page 2: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

2

Introduction• Quiet week again

• Decreasing participation• No alarm tickets• Incidents leading to postmortem

• ATLAS post-mortem• FZK posted a post-mortem explaining their tape

problems during STEP09• RAL scheduled downtime for move to new

Data Centre• ASGC recovering?

Page 3: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

3

Decreasing participation

0

5

10

15

20

25

27-May-09 1-Jun-09 6-Jun-09 11-Jun-09 16-Jun-09 21-Jun-09 26-Jun-09 1-Jul-09 6-Jul-09

Local participationRemote participation

STEP09

Page 4: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

4

GGUS summaryVO User Team Alarm Total

ALICE 2 1 0 3ATLAS 9 13 0 22CMS 3 0 0 3LHCb 1 21 0 22Totals 15 35 0 50

Page 5: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

5

LHCb Team Tickets drifting up ?

Jobs failed or aborted at Tier 2 8 tickets (5 of these 8 still open, all others closed)gLite WMS issues at Tier 1 (temporary) 5Data transfers to Tier 1 failing (disk full) 1Software area files with root owned 1CE marked down but accepting jobs 1

Nothing really unusual

Page 6: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

66

Page 7: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

7

PVSS2COOL? incident 27-6 (1/3)Incident report and affected services: • Sunday afternoon 27-6 Viatcheslav

Khomutnikov (Slava) from Atlas reported to the Physics DB service that the online reconstruction was stopped because of an error was returned by the PVSS2COOL? application (on Atlas offline DB). The error started appearing on Saturday (26-6) evening.

Page 8: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

8

PVSS2COOL? incident 27-6 (2/3)Issue analysis and actions taken: • The error stack reported by Atlas indicated that the error was

generated by a 'drop table operation' being blocked by the custom trigger set up by Atlas to prevent 'unwanted' segment drop. The trigger is operational since several months. This information was fed back by Physics DB services to Atlas on Sunday evening. On Monday morning Atlas still reported this blocking issue and upon further investigation they were not able to find which table the application (PVSS2COOL?) wanted to drop (therefore causing the blocking error) as the issue appeared in a block of code responsible for inserting data. Physics DB service in collaboration with Atlas DBAs then ran 'logmining' of the failed drop operation and found that the application was indeed trying to drop some segments on the recycle bin of the schema owner (ATLAS_COOLOFL_DCS). Further investigations with SQL trace by the DBAs showed that Oracle attempted to drop objects on the recycle bin when PVSS2COOL? wanted to bulk insert data. This operation was then blocked by the custom Atlas trigger that blocks drop in production, hence the error message originally reported. Metalink note "265253.1" then further clarified that the issue was a side effect of an expected behaviour of Oracle's space reclamation process.

Page 9: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

9

PVSS2COOL? incident 27-6 (3/3)Issue resolution and expected follow-up: • In the evening on 29-6 Physics DB

support in collaboration with Atlas DBAs extended the datafile of the PVSS2COOL? application to circumvent this space reclamation process issue. Atlas has reported that this has fixed the issue. Further discussions on the role of the recycle bin and on possible improvements of the 'block drop trigger' of Atlas are currently in progress to avoid further occurrences of this issue.

Page 10: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

10

FZK tape problems during STEP09

• Jos posted a Post-Mortem analysis of the tape problems seen at FZK during STEP09: https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_storage_FZK_GridKa.pdf

• Too long to fit here but in summary• Before STEP09

• An update to fix a minor problem in the tape library manager resulted in stability problems

• Possible cause: SAN or library configuration• Both were tried and problem disappeared but which one was

the root cause?• Second SAN had reduced connectivity to dCache pools: not

enough for CMS and ATLAS at the same time CMS asked to not to use tape

• First week of STEP09• Many problems: hw (disk, library, tape drives), sw (TSM)

• Second week of STEP09• Added two more dedicated stager hosts resulted in better

stability• Finally getting stable rates 100 – 150MB/s

Page 11: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

11

RAL scheduled downtime for DC move

• Friday 3/7: reported still on schedule for restoring CASTOR and Batch on Monday 6/7

• Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call • Planning and detailed progress reported at :

http://www.gridpp.rl.ac.uk/blog/category/r89-migration

R89 Migration: Friday 3rd JulyPosted by Andrew Sansum 12:00Our last dash towards restoration of the production service is under way. All racks of disk servers have now had a first pass check. Faults list iscurrently 11 servers, although some of these may well be trivial. We expectto provide a large number of disk servers to the CASTOR team later today.

Page 12: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

12

ASGC instabilities• ATLAS reported instabilities in beginning of week

• Monday:• Functional tests worked but still some problem

withTier-1 Tier-2 transfers• Another unscheduled downtime (recabling of

CASTOR disk servers)• CMS allowed the full week grace period for ASGC to

recover from all its problems• No new tickets and opened tickets put on hold• Resume on Monday 6/7

• Both ATLAS and CMS specific site tests changed from Red to Green during the week

• Friday 3/7: Gang reports that tape drives and servers are online

Page 13: WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

13

Summary• Daily meeting attendance is degrading –

holidays…?• No new serious site issues • RAL long downtime for DC move is progressing

to plan. (Tuesday report – RAL back apart CASTORATLAS, some network instability).

• Tape problems at FZK during STEP09 understood

• ASCG is recovering?