lhcopn: operations report

21
LHCOPN: Operations report LHCOPN: Operations report Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08

Upload: paloma

Post on 22-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

LHCOPN: Operations report. Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08. From last LHCOPN meeting, 2010-06-29, Barcelona. Conclusion on Operations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LHCOPN: Operations report

LHCOPN: Operations reportLHCOPN: Operations report

Guillaume.Cessieux @ cc.in2p3.fr

Network team, FR-CCIN2P3

LHCOPN meeting, CERN, 2010-10-08

Page 2: LHCOPN: Operations report

From last LHCOPN meeting, 2010-06-29, BarcelonaFrom last LHCOPN meeting, 2010-06-29, Barcelona

Conclusion on Operations– Unequal following of processes by sites because

missing clear feeling of usefulness and evidence of network failures

– WLCG relationships are weak– Monitoring and SLD required to really assess Operations

Items not solved– LHCOPN representatives

• How to push efficiently for proper solving of some issues/administrative tasks

– In clear words: Stress sites and escalate frozen issues

– Merging LHCOPN helpdesk with standard GGUS

LHCOPN meeting, CERN, 2010-10-08GCX 2

Page 3: LHCOPN: Operations report

OutlinesOutlines

Operation status– TTS stats– Long standing issues & Ops phoneconf report

Operational exchanges with WLCG– Post mortem analysis of some issues– Ease exchanges with WLCG

AOB

LHCOPN meeting, CERN, 2010-10-08GCX 3

Page 4: LHCOPN: Operations report

LHCOPN meeting, CERN, 2010-10-08GCX 4

Number of tickets put in the LHCOPN TTS per monthNumber of tickets put in the LHCOPN TTS per month

AVG: 23 tickets/month

Page 5: LHCOPN: Operations report

Kind of tickets per monthKind of tickets per month

LHCOPN meeting, CERN, 2010-10-08GCX 5

Page 6: LHCOPN: Operations report

KPI-1: Infrastructure vs operations behaviorKPI-1: Infrastructure vs operations behavior

LHCOPN meeting, CERN, 2010-10-08GCX 6

Page 7: LHCOPN: Operations report

LHCOPN meeting, CERN, 2010-10-08GCX 7

Ticket ownership during [2010-07-01,2010-09-31]Ticket ownership during [2010-07-01,2010-09-31]

Joy of terminating 6 LHCOPN links

Page 8: LHCOPN: Operations report

Ownership of tickets per month per siteOwnership of tickets per month per site

LHCOPN meeting, CERN, 2010-10-08GCX 8

Page 9: LHCOPN: Operations report

Conclusion from TTS statsConclusion from TTS stats

Workflow stable, but unclear if this is good– Miss SLD & monitoring to correlate and focus on

service impacting events

Lot of L2 events (80%) well handled– Often clear cut, easy to spot

Not used to complex issues– Often turning into a long story

• packet loss, MTU...

LHCOPN meeting, CERN, 2010-10-08GCX 9

Page 10: LHCOPN: Operations report

Long standing issuesLong standing issues

Only administrative!– Validate prefix acceptance etc.– Wait GGUS feature “clone this ticket and

assign it to all impacted sitename” to follow this in a per site basis

Followed during the LHCOPN Ops phoneconf, each 3 months– Recurrent issue: Hard to have administrative

issue solved

LHCOPN meeting, CERN, 2010-10-08GCX 10

Page 11: LHCOPN: Operations report

Issues highlighted by WLCG (1/4)Issues highlighted by WLCG (1/4)

Painful to spot and a lot not anyhow related to the LHCOPN

1. #GGUS-54473 transfer error from PIC_DATADISK to SARA-MATRIX_DATADISK– Child issues: #GGUS-54416, #GGUS-54474, #GGUS-54500

– “The two LHCOPN routers at CERN were connected via a VLAN, and VLAN tagging adds 4 bytes to a packet. The MTU between these routers has been increased”

– Opened 2010-01-05 12:17, closed 2010-01-08 16:16

– No related LHCOPN tickets

LHCOPN meeting, CERN, 2010-10-08GCX 11

Page 12: LHCOPN: Operations report

Issues highlighted by WLCG (2/4)Issues highlighted by WLCG (2/4)

2. #LHCOPN-58197: Poor performance between CERN and ASGC– Opened 2010-05-12, closed 2010-05-17– Never updated, only Opened/Closed for the record

• Only communication problem, issue was managed

• Network staff movement at TW-ASGW, solved• SIR filled https://twiki.cern.ch/twiki/bin/view/LCG/SIRCernAsgcLinkMay2010

3. #GGUS-59791: Transfer problem from to INFN-T1_DATADISK to PIC_DATADISK– Child issue: #GGUS-59697 T0 export to INFN-T1_DATADISK failures:

No valid space tokens – Opened 2010-07-07 00:06, closed 2010-07-14 18:05– “Network issue of MTU black hole + route asymetry at CNAF/GARR”– No LHCOPN tickets

LHCOPN meeting, CERN, 2010-10-08GCX 12

Page 13: LHCOPN: Operations report

Issues highlighted by WLCG (3/4)Issues highlighted by WLCG (3/4)

4. # GGUS-61306: Functional test transfer errors to RAL-LCG2_DATADISK – Related to

• #GGUS-61942 “NDGF-T1 transfer error from RAL-LCG2 and to BNL-OSG2”

• #GGUS-61835 “Transfer errors from NDGF-T1_DATADISK to RAL-LCG2_DATADISK”

• #GGUS-62287 “Transfer errors at NDGF-T1_SCRATCHDISK”

– Opened 2010-08-19 17:41, closed 2010-09-17 15:09 – #LHCOPN-62228, opened/closed 2010-09-17

• Symbolic for the record, no info into

– “The linecard terminating the RAL primary link on the CERN router was replaced and the issue was definitely solved”

LHCOPN meeting, CERN, 2010-10-08GCX 13

Page 14: LHCOPN: Operations report

Issues highlighted by WLCG (4/4)Issues highlighted by WLCG (4/4)

4 LHCOPN issues this year– Nothing particularly wrong– Problem is mainly around communication

Main mistake is forgetting creating a ticket in LHCOPN helpdesk– This was the agreed process

Not aware of any other LHCOPN related issue from WLCG– But others network issues (LAN, Generic IP...)

LHCOPN meeting, CERN, 2010-10-08GCX 14

Page 15: LHCOPN: Operations report

Separated LHCOPN helpdesk in GGUS, why? (1/3)Separated LHCOPN helpdesk in GGUS, why? (1/3)

Key requirement 2008-03– Not doing user support, but coordinating network teams– Match operational model, particularly responsibility and notification

scheme– Network issue ≠ Grid issue, lot of non service impacting events to be

registered into• Avoid disturbing or misleading people

– Network teams have no access to standard GGUS• And did not want

– Centralize anything related to LHCOPN Ops– Clear desire to be isolated/protected

• “If we use standard GGUS this will be a mess”• Real fear of enquiries for anything• Did not want to be considered as a catch all networking support, we should accept only

selected enquiries LHCOPN related going through storage teams

So we ended with the LHCOPN helpdesk

LHCOPN meeting, CERN, 2010-10-08GCX 15

Page 16: LHCOPN: Operations report

Separated LHCOPN helpdesk in GGUS, why? (2/3)Separated LHCOPN helpdesk in GGUS, why? (2/3)

Now– General workflow is agreed, discussion is on way to implement it– Lot of things have evolved

• GGUS support scheme, experience in applying processes etc.

– Several problems/concerns experienced• Problem cannot be solved independently by network team?

– Lot of interaction with storage, system etc.

– Aren’t iperf tests or monitoring sufficient?

• We miss clear bridge with WLCG Ops– Hope was put in awaited parent/child relationship feature for GGUS tickets

– cross helpdesk accesses and exchanges required ?

• Enquiries often still have a standard GGUS tickets– “Why creating a LHCOPN TT if there is still a GGUS one ?”

» Competition between LHCOPN helpdesk and standard GGUS

– Tickets turning out to be network related after some time and investigations

– LHCOPN tickets: Overhead or true advantage?

» Notification, responsibility, tracking etc.

LHCOPN meeting, CERN, 2010-10-08GCX 16

Page 17: LHCOPN: Operations report

Separated LHCOPN helpdesk in GGUS, why? (3/3)Separated LHCOPN helpdesk in GGUS, why? (3/3) So create 12 related support units in the standard

GGUS?• LHCOPN_CA-TRIUMF etc.

– Will this add happy interactions with everybody?– Can we keep the set of particular features we have and be smartly

integrated in current GGUS’ workflow?• Particular view, non service impacting events hidden, categories, tickets for

maintenances, notification and assignment scheme ?• Transparent for us? Can a standard ticket be turned into a LHCOPN one?

– Aren’t we doing more than user support?

LHCOPN meeting, CERN, 2010-10-08GCX 17

Page 18: LHCOPN: Operations report

AOB (1/3)AOB (1/3) Routing policies

– To be documented accurately through a routing matrix– https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies

Escalation process– Existing, but never used– https://twiki.cern.ch/twiki/bin/view/LHCOPN/

OperationalModel#Escalated_incident_management_pr – Give this privilege to WLCG people on LHCOPN tickets?

Scheme of responsibilities to be improved?– Set on links basis, so who’s responsible for a IT-INFN-

CNAF ↔ US-T1-BNL issue?• Can this really happen without problems between IT-INFN-

CNAF ↔ CERN or US-T1-BNL ↔ CERN ?

LHCOPN meeting, CERN, 2010-10-08GCX 18

Page 19: LHCOPN: Operations report

AOB (2/3)AOB (2/3) Issues/requests related to MDM

– Must be visible, tracked and centralised like any others LHCOPN issues

• Must be in the LHCOPN TTS– Maybe new problem categories etc. to support this

– How far? Track software bug or only sites implementation?

• DANTE/GN3 could have login/pass to GGUS if no certificate– Any concern about?

– Documentation about MDM boxes available?• Should be on the LHCOPN twiki, even very brief

– List and IP address of boxes enough?

• Hard to solve problems only knowing local boxes• DANTE/GN3 should have R/W access to LHCOPN twiki

LHCOPN meeting, CERN, 2010-10-08GCX 19

Page 20: LHCOPN: Operations report

AOB (3/3)AOB (3/3)

Too many off the record e-mails exchanges about LHCOPN issues– MUST be in the LHCOPN TTS

• Visible, followed, timestamped etc.• Tickets in the LHCOPN TTS have a clear scheme of

responsibilites… not an e-mail sleeping in inbox

– If no LHCOPN ticket, no LHCOPN issue

LHCOPN meeting, CERN, 2010-10-08GCX 20

Page 21: LHCOPN: Operations report

ConclusionConclusion

Awaiting monitoring to revitalise Ops– And SLD to really know what matters

Main weakness of LHCOPN Ops: relationship with WLCG– GGUS merging: To be investigated/discussed further

• Why not if this solves issues

Be careful with the scope of our model– LHCOPN only– Key reason for having this so specific?

• But be careful before changing something working• Wait also EGI networking support and Tiers 2 networking to

converge

LHCOPN meeting, CERN, 2010-10-08GCX 21