lhcopn: operations report
DESCRIPTION
LHCOPN: Operations report. Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08. From last LHCOPN meeting, 2010-06-29, Barcelona. Conclusion on Operations - PowerPoint PPT PresentationTRANSCRIPT
LHCOPN: Operations reportLHCOPN: Operations report
Guillaume.Cessieux @ cc.in2p3.fr
Network team, FR-CCIN2P3
LHCOPN meeting, CERN, 2010-10-08
From last LHCOPN meeting, 2010-06-29, BarcelonaFrom last LHCOPN meeting, 2010-06-29, Barcelona
Conclusion on Operations– Unequal following of processes by sites because
missing clear feeling of usefulness and evidence of network failures
– WLCG relationships are weak– Monitoring and SLD required to really assess Operations
Items not solved– LHCOPN representatives
• How to push efficiently for proper solving of some issues/administrative tasks
– In clear words: Stress sites and escalate frozen issues
– Merging LHCOPN helpdesk with standard GGUS
LHCOPN meeting, CERN, 2010-10-08GCX 2
OutlinesOutlines
Operation status– TTS stats– Long standing issues & Ops phoneconf report
Operational exchanges with WLCG– Post mortem analysis of some issues– Ease exchanges with WLCG
AOB
LHCOPN meeting, CERN, 2010-10-08GCX 3
LHCOPN meeting, CERN, 2010-10-08GCX 4
Number of tickets put in the LHCOPN TTS per monthNumber of tickets put in the LHCOPN TTS per month
AVG: 23 tickets/month
Kind of tickets per monthKind of tickets per month
LHCOPN meeting, CERN, 2010-10-08GCX 5
KPI-1: Infrastructure vs operations behaviorKPI-1: Infrastructure vs operations behavior
LHCOPN meeting, CERN, 2010-10-08GCX 6
LHCOPN meeting, CERN, 2010-10-08GCX 7
Ticket ownership during [2010-07-01,2010-09-31]Ticket ownership during [2010-07-01,2010-09-31]
Joy of terminating 6 LHCOPN links
Ownership of tickets per month per siteOwnership of tickets per month per site
LHCOPN meeting, CERN, 2010-10-08GCX 8
Conclusion from TTS statsConclusion from TTS stats
Workflow stable, but unclear if this is good– Miss SLD & monitoring to correlate and focus on
service impacting events
Lot of L2 events (80%) well handled– Often clear cut, easy to spot
Not used to complex issues– Often turning into a long story
• packet loss, MTU...
LHCOPN meeting, CERN, 2010-10-08GCX 9
Long standing issuesLong standing issues
Only administrative!– Validate prefix acceptance etc.– Wait GGUS feature “clone this ticket and
assign it to all impacted sitename” to follow this in a per site basis
Followed during the LHCOPN Ops phoneconf, each 3 months– Recurrent issue: Hard to have administrative
issue solved
LHCOPN meeting, CERN, 2010-10-08GCX 10
Issues highlighted by WLCG (1/4)Issues highlighted by WLCG (1/4)
Painful to spot and a lot not anyhow related to the LHCOPN
1. #GGUS-54473 transfer error from PIC_DATADISK to SARA-MATRIX_DATADISK– Child issues: #GGUS-54416, #GGUS-54474, #GGUS-54500
– “The two LHCOPN routers at CERN were connected via a VLAN, and VLAN tagging adds 4 bytes to a packet. The MTU between these routers has been increased”
– Opened 2010-01-05 12:17, closed 2010-01-08 16:16
– No related LHCOPN tickets
LHCOPN meeting, CERN, 2010-10-08GCX 11
Issues highlighted by WLCG (2/4)Issues highlighted by WLCG (2/4)
2. #LHCOPN-58197: Poor performance between CERN and ASGC– Opened 2010-05-12, closed 2010-05-17– Never updated, only Opened/Closed for the record
• Only communication problem, issue was managed
• Network staff movement at TW-ASGW, solved• SIR filled https://twiki.cern.ch/twiki/bin/view/LCG/SIRCernAsgcLinkMay2010
3. #GGUS-59791: Transfer problem from to INFN-T1_DATADISK to PIC_DATADISK– Child issue: #GGUS-59697 T0 export to INFN-T1_DATADISK failures:
No valid space tokens – Opened 2010-07-07 00:06, closed 2010-07-14 18:05– “Network issue of MTU black hole + route asymetry at CNAF/GARR”– No LHCOPN tickets
LHCOPN meeting, CERN, 2010-10-08GCX 12
Issues highlighted by WLCG (3/4)Issues highlighted by WLCG (3/4)
4. # GGUS-61306: Functional test transfer errors to RAL-LCG2_DATADISK – Related to
• #GGUS-61942 “NDGF-T1 transfer error from RAL-LCG2 and to BNL-OSG2”
• #GGUS-61835 “Transfer errors from NDGF-T1_DATADISK to RAL-LCG2_DATADISK”
• #GGUS-62287 “Transfer errors at NDGF-T1_SCRATCHDISK”
– Opened 2010-08-19 17:41, closed 2010-09-17 15:09 – #LHCOPN-62228, opened/closed 2010-09-17
• Symbolic for the record, no info into
– “The linecard terminating the RAL primary link on the CERN router was replaced and the issue was definitely solved”
LHCOPN meeting, CERN, 2010-10-08GCX 13
Issues highlighted by WLCG (4/4)Issues highlighted by WLCG (4/4)
4 LHCOPN issues this year– Nothing particularly wrong– Problem is mainly around communication
Main mistake is forgetting creating a ticket in LHCOPN helpdesk– This was the agreed process
Not aware of any other LHCOPN related issue from WLCG– But others network issues (LAN, Generic IP...)
LHCOPN meeting, CERN, 2010-10-08GCX 14
Separated LHCOPN helpdesk in GGUS, why? (1/3)Separated LHCOPN helpdesk in GGUS, why? (1/3)
Key requirement 2008-03– Not doing user support, but coordinating network teams– Match operational model, particularly responsibility and notification
scheme– Network issue ≠ Grid issue, lot of non service impacting events to be
registered into• Avoid disturbing or misleading people
– Network teams have no access to standard GGUS• And did not want
– Centralize anything related to LHCOPN Ops– Clear desire to be isolated/protected
• “If we use standard GGUS this will be a mess”• Real fear of enquiries for anything• Did not want to be considered as a catch all networking support, we should accept only
selected enquiries LHCOPN related going through storage teams
So we ended with the LHCOPN helpdesk
LHCOPN meeting, CERN, 2010-10-08GCX 15
Separated LHCOPN helpdesk in GGUS, why? (2/3)Separated LHCOPN helpdesk in GGUS, why? (2/3)
Now– General workflow is agreed, discussion is on way to implement it– Lot of things have evolved
• GGUS support scheme, experience in applying processes etc.
– Several problems/concerns experienced• Problem cannot be solved independently by network team?
– Lot of interaction with storage, system etc.
– Aren’t iperf tests or monitoring sufficient?
• We miss clear bridge with WLCG Ops– Hope was put in awaited parent/child relationship feature for GGUS tickets
– cross helpdesk accesses and exchanges required ?
• Enquiries often still have a standard GGUS tickets– “Why creating a LHCOPN TT if there is still a GGUS one ?”
» Competition between LHCOPN helpdesk and standard GGUS
– Tickets turning out to be network related after some time and investigations
– LHCOPN tickets: Overhead or true advantage?
» Notification, responsibility, tracking etc.
LHCOPN meeting, CERN, 2010-10-08GCX 16
Separated LHCOPN helpdesk in GGUS, why? (3/3)Separated LHCOPN helpdesk in GGUS, why? (3/3) So create 12 related support units in the standard
GGUS?• LHCOPN_CA-TRIUMF etc.
– Will this add happy interactions with everybody?– Can we keep the set of particular features we have and be smartly
integrated in current GGUS’ workflow?• Particular view, non service impacting events hidden, categories, tickets for
maintenances, notification and assignment scheme ?• Transparent for us? Can a standard ticket be turned into a LHCOPN one?
– Aren’t we doing more than user support?
LHCOPN meeting, CERN, 2010-10-08GCX 17
AOB (1/3)AOB (1/3) Routing policies
– To be documented accurately through a routing matrix– https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies
Escalation process– Existing, but never used– https://twiki.cern.ch/twiki/bin/view/LHCOPN/
OperationalModel#Escalated_incident_management_pr – Give this privilege to WLCG people on LHCOPN tickets?
Scheme of responsibilities to be improved?– Set on links basis, so who’s responsible for a IT-INFN-
CNAF ↔ US-T1-BNL issue?• Can this really happen without problems between IT-INFN-
CNAF ↔ CERN or US-T1-BNL ↔ CERN ?
LHCOPN meeting, CERN, 2010-10-08GCX 18
AOB (2/3)AOB (2/3) Issues/requests related to MDM
– Must be visible, tracked and centralised like any others LHCOPN issues
• Must be in the LHCOPN TTS– Maybe new problem categories etc. to support this
– How far? Track software bug or only sites implementation?
• DANTE/GN3 could have login/pass to GGUS if no certificate– Any concern about?
– Documentation about MDM boxes available?• Should be on the LHCOPN twiki, even very brief
– List and IP address of boxes enough?
• Hard to solve problems only knowing local boxes• DANTE/GN3 should have R/W access to LHCOPN twiki
LHCOPN meeting, CERN, 2010-10-08GCX 19
AOB (3/3)AOB (3/3)
Too many off the record e-mails exchanges about LHCOPN issues– MUST be in the LHCOPN TTS
• Visible, followed, timestamped etc.• Tickets in the LHCOPN TTS have a clear scheme of
responsibilites… not an e-mail sleeping in inbox
– If no LHCOPN ticket, no LHCOPN issue
LHCOPN meeting, CERN, 2010-10-08GCX 20
ConclusionConclusion
Awaiting monitoring to revitalise Ops– And SLD to really know what matters
Main weakness of LHCOPN Ops: relationship with WLCG– GGUS merging: To be investigated/discussed further
• Why not if this solves issues
Be careful with the scope of our model– LHCOPN only– Key reason for having this so specific?
• But be careful before changing something working• Wait also EGI networking support and Tiers 2 networking to
converge
LHCOPN meeting, CERN, 2010-10-08GCX 21