lhcopn operational model - 4 use-cases
DESCRIPTION
LHCOPN operational model - 4 use-cases. Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin. Agenda. Focus on 4 use-cases: Incident Management L3: Power outage at DE-KIT leading to routers down - PowerPoint PPT PresentationTRANSCRIPT
LHCOPN operational model -4 use-cases
Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support)on behalf of the Ops WG
LHCOPN meeting, 2009-01-15, Berlin
AgendaFocus on 4 use-cases:• Incident Management
1. L3: Power outage at DE-KIT leading to routers down 2. L2: Fibre cut between London and Didcot affecting CERN-
RAL-LHCOPN-001
• Change Management3. L3: New IP prefixe for ES-PIC
• Maintenance Management4. L2: USLHCNET's scheduled power cut for devices in Chicago
GCX - LHCOPN meeting - 2009-01-15 2
Tools used
• CERN’s twiki– https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsModelUseCases– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsContacts
• GGUS– Public release 2009-02-01
• Monitoring– MDM, e2e2mon, ASPDrawer...
GCX - LHCOPN meeting - 2009-01-15 3
POWER OUTAGE AT DE-KIT LEADING TO ROUTERS DOWN
L3 incident management
GCX - LHCOPN meeting - 2009-01-15 4
Scope
GCX - LHCOPN meeting - 2009-01-15 5
• 2 routers unexpectedly down• Affected:
• NL-T1, CH-CERN, IT-INFN-CNAF, FR-CCIN2P3, DE-KIT• 5 links
L3 incident managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_incident_management_process
GCX - LHCOPN meeting - 2009-01-15 6
Scope: Router down, BGP filtering, bad routing...The source site is the site where the problem lies.
1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site.
1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site).
1.3 If the problem is related to an underlying layer (L2: dark fibre outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS.
1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed.
2 The LHCOPN TTS notifies all impacted sites about the incident
L3 Incident management process
Source site involved
Site involved
A notifies B
Grid Data
contact* Router
operators
Router operators
A A BB A interacts with B
Affected sites
1.1LHCOPN TTS
(GGUS)
L2 incident management
1.4
1.2
2(1.3)
BA A reads and writes BA goes to process BA B V0.5 20081215 gcx
Ticket opening
1.1 A DE-KIT router operator opens a trouble ticket into GGUS
GCX - LHCOPN meeting - 2009-01-15 8
DE-KIT
* Router operators
1.1LHCOPN TTS
(GGUS)
GGUS submit interface
GCX - LHCOPN meeting - 2009-01-15 9
Ticket opened
GCX - LHCOPN meeting - 2009-01-15 10
Other steps
• Outage is localised and noticed by source site– No need to perform 1.2: Contact counterpart on
distant site
• This is a power cut, not a real L2 problem– No need to go further on 1.3: L2 incident
management process
GCX - LHCOPN meeting - 2009-01-15 11
Grid interaction
• 1.4: Grid data contact at DE-KIT is warned about the outage – GGUS TTid provided– He will compute impact on the Grid– He will warn the Grid
GCX - LHCOPN meeting - 2009-01-15 12
DE-KITGrid Data
contact* Router
operators
1.1LHCOPN TTS
(GGUS)
1.4
Automatic broadcasting
2: The GGUS TTS will warn all affected sites– This is done when ticket is submited
GCX - LHCOPN meeting - 2009-01-15 13
DE-KITGrid Data
contact* Router
operators
1.1LHCOPN TTS
(GGUS)
1.4 CH-CERN, FR-CCIN2P3,
IT-INFN-CNAF, NL-T1, DE-KIT2
Following/Closure• Incident registration and broadcasting is terminated• DE-KIT router operator is in charge of updating/
closing the GGUS ticket– Affected sites will be notified
• Local Grid data contact has also to be warned
GCX - LHCOPN meeting - 2009-01-15 14
History
GCX - LHCOPN meeting - 2009-01-15 15
Conclusion for first use case
• Shortcut as the incident is quickly localised– Otherwise more interactions between sites
• Deeply organised around GGUS tickets– Could be opened by another site and assigned to
DE-KIT– Put status from « assigned » to « in progress » to
acknowledge
GCX - LHCOPN meeting - 2009-01-15 16
Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001
L2 Incident management
GCX - LHCOPN meeting - 2009-01-15 17
Scope
GCX - LHCOPN meeting - 2009-01-15 18
• Router operator at UK-T1-RAL noticed that link is down thanks to their monitoring system
• Affected• 1 link: CERN-RAL-LHCOPN-001 • 2 sites: CH-CERN and UK-T1-RAL
• Not clear idea of what and where the problem is• Router down at CH-CERN, fibre cut…
Global problem management process started
GCX - LHCOPN meeting - 2009-01-15 19
Quick investigation
1- Nothing seems occurring on site2- Take an overview of the LHCOPN
– e2emon monitoring system indicates that the L2 link is down in segment “UKERNA”
• Now tracking a fibre cut
– Nothing seems registered on GGUS about• Unscheduled event = Incident
• Going to L2 incident managementGCX - LHCOPN meeting - 2009-01-15 20
L2 incident managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_incident_management_process
GCX - LHCOPN meeting - 2009-01-15 21
Scope: Dark fibres outages...
1.1 A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site
1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS.1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact.
2 All impacted sites will be notified by the TTS.
3 If nothing if found at L2 the Escalated incident management process is started.
Sites linked
L2 Incident management process
Sites linked
* L2 NOC
Grid Data
contact
* Router operators
LHCOPN TTS(GGUS)
* End of L3 incident management
A notifies BA A BB A interacts with B BA A reads and writes B
1.1 1.3
1.2
2
escalated incident management(3)
Affected sites
V0.5 20081215 gcx
Incident registration• 1.1 Router operator at UK-T1-RAL will open a
ticket to JANET for the outage• 1.2: UK-T1-RAL noticed the outage so will
open a ticket into GGUS for the LHCOPN community– Self assigned to them because under their
responsibility (T0-T1)
GCX - LHCOPN meeting - 2009-01-15 23
UK-T1-RAL
JANET NOC
* Router operators
LHCOPN TTS(GGUS)
1.1 1.2
GGUS ticket submited
GCX - LHCOPN meeting - 2009-01-15 24
Broadcasting
1.3: Grid interaction– Local Grid data contact warned (+ #GGUS-TTid)
2: Other affected sites automaticaly notified by GGUS
GCX - LHCOPN meeting - 2009-01-15 25
Sites linked
UK-T1-RAL
JANET NOC
Grid Data
contact
* Router operators
LHCOPN TTS(GGUS)
1.1 1.3
1.2
2
CH-CERN
Following/Closure
• UK-T1-RAL will update GGUS tickets with information from JANET– Grid data contact and affected sites are kept
updated
• Ticket will be closed by UK-T1-RAL
GCX - LHCOPN meeting - 2009-01-15 26
Conclusion for second use-case
• Accurate and reliable monitoring is required to really shortcut investigations
• Key communication between network provider and customer– We did not changed the way this currently works
GCX - LHCOPN meeting - 2009-01-15 27
New IP prefixe for ES-PIC L3 Change management
GCX - LHCOPN meeting - 2009-01-15 28
Scope
• ES-PIC has a new IP prefixe that must be included within the LHCOPN
• Affected: – All sites: Filters to update…– And monitoring systems
GCX - LHCOPN meeting - 2009-01-15 29
L3 change management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_change_management_process
GCX - LHCOPN meeting - 2009-01-15 30
Scope: IP addresses change, new prefix propagated, new filtering The source actor for these changes are router operators. 1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility ...)1.2 Router operator will expose change to affected sites (e.g linked sites)
2.1 The change will be fully documented on the global web repository and some technical information should also be updated2.2 An informational ticket summarizing the change will be put into the LHCOPN TTS. It will contain link to the full documentation of the change (e.g URL to the Global web repository)2.3 The L3 monitoring infrastructure may be adapted if needed (new p2p IPs to be watched...)
3 The LHCOPN TTS notifies all impacted sites
4 If the change has an impact a L3 maintenance management process will be started to commit changes. Else the change could be directly done
If we have some L3 changes impacting the L2 (L3 VPN for instance) the L2 change management process should be started.
Linked Sites Linked Sites
L3 Change Management
Source site
Grid Data
contact
Router * operators
Affected Sites
Router operators
L3 maintenance management
Global web repository
(Twiki)
A notifies BA A BB A interacts with B BA A reads and writes B
Monitoring
1.1
1.2
2.1
2.2
(2.3)
(4)
LHCOPN TTS (GGUS)
3Affected
sites
V0.5 20081215 gcx
Change registration
1.1: Grid data contact is warned about the change– New hosts will benefit of the LHCOPN?
1.2: This change is common and has no deep impact for others– No need to discuss with impacted sites
GCX - LHCOPN meeting - 2009-01-15 32
ES-PIC
Grid Data
contact
Router * operators
1.1
• 2.1: – The change will be documented on the change
management database• https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase
– Technical information will be updated• https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnIpAddresses • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OverallNetworkMaps
2.1
ES-PIC
Grid Data
contact
Router * operators
1.1
Documentation and tool update
GCX - LHCOPN meeting - 2009-01-15 33
Global web repository
(Twiki)
Technical information
Change management DB
2.1
Broadcasting
2.2: A « informational » GGUS ticket will be created– With link to the change management database
entry– With link to technical information updated– 3: All sites will be notified
• 3: DANTE Operation + ENOC are put in copy– New prefixes might need to be also monitored by
MDM + ASPDrawer
GCX - LHCOPN meeting - 2009-01-15 34
Summary
GCX - LHCOPN meeting - 2009-01-15 36
36
ES-PIC
Grid Data
contact
Router * operators
Global web repository
(Twiki)
Monitoring
1.1
2.1
2.2
(2.3)
LHCOPN TTS (GGUS)
3ALL
Sites
DANTE Operation
Technical information
MDM
Change management DB
BGP
ENOC
Committing the change (1/2)
• The change is documented and advertised but not yet committed
• Has the change, or its commitment, impact on existing service?– No, so no need to commit it within a “true”
maintenance
GCX - LHCOPN meeting - 2009-01-15 37
Committing the change (2/2)
• The change will be silently implemented by ES-PIC and reported with a GGUS ticket– Kind: Maintenance L3– To track implementation + statistics
GCX - LHCOPN meeting - 2009-01-15
Conclusion for third use-case
• Documenting and implementing are separated– 2 tickets: Informational & Maintenance
• Third party tools might need to be updated– MDM, e2emon, ASPDrawer, GGUS …
• Lighten process for non impacting changes
GCX - LHCOPN meeting - 2009-01-15 39
USLHCNET's scheduled power cut for devices in Chicago
L2 maintenance management
GCX - LHCOPN meeting - 2009-01-15 40
Scope (1/2)
• USLHCNET will have power cut in Chicago
GCX - LHCOPN meeting - 2009-01-15 41
Scope (2/2)
• Fictional impact:– US-FNAL-CMS will be fully disconnected
GCX - LHCOPN meeting - 2009-01-15 42
L2 maintenance managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_maintenance_management_proces
GCX - LHCOPN meeting - 2009-01-15 43
Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...)
Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried.
1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process.
1.2 The router operator will warn its Grid data contact (and may check with him date is ok)
1.3 The router operator may check with distant affected sites - off the record - that the date is suitable
1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1 . Else the maintenance is posted in the LHCOPN TTS by the router operator.
2 All impacted sites are notified.
3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.
Linked Sites
L2 Maintenance management process
* L2 NOC
Linked Sites Grid Data
contactRouter operators
A notifies BA A BB A interacts with B BA A reads and writes B
Linked Sites
Router operators
LHCOPN TTS (GGUS)
1.1
1.4
1.2
1.3
2Affected
sites
V0.5 20081215 gcx
Registering maintenance (1/2)
1.1: USLHCNET warns at least site US-FNAL-CMS• Not Grid, not all LHCOPN sites etc.
1.2: US-FNAL-CMS will warn its local Grid data contact– And may check with him date is OK– 1.3: Ideally also avoid overlap with CH-CERN’s events
GCX - LHCOPN meeting - 2009-01-15 45
USLHCNET NOC
US-FNAL-CMSGrid Data
contactRouter operators
1.11.2
Linked SiteCH-CERN1.3
Registering maintenance (2/2)• Affected sites:
– US-FNAL-CMS, CH-CERN
– US-FNAL-CMS is responsible for following this event • 1.4: A FNAL Router operator will put the maintenance
into GGUS
GCX - LHCOPN meeting - 2009-01-15 46
GGUS submit interface
GCX - LHCOPN meeting - 2009-01-15 47
Summary
GCX - LHCOPN meeting - 2009-01-15 48
USLHCNET NOC
US-FNAL-CMSGrid Data
contactRouter operators
1.11.2
Linked SiteCH-CERN
Router operators
LHCOPN TTS (GGUS)
1.4
2
CH-CERN
1.3
Following
• US-FNAL-CMS updates ticket according to USLHCNET reports
• US-FNAL-CMS is in charge to close the ticket when terminated
GCX - LHCOPN meeting - 2009-01-15 49
Ticket’s handling
GCX - LHCOPN meeting - 2009-01-15 50
Conclusion for fourth use-case
• Light process for network providers– Like what currently happens– Warn only your customers– No Grid interaction
• Site acts as a relay for information from network providers– Propagated within LHCOPN community
GCX - LHCOPN meeting - 2009-01-15 51
Overall conclusion
GCX - LHCOPN meeting - 2009-01-15 52
Overall conclusion (1/2)
• Sample provided here– Many details could be adjusted
• Steps for incident management– Investigate, register, broadcast, follow
• Steps for change management– Document, register, broadcast, commit
• Steps for maintenance management– Register, broadcast, (commit), follow
GCX - LHCOPN meeting - 2009-01-15 53
Overall conclusion (2/2)
• Not really different from current way to carry network operations?– But formalised
• Feel free to ask details on processes– Propose interesting/embarrassing use-case– Everything is/will be on the twiki
• GGUS accesses/notifications are indispensable– Access table is a key thing to be accurately filled
GCX - LHCOPN meeting - 2009-01-15 54
Questions & discussion
GCX - LHCOPN meeting - 2009-01-15 55