lhcopn operational model - 4 use-cases

55
LHCOPN operational model - 4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin

Upload: zahi

Post on 21-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

LHCOPN operational model - 4 use-cases. Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin. Agenda. Focus on 4 use-cases: Incident Management L3: Power outage at DE-KIT leading to routers down - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LHCOPN operational model - 4 use-cases

LHCOPN operational model -4 use-cases

Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support)on behalf of the Ops WG

LHCOPN meeting, 2009-01-15, Berlin

Page 2: LHCOPN operational model - 4 use-cases

AgendaFocus on 4 use-cases:• Incident Management

1. L3: Power outage at DE-KIT leading to routers down 2. L2: Fibre cut between London and Didcot affecting CERN-

RAL-LHCOPN-001

• Change Management3. L3: New IP prefixe for ES-PIC

• Maintenance Management4. L2: USLHCNET's scheduled power cut for devices in Chicago

GCX - LHCOPN meeting - 2009-01-15 2

Page 3: LHCOPN operational model - 4 use-cases

Tools used

• CERN’s twiki– https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsModelUseCases– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsContacts

• GGUS– Public release 2009-02-01

• Monitoring– MDM, e2e2mon, ASPDrawer...

GCX - LHCOPN meeting - 2009-01-15 3

Page 4: LHCOPN operational model - 4 use-cases

POWER OUTAGE AT DE-KIT LEADING TO ROUTERS DOWN

L3 incident management

GCX - LHCOPN meeting - 2009-01-15 4

Page 5: LHCOPN operational model - 4 use-cases

Scope

GCX - LHCOPN meeting - 2009-01-15 5

• 2 routers unexpectedly down• Affected:

• NL-T1, CH-CERN, IT-INFN-CNAF, FR-CCIN2P3, DE-KIT• 5 links

Page 6: LHCOPN operational model - 4 use-cases

L3 incident managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_incident_management_process

GCX - LHCOPN meeting - 2009-01-15 6

Scope: Router down, BGP filtering, bad routing...The source site is the site where the problem lies.

1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site.

1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site).

1.3 If the problem is related to an underlying layer (L2: dark fibre outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS.

1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed.

2 The LHCOPN TTS notifies all impacted sites about the incident

Page 7: LHCOPN operational model - 4 use-cases

L3 Incident management process

Source site involved

Site involved

A notifies B

Grid Data

contact* Router

operators

Router operators

A A BB A interacts with B

Affected sites

1.1LHCOPN TTS

(GGUS)

L2 incident management

1.4

1.2

2(1.3)

BA A reads and writes BA goes to process BA B V0.5 20081215 gcx

Page 8: LHCOPN operational model - 4 use-cases

Ticket opening

1.1 A DE-KIT router operator opens a trouble ticket into GGUS

GCX - LHCOPN meeting - 2009-01-15 8

DE-KIT

* Router operators

1.1LHCOPN TTS

(GGUS)

Page 9: LHCOPN operational model - 4 use-cases

GGUS submit interface

GCX - LHCOPN meeting - 2009-01-15 9

Page 10: LHCOPN operational model - 4 use-cases

Ticket opened

GCX - LHCOPN meeting - 2009-01-15 10

Page 11: LHCOPN operational model - 4 use-cases

Other steps

• Outage is localised and noticed by source site– No need to perform 1.2: Contact counterpart on

distant site

• This is a power cut, not a real L2 problem– No need to go further on 1.3: L2 incident

management process

GCX - LHCOPN meeting - 2009-01-15 11

Page 12: LHCOPN operational model - 4 use-cases

Grid interaction

• 1.4: Grid data contact at DE-KIT is warned about the outage – GGUS TTid provided– He will compute impact on the Grid– He will warn the Grid

GCX - LHCOPN meeting - 2009-01-15 12

DE-KITGrid Data

contact* Router

operators

1.1LHCOPN TTS

(GGUS)

1.4

Page 13: LHCOPN operational model - 4 use-cases

Automatic broadcasting

2: The GGUS TTS will warn all affected sites– This is done when ticket is submited

GCX - LHCOPN meeting - 2009-01-15 13

DE-KITGrid Data

contact* Router

operators

1.1LHCOPN TTS

(GGUS)

1.4 CH-CERN, FR-CCIN2P3,

IT-INFN-CNAF, NL-T1, DE-KIT2

Page 14: LHCOPN operational model - 4 use-cases

Following/Closure• Incident registration and broadcasting is terminated• DE-KIT router operator is in charge of updating/

closing the GGUS ticket– Affected sites will be notified

• Local Grid data contact has also to be warned

GCX - LHCOPN meeting - 2009-01-15 14

Page 15: LHCOPN operational model - 4 use-cases

History

GCX - LHCOPN meeting - 2009-01-15 15

Page 16: LHCOPN operational model - 4 use-cases

Conclusion for first use case

• Shortcut as the incident is quickly localised– Otherwise more interactions between sites

• Deeply organised around GGUS tickets– Could be opened by another site and assigned to

DE-KIT– Put status from « assigned » to « in progress » to

acknowledge

GCX - LHCOPN meeting - 2009-01-15 16

Page 17: LHCOPN operational model - 4 use-cases

Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001

L2 Incident management

GCX - LHCOPN meeting - 2009-01-15 17

Page 18: LHCOPN operational model - 4 use-cases

Scope

GCX - LHCOPN meeting - 2009-01-15 18

• Router operator at UK-T1-RAL noticed that link is down thanks to their monitoring system

• Affected• 1 link: CERN-RAL-LHCOPN-001 • 2 sites: CH-CERN and UK-T1-RAL

• Not clear idea of what and where the problem is• Router down at CH-CERN, fibre cut…

Page 19: LHCOPN operational model - 4 use-cases

Global problem management process started

GCX - LHCOPN meeting - 2009-01-15 19

Page 20: LHCOPN operational model - 4 use-cases

Quick investigation

1- Nothing seems occurring on site2- Take an overview of the LHCOPN

– e2emon monitoring system indicates that the L2 link is down in segment “UKERNA”

• Now tracking a fibre cut

– Nothing seems registered on GGUS about• Unscheduled event = Incident

• Going to L2 incident managementGCX - LHCOPN meeting - 2009-01-15 20

Page 21: LHCOPN operational model - 4 use-cases

L2 incident managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_incident_management_process

GCX - LHCOPN meeting - 2009-01-15 21

Scope: Dark fibres outages...

1.1 A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site

1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS.1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact.

2 All impacted sites will be notified by the TTS.

3 If nothing if found at L2 the Escalated incident management process is started.

Page 22: LHCOPN operational model - 4 use-cases

Sites linked

L2 Incident management process

Sites linked

* L2 NOC

Grid Data

contact

* Router operators

LHCOPN TTS(GGUS)

* End of L3 incident management

A notifies BA A BB A interacts with B BA A reads and writes B

1.1 1.3

1.2

2

escalated incident management(3)

Affected sites

V0.5 20081215 gcx

Page 23: LHCOPN operational model - 4 use-cases

Incident registration• 1.1 Router operator at UK-T1-RAL will open a

ticket to JANET for the outage• 1.2: UK-T1-RAL noticed the outage so will

open a ticket into GGUS for the LHCOPN community– Self assigned to them because under their

responsibility (T0-T1)

GCX - LHCOPN meeting - 2009-01-15 23

UK-T1-RAL

JANET NOC

* Router operators

LHCOPN TTS(GGUS)

1.1 1.2

Page 24: LHCOPN operational model - 4 use-cases

GGUS ticket submited

GCX - LHCOPN meeting - 2009-01-15 24

Page 25: LHCOPN operational model - 4 use-cases

Broadcasting

1.3: Grid interaction– Local Grid data contact warned (+ #GGUS-TTid)

2: Other affected sites automaticaly notified by GGUS

GCX - LHCOPN meeting - 2009-01-15 25

Sites linked

UK-T1-RAL

JANET NOC

Grid Data

contact

* Router operators

LHCOPN TTS(GGUS)

1.1 1.3

1.2

2

CH-CERN

Page 26: LHCOPN operational model - 4 use-cases

Following/Closure

• UK-T1-RAL will update GGUS tickets with information from JANET– Grid data contact and affected sites are kept

updated

• Ticket will be closed by UK-T1-RAL

GCX - LHCOPN meeting - 2009-01-15 26

Page 27: LHCOPN operational model - 4 use-cases

Conclusion for second use-case

• Accurate and reliable monitoring is required to really shortcut investigations

• Key communication between network provider and customer– We did not changed the way this currently works

GCX - LHCOPN meeting - 2009-01-15 27

Page 28: LHCOPN operational model - 4 use-cases

New IP prefixe for ES-PIC L3 Change management

GCX - LHCOPN meeting - 2009-01-15 28

Page 29: LHCOPN operational model - 4 use-cases

Scope

• ES-PIC has a new IP prefixe that must be included within the LHCOPN

• Affected: – All sites: Filters to update…– And monitoring systems

GCX - LHCOPN meeting - 2009-01-15 29

Page 30: LHCOPN operational model - 4 use-cases

L3 change management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_change_management_process

GCX - LHCOPN meeting - 2009-01-15 30

Scope: IP addresses change, new prefix propagated, new filtering The source actor for these changes are router operators. 1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility ...)1.2 Router operator will expose change to affected sites (e.g linked sites)

2.1 The change will be fully documented on the global web repository and some technical information should also be updated2.2 An informational ticket summarizing the change will be put into the LHCOPN TTS. It will contain link to the full documentation of the change (e.g URL to the Global web repository)2.3 The L3 monitoring infrastructure may be adapted if needed (new p2p IPs to be watched...)

3 The LHCOPN TTS notifies all impacted sites

4 If the change has an impact a L3 maintenance management process will be started to commit changes. Else the change could be directly done

If we have some L3 changes impacting the L2 (L3 VPN for instance) the L2 change management process should be started.

Page 31: LHCOPN operational model - 4 use-cases

Linked Sites Linked Sites

L3 Change Management

Source site

Grid Data

contact

Router * operators

Affected Sites

Router operators

L3 maintenance management

Global web repository

(Twiki)

A notifies BA A BB A interacts with B BA A reads and writes B

Monitoring

1.1

1.2

2.1

2.2

(2.3)

(4)

LHCOPN TTS (GGUS)

3Affected

sites

V0.5 20081215 gcx

Page 32: LHCOPN operational model - 4 use-cases

Change registration

1.1: Grid data contact is warned about the change– New hosts will benefit of the LHCOPN?

1.2: This change is common and has no deep impact for others– No need to discuss with impacted sites

GCX - LHCOPN meeting - 2009-01-15 32

ES-PIC

Grid Data

contact

Router * operators

1.1

Page 33: LHCOPN operational model - 4 use-cases

• 2.1: – The change will be documented on the change

management database• https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase

– Technical information will be updated• https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnIpAddresses • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OverallNetworkMaps

2.1

ES-PIC

Grid Data

contact

Router * operators

1.1

Documentation and tool update

GCX - LHCOPN meeting - 2009-01-15 33

Global web repository

(Twiki)

Technical information

Change management DB

2.1

Page 34: LHCOPN operational model - 4 use-cases

Broadcasting

2.2: A « informational » GGUS ticket will be created– With link to the change management database

entry– With link to technical information updated– 3: All sites will be notified

• 3: DANTE Operation + ENOC are put in copy– New prefixes might need to be also monitored by

MDM + ASPDrawer

GCX - LHCOPN meeting - 2009-01-15 34

Page 35: LHCOPN operational model - 4 use-cases

GGUS submit interface

GCX - LHCOPN meeting - 2009-01-15 35

[email protected] + ENOC

Page 36: LHCOPN operational model - 4 use-cases

Summary

GCX - LHCOPN meeting - 2009-01-15 36

36

ES-PIC

Grid Data

contact

Router * operators

Global web repository

(Twiki)

Monitoring

1.1

2.1

2.2

(2.3)

LHCOPN TTS (GGUS)

3ALL

Sites

DANTE Operation

Technical information

MDM

Change management DB

BGP

ENOC

Page 37: LHCOPN operational model - 4 use-cases

Committing the change (1/2)

• The change is documented and advertised but not yet committed

• Has the change, or its commitment, impact on existing service?– No, so no need to commit it within a “true”

maintenance

GCX - LHCOPN meeting - 2009-01-15 37

Page 38: LHCOPN operational model - 4 use-cases

Committing the change (2/2)

• The change will be silently implemented by ES-PIC and reported with a GGUS ticket– Kind: Maintenance L3– To track implementation + statistics

GCX - LHCOPN meeting - 2009-01-15

Page 39: LHCOPN operational model - 4 use-cases

Conclusion for third use-case

• Documenting and implementing are separated– 2 tickets: Informational & Maintenance

• Third party tools might need to be updated– MDM, e2emon, ASPDrawer, GGUS …

• Lighten process for non impacting changes

GCX - LHCOPN meeting - 2009-01-15 39

Page 40: LHCOPN operational model - 4 use-cases

USLHCNET's scheduled power cut for devices in Chicago

L2 maintenance management

GCX - LHCOPN meeting - 2009-01-15 40

Page 41: LHCOPN operational model - 4 use-cases

Scope (1/2)

• USLHCNET will have power cut in Chicago

GCX - LHCOPN meeting - 2009-01-15 41

Page 42: LHCOPN operational model - 4 use-cases

Scope (2/2)

• Fictional impact:– US-FNAL-CMS will be fully disconnected

GCX - LHCOPN meeting - 2009-01-15 42

Page 43: LHCOPN operational model - 4 use-cases

L2 maintenance managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_maintenance_management_proces

GCX - LHCOPN meeting - 2009-01-15 43

Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...)

Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried.

1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process.

1.2 The router operator will warn its Grid data contact (and may check with him date is ok)

1.3 The router operator may check with distant affected sites - off the record - that the date is suitable

1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1 . Else the maintenance is posted in the LHCOPN TTS by the router operator.

2 All impacted sites are notified.

3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.

Page 44: LHCOPN operational model - 4 use-cases

Linked Sites

L2 Maintenance management process

* L2 NOC

Linked Sites Grid Data

contactRouter operators

A notifies BA A BB A interacts with B BA A reads and writes B

Linked Sites

Router operators

LHCOPN TTS (GGUS)

1.1

1.4

1.2

1.3

2Affected

sites

V0.5 20081215 gcx

Page 45: LHCOPN operational model - 4 use-cases

Registering maintenance (1/2)

1.1: USLHCNET warns at least site US-FNAL-CMS• Not Grid, not all LHCOPN sites etc.

1.2: US-FNAL-CMS will warn its local Grid data contact– And may check with him date is OK– 1.3: Ideally also avoid overlap with CH-CERN’s events

GCX - LHCOPN meeting - 2009-01-15 45

USLHCNET NOC

US-FNAL-CMSGrid Data

contactRouter operators

1.11.2

Linked SiteCH-CERN1.3

Page 46: LHCOPN operational model - 4 use-cases

Registering maintenance (2/2)• Affected sites:

– US-FNAL-CMS, CH-CERN

– US-FNAL-CMS is responsible for following this event • 1.4: A FNAL Router operator will put the maintenance

into GGUS

GCX - LHCOPN meeting - 2009-01-15 46

Page 47: LHCOPN operational model - 4 use-cases

GGUS submit interface

GCX - LHCOPN meeting - 2009-01-15 47

Page 48: LHCOPN operational model - 4 use-cases

Summary

GCX - LHCOPN meeting - 2009-01-15 48

USLHCNET NOC

US-FNAL-CMSGrid Data

contactRouter operators

1.11.2

Linked SiteCH-CERN

Router operators

LHCOPN TTS (GGUS)

1.4

2

CH-CERN

1.3

Page 49: LHCOPN operational model - 4 use-cases

Following

• US-FNAL-CMS updates ticket according to USLHCNET reports

• US-FNAL-CMS is in charge to close the ticket when terminated

GCX - LHCOPN meeting - 2009-01-15 49

Page 50: LHCOPN operational model - 4 use-cases

Ticket’s handling

GCX - LHCOPN meeting - 2009-01-15 50

Page 51: LHCOPN operational model - 4 use-cases

Conclusion for fourth use-case

• Light process for network providers– Like what currently happens– Warn only your customers– No Grid interaction

• Site acts as a relay for information from network providers– Propagated within LHCOPN community

GCX - LHCOPN meeting - 2009-01-15 51

Page 52: LHCOPN operational model - 4 use-cases

Overall conclusion

GCX - LHCOPN meeting - 2009-01-15 52

Page 53: LHCOPN operational model - 4 use-cases

Overall conclusion (1/2)

• Sample provided here– Many details could be adjusted

• Steps for incident management– Investigate, register, broadcast, follow

• Steps for change management– Document, register, broadcast, commit

• Steps for maintenance management– Register, broadcast, (commit), follow

GCX - LHCOPN meeting - 2009-01-15 53

Page 54: LHCOPN operational model - 4 use-cases

Overall conclusion (2/2)

• Not really different from current way to carry network operations?– But formalised

• Feel free to ask details on processes– Propose interesting/embarrassing use-case– Everything is/will be on the twiki

• GGUS accesses/notifications are indispensable– Access table is a key thing to be accurately filled

GCX - LHCOPN meeting - 2009-01-15 54

Page 55: LHCOPN operational model - 4 use-cases

Questions & discussion

GCX - LHCOPN meeting - 2009-01-15 55