availability and survivability in ip networks · tutorial outline part i introduction &...

110
ICNP 2003 Tutorial ICNP 2003 Tutorial Availability and Availability and Survivability Survivability in IP Networks in IP Networks Supratik Bhattacharyya Sprint Advanced Technology Labs [email protected] Gianluca Iannaccone Intel Research Cambridge [email protected] Copyright ©2003 Sprint. All rights reserved.

Upload: others

Post on 19-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

ICNP 2003 TutorialICNP 2003 Tutorial

Availability and Availability and Survivability Survivability in IP Networksin IP Networks

Supratik BhattacharyyaSprint Advanced Technology Labs

[email protected]

Gianluca IannacconeIntel Research Cambridge

[email protected]

Copyright ©2003 Sprint. All rights reserved.

Page 2: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

22ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Tutorial Outline

Part IIntroduction & Background

Part IICommon approaches to survivability

Part IIIThe Sprint experience

Part IVOpen Issues & Future Directions

Page 3: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

33ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IIntroduction & Background

Page 4: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

44ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part I - Outline

IP NetworksSurvivability & AvailabilityScope of this tutorial

Page 5: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

55ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part I - Outline

IP NetworksSurvivability & AvailabilityScope of this tutorial

Page 6: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

66ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

What is the Internet?

Set of networks running the IP protocolNetworks are loosely connectedAdministered independentlyFor our purposes, the Internet ends at the borders of the Internet Service Providers (ISPs).

Page 7: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

77ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Loosely connected networks...

UUnet

Sprint

AT&T

Dial-up ISP

Tier 2 ISP

BT

Peering points

Page 8: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

88ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Sub-IP Technologies (Sprint network)

Dense Wavelength Dense Wavelength Division Multiplexing (DWDM)Division Multiplexing (DWDM)

[one [one fiberfiber can carry up to 40 can carry up to 40 λλ at at OCOC--192 (10 192 (10 GbpsGbps) speed]) speed]

SONET FramingSONET FramingCisco HDLC (for protocol multiplexing)Cisco HDLC (for protocol multiplexing)

IPIP ISIS--ISIS

Page 9: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

99ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Topology of today’s tier-1 backbones

Points-of-Presence connected by long-haul fibers– Each PoP contains several backbone routers and access

routers to connect customers– Two options: many small PoPs or few large PoPs– Trade off: sites to maintain vs. proximity to customers

Some examples– Sprint: IP over Sonet with few large PoPs– AT&T: IP over Sonet with many small PoPs– UUNet: MPLS/IP/ATM with many small PoPs

Page 10: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1010ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

The Sprint U.S. Topology

Page 11: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1111ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Protocols for IP networks

Exterior Gateway Protocol (EGP)– Border Gateway Protocol (BGP)– Announces reachability between networks– Heavily affected by policies

• “Hot-potato routing” is the norm• Based on trust relationship between ISPs• Difficult to provide availability/survivability guarantees on BGP

(and no incentive to do so)• Visibility into problems via the NANOG mailing list...

Page 12: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1212ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Protocols for IP networks (cont)

Interior Gateway Protocols (IGP)– Intermediate System to Intermediate System (IS-IS)– Open Shortest Path First (OSPF)– Routing based on shortest paths

• ISP assigns a weight (or metric) to each link in the network. • Routers run the Dijkstra algorithm over the resulting weighted

undirected graph

Page 13: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1313ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part I - Outline

IP NetworksSurvivability & AvailabilityScope of this tutorial

Page 14: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1414ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Definitions

SurvivabilityAbility to maintain uninterrupted service in presence of failures– It needs a definition of failure scenarios

AvailabilityA measure of the disruption in packet forwarding due to failures– Different from disruption due to traffic congestion– It should be independent of traffic demand– More discussion in Part IV

Page 15: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1515ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Scope of this tutorial

What we will cover in detail– Sprint’s network design principles– Performance of today’s routing equipment– Failure characterization in IP networks

What we will cover briefly– Optical protection/restoration– MPLS-based approaches– Many references can be found in the bibliography

What we will not cover– Survivability across multiple autonomous systems

Page 16: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1616ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IICommon Approaches

to Survivability

Page 17: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1717ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part II - Outline

Protection vs. Restoration: fundamental trade-offsOptical Protection/RestorationMPLS-based ProtectionIP RestorationMulti-layer approaches

Page 18: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1818ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part II - Outline

Protection vs. Restoration: fundamental trade-offsOptical Protection/RestorationMPLS-based ProtectionIP RestorationMulti-layer approaches

Page 19: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

1919ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Protection vs. RestorationProtection: fixed, pre-determined failure recovery– wired in the network– provision primary and backup path at the same time

Restoration: on-demand recovery– no need to plan ahead where failures may occur– the backup path is not defined a-priori

They can co-exist...– they operate at different timescales and layers

...but do we need both of them?

Page 20: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2020ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Trade-offs: Recovery Speed

Protection inherently faster than RestorationBut... how fast does it need to be? – Outage durations between 50 – 200 ms will have

minimal impact to services.– Voice services slightly affected by outages between

200ms and 2s. Some concerns with video applications.– [ANSI Technical Report T1.TR.68-2001]

Page 21: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2121ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Trade-offs: Deployment Costs

Restoration uses better the network resourcesProtection less flexible than RestorationExample: Optical protection– Fiber provisioning cycle in the order of 12-18 months – New fiber requires substantial capital investment– Geography “against” fiber path diversity

Flexibility crucial if traffic demand grows too fast– Late 90s: Internet traffic “doubling every year”– [Coffman and Odlyzko, 2001]

Page 22: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2222ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Trade-offs: Failure characteristics

Network supposed to survive to all failuresUnderstanding of failures is crucial– frequency and magnitude of failure events– what equipment is more prone to fail

Lower layers cannot address higher layers’ failures– e.g., failures of IP forwarding engine– the opposite is not (usually) true

Page 23: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2323ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part II - Outline

Protection vs. Restoration: fundamental trade-offsOptical Protection/RestorationMPLS-based ProtectionIP RestorationMulti-layer approaches

Page 24: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2424ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Optical Protection schemes[Ramamurthy and Mukherjee, 1999]

Pre-configured backup route and wavelength

Dedicated Shared

Page 25: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2525ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Dedicated vs. Shared Protection

Dedicated

Prim

ary

APrimary B

Backup ABa

ckup

B

Shared

Prim

ary

A

Primary B

Backup A

Back

up B

Dedicated protection is more robust to failuresShared protection uses resources more efficiently

Page 26: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2626ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Optical Protection schemes (WDM)[Ramamurthy and Mukherjee, 1999]

Pre-configured backup route and wavelength

Dedicated Shared

Pathprotection

Linkprotection

Pathprotection

Linkprotection

Page 27: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2727ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Path vs. Link Protection

Path

Backup

Link

Backup 1

Backup 2

Link 1

Link 2

Path protection uses resources more efficiently Link protection gives longer backup path– Primary/Backup paths use same wavelength (may not

be feasible)

Page 28: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2828ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Comparison of Protection Schemes

Dedicated link protection is the least efficientUsage of resources– shared path << dedicated path < shared-link

Susceptibility to two-link failures– dedicated path = shared link < shared path

Protection switching time– if OXC configuration time is low, then

shared link < dedicated path < shared-path– otherwise,

dedicated-path < shared-link < shared-path

Page 29: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

2929ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Optical Restoration schemes

Dynamic discovery of path and wavelengthAgain, it can be Path-based or Link-based

Restoration efficiency– Measured as fraction of connections restored after a

failure– Path restoration > link restoration

Restoration time– Link restoration < path restoration

Page 30: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3030ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part II - Outline

Protection vs. Restoration: fundamental trade-offsOptical Protection/RestorationMPLS-based ProtectionIP RestorationMulti-layer approaches

Page 31: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3131ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS-based Protection

Multi Protocol Label Switching– Label Switched Path (LSP) uniquely defines the path

between source and destination– Routers (or Label Switched Router - LSR) switch packets

based on labels and can also assign a different label

Protection-like scheme– Provision primary LSP and backup LSP– In case of failure along primary path, first LSR assigns

backup label to incoming packets– There can be multiple backup LSP

Page 32: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3232ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Options for the backup LSP

There can be multiple backup LSPAll backup paths are equal– Selection based on listed order of configuration

Stand-by knob– Maintains backup path in ‘up’ condition– Eliminates call-setup delay of secondary LSP– Additional state information must be maintained

Page 33: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3333ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Protection example

LSR 1

Primary LSP (Label A)

LSR 0

Backup LSP (Label B)

3B

1

Next Hop

A

Label

LSR 0: MPLS Table

LSR 3 LSR 4 LSR 5

LSR 2

5B

Next HopLabel

LSR 4: MPLS Table

5

Next Hop

A

Label

LSR 1: MPLS Table

4B

Next HopLabel

LSR 3: MPLS Table

Page 34: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3434ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

LSP Rerouting

Initiated by ingress LSR (#0 in the example)– Exception: Fast Re-Route (we will discuss it in Part IV)

Conditions that trigger reroute– More optimal route becomes available– Failure along primary path– Preemption– Manual configuration change

Recovery speed– With backup in stand-by: 100’s msec – 1 sec– NANOG talk based on experiments in Qwest network

Page 35: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3535ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part II - Outline

Protection vs. Restoration: fundamental trade-offsOptical Protection/RestorationMPLS-based ProtectionIP RestorationMulti-layer approaches

Page 36: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3636ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IP Restoration: IS-IS protocol

Link State protocol– Each node has complete information on the topology– Nodes flood list of neighbors and cost to reach them– Nodes independently compute their own routing tree

In case of failure– The nodes that identify the failure flood an update

message with the new list of neighbors– Each nodes updates its routing tree accordingly

Page 37: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3737ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IP Restoration: Example

A

HG

FD E

BC

11

5204

1015

710

12

3

5

Page 38: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3838ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IP Restoration: Example

A

HG

FD E

BC

11

5204

1015

710

12

3DH

DG

B, DF

DE

DD

BC

BB

-A

NEXTHOP

DST

5

Page 39: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

3939ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IP Restoration: Example

A

HG

FD E

BC

5 11

5204

1015

710

12

3DH

DG

B, DF

DE

DD

BC

BB

-A

NEXTHOP

DST

LSP: E-F down

Page 40: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4040ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IP Restoration: Example

A

HG

FD E

BC

11

5204

1015

710

12

3DH

DG

B, DF

DE

DD

BC

BB

-A

NEXTHOP

DST

5

Page 41: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4141ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part II - Outline

Protection vs. Restoration: fundamental trade-offsOptical Protection/RestorationMPLS-based ProtectionIP RestorationMulti-layer approaches

Page 42: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4242ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Multi-Layered Protection/Restoration

MPLS/IP over WDM Need coordination– Multiple schemes should not compete– Need escalation strategy based on

• explicit messaging or,• timer settings for detection and completing restoration

– Higher layers should wait for lower layers first

Page 43: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4343ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IIIThe Sprint Experience

Page 44: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4444ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part III - Outline

IP-based SurvivabilityPerformance

Page 45: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4545ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part III - Outline

IP-based Survivability– Design Requirements– Components of design solutions

Performance

Page 46: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4646ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Requirements for Survivability

Availability of backup paths– must have enough spare capacity– must satisfy SLA requirements

• low loss, end-to-end latency bounds

Localized failure recovery– re-routing should be close to point of failure

Prevent partitions– multiple node/link disjoint paths– Physical path diversity

Page 47: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4747ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Design Solution: Capacity Provisioning

No admission controlRedundant capacity for restoration path– rule of thumb – average link utilization < 50%

Must plan for widespread outages– Baltimore tunnel fire, meltdown in other ISPs, etc.

Added benefit– SLAs satisfied: negligible loss, no queuing– no need for QOS, service classes in the core

Page 48: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4848ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Design Solution: Network Topology

Fully meshed inter-POP topology is not feasible– each POP connected to a subset of other POPs

• between 2 and 10

– reduces probability of network partitioning

Parallel links between adjacent PoPs– terminates on different routers, run on different fibers– Added benefit: load balancing

Page 49: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

4949ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

The Sprint U.S. Topology

Page 50: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5050ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Point-of-Presence(PoP) Design

Page 51: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5151ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Design Solution: IP-Based Restoration

No protection at optical layer– huge capital investment– less flexible, provisioning cycle between 12-18 months– cannot fix IP layer problems

Relies solely on IS-IS protocol to re-route traffic around failures

Page 52: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5252ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IS-IS routing practices

Primary focus: inter-PoP paths– Selection criterion: end-to-end latency – primary and backup paths should traverse same set of

PoPs

Link weights – inter-POP links ~10-63 – intra-POP links ~ 1-4

Updating weights:– set by hand, modified only for large-scale failures

Page 53: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5353ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part III - Outline

IP-based SurvivabilityPerformance– Analysis of failure patterns– Recovery Speed with IS-IS

Page 54: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5454ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Performance of IP-based Survivability

Question 1: Is it sufficient to handle the type, frequency, and scale of failures in the Sprint network?

Analyze and characterize failures

Page 55: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5555ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

“Listening” to failures

All failures are visible at the IP layer– Lower layers are not masking out events

LSPs are flooded throughout the entire network– A machine that sets up an adjacency with a router is

enough to observe and record all the failure events– Python Routing Toolkit (http://ipmon.sprint.com/pyrt)

Three locations: East and West Coast– To monitor loss of LSPs– To measure propagation delay of LSPs

Page 56: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5656ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Definition of failure event

“Failure”: any event that causes a topology change

time

B: Link to A is downB: Link to A is down

A: Link to B is upA: Link to B is up

TimeTime--toto--Repair (or “duration”)Repair (or “duration”)

A: Link to B is downA: Link to B is down

start

B: Link to A is upB: Link to A is up

end

Page 57: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5757ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Failures are part of everyday operations

Weekly

Daily

Hourly

Page 58: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5858ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Time between Failures (network-wide)

43%: <1 min

81%: <20 min

Page 59: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

5959ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Sources of failures

Duration can provide hints, e.g., – long (>1hour): fiber cuts, severe failures– medium (>10min): router/line card failures– short (>1min): line card resets– very short (<1min): software problems, optical

equipment glitches

Other hints – shared equipment (routers, optical)– router logs (e.g., SONET alarms), etc.

Page 60: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6060ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Network-wide Failure Duration

c um

ulat

ive

frac

tion

of f

a il u

r es

40 % in 1-60sec

40 % in 1-15min

10 % in 15-60min

10 % >1h

Page 61: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6161ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Classification of Failures

Remove Maintenance– 9 hrs/week account for 20% of all failures events

Classify remaining “Unplanned” failures

Unplanned

Shared Link Failures 30.8%

Individual Link Failures69.2%

Page 62: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6262ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Shared Link Failures: Simultaneous

2+ links going down at the exact same time– Example: Failure of a line card with multiple interfaces

All events turn out to involve a common router– Router Related class (16.5%)

Router 0

LineCard

Router 1

Router 2

Router 3

Router 4

IS-IS logs

t1 time

0 – 1

t2

0 – 2

0 – 3

0 – 4

LSP from Router 0reporting links down

LSP from Router 0 reporting 3 links up

Page 63: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6363ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Shared Link Failures: Overlapping

Some links fail “almost simultaneously”

Possible causes– Shared component fails but the news arrive to the

listener with some delay, due to various timers– Could be optical component– ...or also router component

Overlapping W2 W1

Link 1Link 2Link 3

time

Shared optical part

Link 1Link 2Link 3

Page 64: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6464ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Classification of Failures (updated)

Unplanned

Shared Link Failures 30.8%

Individual Link Failures69.2%

Router Related16.5%

Simultaneous Overlapping

Optical Related11.4%

Unspecified2.9%

Page 65: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6565ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Individual Link Failures

Page 66: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6666ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

High vs. Low Failure Links

Normalized number of failures per link High degree of heterogeneityA few (2.5% of) links account for half of independent failuresRoughly two power-laws: -0.73, -1.35

Page 67: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6767ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

High Failure Links

Page 68: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6868ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Low failure links

Page 69: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

6969ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Classification of Failures (updated)

Unplanned

Shared Link Failures 30.8%

Individual Link Failures69.2%

Simultaneous

Router Related16.5%

Overlapping

Optical Related11.4%

Unspecified2.9%

High Failure38.5%

Low Failure30.7%

Page 70: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7070ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Summary: Lessons learnt

Failure are part of everyday operationsNetwork topology is very dynamic– Links are reported down every 30 minutes on average– In 80% of the cases links come back up in <10 min– Implications for QoS, accounting, TE

Importance of understanding sources of failures– Are routers really unreliable? – What are the root causes of high failure links?– Are there regions of the network more prone to failure?

• Implement optical protection in those regions

Page 71: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7171ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Performance of IP-based Survivability

Question 1: Is it sufficient to handle the type, frequency, and scale of failures in the Sprint network?

Analyze and characterize failures

Question 2: Is it fast enough to provide a highly available service across the backbone?

Evaluate and fine-tune IS-IS

Page 72: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7272ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Restoration Steps in IS-IS

Failure detectionFailure notificationForwarding path re-computation

Page 73: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7373ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Failure detection

For optical layer problems on point-to-point links– SONET layer detects failure in 10-20 msec

Keep-alive messages are used in all other cases– software/hardware failures on a router– switched networks (e.g., Ethernet)– detection on the order of seconds (up to 60s)

Page 74: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7474ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Failure notification

For optical layer problems– SONET alarm on adjacent routers– Hold-off timer delays notification to IS-IS process to

mask out transients

Notifying other routers– LSPs are flooded to the entire network – Regulated by “generation” timer – At each hop, LSP flooding is rate-related and incurs

processing delay

Page 75: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7575ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Forward path re-computation

Routing tree computation– CPU intensive, depends on number of nodes– Hold-off timer to aggregate multiple LSPs

Forwarding information update has to propagate to interface cards– only in distributed architectures (e.g. Cisco GSR)– depends on number of BGP prefixes

Page 76: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7676ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Generic GSR architecture

RouteProcessor

RoutingTable (RIB)

Line Card

MAC

PacketMemory

FwdingTable (FIB)

Line Card

MAC

PacketMemory

FwdingTable (FIB)

Switched Backplane

Line Interface

CPUMemory

Page 77: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7777ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Restoration Steps

2. Forward LSP

2. LSP Flooding

1a. Failure Detection

1. Del ISIS adjacency

1b. ISIS Notification

4.Update FIB on linecards

4. Update FIBs

2.Generate LSP

3. SPF & update RIB

3. SPF & update RIB

Protocol convergence = 1+2+3Service convergence = Protocol convergence + 4

Page 78: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7878ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Seven Steps for Restoration

1. Detect link down 10-20ms2. Wait to filter out transient flaps 20ms (2s)3. Wait before sending update out 50ms4. LSP flooding ~10ms/hop5. Hold time before SPF 100ms (5.5s)6. Compute shortest paths 100-400 ms7. Update the routing tables ~20 pfx/ms

Expected service disruption 400ms-1.2s (+7s)

Page 79: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

7979ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Backbone Experiments

after 640msafter 640ms

POP #2 POP #2 –– East CoastEast CoastPOP #1 POP #1 –– West CoastWest Coast

POP #4 POP #4 –– East CoastEast CoastPOP #3 POP #3 –– West CoastWest Coast

Page 80: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8080ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Summary of Lessons learnt

Timers dominated the restoration times for historical reasons– fear of instability in the network– fear of overloading router CPUs

Sub-second convergence is achievableBut…– need greater predictability– need fine tuning, e.g., FIB update process

Page 81: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8181ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IVCurrent Trends/Open Issues

Page 82: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8282ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IV - Outline

IP-based restorationMPLS Fast Re-RouteOptical layer protection/restorationService Availability

Page 83: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8383ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IV - Outline

IP-based restoration– Link weight assignment for transient failures– Failure-Insensitive Routing (FIR)– IGP Protocol modifications– Router architecture

MPLS Fast Re-RouteOptical layer protection/restorationService Availability

Page 84: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8484ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Link Weight Assignment (LWA)

Manually configured by ISP operators looking at the network staticallyOperational Goals– Low end-to-end delays – Prevent link overload/congestion

Main problem: “transient” failures– Failures repaired in less than 10 min (~80% of total)– IS-IS re-routes around failure– Load balancing across the network is sub-optimal

Page 85: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8585ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

LWA: looking at transient failures

Page 86: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8686ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

LWA: optimization problem

[Nucci et al., ITC 2003]Consider every single link failure scenarioFind a set of weights that – minimizes max utilization– satisfy end-to-end delay guarantees (from SLA)– across all failure scenarios

Tabu search heuristic to explore solution space

Page 87: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8787ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

LWA: Worst case load spreading

Weight setting helps balancing the load during the worst case failure scenario

Worst-case failure scenario

Without taking into account transient failures

Taking into account transient failures

No failure state

Page 88: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8888ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Failure Insensitive Routing (FIR)

[Nelakuditi et al., IWQoS 2003]Handle single transient failures with– no network-wide re-convergence– fast restoration of packet forwarding

Key idea– detect failure if packet returned by outgoing interface– upon failure detection

• suppress LSP broadcast• use precomputed table to re-route packet

Page 89: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

8989ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

FIR: Example

1

2

1

2

1

33

AE

D

C

B

F

Shortest path from A to F : A->B->E->F

Page 90: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9090ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

FIR: Example (cont’d)

1

2

1

2

1

33

AE

D

C

B

F

A detects failure of one of (B->E) or (E->F)Key link set for F on interface (A->B) is { (B->E), (E->F)}Interface-specific forwarding is pre-computed “Backwarding” table needed for sending packet backward

Page 91: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9191ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IS-IS/OSPF modifications

Content-based rules for LSP processing– different priorities for important links and messages

(up/down)– which first: LSP propagation or SPF?

Second shortest paths– avoid re-computing the paths (save 100-400ms)– not clear how to guarantee loop-free alternatives

Page 92: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9292ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Router Architecture

Incremental SPF– some updates don’t change routing tree– no need to shortest path tree for each update

Reliable multicast for line-card updates– a backbone router may have 16 line cards

Prioritize routing updates– some prefixes are more important than others, e.g.,

voice/video gateways

Page 93: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9393ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part IV - Outline

IP-based restorationMPLS Fast Re-RouteOptical layer protection/restorationService Availability

Page 94: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9494ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Fast Re-Route

Basic Idea: emulate optical link protectionOperator assign additional labels to re-route traffic around failure of each protected link– Same mechanism that is used for path protection– Labels provide a temporary “patch” after a failure– Give time to find a “more optimal” solution without

interrupting traffic forwarding

One label is required per link to be protected– More complex network management– Higher risk of configuration errors

Page 95: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9595ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Fast Re-Route: Example

A

BD

E

F

Primary LSP from A to E

C

Page 96: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9696ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Fast Re-Route: Example

A

D

C

B

E

F

Has to be enabled on ingress – A creates detour around B, B around C, C around D

Page 97: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9797ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Fast Re-Route: Example

A

DB

E

F

B to C link fails – B immediately detours around C– B signals to A that failure occurred

C

Page 98: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9898ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Fast Re-Route: Example

A

DB

E

F

A calculates and signals new primary path

C

Page 99: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

9999ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

MPLS Fast Re-Route: Performance

NANOG presentation on Qwest backboneFast Re-route: 10’s – 100’s of msec– Secondary LSP plus stand-by: 100’s msec – 1 sec– Disruption of traffic is limited to the detection time and

propagation of update to FIBs

Question: Is the gain worth the complexity?

Page 100: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

100100ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part III: Outline

IP-based restorationMPLS Fast Re-RouteOptical layer protection/restoration– Traffic grooming– Multi-layered approach

Service Availability

Page 101: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

101101ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Traffic Grooming

Single lightpath has high capacity– 10 Gbps, soon to be 40 Gbps– most customers need less

Fill up a lightpath with low-speed streams – e.g. send 16 OC-3’s over a single lightpath

Protection/Restoration– can be done at the level of lightpath, or…– find backup paths for individual streams

• need sophisticated switches, costs more

Page 102: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

102102ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Traffic Grooming for Survivability

[Mukherjee et al., 2003]Protect sub-lambda granularity connectionsApproaches– Protection at lightpath level (PAL)– Protect at connection level (PAC)

• Capacity of a lightpath may be shared by backup and primary paths for a connection(MPAC) or not (SPAC)

Better to groom working and backup paths separatelySPAC is best but grooming ports are costly!

Page 103: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

103103ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

A Joint IP/WDM Technique

[Fumagalli et al., 2000]Hybrid solution – only part of traffic protected at WDM layer– Use simulated annealing to optimize cost

• WDM cost = total miles of working and protection wavelengths • IP cost = mileage of unprotected traffic streams and a “penalty”

factor

– cost function used to tune extent of WDM protection

Page 104: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

104104ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Part III: Outline

IP-based restorationMPLS Fast Re-RouteOptical layer protection/restorationService Availability

Page 105: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

105105ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Motivation

Traditional QOS objectives less challenging in today’s backbones– loss, delay, jitter, etc.

Need to capture the effect of failures– known to cause disruptions

Availability based metric in SLA can provide competitive advantage to ISPs

Page 106: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

106106ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

The challenge of defining availability

Telephone networks– gold standard 5 9’s, in terms of blocked calls– ANSI-defined outage index

IP networks– no admission control, connectionless paradigm – ISPs offer “port availability” as part of SLA, but this

ignore many factors e.g.,• Is a physical path established? Is there an IP route? Is the

server up and responding? etc.

Page 107: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

107107ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

IP service availability for an ISP

[Diot et al., 2003]Definition: how often is packet forwarding available between two end-points?Factors– network topology– IP-to-fiber mapping– interdependence of IP-layer elements– failure characteristics of links/routers– network convergence times

Page 108: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

108108ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

A strawman definition

Assume – uncongested network, P parallel paths between O-D pair

– tk=mean time between failure for any k paths– D = constant forwarding outage due to convergence– Ok = average failure duration affecting any k paths

A = 1 – [ D/K * Σ k/tk + OP / tp ]k=1

P-1

subset of paths fail, forwarding disrupted due to convergence

all paths fail simultaneously

Page 109: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

109109ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Failures in a congested network

[Farago et al., 2003]Path can become unavailable due to congestion or failureρj = traffic on link j; pj = reliability of link jLink availability function αj(pj, ρj )– Bounded by [0,1], decreasing function of ρj

– A route is available if all links on the route are simultaneously available

– Upper and lower bounds derived

Page 110: Availability and Survivability in IP Networks · Tutorial Outline Part I Introduction & Background Part II Common approaches to survivability ... Dynamic discovery of path and wavelength

110110ICNP 2003 ICNP 2003 -- S. Bhattacharyya and G. IannacconeS. Bhattacharyya and G. Iannaccone November 4th, 2003November 4th, 2003 --

Conclusions

A wide range of techniques available, but…– Need cost-benefit analysis– Need to understand network failure characteristics

Sprint experience– Optical layer mechanisms may be faster but IP

restoration is satisfactory– Need to fine-tune existing mechanisms

The future– Service availability will be an important metric for IP

networks– End-to-end availability is orders of magnitude harder!