routing convergence global routing internet routing convergence an experimental study of delayed...

Routing Convergence

Global Routing

Internet Routing Convergence

• An Experimental Study of Delayed Internet Routing Convergence

•Craig Labovitz, Abha Ahuja, Farnam Jahanian, Abhijit Bose

•ACM Sigcomm September 2000

Hierarchical Routing -- Review

scale: with 50 million destinations:

• can’t store all dest’s in routing tables!

• routing table exchange would swamp links!

administrative autonomy

• internet = network of networks

• each network admin may want to control routing in its own network

Untruths about Internet Routing: • all routers identical• network “flat”… not true in practice

Hierarchical Routing

• aggregate routers into regions, “autonomous systems” (AS)

• routers in same AS run same routing protocol– “inter-AS” routing

protocol– routers in different AS

can run different inter-AS routing protocol

• special routers in AS• run inter-AS routing

protocol with all other routers in AS

• also responsible for routing to destinations outside AS– run intra-AS routing

protocol with other gateway routers

gateway routers

Intra-AS and Inter-AS routing

Gateways:•perform inter-AS routing amongst themselves•perform intra-AS routers with other routers in their AS

inter-AS, intra-AS routing in

gateway A.c

network layer

link layer

physical layer

a

b

b

aaC

A

Bd

A.a

A.c

C.bB.a

cb

c

Intra-AS and Inter-AS routing

Host h2

a

b

b

aaC

A

Bd c

A.a

A.c

C.bB.a

cb

Hosth1

Intra-AS routingwithin AS A

Inter-AS routingbetween A and B

Intra-AS routingwithin AS B

AS graphs obscure topology!

The AS graphmay look like this. Reality may be closer to this…

Tim Griffin, Leiden 2000

Inter-AS routing (cont)

• BGP (Border Gateway Protocol): the de facto standard

• Path Vector protocol: and extension of Distance Vector

• Each Border Gateway broadcast to neighbors (peers) the entire path (ie, sequence of ASs) to destination

• For example, Gateway X may store the following path to destination Z:

Path (X,Z) = X,Y1,Y2,Y3,…,Z


• Now, suppose Gwy X send its path to peer Gwy W• Gwy W may or may not select the path offered by

Gwy X, because of cost, policy ($$$$) or loop prevention reasons.

• If Gwy W selects the path advertised by Gwy X, then:

Path (W,Z) = w, Path (X,Z)Note: path selection based not so much on cost (eg,#

ofAS hops), but mostly on administrative and policy

issues(e.g., do not route packets through competitor’s AS)


• Peers exchange BGP messages using TCP.

• OPEN msg opens TCP connection to peer and authenticates sender

• UPDATE msg advertises new path (or withdraws old)

• KEEPALIVE msg keeps connection alive in absence of UPDATES; it also serves as ACK to an OPEN request

• NOTIFICATION msg reports errors in previous msg; also used to close a connection

Why different Intra- and Inter-AS routing ?

• Policy: Inter is concerned with policies (which provider we must select/avoid, etc). Intra is contained in a single organization, so, no policy decisions necessary

• Scale: Inter provides an extra level of routing table size and routing update traffic reduction above the Intra layer

• Performance: Intra is focused on performance metrics; needs to keep costs low. In Inter it is difficult to propagate performance metrics efficiently (latency, privacy etc). Besides, policy related information is more meaningful.

We need BOTH!

What is Routing Policy?

• Description of the routing relationship between autonomous systems– Who are the peers?– What routes are

• Originated by a peer?• Imported from each peer?• Exported to each peer?• Preferred when multiple routes exist?

– What to do if no route exists?

The example I mentioned earlier

Date: Fri, 25 Apr 1997 20:16:47 -0500 (CDT) Subject: ** ALERT – Massive Routing Failures ***

At about 10:30 AM today, one of Sprints customers (AS7007, Florida Internet Exchange) began announcing a /24 route for every CIDR block in the core routing table. This was due to a configuration problem in that they imported all their routing into a classfull interior routing protocol and then redistributed the route back into BGP, becoming a source for the first class C network in every CIDR block. Sprint does no border routing filters, so they happily accepted these routes and gave them away to all…

Motivation

• Why we should care about convergence?

• Routing reliability/fault-tolerance on small time scales (minutes) not previously a priority

• Emerging transaction oriented and interactive applications (e.g. Internet Telephony) will require higher levels of end2end network reliability

• How well does the Internet routing infrastructure tolerate faults?

Conventional Routing Wisdom

• The Internet is designed to survive a nuclear cataclysm.Internet routing is robust under faults– Supports path re-routing and restoral on the order of

seconds • The internet supports fast path rerouting and restoral.

BGP has good convergence properties– Does not exhibit looping/bouncing problems of RIP

• Internet fail-over will improve with faster routers and faster links

• More redundant connections (multi-homing) to Internet will always improve site fault-tolerance

Contribution

Labovitz et al show that most of the conventional wisdom about routing convergence is not accurate…

– Measurement of BGP convergence in the Internet

– Analysis/intuition behind delayed BGP routing convergence

– Modifications to BGP implementations which would improve convergence times

Motivation

• Why has fail-over and fault-tolerance not previously been a priority?– Applications like email not delay sensitive and possess

fault-tolerance– TCP/IP fault-tolerance (resend)– Content replication helps improve reliability for static

content

• Network support is required for emerging transaction oriented and interactive applications (e.g. Internet Telephony, QoS)

Building a Reliable Internet

• What Network support has been proposed already?

• Significant recent improvement on data-link fail-over (e.g. SRP, Sonet). Solves some enterprise, intra-domain reliability problems

• Also significant research on QoS and resource reservation protocols for the Internet– But, all of these protocols assume stable underlying IP

forwarding path

Background

• Internet sites multi-home, or purchase connectivity from multiple Internet providers to improve fault tolerance– Goal: tolerate a single link, router or ISP failure– 35% Internet end-sites currently multi-homed

Background: Multi-homing

Sprint

Verio

INTERNET

Enterprise

BGP

BGP

PSTN versus Internet

• Public Switched Telephone Network (PSTN) is the “other” network in place.

• Trade-off between– scalability/extensibility/low cost and – fault-tolerance/service guarantees/high cost

• PSTN retains significant intermediate state (i.e. circuit setup) and services on relatively few nodes. A “Smart Network”

• Internet places all intelligence on end-nodes. A “Stupid Network”

Trade-Offs

ScalabilityFlexibility

Distributed Operation

StateReliability

Service GuaranteesDevelopment Time

Switch CostCoordination

PSTN

Hi g

hL

ow

HighLow

In ternet

Enterprise w ith IntServ

Enterprise w ith D iffServ

http://www.supremevideo.com/callback.htm

Routing

• Unlike circuit-switched PSTN, packet-switched Internet uses hop-by-hop forwarding and next-hop selection

• Global state and circuit-setup used in PSTN – this is like owning an atlas and planning route

• Internet routers only keep local knowledge and routes learned from neighbors– like asking directions at each stop

Internet Routing

• Inter-domain Internet routing protocols are distance vector (i.e. Bellman-Ford) algorithms. Unlike PSTN, no pre-computed backup paths!

• Distance vector protocols are problematic– Require time to converge– Suffer from “counting to infinity”

Problems with Distance Vector Protocols

Counting to Infinity

A B2

B 2R 3

A 2R 1

R

1

R 5

R=3R=5

R 7

R=7

Node Distance Node Distance

A B

R

Internet Routing

• The Internet inter-domain routing protocol, BGP, “solves” count-to-infinity problem by keeping record of path the route announcement has traveled through network

• Internet routing commonly (and incorrectly) believed to converge within 30 seconds

IS P 1

BGP Routing

IS P 2

IS P 3AS1 R

AS2 AS1 R

R

AS3 AS2 AS1 R

BGP

Open Question

After a fault in a path to multi-homed site, how long does it take for the majority of Internet routers to fail-over to the secondary path?

– Routing table convergence (backbone routers reach steady-state) after a fault

– End-to-end paths stable (“normal” levels of loss and latency)

Customer

Primary ISP

Backup ISP

BGP

TRAFFIC

Internet Fail-Over Experiments

• Instrument the Internet– Inject routes into geographically and topologically diverse provider

BGP peering sessions (Mae-West, Japan, Michigan, London)

– Periodically fail and change these routes (i.e. send withdraws or new attributes)

– Monitor impact faults through 1) recordings of BGP peering sessions with 20 tier1/tier2 ISPs and 2) active ICMP ECHO measurements (512 byte/second to 100 random web sites)

– Write lots of Perl scripts– Wait two years… (125,000 routing events)

Experiment(For the Last Two Years)

In ternet

ISP4

Stub AS

RouteView s Data

CollectionProbe

ISP5

ISP6

ISP3

UpstreamISP1

Stub AS

Fault Injection Server

UpstreamISP2

BGP

BGP

BGP

BGP

IC M PE chos

Fault Scenarios

• Tup -- A new route is advertised

• Tdown -- A route is withdrawn (i.e. single-homed failure)

• Tshort -- Advertise a shorter/better ASPath (i.e. primary

path repaired)

• Tlong -- Advertise a longer/worse ASPath (i.e.primary

path fails)

Major Convergence Results

• Routing convergence requires an order of magnitude longer than expected (10s of minutes)

• Routes converge more quickly following Tup/Repair than Tdown/Failure events (“bad news travels more slowly”)

• Curiously, withdrawals (Tdown) generate several times the number of announcements than announcements (Tup)

Example of BGP ConvergenceTIME BGP Message/Event10:40:30 Route Fails/Withdrawn by AS2129

10:41:08 2117 announce 5696 2129

10:41:32 2117 announce 1 5696 2129

10:41:50 2117 announce 2041 3508 3508 4540 7037 1239 5696 2129

10:42:17 2117 announce 1 2041 3508 3508 4540 7037 1239 5696 2129

10:43:05 2117announce 2041 3508 3508 4540 7037 1239 6113 5696 2129

10:43:35 2117 announce 1 2041 3508 3508 4540 7037 1239 6113 5696 2129

10:43:59 2117 sends withdraw

• BGP log of updates from AS2117 for route via AS2129• One BGP withdrawal triggers 6 announcements and one

withdrawal from 2117• Increasing ASPath length until final withdraw

CDF of BGP Routing Table Convergence Times

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Seconds Until Convergence

Cu

mu

lati

ve P

erce

nta

ge

of

Eve

nts

Tup

Tshort

Tlong

Tdow n

Shor

t->Lon

g Fa

il-O

ver

New

Rou

te

Lon

g->S

hort

Fai

l-ov

er

Failu

re

• Less than half of Tdown events converge within two minutes• Tup/Tshort and Tdown/Tlong form equivalence classes• Long tailed distribution (up to 15 minutes)

Impact of Delayed Convergence

• Why do we care about routing table convergence? It deleteriously impacts end-to-end Internet paths

• ICMP experiment results– Loss of connectivity, packet loss, latency, and packet

re-ordering for an average of 3-5 minutes after a fault

– Why? Routers drop packets for which they do not have a valid next hop. Also problems with cache flushing in some older routers.

End-to-End Impact Failover

• ICMP loss to 100 randomly chosen web sites with VIF source address of our probe

• Tlong/Tshort exhibit similar relationship as before

Delayed Convergence Background

• Well known that distance vector protocols exhibit poor convergence behaviors– Counting to infinity, looping, bouncing problem

• RIP redefines infinity and adds split-horizon, poison reverse, etc. – Still, slow convergence and not scalable

• BGP advertises ASPaths instead of distance – Solves counting to infinity and RIP looping problem, but…– BGP can still explore “invalid” paths during convergence

(i.e. the bouncing problem)

BGP Convergence Example

R

AS0 AS1

AS2AS3

*B R via 3 B R via 03 B R via 23



AS0 AS1 AS2

** **B R via 203

*B R via 013 B R via 103

N > 4?

AS1673

AS237

AS5696

AS2497

AS1239

AS6453

AS701

AS2914

AS6461

AS5000

AS6113

AS1

2914 237

1 5696 237

1239 5696 237

2497 5696 237

701 6461 5696 237

6461 5696 237

237

5696 237

5000 237

6113 2914 237

1673 5696 237

6453 1239 5696 237

MinRouteAdver Rounds

• Implementation of MinRouteAdver timer and receiver-side loop detection timer leads to 30 second rounds O(n-3)*30 seconds time complexity

An Experiment with SSF.OS.BGP4An Experiment with SSF.OS.BGP4

• The Model– Topology: full mesh of N ASes, each with just 1 router– No route filtering– Shortest path is best

• Advertise, Withdraw, Wait and Watch– Wait for system to reach stable state, then …– AS #1 advertises a bogus destination to everyone else– Wait for system to reach a stable state again, then …– AS #1 tells everyone that the bogus route is not

reachable through it any more– Wait for system to reach a stable state again

bogus3

4 5

2

1

N 1020304050

avg # updates dueto withdrawal (range) 59.50 (35-84) 269.55 (58-397) 539.10 (118-892) 945.20 (160-1647) 1423.66 (196-2377)

longestpath 920284046

convergence timeafter withdrawal (sec) 150 480 72010801260

.

.

.1610.040778415 bgp@38:1 snd update to bgp@2:1 wds=bogus1610.040778415 bgp@38:1 snd update to bgp@20:1 wds=bogus1610.040778415 bgp@38:1 snd update to bgp@32:1 wds=bogus1610.040778415 bgp@38:1 snd update to bgp@44:1 wds=bogus1610.040890567 bgp@32:1 snd update to bgp@38:1 nlri=bogus,asp=32 44 34 38 4 22 2 20 48 10 26 12 6 16 36 8 1424 28 41 18 51 21 33 45 43 35 3 5 47 23 31 37 49 25 46 39 7 27 13 9 29 11 15 17 50 19 42 40 30 11610.040890567 bgp@32:1 snd update to bgp@44:1 wds=bogus1610.040907352 bgp@44:1 snd update to bgp@38:1 wds=bogus1610.040907352 bgp@44:1 snd update to bgp@34:1 nlri=bogus,asp=44 38 34 32 4 22 2 20 48 10 26 12 6 16 36 8 1424 28 41 18 51 21 33 45 43 35 3 5 47 23 31 37 49 25 46 39 7 27 13 9 29 11 15 17 50 19 42 40 30 11610.050930294 bgp@44:1 snd update to bgp@32:1 wds=bogus...

The Problem with BGP

• If we assume – unbounded delay on BGP processing and propagation

– Full BGP mesh BGP peers

– Constrained shortest path first selection algorithm

• BGP is O(N!), where N number of default-free BGP speakers

There exists possible ordering of messages There exists possible ordering of messages such that BGP will explore all possible such that BGP will explore all possible ASPaths of all possible lengthsASPaths of all possible lengths

BGP and RIP• RIP precisely monotonically increasing. Can explore

metrics (1…N)• BGP monotonically increasing. Multiple (N!) ways to

represent a path metric of N.

• BGP “solved” RIP routing table loop problem by making it exponentially worse…

2117 5696 2129

2117 1 5696 2129

2117 2041 3508 3508 4540 7037 1239 5696 2129

2117 1 2041 3508 3508 4540 7037 1239 5696 2129

2117 2041 3508 3508 4540 7037 1239 6113 5696 2129

2117 1 2041 3508 3508 4540 7037 1239 6113 5696 2129

BGP Best Case

What is the best we can expect from BGP?What is the best we can expect from BGP? Implementation of MinRouteAdver timer Implementation of MinRouteAdver timer

leads to 30 second roundsleads to 30 second rounds• Time complexity is O(n-3)*30 secondsTime complexity is O(n-3)*30 seconds• State/Computational complexity O(n)State/Computational complexity O(n)• At its best, BGP performs as well as RIP2 At its best, BGP performs as well as RIP2

(but uses exponentially more memory in (but uses exponentially more memory in the process)the process)

MinRouteAdver

• Minimum interval between successive updates sent to a peer for a given prefix– Allow for greater efficiency/packing of updates– Rate throttle

• Applied only to announcements (at least according to BGP RFC)

• Applied on (prefix destination, peer) basis, but implemented on (peer) basis

MinRouteAdver

• 30*(N-3) delay due to creation mutual dependencies. Provide proof that N-3 rounds necessarily created during bounded BGP MinRouteAdver convergence

• Rounds due to– Ambiguity in the BGP RFC and lack receiver

loop detection– Inclusion of BGP withdrawals with

MinRouteAdver (in violation of RFC)

Simulation Results

Nodes Time States Messages Nodes Time States Messages Nodes Time States Messages4 N/A 12 41 4 30 11 26 4 30 11 265 N/A 60 306 5 60 26 54 5 30 23 546 N/A 320 2571 6 90 50 92 6 30 39 927 N/A 1955 23823 7 120 85 140 7 30 59 140

Unbounded Delay MinRouteAdver Modified MinRouteAdver

Intuition for Delayed BGP Convergence

• There exists possible ordering of messages such that BGP will explore ALL possible ASPaths of ALL possible lengths– BGP is O(N!), where N number of default-free BGP

speakers in a complete graph with default policy

• Although seemingly very different protocols, BGP and RIP share very similar convergence behaviors. Major difference:– RIP explores metrics (1…N)– BGP ASPath provides multiple ways to represent metric

(path) of length N, or (N-1)!

Lower Bound on BGP • If assume optimal ordering of messages, what is the best

we can expect from BGP?• In practice, BGP timers (MinRouteAdver) provide

synchronization and limit possible orderings of messages – MinRouteAdver timer specifies interval between successive

updates sent to a peer for a given prefix– Useful for bundling updates together– According to RFC, MinRouteAdver applies only

announcements

• But, interaction of MinRouteAdver and vendor ASPath loop detection implementation introduce “artificial” delay

Conclusions

• Internet does not posses effective inter-domain fail-over (15 minutes is a long time for phone call)

• Majority of BGP convergence delay due to vendor implementation decisions of MinRouteAdver and loop detection

• In practice, Internet is not a complete graph and same degree of message re-ordering unlikely. Our current work:– What is the impact of ISP policy and topology on BGP

convergence? – Can we improve BGP convergence times?

routing convergence global routing internet routing convergence an experimental study of delayed...

Documents

routing relationship

routing tables

routing protocolrouters

routing contnow

routing protocolinter

routing table exchange

path x

path w