routing convergence global routing internet routing convergence an experimental study of delayed...
TRANSCRIPT
Routing Convergence
Global Routing
Internet Routing Convergence
• An Experimental Study of Delayed Internet Routing Convergence
•Craig Labovitz, Abha Ahuja, Farnam Jahanian, Abhijit Bose
•ACM Sigcomm September 2000
Hierarchical Routing -- Review
scale: with 50 million destinations:
• can’t store all dest’s in routing tables!
• routing table exchange would swamp links!
administrative autonomy
• internet = network of networks
• each network admin may want to control routing in its own network
Untruths about Internet Routing: • all routers identical• network “flat”… not true in practice
Hierarchical Routing
• aggregate routers into regions, “autonomous systems” (AS)
• routers in same AS run same routing protocol– “inter-AS” routing
protocol– routers in different AS
can run different inter-AS routing protocol
• special routers in AS• run inter-AS routing
protocol with all other routers in AS
• also responsible for routing to destinations outside AS– run intra-AS routing
protocol with other gateway routers
gateway routers
Intra-AS and Inter-AS routing
Gateways:•perform inter-AS routing amongst themselves•perform intra-AS routers with other routers in their AS
inter-AS, intra-AS routing in
gateway A.c
network layer
link layer
physical layer
a
b
b
aaC
A
Bd
A.a
A.c
C.bB.a
cb
c
Intra-AS and Inter-AS routing
Host h2
a
b
b
aaC
A
Bd c
A.a
A.c
C.bB.a
cb
Hosth1
Intra-AS routingwithin AS A
Inter-AS routingbetween A and B
Intra-AS routingwithin AS B
AS graphs obscure topology!
The AS graphmay look like this. Reality may be closer to this…
Tim Griffin, Leiden 2000
Inter-AS routing (cont)
• BGP (Border Gateway Protocol): the de facto standard
• Path Vector protocol: and extension of Distance Vector
• Each Border Gateway broadcast to neighbors (peers) the entire path (ie, sequence of ASs) to destination
• For example, Gateway X may store the following path to destination Z:
Path (X,Z) = X,Y1,Y2,Y3,…,Z
Inter-AS routing (cont)
• Now, suppose Gwy X send its path to peer Gwy W• Gwy W may or may not select the path offered by
Gwy X, because of cost, policy ($$$$) or loop prevention reasons.
• If Gwy W selects the path advertised by Gwy X, then:
Path (W,Z) = w, Path (X,Z)Note: path selection based not so much on cost (eg,#
ofAS hops), but mostly on administrative and policy
issues(e.g., do not route packets through competitor’s AS)
Inter-AS routing (cont)
• Peers exchange BGP messages using TCP.
• OPEN msg opens TCP connection to peer and authenticates sender
• UPDATE msg advertises new path (or withdraws old)
• KEEPALIVE msg keeps connection alive in absence of UPDATES; it also serves as ACK to an OPEN request
• NOTIFICATION msg reports errors in previous msg; also used to close a connection
Why different Intra- and Inter-AS routing ?
• Policy: Inter is concerned with policies (which provider we must select/avoid, etc). Intra is contained in a single organization, so, no policy decisions necessary
• Scale: Inter provides an extra level of routing table size and routing update traffic reduction above the Intra layer
• Performance: Intra is focused on performance metrics; needs to keep costs low. In Inter it is difficult to propagate performance metrics efficiently (latency, privacy etc). Besides, policy related information is more meaningful.
We need BOTH!
What is Routing Policy?
• Description of the routing relationship between autonomous systems– Who are the peers?– What routes are
• Originated by a peer?• Imported from each peer?• Exported to each peer?• Preferred when multiple routes exist?
– What to do if no route exists?
The example I mentioned earlier
Date: Fri, 25 Apr 1997 20:16:47 -0500 (CDT) Subject: ** ALERT – Massive Routing Failures ***
At about 10:30 AM today, one of Sprints customers (AS7007, Florida Internet Exchange) began announcing a /24 route for every CIDR block in the core routing table. This was due to a configuration problem in that they imported all their routing into a classfull interior routing protocol and then redistributed the route back into BGP, becoming a source for the first class C network in every CIDR block. Sprint does no border routing filters, so they happily accepted these routes and gave them away to all…
Motivation
• Why we should care about convergence?
• Routing reliability/fault-tolerance on small time scales (minutes) not previously a priority
• Emerging transaction oriented and interactive applications (e.g. Internet Telephony) will require higher levels of end2end network reliability
• How well does the Internet routing infrastructure tolerate faults?
Conventional Routing Wisdom
• The Internet is designed to survive a nuclear cataclysm.Internet routing is robust under faults– Supports path re-routing and restoral on the order of
seconds • The internet supports fast path rerouting and restoral.
BGP has good convergence properties– Does not exhibit looping/bouncing problems of RIP
• Internet fail-over will improve with faster routers and faster links
• More redundant connections (multi-homing) to Internet will always improve site fault-tolerance
Contribution
Labovitz et al show that most of the conventional wisdom about routing convergence is not accurate…
– Measurement of BGP convergence in the Internet
– Analysis/intuition behind delayed BGP routing convergence
– Modifications to BGP implementations which would improve convergence times
Motivation
• Why has fail-over and fault-tolerance not previously been a priority?– Applications like email not delay sensitive and possess
fault-tolerance– TCP/IP fault-tolerance (resend)– Content replication helps improve reliability for static
content
• Network support is required for emerging transaction oriented and interactive applications (e.g. Internet Telephony, QoS)
Building a Reliable Internet
• What Network support has been proposed already?
• Significant recent improvement on data-link fail-over (e.g. SRP, Sonet). Solves some enterprise, intra-domain reliability problems
• Also significant research on QoS and resource reservation protocols for the Internet– But, all of these protocols assume stable underlying IP
forwarding path
Background
• Internet sites multi-home, or purchase connectivity from multiple Internet providers to improve fault tolerance– Goal: tolerate a single link, router or ISP failure– 35% Internet end-sites currently multi-homed
Background: Multi-homing
Sprint
Verio
INTERNET
Enterprise
BGP
BGP
PSTN versus Internet
• Public Switched Telephone Network (PSTN) is the “other” network in place.
• Trade-off between– scalability/extensibility/low cost and – fault-tolerance/service guarantees/high cost
• PSTN retains significant intermediate state (i.e. circuit setup) and services on relatively few nodes. A “Smart Network”
• Internet places all intelligence on end-nodes. A “Stupid Network”
Trade-Offs
ScalabilityFlexibility
Distributed Operation
StateReliability
Service GuaranteesDevelopment Time
Switch CostCoordination
PSTN
Hi g
hL
ow
HighLow
In ternet
Enterprise w ith IntServ
Enterprise w ith D iffServ
Routing
• Unlike circuit-switched PSTN, packet-switched Internet uses hop-by-hop forwarding and next-hop selection
• Global state and circuit-setup used in PSTN – this is like owning an atlas and planning route
• Internet routers only keep local knowledge and routes learned from neighbors– like asking directions at each stop
Internet Routing
• Inter-domain Internet routing protocols are distance vector (i.e. Bellman-Ford) algorithms. Unlike PSTN, no pre-computed backup paths!
• Distance vector protocols are problematic– Require time to converge– Suffer from “counting to infinity”
Problems with Distance Vector Protocols
Counting to Infinity
A B2
B 2R 3
A 2R 1
R
1
R 5
R=3R=5
R 7
R=7
Node Distance Node Distance
A B
R
Internet Routing
• The Internet inter-domain routing protocol, BGP, “solves” count-to-infinity problem by keeping record of path the route announcement has traveled through network
• Internet routing commonly (and incorrectly) believed to converge within 30 seconds
IS P 1
BGP Routing
IS P 2
IS P 3AS1 R
AS2 AS1 R
R
AS3 AS2 AS1 R
BGP
Open Question
After a fault in a path to multi-homed site, how long does it take for the majority of Internet routers to fail-over to the secondary path?
– Routing table convergence (backbone routers reach steady-state) after a fault
– End-to-end paths stable (“normal” levels of loss and latency)
Customer
Primary ISP
Backup ISP
BGP
TRAFFIC
Internet Fail-Over Experiments
• Instrument the Internet– Inject routes into geographically and topologically diverse provider
BGP peering sessions (Mae-West, Japan, Michigan, London)
– Periodically fail and change these routes (i.e. send withdraws or new attributes)
– Monitor impact faults through 1) recordings of BGP peering sessions with 20 tier1/tier2 ISPs and 2) active ICMP ECHO measurements (512 byte/second to 100 random web sites)
– Write lots of Perl scripts– Wait two years… (125,000 routing events)
Experiment(For the Last Two Years)
In ternet
ISP4
Stub AS
RouteView s Data
CollectionProbe
ISP5
ISP6
ISP3
UpstreamISP1
Stub AS
Fault Injection Server
UpstreamISP2
BGP
BGP
BGP
BGP
IC M PE chos
Fault Scenarios
• Tup -- A new route is advertised
• Tdown -- A route is withdrawn (i.e. single-homed failure)
• Tshort -- Advertise a shorter/better ASPath (i.e. primary
path repaired)
• Tlong -- Advertise a longer/worse ASPath (i.e.primary
path fails)
Major Convergence Results
• Routing convergence requires an order of magnitude longer than expected (10s of minutes)
• Routes converge more quickly following Tup/Repair than Tdown/Failure events (“bad news travels more slowly”)
• Curiously, withdrawals (Tdown) generate several times the number of announcements than announcements (Tup)
Example of BGP ConvergenceTIME BGP Message/Event10:40:30 Route Fails/Withdrawn by AS2129
10:41:08 2117 announce 5696 2129
10:41:32 2117 announce 1 5696 2129
10:41:50 2117 announce 2041 3508 3508 4540 7037 1239 5696 2129
10:42:17 2117 announce 1 2041 3508 3508 4540 7037 1239 5696 2129
10:43:05 2117announce 2041 3508 3508 4540 7037 1239 6113 5696 2129
10:43:35 2117 announce 1 2041 3508 3508 4540 7037 1239 6113 5696 2129
10:43:59 2117 sends withdraw
• BGP log of updates from AS2117 for route via AS2129• One BGP withdrawal triggers 6 announcements and one
withdrawal from 2117• Increasing ASPath length until final withdraw
CDF of BGP Routing Table Convergence Times
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140 160
Seconds Until Convergence
Cu
mu
lati
ve P
erce
nta
ge
of
Eve
nts
Tup
Tshort
Tlong
Tdow n
Shor
t->Lon
g Fa
il-O
ver
New
Rou
te
Lon
g->S
hort
Fai
l-ov
er
Failu
re
• Less than half of Tdown events converge within two minutes• Tup/Tshort and Tdown/Tlong form equivalence classes• Long tailed distribution (up to 15 minutes)
Impact of Delayed Convergence
• Why do we care about routing table convergence? It deleteriously impacts end-to-end Internet paths
• ICMP experiment results– Loss of connectivity, packet loss, latency, and packet
re-ordering for an average of 3-5 minutes after a fault
– Why? Routers drop packets for which they do not have a valid next hop. Also problems with cache flushing in some older routers.
End-to-End Impact Failover
• ICMP loss to 100 randomly chosen web sites with VIF source address of our probe
• Tlong/Tshort exhibit similar relationship as before
Delayed Convergence Background
• Well known that distance vector protocols exhibit poor convergence behaviors– Counting to infinity, looping, bouncing problem
• RIP redefines infinity and adds split-horizon, poison reverse, etc. – Still, slow convergence and not scalable
• BGP advertises ASPaths instead of distance – Solves counting to infinity and RIP looping problem, but…– BGP can still explore “invalid” paths during convergence
(i.e. the bouncing problem)
BGP Convergence Example
R
AS0 AS1
AS2AS3
*B R via 3 B R via 03 B R via 23
*B R via 3 B R via 03 B R via 13
*B R via 3 B R via 13 B R via 23
AS0 AS1 AS2
** **B R via 203
*B R via 013 B R via 103
N > 4?
AS1673
AS237
AS5696
AS2497
AS1239
AS6453
AS701
AS2914
AS6461
AS5000
AS6113
AS1
2914 237
1 5696 237
1239 5696 237
2497 5696 237
701 6461 5696 237
6461 5696 237
237
5696 237
5000 237
6113 2914 237
1673 5696 237
6453 1239 5696 237
MinRouteAdver Rounds
• Implementation of MinRouteAdver timer and receiver-side loop detection timer leads to 30 second rounds O(n-3)*30 seconds time complexity
An Experiment with SSF.OS.BGP4An Experiment with SSF.OS.BGP4
• The Model– Topology: full mesh of N ASes, each with just 1 router– No route filtering– Shortest path is best
• Advertise, Withdraw, Wait and Watch– Wait for system to reach stable state, then …– AS #1 advertises a bogus destination to everyone else– Wait for system to reach a stable state again, then …– AS #1 tells everyone that the bogus route is not
reachable through it any more– Wait for system to reach a stable state again
bogus3
4 5
2
1
N 1020304050
avg # updates dueto withdrawal (range) 59.50 (35-84) 269.55 (58-397) 539.10 (118-892) 945.20 (160-1647) 1423.66 (196-2377)
longestpath 920284046
convergence timeafter withdrawal (sec) 150 480 72010801260
.
.
.1610.040778415 bgp@38:1 snd update to bgp@2:1 wds=bogus1610.040778415 bgp@38:1 snd update to bgp@20:1 wds=bogus1610.040778415 bgp@38:1 snd update to bgp@32:1 wds=bogus1610.040778415 bgp@38:1 snd update to bgp@44:1 wds=bogus1610.040890567 bgp@32:1 snd update to bgp@38:1 nlri=bogus,asp=32 44 34 38 4 22 2 20 48 10 26 12 6 16 36 8 1424 28 41 18 51 21 33 45 43 35 3 5 47 23 31 37 49 25 46 39 7 27 13 9 29 11 15 17 50 19 42 40 30 11610.040890567 bgp@32:1 snd update to bgp@44:1 wds=bogus1610.040907352 bgp@44:1 snd update to bgp@38:1 wds=bogus1610.040907352 bgp@44:1 snd update to bgp@34:1 nlri=bogus,asp=44 38 34 32 4 22 2 20 48 10 26 12 6 16 36 8 1424 28 41 18 51 21 33 45 43 35 3 5 47 23 31 37 49 25 46 39 7 27 13 9 29 11 15 17 50 19 42 40 30 11610.050930294 bgp@44:1 snd update to bgp@32:1 wds=bogus...
The Problem with BGP
• If we assume – unbounded delay on BGP processing and propagation
– Full BGP mesh BGP peers
– Constrained shortest path first selection algorithm
• BGP is O(N!), where N number of default-free BGP speakers
There exists possible ordering of messages There exists possible ordering of messages such that BGP will explore all possible such that BGP will explore all possible ASPaths of all possible lengthsASPaths of all possible lengths
BGP and RIP• RIP precisely monotonically increasing. Can explore
metrics (1…N)• BGP monotonically increasing. Multiple (N!) ways to
represent a path metric of N.
• BGP “solved” RIP routing table loop problem by making it exponentially worse…
2117 5696 2129
2117 1 5696 2129
2117 2041 3508 3508 4540 7037 1239 5696 2129
2117 1 2041 3508 3508 4540 7037 1239 5696 2129
2117 2041 3508 3508 4540 7037 1239 6113 5696 2129
2117 1 2041 3508 3508 4540 7037 1239 6113 5696 2129
BGP Best Case
What is the best we can expect from BGP?What is the best we can expect from BGP? Implementation of MinRouteAdver timer Implementation of MinRouteAdver timer
leads to 30 second roundsleads to 30 second rounds• Time complexity is O(n-3)*30 secondsTime complexity is O(n-3)*30 seconds• State/Computational complexity O(n)State/Computational complexity O(n)• At its best, BGP performs as well as RIP2 At its best, BGP performs as well as RIP2
(but uses exponentially more memory in (but uses exponentially more memory in the process)the process)
MinRouteAdver
• Minimum interval between successive updates sent to a peer for a given prefix– Allow for greater efficiency/packing of updates– Rate throttle
• Applied only to announcements (at least according to BGP RFC)
• Applied on (prefix destination, peer) basis, but implemented on (peer) basis
MinRouteAdver
• 30*(N-3) delay due to creation mutual dependencies. Provide proof that N-3 rounds necessarily created during bounded BGP MinRouteAdver convergence
• Rounds due to– Ambiguity in the BGP RFC and lack receiver
loop detection– Inclusion of BGP withdrawals with
MinRouteAdver (in violation of RFC)
Simulation Results
Nodes Time States Messages Nodes Time States Messages Nodes Time States Messages4 N/A 12 41 4 30 11 26 4 30 11 265 N/A 60 306 5 60 26 54 5 30 23 546 N/A 320 2571 6 90 50 92 6 30 39 927 N/A 1955 23823 7 120 85 140 7 30 59 140
Unbounded Delay MinRouteAdver Modified MinRouteAdver
Intuition for Delayed BGP Convergence
• There exists possible ordering of messages such that BGP will explore ALL possible ASPaths of ALL possible lengths– BGP is O(N!), where N number of default-free BGP
speakers in a complete graph with default policy
• Although seemingly very different protocols, BGP and RIP share very similar convergence behaviors. Major difference:– RIP explores metrics (1…N)– BGP ASPath provides multiple ways to represent metric
(path) of length N, or (N-1)!
Lower Bound on BGP • If assume optimal ordering of messages, what is the best
we can expect from BGP?• In practice, BGP timers (MinRouteAdver) provide
synchronization and limit possible orderings of messages – MinRouteAdver timer specifies interval between successive
updates sent to a peer for a given prefix– Useful for bundling updates together– According to RFC, MinRouteAdver applies only
announcements
• But, interaction of MinRouteAdver and vendor ASPath loop detection implementation introduce “artificial” delay
Conclusions
• Internet does not posses effective inter-domain fail-over (15 minutes is a long time for phone call)
• Majority of BGP convergence delay due to vendor implementation decisions of MinRouteAdver and loop detection
• In practice, Internet is not a complete graph and same degree of message re-ordering unlikely. Our current work:– What is the impact of ISP policy and topology on BGP
convergence? – Can we improve BGP convergence times?