Traffic-aware Inter-Domain Routing for Improved Internet Routing Stability
Zhenhai Duan
Florida State University
1
Outline
• Introduction and Background
• Motivation and Intuition
• Traffic-Aware Inter-Domain Routing (TIDR)
• Performance Studies
• Summary
2
Introduction and Background
• Internet consists of large number of network domains – Or Autonomous Systems (ASes)– Currently about 26K– Exchange network prefix reachability information using BGP
• In a system this big, things happen all the time– Fiber cuts, equipment outages, operator errors
• Direct consequence on routing system– Large number of BGP updates exchanged between ASes– Re-computing/propagating best routes– Events may propagated through entire Internet
• Effects on user-perceived network performance– Long network delay, packet loss, even loss of network connectivity
3
Introduction and Background
• Implicit design assumption in BGP– Failure events of same importance to all users
• No explicit mechanisms to localize failure in BGP
• Internet global reachability == global propagation of failure– Is this valid?– A user (AS) in US may not be interested in failure in Asian country
• Design of BGP failed to recognize two Internet properties– Internet access non-uniformity– Prevalence of transient failures
4
5
Motivation and Intuition
• Internet access non-uniformity– APRANET(1970, Kleinrok and Naylor)
• Top 12.6% responsible for 90% of traffic
– NSFNET(1980,Rekhter and Chinoy)• Top 10% responsible for 85% of traffic
– Fang and Peterson (1999), and Rexford(2002)• Non-uniform distribution nature of Internet traffic
• Model on network value [IEEE/SPECTRUM2006]– Zipf’s law
6
Internet Access Non-Uniformity
• FSU Study– Study if Internet access locality holds from viewpoint of edge network– Bidirectional data traffic collected at border router at FSU for 16 days
7
FSU Data Traffic on other Days
8
BGP Updates (RouteViews Project)
Most of updates are from rest of the prefixes
Only a few updates are related to top prefixes at FSU
Motivation and Intuition
• Prevalence of transient failures– Sprint backbone measurement (2002)– BGP misconfigurations
• 50% misconfigurations lasted less than 10 minutes
• 50% < 1 minute• 80% < 10 minutes• 90% < 20 minutes
Majority of network failures are transient 9
Motivation and Intuition
Internet Access Non-Uniformity
Users (networks) normally communicates with small set of other network domains
Prevalence of Transient Failure
Majority of the network failures on the Internet are transient
TIDR
10
Traffic-aware Inter-Domain Routing (TIDR)
• Prefix classified into either significant or insignificant– At AS v, with respect to neighbor n
• Treat differently propagation of sign/insign prefixes– Propagating BGP updates of sign prefixes with high priority– Aggressively slow down propagation of BGP updates of insign prefixes
• Localizing effect of transient failures on insign prefixes– Hold propagation of transient failures if valid alternative route exists
• BGP withdrawals always propagated
11
vv
Insignificant
Significant
nn
12
TIDR Timers
15/30SEC.
MRAI TIMER
ASAS1010MIN.MIN.
TIDR TIMER
Recovery
TIDR Design
• How to avoid traffic black-holes?– If the alternative route that is held by Timer is invalid, node will be the
black-hole that drops all the packets that it receives
– Utilizing Root Cause Information (RCI)• Similar to EPIC and RCN
• flush out all local invalid alternative routes
• Alternative route chosen can be guaranteed to be valid
• How to avoid slow propagation of long-term failure of insign pref– Every node will hold propagation of BGP update, if not design carefully
– Only one node will apply TIDR timer to insign prefixes• Nodes neighboring to failure
• First node to have valid alternative route
13
TIDR Algorithm
14
Performance Studies
• Used simBGP simulator• With both clique and Waxman random network topologies• Simulated both link fail-down and fail-over events
– Only dummy node announce prefixes• 20% to be significant, 80% to be insignificant
– Link failure• 20% to be long-term, 80% to be transient
• Settings– Link delay: randomly from 0.01 to 0.1 seconds– Processing delay: randomly from 0.001 to 0.01 seconds– MRAI timer: 30 seconds– TIDR timer: 10 minutes
15
Fail-down Events
16
Fail-Over Events
17
18
Summary and On-going Work
• TIDR: Traffic-aware Inter-Domain Routing– Capitalizing on two important properties
• Internet access non-uniformity
• Prevalence of transient failure
– Differentiated BGP update propagation for sign and insign prefixes
• Propagating updates of sign prefixes with higher priority
• Aggressively slow down propagation of updates of insign prefix
• Performed simulation studies– Outperforms BGP and other existing enhancements