![Page 1: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/1.jpg)
1
Finding a Needle in a Haystack: Pinpointing Significant BGP Routing
Changes in an IP Network
Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University)
Jia Wang (AT&T Labs Research)
![Page 2: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/2.jpg)
2
Motivation
CBR
CBRCBR
CBRAS1
AS2 AS3
destination
AB
C
D
Failure
Disruption
Congestion
Mitigation
AS4
source
A backbone network is vulnerable to routing changes that occur in other domains.
![Page 3: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/3.jpg)
3
Goal
Identify important routing anomalies Lost reachability Persistent flapping Large traffic shifts
Contributions:
•Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time.
•Use the tool to characterize routing disruptions in an operational network
![Page 4: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/4.jpg)
4
Interdomain Routing:Border Gateway Protocol
Prefix-based: one route per prefix Path-vector: list of ASes in the path Incremental: every update indicates a change Policy-based: local ranking of routes
CBRCBR CBRCBRCBRCBRCBRCBR
“I can reach 12.34.158.0/24”
“I can reach 12.34.158.0/24
via AS 1”
AS 1AS 2
12.34.158.5
data traffic data trafficAS 3iBGPeBGP eBGP
12.34.158.0/24
![Page 5: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/5.jpg)
5
Capturing Routing Changes
CBRCBR
CPEBGP Monit
or
CBRCBR
CBRCBR
CBRCBR
CBRCBR
CBRCBR
iBGP
iBG
P
iBGP
eBGP
eBG
P
eBGPUpdatesUpdates
Best routesBest ro
utes
A large operational network(8/16/2004 – 10/10-2004)
![Page 6: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/6.jpg)
6
Challenges
Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage
Different from root-cause analysis Identify changes and their effects Focus on actionable events rather than
diagnosis Diagnose causes in/near the AS
![Page 7: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/7.jpg)
7
System Architecture
Event Classification
Event Classification
“Typed”Events
EEBR
EEBR
EEBR
BGP Updates
(106)
BGP Update Grouping
BGP Update Grouping
Events
Persistent Flapping Prefixes
(101)
(105)
EventCorrelation
EventCorrelation
Clusters
Frequent Flapping Prefixes
(103)
(101)
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBREEBR EEBR
LargeDisruptions
Netflow Data
(101)
From millions of updates to a few dozen reports
![Page 8: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/8.jpg)
8
Grouping BGP Update into Events
Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers
Approach:
•Group together all updates for a prefix with inter-arrival < 70 seconds•Flag prefixes with changes lasting > 10 minutes.
BGP Update Grouping
BGP Update Grouping
EEBR
EEBR
EEBR
BGP Updates
Events
Persistent Flapping Prefixes
![Page 9: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/9.jpg)
9
Grouping Thresholds Based on our understanding of BGP
and data analysis Event timeout: 70 seconds
2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds
Convergence timeout: 10 minutes BGP usually converges within a few
minutes 99.9% events < 10 minutes
![Page 10: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/10.jpg)
10
Persistent Flapping Prefixes
Types of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interfaces or BGP sessions (3.0%)
A surprising finding: 15.2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled.
![Page 11: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/11.jpg)
11
Example: Unstable eBGP Session
ISP Peer
CustomerEC
EB
EA ED
p
Flap damping parameters is session-based Damping not implemented for iBGP sessions
![Page 12: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/12.jpg)
12
Event Classification
Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of the traffic through the network
Event Classification
Event ClassificationEvents
“Typed” Events,e.g., Loss/Gain of Reachability
Solution: classify events by severity of their impact
![Page 13: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/13.jpg)
13
Event Category – “No Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
No Traffic Shift
“No Disruption”:
no border routers have any traffic shift. (50.3%)
![Page 14: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/14.jpg)
14
Event Category – “Internal Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
Internal Traffic Shift
“Internal Disruption”:
all traffic shifts are internal. (15.6%)
![Page 15: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/15.jpg)
15
Event Category – “Single External Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
external Traffic Shift
“Single External Disruption”:
only one of the traffic shifts is external (20.7%)
![Page 16: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/16.jpg)
16
Statistics on Event Classification
Events Updates
No Disruption 50.3% 48.6%
Internal Disruption 15.6% 3.4%
Single External Disruption 20.7% 7.9%
Multiple External Disruption 7.4% 18.2%
Loss/Gain of Reachability 6.0% 21.9%
First 3 categories have significant day-to-day variations
Updates per event depends on the type of events and the number of affected routers
![Page 17: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/17.jpg)
17
Event Correlation
Challenge: A single routing change affects multiple destination prefixes
EventCorrelation
EventCorrelation“Typed”
EventsClusters
Solution: group the same-type, close-occurring events
![Page 18: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/18.jpg)
18
EBGP Session Reset Caused most of “single external disruption”
events Check if the number of prefixes using that
session as the best route changes dramatically
Validation with Syslog router report (95%)time
Number of prefixes
session failure
session recovery
![Page 19: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/19.jpg)
19
Hot-Potato Changes Hot-Potato Changes
Caused “internal disruption” events Validation with OSPF measurement (95%)
[Teixeira et al – SIGMETRICS’ 04]
ISP
P
EA EB
EC
10119
“Hot-potato routing” = route to closest egress point
![Page 20: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/20.jpg)
20
Traffic Impact Prediction
Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBR
Clusters LargeDisruptions
Netflow Data
EEBR EEBR
Solution: weigh each cluster by traffic volume
![Page 21: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/21.jpg)
21
Traffic Impact Prediction Traffic weight
Per-prefix measurement from netflow 10% prefixes accounts for 90% of traffic
Traffic weight of a cluster the sum of “traffic weight” of the
prefixes A small number of large clusters have
large traffic weight Mostly session resets and hot-potato
changes
![Page 22: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/22.jpg)
22
Performance Evaluation Memory
Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB
Speed 99% of intervals of 1 second of updates
can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates
can be processed within 70 seconds
Measurements were based on 900MHz CPU
![Page 23: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d3a5503460f94a14bc5/html5/thumbnails/23.jpg)
23
Conclusion BGP troubleshooting system
Fast, online fashion Operators’ concerns (reachability, flapping, traffic) Significant information reduction
millions of update a few dozens of large disruptions
Uncovered important network behavior Hot-Potato changes Session resets Persistent-flapping prefixes