11
A Framework for Measuring and A Framework for Measuring and Predicting the Impact of Routing Predicting the Impact of Routing
ChangesChanges
Ying Zhang Z. Morley Mao Jia Wang Ying Zhang Z. Morley Mao Jia Wang
22
Internet routing changesInternet routing changes Various causesVarious causes
Link failures, configuration changes, topology changes, etc. Link failures, configuration changes, topology changes, etc. Direct influence on the data planeDirect influence on the data plane
Transient data-plane disruptionTransient data-plane disruption Packet loss, increased delay, forwarding loopsPacket loss, increased delay, forwarding loops
Internet
CBR
CBR
CBR
Destination
SourceOld path
New path
CBR
CBR
CBR
CBRCBR
MotivationMotivation
Frequent routing dynamics can cause Frequent routing dynamics can cause transient disruption in the data planetransient disruption in the data planeInconsistent routes during convergenceInconsistent routes during convergence
Real-time applications can be Real-time applications can be affectedaffected
Predicting performance impact can Predicting performance impact can assist more intelligent route selectionassist more intelligent route selection
33
Measuring and predicting the Measuring and predicting the impactimpact
Comprehensively measure the Comprehensively measure the impact of routing changesimpact of routing changes
Characterize the properties of Characterize the properties of routing changes that cause traffic routing changes that cause traffic disruptiondisruption
Search for pattern to help predictionSearch for pattern to help prediction
44
OutlineOutline
MotivationMotivationMethodologyMethodologyCharacterization of data-plane Characterization of data-plane
failuresfailuresFailure prediction model Failure prediction model
55
MethodologyMethodology
Data collectionData collection Control plane: local real-time BGP updates Control plane: local real-time BGP updates Data plane: ping and traceroute probes for each updateData plane: ping and traceroute probes for each update
A light weight active probing methodologyA light weight active probing methodology A coarse-grained performance metric: reachabilityA coarse-grained performance metric: reachability
Destination reachable: any ping replyDestination reachable: any ping reply Scalable to many destinations with live IPs Scalable to many destinations with live IPs
Measurement-based approachMeasurement-based approach No simplifying assumptionsNo simplifying assumptions Empirical evidenceEmpirical evidence
66
Our approachOur approach Focus: measure data-plane failures caused by Focus: measure data-plane failures caused by
routing changesrouting changes Coarse-grained performance metricsCoarse-grained performance metrics
Methodology: light-weight active probingMethodology: light-weight active probing Triggered by locally observed routing updatesTriggered by locally observed routing updates Probing target of a live IP within the prefix Probing target of a live IP within the prefix
77
Prefix P
Old path
New path
CBR
AS C
Update Prefix: P,
AS path: A D B
CBRAS B
AS A
CBR
AS DMeasurement Framework
Internet
Our approachOur approach Focus: measure data-plane failure caused by Focus: measure data-plane failure caused by
routing changesrouting changes Methodology: light-weight active probingMethodology: light-weight active probing
Triggered by locally observed routing updatesTriggered by locally observed routing updates Probing target of a live IP within the prefix Probing target of a live IP within the prefix
88Live IP 1 within Prefix P
Old path
New path
CBR
AS C
Ping
CBRAS B
AS A
CBR
AS DMeasurement Framework
Internet
Traceroute
Ping, traceroute
Probing controlProbing control
Background probingBackground probing Identifying persistent failuresIdentifying persistent failures Verifying live IP’s responseVerifying live IP’s response
Resource controlResource control Ignoring updates due to table transfersIgnoring updates due to table transfers Imposing maximum probing durationImposing maximum probing duration
Accuracy controlAccuracy control Impose maximum waiting durationImpose maximum waiting duration
99
OutlineOutline
MotivationMotivationMethodologyMethodologyCharacterization of data-plane Characterization of data-plane
failuresfailuresFailure prediction model Failure prediction model
1010
Characterization of data-plane Characterization of data-plane failuresfailures
Failure typesFailure types Reachability failureReachability failure
Ping reply is not received due to network problemsPing reply is not received due to network problems Forwarding loopsForwarding loops
A subset of reachability failuresA subset of reachability failures Transient loops observed in the pathTransient loops observed in the path
Failure propertiesFailure properties Affected networksAffected networks Failure durationFailure duration Failure predictabilityFailure predictability
1111
Overall reachability failure Overall reachability failure statisticsstatistics
1212
Incidence
Prefix AS
Unreachable
Loop 6% 23% 33%
Other 36% 72% 38%
All 42% 73% 63%
Reachable 57% 83% 98%
Internet experiments for 11 weeks
Affected network locationsAffected network locations Understanding the networks affected by routing changesUnderstanding the networks affected by routing changes
Most Ases are near the edge and in foreign countriesMost Ases are near the edge and in foreign countries Small fraction of destinations experiencing many unreachable Small fraction of destinations experiencing many unreachable
incidencesincidences
1313
Failure durationsFailure durations Short durationShort duration
Most last less than 300 secondsMost last less than 300 seconds Transient routing failure, convergence delayTransient routing failure, convergence delay
10% incidences with longer duration10% incidences with longer duration Configuration errors or path failuresConfiguration errors or path failures
1414
Failure predictabilityFailure predictability
Destination prefix informationDestination prefix information Appearance probabilityAppearance probability
Probability of an unreachable incidence for prefix DProbability of an unreachable incidence for prefix D
Destination prefix and AS path segmentsDestination prefix and AS path segments Conditional probability on AS path segmentsConditional probability on AS path segments
Probability of an unreachable event occurring given a particular AS path segment Probability of an unreachable event occurring given a particular AS path segment
Responsible ASResponsible AS Where traceroute stopsWhere traceroute stops
1515
OutlineOutline
MotivationMotivationMethodologyMethodologyCharacterization of data plane failureCharacterization of data plane failureFailure prediction model Failure prediction model
1616
Prediction modelPrediction model
Prefix and AS segment informationPrefix and AS segment informationThe data plane failure likelihood ratioThe data plane failure likelihood ratio
P(Y=1|R;D): the conditional probability of data-plane failure given P(Y=1|R;D): the conditional probability of data-plane failure given a routing update R for prefix Da routing update R for prefix D
Assuming the failure on each AS is independentAssuming the failure on each AS is independent
xxii is the responsible AS in history data is the responsible AS in history data
1717
);|0(
);|1()(
DRYP
DRYPY
));|1(1(1);,...,|1(1
21
n
iin DxYPDxxxRYP
EvaluationEvaluation The trade-off between selectivity and sensitivityThe trade-off between selectivity and sensitivity is the decision threshold which determines false positives is the decision threshold which determines false positives
and false negative routeand false negative route Receiver operating characteristicReceiver operating characteristic
Evaluation resultsEvaluation results 60% detection rate 60% detection rate
with 18% false positiveswith 18% false positives
1818
ConclusionConclusion
Developed an efficient framework for Developed an efficient framework for measuring and predicting data-plane measuring and predicting data-plane failures caused by routing changesfailures caused by routing changes
Identified patterns to accurately Identified patterns to accurately predict data-plane failurespredict data-plane failures
Provided suggestions for more Provided suggestions for more intelligent route selectionsintelligent route selections
1919