minimizing probing cost for detecting interface failures: algorithms and scalability analysis hung...
TRANSCRIPT
Minimizing Probing Cost for Minimizing Probing Cost for Detecting Interface Failures: Detecting Interface Failures: Algorithms and Scalability Algorithms and Scalability AnalysisAnalysis
Hung Nguyen (Univ. of Adelaide, Australia)Renata Teixeira (UPMC, France)Patrick Thiran (EPFL, Switzerland)Christophe Diot (Thomson, France)
The Internet is great, but The Internet is great, but problems happenproblems happen
UoAnetwork
Net1Net2
Net3
How to automatically detect and identify problems?
Is my connection ok?
Is the server up?
Is the problem in some of the networks in the
path?
129.130.42.3
Current alarms are not Current alarms are not enoughenough
Network equipments already have many alarms◦ SNMP traps◦ Anomaly detection systems
But, alarms may not reflect user’s experience◦ Hard to map users’ complaints to alarms◦ Problem may not raise an alarm
A C
BD
129.130.42.3
13.110.42.5
C wrongly filters packets to 129.130.42.3/24
Active monitoring system to Active monitoring system to detect faultsdetect faultsNetwork admins often resort to
active measurements◦Active monitoring servers inside
their network◦Subscribe to third-party monitoring
service e.g. ,Keynote or RIPE TTMChallenge
Cannot continuously overload the network or end-user’s machine to detect faults, which are rare events
Problem definitionProblem definition
M1
M2
T3
T1 T2
A C
BD
target hosts
monitors
Goal detect failures of any of the interfaces in the
subscriber’s network with minimum probing overhead
subscriber network
Simple solution: Coverage Simple solution: Coverage problem problem
M1
M2
T3
T1 T2
A C
BD
Instead of probing all paths, select the minimum set of paths that covers
all interfaces in the subscriber’s network
Coverage solution doesn’t Coverage solution doesn’t detect all types of failuresdetect all types of failuresDetects full-stop failures
◦Failures that affect all packets that traverse the faulty interface Eg., interface or router crashes, fiber
cuts, bugs
But not path-specific failures◦Failures that affect only a subset of
paths that cross the faulty interface Eg., router misconfigurations
New formulation of failure New formulation of failure detection problemdetection problem
Simultaneously select the frequency to probe each path◦Lower frequency per-path probing can
achieve a high frequency probing of each interface
M1
M2
T3
T1 T2
A C
BD
1 every 9 mins
1 every 3 mins
Properties of solutionProperties of solutionProbe minimization for failure detection is no
longer NP-hard◦ Can find optimal solution using linear programming
Needs synchronization among monitors◦ Monitors need to collaborate to probe an interface
• Alternative probabilistic solution with Poisson probes to avoids synchronization overhead
M1
M2
T3
T1 T2
A C
BD
1 every 9 mins
1 every 3 mins
Scaling law of probing Scaling law of probing costcostProbing cost (number of probes sent per
second) scales almost linearly with the size of the subscriber’s network ◦ In our inferred internet graphs
For a random power-law graph, probing cost is a linear function of the number of nodes (n)
Bounded by the isometric path number of a graph, i(G)
For other graphs:Graph i(G)
Cycle 2n/(n+1)
Complete n/2
Hypercube n/log n
Grid n/2
EvaluationEvaluation Paths obtained using traceroutes
◦ From 750 PlanetLab nodes to 3,000 DNS servers◦ From 12 RON nodes to 60,000 targets
Subscriber networks are probed ASes ◦ Map IPs to ASes using Mao et al.’s technique◦ 1,366 ASes in PlanetLab◦ 6,517 ASes in RON
Compute probing costs varying parameters◦ Set of paths, failure durations, subscriber’s network
Probing costs varying size of Probing costs varying size of subscriber network in subscriber network in PlanetLabPlanetLab
DurationPath-specific = 1000
secFull-stop duration = 1
sec
SummarySummary Practical formulation of failure detection problem
◦ Incorporates both full-stop and path-specific failures Solution minimizes probing cost
◦ Using linear programming Inferred internet graphs are among the most
expensive to probe◦ Probing cost scales almost linearly with network
size Next step
◦ Deploy a system based on these probing techniques
Probing costsProbing costsDuration
Path-specific = 2 secFull-stop duration = 1
sec
Varying Failure DurationsVarying Failure DurationsFull-stop duration = 10
sec
Path-specific failures dominate the cost
Full-stop failures dominate the cost
Probing costs varying size of Probing costs varying size of subscriber network in RONsubscriber network in RON
DurationPath-specific = 1000
secFull-stop duration = 1
sec