challenges in making tomography practical yiyi huang, georgia tech nick feamster, georgia tech...

Challenges in Making Tomography Practical

Yiyi Huang, Georgia TechNick Feamster, Georgia Tech

Renata Teixeira, LIP6Christophe Diot, Thomson

Problem

• Network operators need to detect and isolate faults quickly, before customers complain

• Plenty of existing alarms– SNMP traps– Active probes– Anomaly detection systems

• Unfortunately, this set of alarms does not help operators locate and eliminate problems that induce problems on end-to-end paths

Network Tomography to the Rescue

• Send end-to-end probes through the network• Monitor paths for differences in reachability• Infer location of reachability problem from these differences

Monitor

Targets

Some Problems

• Scalability vs. speed: Detection must be fast

• Ambiguity: Losses are one-way but don’t always have access to both ends of the path

• Lack of synchronization: Different monitors see different conditions

• Dynamics: Topology can change, loss can be transient

Doppler: Making Tomography Practical

• Fast, scalable detection– Solution: Monitor selection algorithm to reduce the

number of monitors and targets so that “cycle times” are fast

• Transient packet loss– Solution: Triggered confirmation of failed paths

• One-way losses– Solution: New algorithm based on IP spoofing

• Dynamic routing– Solution: Periodic snapshots of the network topology

Controlled evaluation on VINI, plus limited wide-area experiments.

Fast, Scalable Detection

• Select monitors, targets to satisfy two conditions– All interfaces are “covered” (or diagnosable)– The number of monitors is small enough to ensure a

short round time

• Two goals– Coverage: When a failure occurs, system detects it

• Every interface is covered by at least one path– Diagnosability: When a failure occurs, system locates it

• Every interface is covered by a unique set of paths

Offline Path Selection: Diagnosability

• Step 1: Compute the set of paths that cover all interfaces (greedy set cover heuristic)

• Step 2: Compute hitting set for each interface

• Step 3: Build equivalence classes for interfaces with common hitting set– For each interface in a set with more than one

interface, find path that crosses only that interface

Detection, Confirmation, Correlation

• Periodic (once per 5 minutes) topology snapshot from all monitors to all destinations keeps track of underlying topology before the failure

• Detection: Periodic probes (once per “cycle time”) detect failure

• Confirmation: When a probe is lost, the monitor sends three additional probes. If all three are lost, path is determined to have failed.

• Correlation: Paths that fail within 10 seconds of one another are grouped.

Disambiguating One-Way Losses: Spoofing

• Monitor sends request to spoofer to send probe• Probe has IP address of the monitor• If reply reaches the monitor, reverse path is

working

Spoofer: Send spoofed packet with source address of M

Identification: NetDiagnoser

• Binary network tomography algorithm [Dhamdhere et al.]

• Input: hosts, destinations, topology before the failure

• Output: Set of possible locations for the fault

Evaluation of Detection Algorithms

• Controlled experiments on the VINI testbed– Emulated copy of Abilene network on wide-area paths– Probing strategy emulates the paths that would be probed in monitor

selection algorithm– Compare reduced set of paths to “aggressive” measurement

approach

• Varied failure location and duration– Duration varied from 5 to 80 seconds– Test repeated for each failed link

• Measure detection and false alarm rates• Preliminary experiments using data from real-world networks

Detection: Scale and Speed

• Compute reduction in the number of paths required to achieve coverage and diagnosability– Reduction from about 27,000 paths to 151 paths

• For real-world networks, compute corresponding reduction in cycle time– Reduction from aout 3.5 minutes to < 5 seconds

Single-Link Failures

• More selective probing identifies more of the shorter link failures (due to shorter cycle time)

• Also results in fewer false alarms

Single-Node Failures

• Similar results to single-link failures– Selective measurements result in faster detection,

fewer false alarms

Does Failure Confirmation Reduce the Total Number of Alarms?

• Confirmation reduces the number of failures by > 35%• Correlation further reduces the number of alarms (by

about a factor of 10)

How Quickly can Doppler Identify Failures?

• Answer: Roughly 20 seconds using the reduced set of paths

• Two main components– Detection/Confirmation: Time from when failure was

injected to the time Doppler could detect and confirm the failure

– Correlation: Time to group failures and construct reachability matrix

Detection and Confirmation Delay

Most failures are detected within 3-5 seconds

Correlation Delay

Reducing the number of paths to probe significantly reduces total correlation time

Summary

• Making tomography practical is challenging– Asynchronous measurements– Scale and speed– Changing topologies– Ambiguity about forward and reverse paths

• Doppler: Set of techniques to address many of these problems

• Current analysis is still performed offline– Many additional challenges remain to coordinate

online measurements

challenges in making tomography practical yiyi huang, georgia tech nick feamster, georgia tech...

interface slide

seconds slide

number of paths

transient slide

failure detection

thomson slide

fault slide

network monitor paths

Documents

yiyi ink!!pdf

carga batch diot 2013 1(1)

yiyi second panel munich 20120918

mining anomalies in network-wide flow data anukool lakhina...

dhs.georgia.gov guide_webversion.pdf2-3 hours per week •...

standardization strategy of china ---- achievements and...

from trustee georgia to royal georgia georgia studies notes

ms06-05 - regional...

wang yiyi sub-institute of standardization theory and...

diot 2015 version 1.1.5

georgia georgia auto - state bar of georgia

advanced technology laboratories page 1 network performance...

evaluation and improvement of health care systems luting...

yiyi y ei yiyi by: javier gomez writing 10 kristine van...

2013-07-04 calculo arrendamiento intermedios diot ret...

colonial georgia trustee georgia and royal georgia 1730-1761

christophe diot -...

www.intel.com/research bridging router performance and...

inferring tcp connection characteristics through passive...

detectability of traffic anomalies in two adjacent networks...