challenges in making tomography practical yiyi huang, georgia tech nick feamster, georgia tech...

Post on 27-Mar-2015

226 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Challenges in Making Tomography Practical

Yiyi Huang, Georgia TechNick Feamster, Georgia Tech

Renata Teixeira, LIP6Christophe Diot, Thomson

2

Problem

• Network operators need to detect and isolate faults quickly, before customers complain

• Plenty of existing alarms– SNMP traps– Active probes– Anomaly detection systems

• Unfortunately, this set of alarms does not help operators locate and eliminate problems that induce problems on end-to-end paths

3

Network Tomography to the Rescue

• Send end-to-end probes through the network• Monitor paths for differences in reachability• Infer location of reachability problem from these differences

Monitor

x

y

Targets

4

Some Problems

• Scalability vs. speed: Detection must be fast

• Ambiguity: Losses are one-way but don’t always have access to both ends of the path

• Lack of synchronization: Different monitors see different conditions

• Dynamics: Topology can change, loss can be transient

5

Doppler: Making Tomography Practical

• Fast, scalable detection– Solution: Monitor selection algorithm to reduce the

number of monitors and targets so that “cycle times” are fast

• Transient packet loss– Solution: Triggered confirmation of failed paths

• One-way losses– Solution: New algorithm based on IP spoofing

• Dynamic routing– Solution: Periodic snapshots of the network topology

Controlled evaluation on VINI, plus limited wide-area experiments.

6

Fast, Scalable Detection

• Select monitors, targets to satisfy two conditions– All interfaces are “covered” (or diagnosable)– The number of monitors is small enough to ensure a

short round time

• Two goals– Coverage: When a failure occurs, system detects it

• Every interface is covered by at least one path– Diagnosability: When a failure occurs, system locates it

• Every interface is covered by a unique set of paths

7

Offline Path Selection: Diagnosability

• Step 1: Compute the set of paths that cover all interfaces (greedy set cover heuristic)

• Step 2: Compute hitting set for each interface

• Step 3: Build equivalence classes for interfaces with common hitting set– For each interface in a set with more than one

interface, find path that crosses only that interface

8

Detection, Confirmation, Correlation

• Periodic (once per 5 minutes) topology snapshot from all monitors to all destinations keeps track of underlying topology before the failure

• Detection: Periodic probes (once per “cycle time”) detect failure

• Confirmation: When a probe is lost, the monitor sends three additional probes. If all three are lost, path is determined to have failed.

• Correlation: Paths that fail within 10 seconds of one another are grouped.

9

Disambiguating One-Way Losses: Spoofing

• Monitor sends request to spoofer to send probe• Probe has IP address of the monitor• If reply reaches the monitor, reverse path is

working

M

Spoofer: Send spoofed packet with source address of M

T

10

Identification: NetDiagnoser

• Binary network tomography algorithm [Dhamdhere et al.]

• Input: hosts, destinations, topology before the failure

• Output: Set of possible locations for the fault

11

Evaluation of Detection Algorithms

• Controlled experiments on the VINI testbed– Emulated copy of Abilene network on wide-area paths– Probing strategy emulates the paths that would be probed in monitor

selection algorithm– Compare reduced set of paths to “aggressive” measurement

approach

• Varied failure location and duration– Duration varied from 5 to 80 seconds– Test repeated for each failed link

• Measure detection and false alarm rates• Preliminary experiments using data from real-world networks

12

Detection: Scale and Speed

• Compute reduction in the number of paths required to achieve coverage and diagnosability– Reduction from about 27,000 paths to 151 paths

• For real-world networks, compute corresponding reduction in cycle time– Reduction from aout 3.5 minutes to < 5 seconds

13

Single-Link Failures

• More selective probing identifies more of the shorter link failures (due to shorter cycle time)

• Also results in fewer false alarms

14

Single-Node Failures

• Similar results to single-link failures– Selective measurements result in faster detection,

fewer false alarms

15

Does Failure Confirmation Reduce the Total Number of Alarms?

• Confirmation reduces the number of failures by > 35%• Correlation further reduces the number of alarms (by

about a factor of 10)

16

How Quickly can Doppler Identify Failures?

• Answer: Roughly 20 seconds using the reduced set of paths

• Two main components– Detection/Confirmation: Time from when failure was

injected to the time Doppler could detect and confirm the failure

– Correlation: Time to group failures and construct reachability matrix

17

Detection and Confirmation Delay

Most failures are detected within 3-5 seconds

18

Correlation Delay

Reducing the number of paths to probe significantly reduces total correlation time

19

Summary

• Making tomography practical is challenging– Asynchronous measurements– Scale and speed– Changing topologies– Ambiguity about forward and reverse paths

• Doppler: Set of techniques to address many of these problems

• Current analysis is still performed offline– Many additional challenges remain to coordinate

online measurements

top related