internet measurements: fault detection, identification ... · internet measurements: fault...

87
Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS and UPMC Sorbonne Universités

Upload: nguyentruc

Post on 02-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Internet measurements: Fault detection, identification,

and topology discovery

Renata Teixeira Laboratoire LIP6

CNRS and UPMC Sorbonne Universités

Page 2: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

2

Internet problems are hard to troubleshoot

LIP6 network

Net1 Net2

Net3

Is my connection ok?

Is it google?

Is the problem in one of the networks in path?

DNS? NAT?

DHCP? ???

Page 3: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Nobody has the full picture to diagnose problems

§ End-users can’t know what happens in network § Network operators can’t know user experience

3

Net1 Net2

Net3 LIP6 network

Page 4: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

End-to-end measurements to the rescue

§ Available data depends on who you are –  Network operators: Data from their own network –  End-users: Data from their machine

§ End-to-end measurements compensate for missing data –  Network operators: Deploy monitoring hosts –  End-users: Monitor some paths and cooperate with other users

§ Troubleshooting is mostly manual and ad-hoc

4

Page 5: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Goal Automatically troubleshoot

network faults and performance disruptions

§ Automatic detection: is there a problem? –  Detect problems before users –  How to detect problems that users care about?

§ Fault identification: where is the problem? –  Most common: traceroute –  Assist in pinpointing the location of the problem –  How to measure to obtain accurate location?

5

Page 6: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Overview § Topology discovery

–  Topology mapping with traceroutes –  Tracking topology changes

§ Detection –  Active versus passive fault detection techniques –  Performance problems as perceived by users

§ Identification

–  Network tomography for fault diagnosis

6

Page 7: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Overview § Topology discovery

–  Topology mapping with traceroutes –  Tracking topology changes

§ Detection –  Active versus passive fault detection techniques –  Performance problems as perceived by users

§ Identification

–  Network tomography for fault diagnosis

7

Page 8: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Measuring router topology

§ With access to routers (or “from inside”) –  Topology of one network –  Routing monitors (OSPF or IS-IS)

§ No access to routers (or “from outside”) –  Multi-network topology or from end-hosts –  Monitors issue active probes: traceroute

8

Page 9: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

9

Topology from inside

§ Routing protocols flood state of each link –  Periodically refresh link state –  Report any changes: link down, up, cost change

§ Monitor listens to link-state messages –  Acts as a regular router

•  AT&T’s OSPFmon or Sprint’s PyRT for IS-IS

§ Combining link states gives the topology –  Easy to maintain, messages report any changes

[Mortier] [Shaikh, 2004]

Page 10: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Inferring a path from outside: traceroute

10

A B

TTL = 1

A.1 A.2 B.2 B.1

TTL = 2

TTL exceeded from A.1

TTL exceeded from B.1

Actual path

Inferred path A.1 B.1

m t

m t

Page 11: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

A traceroute path can be incomplete

§ Load balancing is widely used –  Traceroute only probes one path

§ Sometimes taceroute has no answer (stars) –  ICMP rate limiting –  Anonymous routers

§ Tunnelling (e.g., MPLS) may hide routers –  Routers inside the tunnel may not decrement TTL

11

Page 12: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Paris traceroute

B. Augustin, X. Cuvellier, B. Orgogozo, F. Viger, T. Friedman, M. Latapy,

C. Magnien (LIP6) D. Veitch (U. Melbourne)

Page 13: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

13

Traceroute under load balancing

L

B

A C

D

L

A.1

D.1

C.1

TTL = 2

TTL = 3

B.1

E

E.2

Missing nodes and links

False link

Actual path

Inferred path

m

m t

t

Page 14: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

14

Errors happen even under per-flow load balancing

L

B

A C

D

TTL = 2 Port 2

TTL = 3 Port 3

E

§ Traceroute uses the destination port as identifier –  Needs to match probe to response –  Response only has the header of the issued probe

Flow 1

m t

Page 15: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

15

Paris traceroute: Removing false links

§ Solves the problem with per-flow load balancing –  Probes to a destination belong to same flow

§ Changes the location of the probe identifier –  Use the UDP checksum

L

B

A C

D

TTL = 2 Port 1

TTL = 3 Port 1 E Checksum 3 Checksum 2 m t

Page 16: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Paris traceroute: Tracing all the paths

§ Adaptive probing strategy –  Vary flow id of probes –  Send enough probes to enumerate all interfaces with

high confidence

16

L

B

A C

D

E m t

Port 1

Port 2

Port 3

Port 4

Port 5

Port 6

Page 17: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

4 2 1

1

Topology from traceroutes

§  Inferred nodes = interfaces, not routers

§ Coverage depends on monitors and targets –  Misses links and routers –  Some links and routers appear multiple times

17

1 A

D

3 B 2

3

2

3 1 m1

t1

m2

t2

C

Actual topology

A.1 m1 t1

m2 t2

Inferred topology

C.1 D.1

C.2

B.3

2

Page 18: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Alias resolution: Map interfaces to routers

§ Direct probing –  Probe an interface, may receive

response from another

–  Responses from the same router will have close IP identifiers and same TTL

§ Record-route IP option

–  Records up to nine IP addresses of routers in the path

18

A.1 m1 t1

m2 t2

Inferred topology

C.1 D.1

C.2

B.3

same router

[Spring, 2002] [Sherwood, 2008]

Page 19: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Large-scale topology measurements

§ Probing a large topology takes time –  E.g., probing 1200 targets from PlanetLab nodes

takes 5 minutes on average (using 30 threads) –  Probing more targets covers more links –  But, getting a topology snapshot takes longer

§ Snapshot may be inaccurate –  Paths may change during snapshot

§ Hard to get up-to-date topology –  To know that a path changed, need to re-probe

19

Page 20: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Faster topology snapshots

§ Probing redundancy –  Intra-monitor –  Inter-monitor

§ Doubletree –  Combines backward and

forward probing to eliminate redundancy

20

A

D

B

m1 t1

m2

t2

C

[Donnet, 2005]

Page 21: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Summary: Topology discovery § Network operators

–  Own network: routing messages –  Neighbor networks: traceroutes

§ End users: combining traceroutes –  Be aware of inaccuracies

•  False or missing links and nodes •  Hidden hops: stars, tunneling

21

Page 22: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Overview § Topology discovery

–  Topology mapping with traceroutes –  Tracking topology changes

§ Detection –  Active versus passive fault detection techniques –  Performance problems as perceived by users

§ Identification

–  Network tomography for fault diagnosis

22

Page 23: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Predicting and tracking Internet path changes

I. Cunha (UPMC/Technicolor) D. Veitch (U. Melbourne)

C. Diot (Technicolor)

Page 24: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Challenges of tracking topology with traceroute

§ Cannot measure fast enough to detect all changes –  Network and system limitations

§ Accurate measurements require extra probes –  Identify all paths under load balancing

24

Page 25: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Frequent vs. accurate measurements

25

Frequency

Accu

racy

Paris traceroute

Traceroute Tracetree Doubletree

High

Hig

h

Low

Page 26: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Our approach § Observation: Internet paths are mostly stable

–  Current techniques waste probes § Probe according to path stability

§ Separate tasks –  Change detection: lightweight probing for speed –  Path remapping: accuracy with Paris traceroute

26

Page 27: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Our contributions

§ NN4: Predicting Internet path changes –  Distinguish between stable and unstable paths

§ DTrack: Tracking Internet path changes –  Lightweight probing process to detect changes –  Allocates more probes to unstable paths

27

Page 28: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Predicting path changes

§ Prediction goals –  Time until the next change –  Number of changes in a time interval –  Whether a path will change in a time interval

§ Identify features that can help with prediction –  Features must be computable from traceroutes

•  Characteristics of the current path •  Characteristics of the last path change •  Behavior of the path in the recent past

28

Page 29: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Feature selection

§ RuleFit to identify feature importance –  Fraction of time path active in the past (prevalence) –  Number of changes in the past –  Number of previous occurrences of the current path –  Path age

§ Four most important features carry all the predictive information

29

Page 30: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

NN4 predictor § RuleFit is hard to integrate in other systems

§ NN4 is based on the nearest-neighbor scheme –  Compute neighbors by partitioning the path feature “state-space” –  Prediction computed as the average behavior of all neighbors

30

Changes in the past

Prev

alen

ce

Page 31: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

NN4: summary

§ NN4 is lightweight, easy to integrate, and as accurate as RuleFit

§ Prediction is not highly accurate –  It is possible to distinguish unstable/stable paths

31

Page 32: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

DTrack

§ Goal: Given a probing budget, detect as many changes as possible

§ Allocates probing rates per path using NN4’s predictions

§ Targets probes along each path –  Reduce redundant probes at shared links –  Spread probes over time

32

Page 33: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Probe rate allocation § Allocate rates that minimize number of missed changes

§ Model changes in each path as a Poisson process –  Estimate the rate of changes using NN4

§ Compute missed changes as function of probing rate

33

Time

Probing interval Path changes

Page 34: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Probe targeting overview

34 D1 D2

D3

Page 35: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

DTrack summary § Detects more changes than optimized traceroute

–  2.2 times more changes in trace-driven simulations –  5 times more changes in PlanetLab

35

Frequency

Accu

racy

Paris traceroute

Traceroute Tracetree Doubletree

High

Hig

h

Low

DTrack

Page 36: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Overview § Topology discovery

–  Topology mapping with traceroutes –  Tracking topology changes

§ Detection –  Active versus passive fault detection techniques –  Performance problems as perceived by users

§ Identification

–  Network tomography for fault diagnosis

36

Page 37: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Faults versus performance disruptions

§ Faults are persistent reachability problems –  Cannot connect

§ Performance disruptions –  Web browsing is slow –  Audio is choppy –  Video is stalling

37

Page 38: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Fault detection techniques

§ Active probing: ping –  Send probe, collect response –  From any end host

•  Works for network operators and end users

§ Passive analysis of user’s traffic –  Tap incoming and outgoing traffic

•  At user’s machines or servers: tcpdump, pcap •  Inside the network: DAG card

–  Monitor status of TCP connections

38

Page 39: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Detection with ping

§ If receives reply –  Then, path is good

§ If no reply before timeout –  Then, path is bad

39

m

t probe ICMP

echo request

reply ICMP

echo reply

Page 40: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Persistent failure or measurement noise?

§ Many reasons to lose probe or reply –  Timeout may be too short –  Rate limiting at routers –  Some end-hosts don’t respond to ICMP request –  Transient congestion –  Routing change

§ Need to confirm that failure is persistent –  Otherwise, may trigger false alarms

40

Page 41: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

§ Upon detection of a failure, trigger extra probes § Goal: minimize detection errors

–  Sending more probes –  Waiting longer between probes

§ Tradeoff: detection error and detection time

41

Failure confirmation

time

loss burst packets on

a path

Detection error [Cunha, 2009]

Page 42: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Passive detection at end hosts § tcpdump/pcap captures packets § Track status of each TCP connection

–  RTTs, timeouts, retransmissions § Multiple timeouts indicate path is bad

42

–  If current seq. number > last seq. number seen •  Path is good

–  If current seq. number = last seq. number seen •  Timeout has occurred •  After four timeouts, declare path as bad

[Zhang, 2004]

Page 43: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Passive detection inside the network is hard

§ Traffic volume is too high –  Need special hardware

•  DAG cards can capture packets at high speeds –  May lose packets

§ Tracking TCP connections is hard –  May not capture both sides of a connection –  Large processing and memory overhead

43

Page 44: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Passive vs. active detection Passive

+ No need to inject traffic + Detects all failures that affect user’s traffic

+ Responses from targets that don’t respond to ping

Active

+ No need to tap user’s traffic + Detects failures in any desired path

44

‒ Not always possible to tap user’s traffic ‒ Only detects failures in paths with traffic

‒ Probing overhead –  Cover a large number of paths –  Detect failures fast

Page 45: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Is failure in forward or reverse path?

§ Paths can be asymmetric –  Load balancing –  Hot-potato routing

45

m

t probe

reply

Page 46: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Disambiguating one-way losses: Spoofing

§ Monitor requests to spoofer to send probe

§ Spoofer sends spoofed probe with source address of the monitor

§  If reply reaches the monitor, reverse path is good

46

m

t

Spoofer

[Katz-Bassett, 2008]

Page 47: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Limits of spoofing

§ Network operators often drop spoofed packets –  Spoofed packets are normally used for attacks

47

m

t § Placement of spoofer

–  Paths from spoofer to targets need to be independent than paths from monitors

Page 48: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Summary: Fault detection § End users: passive plus active probing

–  Passive measurements capture user’s experience –  Active probes

•  When path has no traffic •  When TCP connections are too short

§ Network operators: alarms plus active probing –  Alarm systems directly report many faults –  Active monitoring to capture customer’s experience

•  Detect blackholes (i.e., faults that don’t appear in alarms) •  Detect faults in other networks

48

Page 49: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Overview § Topology discovery

–  Topology mapping with traceroutes –  Tracking topology changes

§ Detection –  Active versus passive fault detection techniques –  Performance problems as perceived by users

§ Identification

–  Network tomography for fault diagnosis

49

Page 50: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Understanding Users’ Perception of Performance Disruptions at End-Hosts

Diana Joumblatt Laboratoire LIP6 -- CNRS and UPMC Sorbonne Universités

Jaideep Chandrashekar, Nina Taft

Technicolor

Page 51: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

User experience vs. typical network performance metrics

§ Typical network performance metrics –  Round-trip times (RTTs) –  Throughput –  Losses

§ User perspective only in niche applications –  Voice –  Video –  Gaming

§ What is missing? –  User perspective of overall online experience

51

Page 52: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Why user perspective matters?

§  Automated diagnosis –  Avoid unnecessary diagnosis when users don’t care –  Improve diagnosis when network performance

seems ok and the user is annoyed

§ Performance management –  Adapt application performance to user needs

52

Page 53: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Challenges in measuring user perception

§ User perception varies –  Per user, per environment, per application –  For a given user according to external factors

§ Imbalance in number of samples –  Can’t collect frequent user feedback (~10 per day) –  Orders of magnitude more network measurements

(~103/104/105/…)

§ End-host data collection raises issues –  Privacy –  Machine overhead

53

Page 54: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Our approach

§ Measure network performance with user feedback

–  HostView: end-host data collection

§ Correlate network performance with user feedback

54

Page 55: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

HostView overview

55

End-host

User feedback

System performance

Front-end server

Upload traces

Network environment

Transfer to backend

Network traffic

Application context

Network environment

Traces Every 4 hours 100 kbps max.

Back-end server

Traces Traces

Privacy

Page 56: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

User feedback mechanisms § System-trigged feedback

–  Experience sampling methodology (ESM) –  Triggered based on state of machine –  5 short questions about network performance –  At most 3 times a day

§ User-triggered feedback –  “I’m annoyed” button –  Same questions as in ESM –  Can trigger as often as user wants

56

L

Page 57: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Example question

57

Page 58: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Deployment § Recruiting volunteers

–  Leaflets at IMC 2010 and CS Mailing lists –  50 USD Amazon gift cards –  Real-time feedback about network connection

§ Data: 40 users –  Nov 2010 – Feb 2011 –  26 Mac OS and 14 Linux –  14 countries –  Most users ran tool for one month

58

Page 59: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

User vs. network reporting

§ User perspective –  Good and poor performance epochs as flagged by

the user

§ Network and system Perspective –  Good and poor performance epochs as flagged by

lower-level metrics that trigger atypical (anomalous) behavior automatically

§ Question: Do these co-occur?

Page 60: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Anomalous performance

60

§ Metrics –  RTTs –  TCP retransmissions

–  Wireless noise level –  Machine CPU load –  Data rates

–  Wireless signal strength –  Any instance of TCP reset

Below 5th percentile

Above 95th percentile

Page 61: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

Can’t connect to some sites or services

61

Time in minutes

-0.1 0

-2 -1 0 +1 +2

0.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1.1 1

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2-10+1+2

RTT TCP RST TCP RETR

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

0 = 5th%

1 = 95th%

Page 62: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Everything is good!

62

-0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

-2 -1 0 +1 +2

Time in minutes

-0.1 0

1.1 1

-2 0 +1 +2

0.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

-1

RTT Data Rate

Page 63: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Matching performance reports

63

Yes No No feedback

Yes

User doesn’t

care

Missing user

feedback

No

Not reporting

correct system metrics

Missing user

feedback

C

User report

Machine anomaly

C

95 reports 352 reports

82 reports 313 reports

39 reports 13 reports

Page 64: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Summary: Correlating user feedback with network performance § Hard to get feedback from users

–  Many network performance samples without feedback

–  Users are diverse in how they report a problem

§ Raw network metrics alone are not enough –  Not all outliers affect the user perception

64

Page 65: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Next steps

§ Incorporate user context to define anomaly –  Application –  Environment –  What annoys the user

§ Apply machine learning techniques –  Use HostView data to train models of user

perception

§ Build online detector of poor user experience

65

Page 66: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Overview § Topology discovery

–  Topology mapping with traceroutes –  Tracking topology changes

§ Detection –  Active versus passive fault detection techniques –  Performance problems as perceived by users

§ Identification

–  Network tomography for fault diagnosis

66

Page 67: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Fault identification with traceroute

§ Forwarding loops and some reachability problems

§ Traceroute identifies the effect, not the cause

67

L

B

A C

D

E m t

Page 68: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Network tomography to infer link performance

§ What are the properties of network links? –  Loss rate –  Delay –  Bandwidth –  Connectivity

§ Given end-to-end measurements –  No access to routers

68

D F

E

A C

B

AS 2

AS 1

Page 69: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

The origins

§ MINC: Multicast-based Inference of Network-internal Characteristics

§ Key idea: multicast probes –  Exploit correlation in traces to

estimate link properties

69

probe sender

probe collectors

[MINC project, 1999]

Page 70: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Binary tomography § Given

–  Complete network topology –  Path reachability

§ Find the smallest set of links that explains bad paths

–  If link is bad, all paths that cross the links are bad

–  Given bad links are uncommon

70

m

t1 t2

bad

good bad

[Duffield, 2006]

Page 71: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Fault identification with binary tomography

§ Fault monitoring needs multiple sources and targets

§ Problem becomes NP-hard –  Minimum hitting set problem

§ Iterative greedy heuristic –  Given the set of links in bad paths

–  Iteratively choose link that explains the max number of bad paths

71

m2

t1 t2

m1

Hitting set of link = paths that traverse

the link

[Kompella, 2007] [Dhamdhere, 2007]

Page 72: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Practical issues § Topology is often unknown

–  Need to measure accurate topology § Multicast not available

–  Need to extract correlation from unicast probes –  Even using probes from different monitors

§ Control of targets is not always practical –  Need one-way performance from round-trip probes

§ Links can fail for some paths, but not all –  Need to extend tomography algorithms

72

Page 73: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Uncorrelated measurements lead to errors

§ Lack of synchronization leads to inconsistencies

–  Probes cross links at different times

–  Path may change between probes

73

m

t1 t2

mistakenly inferred failure

Page 74: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

74

Sources of inconsistencies

§ In measurements from a single monitor –  Probing all targets can take time

§ In measurements from multiple monitors –  Hard to synchronize monitors for all probes to reach

a link at the same time –  Impossible to generalize to all links

Page 75: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Inconsistent measurements with multiple monitors

75

m1

t1 tN

mK

mK,t1

mK, tN

… m1,t1

m1, tN

path reachability

good

good

… good

bad …

inconsistent

measurements

Page 76: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Solution: Reprobe paths after failure

76

§ Consistency has a cost –  Delays fault identification –  Cannot identify short failures

m1

t1 tN

mK

mK,t1

mK, tN …

m1,t1

m1, tN

path reachability

good

bad

… good

bad

[Cunha, 2009]

Page 77: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Summary: Tomography

§ Trade-off: consistency vs. identification speed –  Faster identification leads to false alarms –  Slower identification misses short failures

§ Network operators –  Too many false alarms are unmanageable –  Longer failures are the ones that need intervention

§ End users –  Even short failures affect performance

77

Page 78: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Thank you!

§ HostView: http://cmon.lip6.fr/EMD –  Measure end-host performance with user feedback –  Mac OS and Linux

§ HomeNet Profiler: http://cmon.lip6.fr/hnp –  One-shot test –  Home network configuration –  WiFi performance –  Internet access performance

78

Page 79: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

REFERENCES

79

Page 80: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Topology from inside

§  IS-IS monitoring –  R. Mortier, “Python Routeing Toolkit (`PyRT')”, https://

research.sprintlabs.com/pyrt/

§ OSPF monitoring –  A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,

Design and Deployment Experience”, NSDI 2004

§ Commercial products –  Packet Design: http://www.packetdesign.com/

80

Page 81: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Topology with traceroute §  Tracing accurate paths under load-balancing

–  B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.

–  D. Veitch, B. Augustin, R. Teixeira, and T. Friedman, " Failure Control in Multipath Route Tracing", in Proc. of IEEE Infocom, April 2009.

§  Reducing overhead to trace topology of a network and alias resolution with direct probing

–  N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.

§  Use of record route to obtain more accurate topologies –  R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet

Cartographer”, SIGCOMM, 2008.

81

Page 82: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Optimizing topology discovery

§ Reducing overhead to take a topology snapshot –  B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient

Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.

§ Tracking topology changes –  I. Cunha, R. Teixeira, D. Veitch, and C. Diot, "Predicting and Tracking

Internet Path Changes, in Proc. of ACM SIGCOMM, August 2011.

82

Page 83: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Reducing overhead of active fault detection

§ Selection of paths to probe –  H. Nguyen and P. Thiran, “Active measurement for multiple link

failures diagnosis in IP networks”, PAM, 2004. –  Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link

delays and faults in IP networks”, INFOCOM, 2003. § Selection of the frequency to probe paths

–  H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.

83

Page 84: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Performance disruptions as perceived by users

§ HostView design –  D. Joumblatt, R. Teixeira, J. Chandrashekar, N. Taft,

"HostView: Annotating end-host performance measurements with user feedback", in Proc. of ACM HotMetrics Workshop, June 2010.

§ Correlation of network performance and user feedback –  D. Joumblatt, R. Teixeira, J. Chandrashekar, N. Taft,

"Performance of Networked Applications: The Challenges in Capturing the User's Perception", in Proc. of ACM SIGCOMM Workshop on Measurements Up the Stack, August 2011.

84

Page 85: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Network tomography theory

§  Survey on network tomography –  R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network

Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.

§  Traffic matrix estimation –  Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic

Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.

§  Inference of link performance/connectivity –  MINC project: http://gaia.cs.umass.edu/minc/ –  A. Adams et al., “The Use of End-to-end Multicast Measurements for

Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.

85

Page 86: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Binary tomography §  Single-source tree algorithm

–  N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.

§  Applying tomography in one network –  R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection

and Localization of Network Blackholes”, IEEE INFOCOM, 2007.

§  Applying tomography in multiple network topology –  A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,

“NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.

§  Obtaining accurate path status for binary tomography –  I. Cunha, R. Teixeira, N. Feamster, C. Diot, "Measurement Methods

for Fast and Accurate Blackhole Identification with Binary Tomography", in Proc. of ACM Internet Measurement Conference, November 2009.

86

Page 87: Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

Internet-wide fault detection systems

§  Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults

–  E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.

§  Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults

–  M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.

87