internet measurements: fault detection, identification, and topology discovery

61
Internet measurements: fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas

Upload: lethia

Post on 08-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Internet measurements: fault detection, identification, and topology discovery. Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas. Internet monitoring is essential. For network operators Monitor service-level agreements Fault diagnosis Diagnose anomalous behavior - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Internet measurements: fault detection, identification, and topology discovery

Internet measurements: fault detection, identification, and

topology discovery

Renata TeixeiraLaboratoire LIP6

CNRS and UPMC Paris Universitas

Page 2: Internet measurements: fault detection, identification, and topology discovery

Internet monitoring is essential

For network operators– Monitor service-level agreements

– Fault diagnosis

– Diagnose anomalous behavior

For users or content/application providers– Verify network performance

– Verify network neutrality

2

Page 3: Internet measurements: fault detection, identification, and topology discovery

Network operators can’t know the user’s experience

Network operators only have data of one AS– AS4 doesn’t detect any problem– AS3 doesn’t know who is affected by the failure

3

AS1

AS2AS3

AS4

Page 4: Internet measurements: fault detection, identification, and topology discovery

End users can’t know what happens in the network

End-hosts can only monitor end-to-end paths

4

AS1

AS2AS3

AS4

Page 5: Internet measurements: fault detection, identification, and topology discovery

Network tomography to rescue

Network operators– Monitor network paths

– From monitoring hosts• In network

• Third-party monitoring services

– From home gateways

End users– Cooperative monitoring

– Among end users

– From users to popular services

5

http://www.nanodatacenters.eu

http://cmon.grenouille.com

Inference of unknown network properties from measurable ones

Page 6: Internet measurements: fault detection, identification, and topology discovery

Fault diagnosis using end-to-end measurements

Faults are persistent reachability problems

6

detection

continuous path monitoring

identification

binary tomography

Page 7: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification using binary tomography– Correlated path reachability

– Topology discovery

Open issues

7

Page 8: Internet measurements: fault detection, identification, and topology discovery

Network tomography to infer link performance

What are the properties of network links?– Loss rate

– Delay

– Bandwidth

– Connectivity

Given end-to-end measurements– No access to routers

8

D F

E

A C

B

AS 2

AS 1

Page 9: Internet measurements: fault detection, identification, and topology discovery

The origins

MINC: Multicast-based Inference of Network-internal Characteristics

Key idea: multicast probes– Exploit correlation in traces to

estimate link properties

9

probesender

probecollectors

[MINC project, 1999]

Page 10: Internet measurements: fault detection, identification, and topology discovery

Inferring link loss rates

Assumptions– Known, logical-tree topology

– Losses are independent

– Multicast probes

Method– Maximum likelihood

estimates for αk

10

1 10 11 1

α1

α2 α3

α1^ α2^ α3^

m

t1 t2

successprobabilities

estimatedsuccess

probabilities[Adams, 2000]

Page 11: Internet measurements: fault detection, identification, and topology discovery

Binary tomography

Labels links as good or bad– Loss-rate estimation requires

tight correlation

– Instead, separate good/bad performance

– If link is bad, all paths that cross the link are bad

11

1 10 10 1

α1

α2 α3

m

t1 t2

goodbad

[Duffield, 2006]

Page 12: Internet measurements: fault detection, identification, and topology discovery

Single-source tree

“Smallest Consistent Failure Set” algorithm

– Assumes a single-source tree and known topology

– Find the smallest set of links that explains bad paths• Given bad links are uncommon

• Bad link is the root of maximal bad subtree

12

m

t1 t2

bad

1 10 10 1

goodbad

[Duffield, 2006]

Page 13: Internet measurements: fault detection, identification, and topology discovery

Fault identification with binary tomography

Fault monitoring needs multiple sources and targets

Problem becomes NP-hard– Minimum hitting set problem

Iterative greedy heuristic– Given the set of links in bad paths

– Iteratively choose link that explains the max number of bad paths

13

m2

t1 t2

m1

Hitting set of link = paths that traverse

the link

[Kompella, 2007] [Dhamdhere, 2007]

Page 14: Internet measurements: fault detection, identification, and topology discovery

Practical issues

Topology is often unknown – Need to measure accurate topology

Multicast not available– Need to extract correlation from unicast probes– Even using probes from different monitors

Control of targets is not always practical– Need one-way performance from round-trip probes Links can fail for some paths, but not all– Need to extend tomography algorithms

14

Page 15: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection with no control of targets– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification using binary tomography– Correlated path reachability without multicast

– Topology discovery

Open issues

15

Page 16: Internet measurements: fault detection, identification, and topology discovery

Detection techniques

Active probing: ping– Send probe, collect response– From any end host

• Works for network operators and end users

Passive analysis of user’s traffic– Tap incoming and outgoing traffic

• At user’s machines or servers: tcpdump, pcap• Inside the network: DAG card

– Monitor status of TCP connections

16

Page 17: Internet measurements: fault detection, identification, and topology discovery

Detection with ping

If receives reply– Then, path is good

If no reply before timeout– Then, path is bad

17

m

tprobeICMP

echo request

replyICMP

echo reply

Page 18: Internet measurements: fault detection, identification, and topology discovery

Persistent failure or measurement noise?

Many reasons to lose probe or reply– Timeout may be too short

– Rate limiting at routers

– Some end-hosts don’t respond to ICMP request

– Transient congestion

– Routing change

Need to confirm that failure is persistent– Otherwise, may trigger false alarms

18

Page 19: Internet measurements: fault detection, identification, and topology discovery

Upon detection of a failure, trigger extra probes Goal: minimize detection errors

– Sending more probes – Waiting longer between probes

Tradeoff: detection error and detection time

19

Failure confirmation

time

loss burstpackets on

a path

Detection error

[Cunha, 2009]

Page 20: Internet measurements: fault detection, identification, and topology discovery

Passive detection at end hosts tcpdump/pcap captures packets Track status of each TCP connection

– RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad

20

– If current seq. number > last seq. number seen• Path is good

– If current seq. number = last seq. number seen• Timeout has occurred • After four timeouts, declare path as bad

[Zhang, 2004]

Page 21: Internet measurements: fault detection, identification, and topology discovery

Passive detection inside the network is hard

Traffic volume is too high– Need special hardware

• DAG cards can capture packets at high speeds

– May lose packets

Tracking TCP connections is hard– May not capture both sides of a connection

– Large processing and memory overhead

21

Page 22: Internet measurements: fault detection, identification, and topology discovery

Passive vs. active detectionPassive

+ No need to inject traffic+ Detects all failures that

affect user’s traffic+ Responses from targets

that don’t respond to ping

Active

+ No need to tap user’s traffic + Detects failures in any desired path

22

‒ Not always possible to tap user’s traffic

‒ Only detects failures in paths with traffic

‒ Probing overhead– Cover a large number of paths– Detect failures fast

Page 23: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection with no control of targets– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification using binary tomography– Correlated path reachability without multicast

– Topology discovery

Open issues

23

Page 24: Internet measurements: fault detection, identification, and topology discovery

24

Active monitoring: reducing probing overhead

M1

M2

T3

T1 T2

A C

BD

target hosts

monitors Goal detect failures of any of the

interfaces in the target networkwith minimum probing overhead

target network

Page 25: Internet measurements: fault detection, identification, and topology discovery

25

The coverage solution

M1

M2

T3

T1 T2

A C

BD

Instead of probing all paths, select the minimum set of paths that covers all interfaces in target network

Coverage problem is NP-hard

– Solution: greedy set-cover heuristic

[Nguyen, 2004] [Bejerano,2003]

Page 26: Internet measurements: fault detection, identification, and topology discovery

26

Coverage solution doesn’t detect all types of failures

Detects fail-stop failures– Failures that affect all packets that traverse the

faulty interface• Eg., interface or router crashes, fiber cuts, bugs

But not path-specific failures– Failures that affect only a subset of paths that cross

the faulty interface• Eg., router misconfigurations

[Nguyen, 2009]

Page 27: Internet measurements: fault detection, identification, and topology discovery

27

New formulation of failure detection problem

Select the frequency to probe each path– Lower frequency per-path probing can achieve a

high frequency probing of each interface

M1

M2

T3

T1 T2

A C

BD

1 every 9 mins

1 every 3 mins

[Nguyen, 2009]

Page 28: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection with no control of targets– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification using binary tomography– Correlated path reachability without multicast

– Topology discovery

Open issues

28

Page 29: Internet measurements: fault detection, identification, and topology discovery

Is failure in forward or reverse path?

Paths can be asymmetric– Load balancing

– Hot-potato routing

29

m

tprobe

reply

Page 30: Internet measurements: fault detection, identification, and topology discovery

Disambiguating one-way losses: Spoofing

Monitor requests to spoofer to send probe

Spoofer sends spoofed probe with source address of the monitor

If reply reaches the monitor, reverse path is good

30

m

t

Spoofer

[Katz-Bassett, 2008]

Page 31: Internet measurements: fault detection, identification, and topology discovery

Limits of spoofing

Network operators often drop spoofed packets– Spoofed packets are normally used for attacks

31

m

t Placement of spoofer– Paths from spoofer to

targets need to be independent than paths from monitors

Page 32: Internet measurements: fault detection, identification, and topology discovery

Summary: Fault detection

End users: passive plus active probing– Passive measurements capture user’s experience– Active probes

• When path has no traffic• When TCP connections are too short

Network operators: alarms plus active probing– Alarm systems directly report many faults– Active monitoring to capture customer’s experience

• Detect blackholes (i.e., faults that don’t appear in alarms)• Detect faults in other networks

32

Page 33: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection with no control of targets– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification– Correlated path reachability without multicast

– Topology discovery

Open issues

33

Page 34: Internet measurements: fault detection, identification, and topology discovery

Uncorrelated measurements lead to errors

Lack of synchronization leads to inconsistencies

– Probes cross links at different times

– Path may change between probes

34

m

t1 t2

mistakenly inferred failure

Page 35: Internet measurements: fault detection, identification, and topology discovery

35

Sources of inconsistencies

In measurements from a single monitor– Probing all targets can take time

In measurements from multiple monitors– Hard to synchronize monitors for all probes to reach

a link at the same time– Impossible to generalize to all links

Page 36: Internet measurements: fault detection, identification, and topology discovery

Inconsistent measurements with multiple monitors

36

m1

t1

tN

mK

mK,t1

mK, tN

…m1,t1

m1, tN

path reachability

good

good

good

bad…

inconsistent measurements

Page 37: Internet measurements: fault detection, identification, and topology discovery

Solution: Reprobe paths after failure

37

Consistency has a cost– Delays fault identification

– Cannot identify short failures

m1

t1

tN

mK

mK,t1

mK, tN

m1,t1

m1, tN

path reachability

good

bad

good

bad

[Cunha, 2009]

Page 38: Internet measurements: fault detection, identification, and topology discovery

Summary: Correlated measurements

Trade-off: consistency vs. identification speed– Faster identification leads to false alarms– Slower identification misses short failures

Network operators– Too many false alarms are unmanageable– Longer failures are the ones that need intervention

End users– Even short failures affect performance

38

Page 39: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection with no control of targets– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification– Correlated path reachability without multicast

– Topology discovery

Open issues

39

Page 40: Internet measurements: fault detection, identification, and topology discovery

Measuring router topology

With access to routers (or “from inside”) – Topology of one network

– Routing monitors (OSPF or IS-IS)

No access to routers (or “from outside”)– Multi-AS topology or from end-hosts

– Monitors issue active probes: traceroute

40

Page 41: Internet measurements: fault detection, identification, and topology discovery

41

Topology from inside

Routing protocols flood state of each link– Periodically refresh link state

– Report any changes: link down, up, cost change

Monitor listens to link-state messages– Acts as a regular router

• AT&T’s OSPFmon or Sprint’s PyRT for IS-IS

Combining link states gives the topology– Easy to maintain, messages report any changes

[Mortier] [Shaikh, 2004]

Page 42: Internet measurements: fault detection, identification, and topology discovery

Inferring a path from outside: traceroute

42

A B

TTL = 1

A.1 A.2 B.2B.1

TTL = 2

TTL exceeded from A.1

TTL exceeded from B.1

Actual path

Inferred path

A.1 B.1

m t

m t

Page 43: Internet measurements: fault detection, identification, and topology discovery

A traceroute path can be incomplete

Load balancing is widely used– Traceroute only probes one path

Sometimes taceroute has no answer (stars)– ICMP rate limiting

– Anonymous routers

Tunnelling (e.g., MPLS) may hide routers– Routers inside the tunnel may not decrement TTL

43

Page 44: Internet measurements: fault detection, identification, and topology discovery

44

Traceroute under load balancing

L

B

A C

D

L

A

D

C

TTL = 2

TTL = 3

B

E

E

Missing nodes and links

False link

Actual path

Inferred path

m

m t

t

[Augustin, 2006]

Page 45: Internet measurements: fault detection, identification, and topology discovery

45

Errors happen even under per-flow load balancing

L

B

A C

D

TTL = 2Port 2

TTL = 3Port 3

E

Traceroute uses the destination port as identifier– Needs to match probe to response– Response only has the header of the issued probe

Flow 1

m t

[Augustin, 2006]

Page 46: Internet measurements: fault detection, identification, and topology discovery

46

Paris traceroute Solves the problem with per-flow load balancing

– Probes to a destination belong to same flow

Changes the location of the probe identifier– Use the UDP checksum

L

B

A C

D

TTL = 2Port 1

TTL = 3Port 1

EChecksum 3Checksum 2m t

[Augustin, 2006]

Page 47: Internet measurements: fault detection, identification, and topology discovery

42 1

1

Topology from traceroutes

Inferred nodes = interfaces, not routers

Coverage depends on monitors and targets – Misses links and routers– Some links and routers appear multiple times

47

1 A

D

3B 2

3

2

3 1m1

t1

m2

t2

C

Actual topology

A.1m1t1

m2t2

Inferred topology

C.1D.1

C.2

B.3

2

Page 48: Internet measurements: fault detection, identification, and topology discovery

Alias resolution: Map interfaces to routers

Direct probing– Probe an interface, may receive

response from another

– Responses from the same router will have close IP identifiers and same TTL

Record-route IP option– Records up to nine IP

addresses of routers in the path

48

A.1m1t1

m2t2

Inferred topology

C.1D.1

C.2

B.3

same router

[Spring, 2002] [Sherwood, 2008]

Page 49: Internet measurements: fault detection, identification, and topology discovery

Large-scale topology measurements

Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes

takes 5 minutes on average (using 30 threads)– Probing more targets covers more links– But, getting a topology snapshot takes longer

Snapshot may be inaccurate– Paths may change during snapshot

Hard to get up-to-date topology– To know that a path changed, need to re-probe

49

Page 50: Internet measurements: fault detection, identification, and topology discovery

Faster topology snapshots

Probing redundancy– Intra-monitor

– Inter-monitor

Doubletree– Combines backward and

forward probing to eliminate redundancy

50

A

D

B

m1

t1

m2

t2

C

[Donnet, 2005]

Page 51: Internet measurements: fault detection, identification, and topology discovery

Summary: Topology discovery

Network operators– Own network: routing messages– Neighbor networks: traceroutes

End users: combining traceroutes– Be aware of inaccuracies

• False or missing links and nodes• Hidden hops: stars, tunneling

– Fault identification with lower precision• Determine the network to blame

51

Page 52: Internet measurements: fault detection, identification, and topology discovery

Outline Background in network tomography

Fault detection with no control of targets– Active vs. passive measurements

– Reducing overhead of active measurements

– Disambiguating one-way failures

Fault identification– Correlated path reachability without multicast

– Topology discovery

Open issues

52

Page 53: Internet measurements: fault detection, identification, and topology discovery

Tomography algorithms

Make robust to measurement noise

Make robust to topology uncertainties– Multiple topologies close to the time of an event– Multiple paths between a monitor and a target

Identify other types of faults– Path specific– Intermittent

53

Page 54: Internet measurements: fault detection, identification, and topology discovery

Monitoring techniques Track dynamics of large-scale topologies

– Fast identification requires up-to-date topology Passive detection inside a network

– High speed packet processing– Detect faults with incomplete information

Large-scale deployment– Consolidating measurements becomes bottleneck

Define changes to easy fault diagnosis– Router reports or behavior– Common monitoring infrastructure

54

Page 55: Internet measurements: fault detection, identification, and topology discovery

REFERENCES

55

Page 56: Internet measurements: fault detection, identification, and topology discovery

Network tomography theory

Survey on network tomography– R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network

Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.

Traffic matrix estimation– Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic

Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.

Inference of link performance/connectivity– MINC project: http://gaia.cs.umass.edu/minc/

– A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.

56

Page 57: Internet measurements: fault detection, identification, and topology discovery

Binary tomography Single-source tree algorithm

– N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.

Applying tomography in one network– R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection

and Localization of Network Blackholes”, IEEE INFOCOM, 2007.

Applying tomography in multiple network topology– A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,

“NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.

Obtaining accurate path status for binary tomography– I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement

Methods for Fast and Accurate Blackhole Identification with Binary Tomography”, Thomson technical report CR-PRL-2009-05-006, 2009.

57

Page 58: Internet measurements: fault detection, identification, and topology discovery

Topology from inside

IS-IS monitoring– R. Mortier, “Python Routeing Toolkit (`PyRT')”,

https://research.sprintlabs.com/pyrt/

OSPF monitoring– A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,

Design and Deployment Experience”, NSDI 2004

Commercial products– Packet Design: http://www.packetdesign.com/

58

Page 59: Internet measurements: fault detection, identification, and topology discovery

Topology with traceroute Tracing accurate paths under load-balancing

– B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.

Reducing overhead to trace topology of a network and alias resolution with direct probing

– N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.

Use of record route to obtain more accurate topologies– R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet

Cartographer”, SIGCOMM, 2008.

Reducing overhead to trace a multi-network topology– B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient

Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.

59

Page 60: Internet measurements: fault detection, identification, and topology discovery

Reducing overhead of active fault detection

Selection of paths to probe – H. Nguyen and P. Thiran, “Active measurement for multiple link

failures diagnosis in IP networks”, PAM, 2004.

– Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003.

Selection of the frequency to probe paths– H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing

Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.

60

Page 61: Internet measurements: fault detection, identification, and topology discovery

Internet-wide fault detection systems

Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults

– E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.

Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults

– M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.

61