computer science demystifying the router-level topology john byers department of computer science...

Computer Science

Demystifying the router-level topology

John Byers

Department of Computer Science and Topology Modeling Group, Boston University

CS: Mark Crovella, Anukool Lakhina, Ibrahim Matta Physics: Paul Krapivsky, Sid Redner Statistics: David Chiu, Eric Kolaczyk

Mystery and mystification

The Internet topology is shrouded in mystery.

Rapid, decentralized growth

Emergent behavior

Can treat it like a “found” object…

Need to approach the Internet scientifically

Complexity and massive scale…

Give me a break!

Why are we still mystified?

Many levels of abstraction

AS-level topology vs.

Topology seen by IP or traceroute vs.

“Physical” topology: switching elements

Obfuscation (intentional or otherwise) by ISPs.

Measurement tools are primitive.

Technical challenges are significant.

Refutation of theories not part of our culture (?)

Who is mystified?

I am mystified.

Most networking researchers are mystified, or are under the illusion that they are not mystified.

People in other communities are definitely mystified.

Network operators are presumably not mystified.

The Demystification Manifesto

(apologies to Varghese-Estan for blatantly ripping off part of their HotNets-II paper title)

Clear up technical confusion.

Articulate strengths and weaknesses of tools

Refute broken theories

Continue to conduct measurements, build more informed models, and validate them.

Declare success when:

No more tall tales about scale-free graphs in the router-level topology.

Or… ?

Outline

Demystification Manifesto

Case Study: Demystifying traceroute.[Lakhina, Byers, Crovella, Xie: Infocom

’03]

Next steps on demystification agenda

Internet mapping efforts

Goal: Discover the Internet router-level topology• Vertices represent routers.• Edges connect routers that are one IP hop apart.

Measurement Primitive: traceroute Reports the IP path from A to B.

source destination

212.12.5.77

212.12.58.3 163.55.221.98

163.55.1.41

163.55.1.10

Fundamental limitations of traceroute

The IP path is not the router-level path.

Many-to-many relationships

One router may have many interfaces

A collection of switching gear may appear to be a single IP address

MPLS label switching, ATM, GigaPoP’s

Missing data and noisy data is the norm.

• k sources: Few active sources, strategically located.

• m destinations: Many passive destinations, globally dispersed.

• Union of many traceroute paths.

(k,m)-traceroute study

Most recent traceroute studies

Destinations

Sources

DegreeF

req

uen

cy

Dataset from [PG98]

Heavy tails in topology measurements

A surprising finding: [FFF99]

Let be a given node degree.Let be frequency of degree vertices in a graph

Power-law relationship:

dfd

cd df

d

Subsequent measurements show that the degree distribution is a heavy tail,[GT00, BC01, …]

log

(Pr[

X>

x])

log( )

cdxX ]Pr[

Hmmm…..

We will argue that the evidence for power laws is at best insufficient.

Insufficient does not mean noisy or incomplete. (which these datasets certainly are!)

For us, insufficient means that measurements are statistically biased.

We will show that (k,m)-traceroute studies likely exhibit significant sampling bias.

A thought experiment

Idea: Simulate topology measurements on a random graph.

1. Generate a sparse Erdös-Rényi random graph, G=(V,E). Each edge present independently with probability pAssign weights: w(e) = 1 + , where in

2. Pick k unique source nodes, uniformly at random

3. Pick m unique destination nodes, uniformly at random

4. Simulate traceroute from k sources to m destinations, i.e. learn shortest paths between k sources and m destinations.

5. Let Ĝ be union of shortest paths.

Ask: How does Ĝ compare with G ?

||

1,||

1

VV

Ĝ is a biased sample of G that looks heavy-tailedAre heavy tails a measurement artifact?

MeasuredGraph, Ĝ

Underlying Random Graph, G

Underlying Graph: N=100000, p=0.00015Measured Graph: k=3, m=1000

log(Degree)

log

(Pr[

X>

x])

Understanding Bias

An intuitive explanation: When traces are run from few sources to lots of destinations, some portions of underlying graph

are explored more than others.

We now investigate the causes behind bias.

Are nodes sampled unevenly?

• Conjecture: Shortest path routing favors higher degree nodes nodes sampled unevenly

• Validation:Examine true degrees of nodes in measured graph, Ĝ.

Expect true degrees of nodes in Ĝ to be higher than degrees of nodes in G, on average.

True Degrees of nodes in Ĝ

Degrees of all nodes in G

Measured Graph: k=5,m=1000

• Conclusion: Difference between true degrees of Ĝ and degrees of G is insignificant; dismiss conjecture.

Are edges sampled unevenly?

• Conjecture:Edges selected incident to a node in Ĝ not proportional to true degree.

• Validation:For each node in Ĝ, plot true degree vs. measured degree.

If unbiased, ratio of true to measured degree should be constant. Points clustered around y=cx line (c<1).

• Conclusion: Edges incident to a node are sampled disproportionately; supports conjecture.

Ob

serv

ed D

egre

e

True Degree

Why: Analyzing Bias

• Question: Given some vertex in Ĝ that is h hops from the source, what fraction of its true edges are contained in Ĝ?

• Messages:

• As h increases, number of edges discovered falls off sharply.*

* We can prove exponential fall-off analytically, in a simplified model.

Distance from source

Fra

cti

on

of

no

de

ed

ges

dis

cove

red

1000dst

100dst

600dst

Result of adding more destinations: most new nodes and edges closer to the source.

What does this suggest?

Summary:

Edges are sampled unevenly by (k,m)-traceroute methods.

Edges close to the source are sampled more often than edges further away.

Intuitive Picture:

Neighborhood near sources is well explored, but visibility of edges declines sharply with hop distance from sources.

Hop1lo

g(P

r[X

>x]

)

log(Degree)

Hop2

Hop3

Underlying Graph

Measured Graph

Hop4

Inferring Bias

Goal:Given a measured Ĝ, does it appear to be biased?

Why this is difficult: Don’t have underlying graph. Don’t have formal criteria for checking bias.

General Approach: Examine statistical properties as a function of distance from nearest source.

• Unbiased sample No change• Change Bias

Detecting Bias

Examine Pr[D=d|H=h], the conditional probability that a node has degree d, given that it is at distance h from the source.

Two observations:1. Highest degree nodes are near the source.2. Degree distribution of nodes near the source different from those far away

log(Degree)

Ĝ degrees| H=3

log

(Pr[

X>

x])

Underlying Graph

Ĝ degrees| H=2

A Statistical Test for C1

2

)1(2)1(

)1(]Pr[

v

ek

Cut vertex set in half: N (near) and F (far), by distance from nearest source.Let v : (0.01) |V|

k : fraction of v that lies in N

Can bound likelihood k deviates from 1/2 using Chernoff bounds:

H0C1

Reject hypothesis with confidence 1- if:

2

)1()1(

v

e

C1: Are the highest-degree nodes near the source? If so, then consistent with bias.

The 1% highest degree nodes occur at random with distance to nearest source.

A Statistical Test for C2

Partition vertices across median distance: N (near) and F (far)

Compare degree distribution of nodes in N and F, using the Chi-Square Test:

l

iiii EEO

1

22 /)(

where O and E are observed and expected degree frequencies and l is histogram bin size.

Reject hypothesis with confidence 1- if:

H0C2

2]1,[

2 l

C2: Is the degree distribution of nodes near the source different from those further away? If so, consistent with bias.

Chi Square Test succeeds on degree distribution for nodes near the source and far from the source.

Our Definition of Bias

• Bias (Definition): Failure of a sampled graph to meet statistical tests for randomness associated with C1 and C2.

• Disclaimers:Tests are not conclusive.Tests are binary and don’t tell us how

biased datasets are.

• But dataset that fails both tests is a poor choice to make generalizations of underlying graph.

Introducing datasets

Pansiot-Grad

log(Degree)

Mercator Skitter

log

(Pr[

X>

x])

Dataset Name Date # Nodes # Links # Srcs # Dsts

Pansiot-Grad 1995 3,888 4,857 12 1270

Mercator 1999 228,263 320,149 1 NA

Skitter 2000 7,202 11,575 8 1277

Testing C1

H0C1 The 1% highest degree nodes occur at random with distance

to source.

Pansiot-Grad: 93% of the highest degree nodes are in NMercator: 90% of the highest degree nodes are in NSkitter: 84% of the highest degree nodes are in N

Testing C2

H0C2

Pansiot-Grad Mercator Skitter

log

(Pr[

X>

x])

log(Degree)

Near

Far

All

Near

Far

All

Near

Far

All

Some possible explanations

1. Degree distribution is uniform, but sampling is biased.

2. Degree distribution is non-uniform, and nodes further from the source really do have below-average degree.

3. Others?

Final Remarks on traceroute

• Using (k,m)-traceroute methods for mapping is a bias-prone method.

• Rocketfuel [SMW:02] or similar methods may avoid some pitfalls of (k,m)-traceroute studies.

• Can we remove bias in a statistically sound way?

• An open question: Can we sample the degree of a router at random?

Outline

Demystification Manifesto

Case Study: Demystifying traceroute.

Next steps on demystification agenda

Demystification Agenda

0. Adopt a “no hype, no-nonsense” mindset.

1. Clear up technical misunderstandings and pitfalls associated with “router-level” measurements. Revisit incorrect conclusions, flawed methods, and broken theories.

2. Attempt to educate or re-educate as broad a community as possible. Arguably a focus for tonight’s discussion (?)

Breathing life into topology generation

Raw topology analogous to a skeleton presents coarse structure, but incomplete, inanimate inadequate for conducting most simulations

Flesh out by building annotated graphs: Label nodes with autonomous system (AS) ID’s. Label edges with link bandwidths. Label edges with latencies. Do this in a representative manner.

Animate the topology: Generate representative traffic workloads across the

annotated graph. Consider other dynamic factors (churn, link failures)

Now we’re ready to conduct a simulation.

Demystification Agenda

From my IPAM ’02 talk

3. Understanding the router-level topology alone is insufficient.

Much more insight from studying the annotated graph.

computer science demystifying the router-level topology john byers department of computer science...

Documents

topology measurements

internet topology

physical topology

demystification agenda

eric kolaczyk slide

logprxx log slide

routerlevel path

topology modeling group