1
Server-based Characterization and Inference of Internet Performance
Venkat PadmanabhanLili Qiu
Helen Wang
Microsoft Research
UCLA/IPAM Workshop March 2002
2
Outline
• Overview• Server-based characterization of
performance• Server-based inference of
performance– Passive Network Tomography
• Summary and future work
3
Overview
• Goals– characterize end-to-end performance– infer characteristics of interior links
• Approach: server-based monitoring– passive monitoring relatively
inexpensive– enables large-scale measurements– diversity of network paths
4
Web server
clients
DATA
ACKs ACKs
5
Research Questions
• Server-based characterization of end-to-end performance– correlation with topological metrics– spatial locality– temporal stability
• Server-based inference of internal link characteristics– identification of lossy links
6
Related Work
• Server-based passive measurement– 1996 Olympics Web server study
(Berkeley, 1997 & 1998)– characterization of TCP properties
(Allman 2000)
• Active measurement– NPD (Paxson 1997)– stationarity of Internet path properties
(Zhang et al. 2001)
7
Experiment Setting
• Packet sniffer at microsoft.com– 550 MHz Pentium III– sits on spanning port of Cisco Catalyst 6509– packet drop rate < 0.3%– traces up to 2+ hours long, 20-125 million
packets, 50-950K clients
• Traceroute source– sits on a separate Microsoft network, but all
external hops are shared– infrequent and in the background
8
Topological Metrics and Loss Rate
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20 25 30
Router hop count
E2E
pac
ket
loss
rat
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0 2 4 6 8 10
AS hop countE
2E p
acke
t lo
ss r
ate
Topological distance is a poor predictor of packet loss rate.All links are not equal need to identify the lossy links
9
Spatial Locality
0
20
40
60
80
100
0 1 2 3 4
Difference in loss rate (buckets)
Cu
mu
lati
ve p
rob
abil
ity
(%)
Subnet BGP Prefix AS Random
Spatial locality there may be shared cause for packet loss
• Do clients in the same cluster see similar loss rates?
• Loss rate is quantized into buckets– 0-0.5%, 0.5-2%, 2-5%, 5-
10%, 10-20%, 20+%– suggested by Zhang et al.
(IMW 2002)• Focus on lossy clusters
– average loss rate > 5%
10
Temporal Stability
• Loss rate again quantized into buckets
• Metric of interest: stability period (i.e., time until transition into new bucket)
• Median stability period ≈ 10 minutes• Consistent with previous findings
based on active measurements
11
Putting it all together
• All links are not equal need to identify the lossy links
• Spatial locality of packet loss rate lossy links may well be shared
• Temporal stability worthwhile to try and identify the lossy links
12
Passive Network Tomography
• Goal: determine characteristics of internal network links using end-to-end, passive measurements
• We focus on the link loss rate metric– primary goal: identifying lossy links
• Why is this interesting?– locating trouble spots in the network– keeping tabs on your ISP– server placement and server selection
13
Sprint
AT&T
Web server
UUNET
C&W
Qwest AOL
Earthlink
Darn, it’s slow!
Why is itso slow?
14
Related Work
• MINC (Caceres et al. 1999)– multicast-based active probing
• Striped unicast (Duffield et al. 2001)– unicast-based active probing
• Passive measurement (Coates et al. 2002)– look for back-to-back packets
• Shared bottleneck detection– Padmanabhan 1999, Rubenstein et al. 2000,
Katabi et al. 2001
15
Active Network Tomography
S
A B
S
A B
Multicast probes Striped unicast probes
16
Problem Formulation
l1
l8l7l6
l2
l4 l5
l3
server
clients
p1 p2 p3 p4 p5
Collapse linear chains into virtual links
(1-l1)*(1-l2)*(1-l4) = (1-p1)
(1-l1)*(1-l2)*(1-l5) = (1-p2)…(1-l1)*(1-l3)*(1-l8) = (1-p5)
Under-constrained system of equations
17
#1: Random Sampling
• Randomly sample the solution space• Repeat this several times• Draw conclusions based on overall statistics
• How to do random sampling?– determine loss rate bound for each link using
best downstream client– iterate over all links:
• pick loss rate at random within bounds• update bounds for other links
• Problem: little tolerance for estimation error
l1
l8l7l6
l2
l4 l5
l3
server
clients
p1 p2 p3 p4 p5
18
#2: Linear Optimization
Goals• Parsimonious explanation• Robust to estimation error
Li = log(1/(1-li)), Pj = log(1/(1-pj))
minimize Li + |Sj|L1+L2+L4 + S1 = P1
L1+L2+L5 + S2 = P2
…L1+L3+L8 + S5 = P5
Li >= 0
Can be turned into a linear program
l1
l8l7l6
l2
l4 l5
l3
server
clients
p1 p2 p3 p4 p5
19
#3: Bayesian Inference
• Basics:– D: observed data
• sj: # packets successfully sent to client j
• fj: # packets that client j fails to receive
– Θ: unknown model parameters• li: packet loss rate of link i
– Goal: determine the posterior P(Θ|D)– inference is based on loss events, not
loss rates• Bayes theorem
– P(Θ|D) = P(D|Θ)P(Θ)/∫P(D|Θ)P(Θ)dΘ– hard to compute since Θ is
multidimensional
l1
l8l7l6
l2
l4 l5
l3
server
clients
(s1,f1) (s2,f2) (s3,f3) (s4,f4) (s5,f5)
20
Gibbs Sampling
• Markov Chain Monte Carlo (MCMC)– construct a Markov chain whose stationary
distribution is P(Θ|D)
• Gibbs Sampling: defines the transition kernel– start with an arbitrary initial assignment of li
– consider each link i in turn
– compute P(li|D) assuming lj is fixed for j≠i
– draw sample from P(li|D) and update li– after burn-in period, we obtain samples from the
posterior P(Θ|D)
21
Gibbs Sampling Algorithm
1) Initialize link loss rates arbitrarily2) For j = 1 : burn-in
for each link i compute P(li|D, {li’}) where li is loss rate of link i, and {li’} = ji lj
3) For j = 1 : realSamples for each link i
compute P(li|D, {li’})Use all the samples obtained at step 3 to
approximate P(|D)
22
Experimental Evaluation
• Simulation experiments• Internet traffic traces
23
Simulation Experiments
• Advantage: no uncertainty about link loss rate• Methodology
– Topologies used:• randomly-generated: 20 - 3000 nodes, max degree = 5-50• real topology obtained by tracing paths to microsoft.com
clients– randomly-generated packet loss events at each link
• a fraction f of the links are good, and the rest are “bad”• LM1: good links: 0 – 1%, bad links: 5 – 10%• LM2: good links: 0 – 1%, bad links: 1 – 100%
• Goodness metrics: – Coverage: # correctly inferred lossy links – False positives: # incorrectly inferred lossy links
24
Simulation Results
1000-node random topologies (d=10, f=0.95)
0
20
40
60
80
100
120
140
160
Random LP Gibbs
# li
nk
s
"# true lossy links"
"# correctly identified lossy links"
"# false positive"
25
Simulation Results
1000-node random topologies (d=10, f=0.5)
0
100
200
300
400
500
600
Random LP Gibbs
# li
nk
s
"# true lossy links""# correctly identified lossy links""# false positive"
26
Simulation ResultsGibbs sampling for a 1000-node random topology (d = 10, f = 0.5)
0
100
200
300
400
500
600
0 200 400 600 800 1000
# lin
ks
"# correctly identified lossy links""# true lossy links""# false positive"
High confidence in top few inferences
27
Trade-off
Techniques Coverage False Positive Computation
Random sampling
High High Low
LP Medium Low Medium
Gibbs sampling High Low High
28
Internet Traffic Traces
• Challenge: validation– Divide client traces into two: tomography set and validation
set– Tomography data set => loss inference – Validation set => check if clients downstream of the inferred
lossy links experience high loss• Results
– false positive rate is between 5 – 30%– likely candidates for lossy links:
• links crossing an inter-AS boundary• links having a large delay (e.g. transcontinental links)• links that terminate at clients
– example lossy links:• San Francisco (AT&T) Indonesia (Indo.net)• Sprint PacBell in California• Moscow Tyumen, Siberia (Sovam Teleport)
29
Summary
• Poor correlation between topological metrics & performance
• Significant spatial locality and temporal stability• Passive network tomography is feasible• Tradeoff between computational cost and accuracy• Future directions
– real-time inference– selective active probing
• Acknowledgements:– MSR: Dimitris Achlioptas, Christian Borgs, Jennifer
Chayes, David Heckerman, Chris Meek, David Wilson– Infrastructure: Rob Emanuel, Scott Hogan
http://www.research.microsoft.com/~padmanab