1 internet performance monitoring for the henp community les cottrell & warren matthews – slac...
TRANSCRIPT
1
Internet Performance Monitoring for the HENP Community
Les Cottrell & Warren Matthews – SLACwww.slac.stanford.edu/grp/scs/net/talk/mon-pam-mar00/
Presented at the Passive & Active Measurement Workshop, University of Waikato, New Zealand April 3, 2000
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP
2
Overview• Requirements
• PingER
• Validations
• Results
• Quality of Service
• IPv6 Monitoring
• Summary
3
HENP Requirements• Large experiments with collaborators in over 50
countries– Hundreds or even > 1000 people on experiment
• Data volumes of PetaBytes or even ExaBytes (1018)• Distributed access:
– Bulk transfer to regional centers– Fast database queries– Smooth interactive sessions
• ICFA created standing committee to review Inter-regional Connectivity
• Mainly use National Research Networks• Set expectations, help troubleshoot, planning input
4
PingER• Measurements from
– 30 monitors in 15 countries– Over 500 remote hosts– Over 70 countries – Over 2100 monitor-remote site pairs
• Over 50% of HENP collaborator sites are explicitly monitored as remote sites by PingER project– Atlas (37%), BaBar (68%), Belle (23%), CDF (73%), CMS (31%),
D0 (60%), LEP (44%), Zeus (35%), PPDG (100%), RHIC(64%)
• Remainder covered by Beacons– Currently 56, extending to 76
5
Beacons & UK seen from ESnet
Sites in UK track one another, so can represent with single site
2 Beacons in UK Indicates common source of congestionIncreased capacity by 155 times in 5 years
Effect of ACLs
Direct peering betweenJANet and ESnet
6
PingER Deployment Jan-00
7
Validations: Ping vs. Surveyor
Scatter plot Ping RTT vs Surveyor RTT gives R2 ~ 0.92www.slac.stanford.edu/comp/net/wan-mon/surveyor-vs-pinger.html
8
RIPE vs Surveyor 1/2
Little short term correlationeven for time differences of< 2 secs
Little structureoutliersdon’t match
9
RIPE vs Surveyor 2/2
Optimum agreement ifdisplace RIPE by ~ 0.2 ms(packet size difference)
10
PingER vs AMP
Little obvious short term agreement (R2<0.1)Same if compare ping vs. ping
Avg Ping distribution agrees with AMPBoth show >=95% of samples are 58-59 msecR2 > 0.95 for min & avg
Time series
11
Rate Limiting 1/2• Have identified about 2% of sites probably limiting • Using Sting (Stefan Savage) & SynAck (SLAC)
tools to identify loss(sting or synack probes) << loss(ping)
• www.vincy.bg.ac.yu blocked 884 rounds of 10 ICMP packets each, out of 903
• islamabad-server2.comsats.net.pk – blocked 554 out of 903
• leonis.nus.edu.sg– blocked all non 56Byte packets
• All low loss with sting or synack
12
Rate Limiting 2/2
“Tail-drop” behavior
• Rate-limiting kicks in after the first few packets and hence later packets are more likely to be dropped
Calculate slope and histogram slope frequency for all nodes, look at outliers (8)
Added as PingER metric, Still validating, some sites consistentothers vary from month to month
13
Results:How are the U.S.
Nets doing?
In general performance is good (i.e. <= 1%).
Edu (vBNS/Abilene) is catching up with ESnet
XIWT (70% .com) 3-5 times worse than ESnet or I2
14
Europe seen from U.S.
650ms
200 ms
7% loss10% loss
1% loss
Monitor siteBeacon site (~10% sites)HENP countryNot HENPNot HENP & not monitored
15
Asia seen from U.S.
3.6% loss
10% loss
0.1% loss
640 ms
450 ms
250ms
16
Latin America, Africa & Australasia4% Loss
2% Loss
350 ms
700ms
170 ms
220 ms
17
Quality of Service: How to improve• More bandwidth
– Keep network load low (< 30%) – Costs (at least in the W) are coming down dramatically,
but non-trivial to keep up
• Reserved/managed bandwidth generally on ATM via PVCs today
• Differentiated services
18
Effect of more & managed bandwidth
German Universities as good as DESY after Oct-99 upgradeDFN closes Perryman POP loses direct ESnet peeringPeering re-established via Dante @ 60 Hudson
RTT
Loss
19
RTT from ESnet to Groups of Sites
ITU G.114 300 ms RTT limit for voice
20
Loss seen from ESnet to groups of Sites
ITU limit for loss
21
Bulk transfer - Performance TrendsBandwidth TCP < 1460/(RTT * sqrt(loss))
Note: E. Europe not catching up
ESnetFlatteningout
22
Interactive apps - JitterSLAC<=>CERN two-way
instantaneous packet delay variation
0
10
20
30
40
50
60
70
80
90
-100 -8
0
-60
-40
-20 0
20
40
60
80
100
Ping inter packet delay difference in msec.
Fre
qu
en
cy
0
10
20
30
40
50
60
70
80
90
Frequency
Gaussian
Average = -0.03 msec.Std dev = 35 msec.Median = 0 msec.IQR = 29 msecLoss = 0.3%1000 samples
Gaussian-prob=79*exp(-x**2/(2*(IQR/2)**2))
IPDD(i) = RTT(i) - RTT(i-1)
23
Interactive apps - JitterSLAC<=>CERN two-way
instantaneous packet delay variation
0
10
20
30
40
50
60
70
80
90
-100 -8
0
-60
-40
-20 0
20
40
60
80
100
Ping inter packet delay difference in msec.
Fre
qu
en
cy
0
10
20
30
40
50
60
70
80
90
Frequency
Gaussian
Average = -0.03 msec.Std dev = 35 msec.Median = 0 msec.IQR = 29 msecLoss = 0.3%1000 samples
Gaussian-prob=79*exp(-x**2/(2*(IQR/2)**2))
IPDD(i) = RTT(i) - RTT(i-1)
24
SLAC-CERNJitter
IQR(ipdv) between CERN & SLAC from Surveyor measurements (12/15/98 & medians for Dec-98)
0.1
1
10
100
0 5 10 15 20 25
Time since midnight (GMT)
IQR
(IP
DV
) in
ms
ec
.
IQR(ipdv) CERN>SLAC IQR(ipdv) SLAC>CERN
Monthly IQR(ipdv) CERN>SLAC Monthly IQR(ipdv) SLAC>CERN
ITU/TIPHON delayjitter threshold
(75 ms)
25
Voice over IP: Reachability Within N. America, & W. Europe loss, RTT and jitter is acceptable for VoIP
But what about reachability
26
Availability – Outage ProbabailitySurveyor probes randomly 2/secondMeasure time (Outage length) consecutive probes don’t getthrough
http://www-iepm.slac.stanford.edu/monitoring/surveyor/outage.html
27
Error free secondsTypical US phone company objectives are 99.999%
http://www-iepm.slac.stanford.edu/monitoring/surveyor/err-sec.html
What do we see for the Internet using Surveyor measurements
28
• Small amount of bandwidth carved off ESnet connection to provide native IPv6 service to SLAC
6REN
RTR-IPv6
IPv6 Monitoring
•Production IPv6 allocation•2001:400:0808::/48
•Addresses are in DNSPingER6
Scylla
Charybdis
Switch
IPv6 VLAN
•VLAN allows deployment throughout SLAC
SLAC
29
Porting PingER to PingER6Recompiled Linux 2.2.5-15 (Red
Hat 6.0) kernel with IPv6 support
• Downloaded & installed inet-apps (including ping) from inner.net and patch for glibc-2.1 systems
• Wrote Perl module to provide IPv6 DNS lookup
• Got remote IPv6 sites to monitor– 10 countries, 40 sites
• Currently one monitoring site at SLAC– 6TAP to start soon
– China?
Remote Sites
30
How does it look?
0
5
10
15
20
25
30
35
40
22 24 26 28 30 2 4 6 8 10
% lo
ss
The weekend
0
100
200
300
400
1 7
13
19
25 1 7
13
19
25
31
RTT
RTT Between SLAC andPurdue in Nov/Dec 1999
IPv6
IPv4
Nov/Dec 1999
Much of current 6BONEis congested
31
Summary• Long term agreement between AMP, PingER,
Surveyor, & RIPE– need persistent structure (e.g. congestion or route changes)
for short term point by point agreement
• Rate limiting still a minor effect, but could become a problem, trying to get good signature
• International performance from US to sites outside W. Europe, JP, KR, SG, TW is generally poor to bad
• Managed bandwidth can be big help.
• ESnet & Internet 2 doing well, even for VoIP, except reachability has a way to go
• PingER ported to IPv6, 6BONE congested
32
More Information• This talk:
– www.slac.stanford.edu/grp/scs/net/talk/mon-pam-mar00/
• IEPM/PingER home site– www-iepm.slac.stanford.edu/
• Comparison of Surveyor & RIPE & PingER– www.slac.stanford.edu/comp/net/wan-mon/surveyor-vs-ripe.html– www.slac.stanford.edu/comp/net/wan-mon/surveyor-vs-pinger.html
• Detecting ICMP Rate Limiting– www.slac.stanford.edu/grp/scs/net/talk/limiting-feb00/
• IPv6 Monitoring– www.slac.stanford.edu/grp/scs/net/talk/pinger6/