network problem diagnosis for non-networkers
DESCRIPTION
Network Problem Diagnosis for Non-networkers. Les Cottrell – SLAC University of Helwan / Egypt, Sept 18 – Oct 3, 2010. Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP. Overview. - PowerPoint PPT PresentationTRANSCRIPT
http://www.slac.stanford.edu/grp/scs/net/talk10/diagnosis.pptx
SPACE Weather School: Basic theory & hands-on experience
Network Problem Diagnosis for Non-
networkersLes Cottrell – SLAC
University of Helwan / Egypt, Sept 18 – Oct 3, 2010
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP
Slide: 2Les Cottrell, SLAC
OverviewGoal: provide a practical guide to debugging common
problems Why is diagnosis difficult yet important? Local host Ping, Traceroute, PingRoute Looking at time series Locating bottlenecks Correlation of problems with routes More tools and problems Where is a node Who do you tell, what do you say? Case studies and More Information
Slide: 3Les Cottrell, SLAC
Why is diagnosis difficult? Internet's evolution as a composition of independently
developed and deployed protocols, technologies, and core applications
Diversity, highly unpredictable, hard to find “invariants” Rapid evolution & change, no equilibrium so far
Findings may be out of date Measurement/diagnosis not high on vendors list of priorities
Resources/skill focus on more interesting an profitable issues Tools lacking or inadequate Implementations are flaky & not fully tested with new releases
Slide: 4Les Cottrell, SLAC
Add to that … Distributed systems are very hard
A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson
Network is deliberately transparent The bottlenecks can be in any of the following components:
the applications the OS the disks, NICs, bus, memory, etc. on sender or receiver the network switches and routers, and so on
Problems may not be logical Most problems are operator errors, configurations, bugs
When building distributed systems, we often observe unexpectedly low performance
the reasons for which are usually not obvious Just when you think you’ve cracked it, in steps security
Firewall, NAT boxes etc. Block pings, traceroute looks like port scan, diagnostic tool ports are
blocked … ISPs worried about providing access to core, making results public, &
privacy issues
Slide: 5Les Cottrell, SLAC
Sources of problems
Host “errors” TCP buffers, heavy utilization …
Ethernet duplex and speed mismatch between your host and the network device
Misconfigured router/switches Including routing errors, especially for backup paths
Bad equipment, wiring/fiber problem Congestion
Slide: 6Les Cottrell, SLAC
First steps Command prompt, find out about network connection
ipconfig ? ipconfig
Default gives IP address, gateway/1st router, subnet mask of all your network devices (Ethernet, wireless, bluetooth…)
Make a note of the gateway Icon at bottom right of screen
Allows asking of questions and tries to provide assistance Go to Command prompt and type
ping ?
Slide: 7Les Cottrell, SLAC
Ping on Windows
C:\Users\cottrell>ping –n 4 –l 32 mail.alex.edu.caPinging mail.alex.edu.ca [67.215.65.132] with 32 bytes of data:Reply from 67.215.65.132: bytes=32 time=80ms TTL=45Reply from 67.215.65.132: bytes=32 time=85ms TTL=45Reply from 67.215.65.132: bytes=32 time=83ms TTL=45Reply from 67.215.65.132: bytes=32 time=90ms TTL=43Ping statistics for 67.215.65.132: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),Approximate round trip times in milli-seconds: Minimum = 80ms, Maximum = 90ms, Average = 84ms
Size of packetRTTIP address of targettarget
Specify number pings
?
Try: ping –t, what use is ping -f
Slide: 8Les Cottrell, SLAC
C:\Users\cottrell>ping www.lbl.govPinging www.lbl.gov [128.3.41.105] with 32 bytes
of data:Request timed out.Request timed out.Request timed out.Request timed out.Ping statistics for 128.3.41.105: Packets: Sent = 4, Received = 0, Lost = 4 (100%
loss), Enable Telnet by following these steps:Start=>Control Panel=>Programs And Features=>Turn Windows features on or off=>Check Telnet ClientHit OKNow try:16cottrell@pinger:~>telnet www.lbl.gov 80Blank screen web server waiting to talk to youHit ctrl ] and type exitCompare with another port (non existent
application)C:\Users\cottrell>telnet www.lbl.gov 1010Connecting To www.lbl.gov...Could not open
connection to the host, on port 1010: Connect failed
C:\Users\cottrell>
Anomalies
Pings blocked
Slide: 9Les Cottrell, SLAC
Diversion on ports Applications such as telnet (23), ssh (22) www (80,
443), DNS are assigned a “port” on the host Sometimes written as for example
www.slac.stanford.edu:80 See http://www.iana.org/assignments/port-numbers for
what applications use which ports
Slide: 10Les Cottrell, SLAC
Try: 1. ping localhost2. ping mail.alex.edu.eg3. ping sohag-univ.edu.eg4. ping www.minia.edu.eg5. ping www.alex.edu.eg
Slide: 11Les Cottrell, SLAC
3rd party ping (via Looking Glass) Find servers:
http://www.cogentco.com/us/network_lookingglass.php, http://www.ip.tiscali.net/lg/ http://stat.qwest.net/cgi-bin/jlg-new-asia.pl http://www.slac.stanford.edu/comp/net/wan-mon/viper/t
ulip_map.htm
Slide: 12Les Cottrell, SLAC
RTT from California to world
Longitude (degrees)
300ms
300ms
RTT (ms.)
Freq
uenc
y
RTT
(ms)
Source = Palo Alto CA, W. Coast
E. C
oas t
US
W. C
oast
US
Euro
pe &
S. A
mer
ica
Europe
0.3*0.6c
Bra
zil
E. C
oast
Data from CAIDA Skitter project
Slide: 13Les Cottrell, SLAC
Geostationary Satellite linksEach bar represents min RTT for 1 countrySatellite flies 24k miles high, RTT~400msNote cut off between satellite and terrestrial
CountryMin
RTT
(ms) 500
400300200100
0
Terrestrial
Satellite
Slide: 14Les Cottrell, SLAC
Traceroute Rough algorithmRough traceroute algorithm ttl=1; #To 1st router port=33434; #Starting UDP port max=30; #default maximum number of hops
while hops <= maxhops & ttl<max {send UDP packet to host:port with ttlget response
if time exceeded note roundtrip timeelse if UDP port unreachable
print * next
print outputttl++; port++
}
Slide: 15Les Cottrell, SLAC
Traceroute (tracert on Windows)C:\Users\cottrell>tracert
gets helpC:\Users\cottrell>tracert -h 30 mail.alex.edu.egTracing route to mail.alex.edu.eg [193.227.16.29] over a maximum of 30 hops11 ms 1 ms 1 ms 10.13.11.12 1 ms <1 ms 1 ms 10.100.100.5331 ms <1 ms <1 ms 10.0.0.341 ms 1 ms 1 ms 81.21.100.1775 53 ms 12 ms 1 ms 10.181.28.336 2 ms 24 ms 2 ms 172.18.28.1177 5 ms 6 ms 6 ms 172.20.1.1628 6 ms 6 ms 8 ms 172.19.8.1069 * * * 10 6 ms 6 ms 6 ms mail.alex.edu.eg [193.227.16.29]
Try tracert www.lbl.govWhy do the first hops take so long to reply?
Try tracert –d www.lbl.gov
Target IP address
No response
3 RTTs
Router IP address
Max hops
Slide: 16Les Cottrell, SLAC
Private address space N.b. first few addresses are 10.x.y.z Typically these are private (not known to the global
Internet) IP addresses, that can be re-used at multiple sites
See http://en.wikipedia.org/wiki/Private_network Ranges 10.0.0.0 – 10.255.255.255 (16M addresses, 24bits) 172.16.0.0 – 172.31.255.255 (1M addresses, 20 bits) 192.168.0.0 – 192.168.255.255 (65K addresses, 16 bits)
Slide: 17Les Cottrell, SLAC
Traceroute from elsewhere Traceroute to remote host
Is the route direct, over commercial congested nets Reverse traceroute from remote host to you or 3rd party
www.slac.stanford.edu/comp/net/wan-mon/traceroute-srv.html www.tracert.com/ visualroute.visualware.com/ # requires Java
Visualroute servers in Europe
Slide: 18Les Cottrell, SLAC
Traceroute server results Example: www.slac.stanford.edu/cgi-bin/nph-traceroute.pl
Securitywarning
Traceroute
Relatedinfo
Enter IP address or nameYour IP addressYour IP name
Slide: 19Les Cottrell, SLAC
Warning Some Linux versions have bug that incorrectly IDs
cksum error on MPLS links. Make Pkt length>=140, else get checksum errors (not a problem, just annoying). e.g. on Linux traceroute www.slac.stanford.edu 140
Slide: 20Les Cottrell, SLAC
Pingroute example May help tell where losses start Will need many pings if losses small
Routers may not
respond
Start of losses?
But?
Start ofsustained
losses
Slide: 21Les Cottrell, SLAC
Matt’s Traceroute (mtr) Run traceroute, then ping each router n times
helps identify where in route the problems start to occur Routers may not respond to pings, or may treat pings
directed at them, differently to other packets Get Matt’s TraceRoute MTR from
www.bitwizard.nl/mtr/ or pathping (built into windows but inferior) Slower Less info
Slide: 22Les Cottrell, SLAC
Pathping en.wikipedia.org/wiki/PathPing Tracing route to mail.alex.edu.eg [193.227.16.29] over max 30 hops: 0 CDIV-PC83982.win.slac.stanford.edu [10.13.250.215] 1 10.13.11.1 2 10.100.100.53 3 10.0.0.3 4 81.21.100.177 5 10.181.28.33 6 172.18.28.117 7 172.20.1.162 8 172.19.8.106 9 10.191.8.30 10 mail.alex.edu.eg [193.227.16.29]Computing statistics for 250 seconds... Source to Here This Node/LinkHop RTT Lost/Sent = Pct Lost/Sent = Pct Address 0 CDIV-PC83982.win.slac.stanford.edu [10.13.250.215] 0/ 100 = 0% | 1 1ms 0/ 100 = 0% 0/ 100 = 0% 10.13.11.1 0/ 100 = 0% | 2 1ms 0/ 100 = 0% 0/ 100 = 0% 10.100.100.53 0/ 100 = 0% | 3 0ms 0/ 100 = 0% 0/ 100 = 0% 10.0.0.3 0/ 100 = 0% | 4 2ms 0/ 100 = 0% 0/ 100 = 0% 81.21.100.177 13/ 100 = 13% | 5 --- 100/ 100 =100% 87/ 100 = 87% 10.181.28.33 0/ 100 = 0% | 6 --- 100/ 100 =100% 87/ 100 = 87% 172.18.28.117 0/ 100 = 0% | 7 --- 100/ 100 =100% 87/ 100 = 87% 172.20.1.162 0/ 100 = 0% | 8 --- 100/ 100 =100% 87/ 100 = 87% 172.19.8.106 0/ 100 = 0% | 9 --- 100/ 100 =100% 87/ 100 = 87% 10.191.8.30 0/ 100 = 0% | 10 10ms 13/ 100 = 13% 0/ 100 = 0% mail.alex.edu.eg [193.227.16.29]
Trace complete.
Default probes/hop = 100
No RTT variance provided
|=LinkRouter
Help try pathping
Slide: 23Les Cottrell, SLAC
Look at time series Look at history plots (PingER, ISPs, own border router
etc.), when did problem start, how big an effect is it? Assumes you know “proximity” of paths for which there are
archived active measurements to the path that you are interested in
Also that relevant measurements existwww-iepm.slac.stanford.edu/pinger/
Collaboration between Internet2/ESnet/Geant to provide access to router measurements holds promise
Slide: 24Les Cottrell, SLAC
Example time series Look for
change in measured value Note
time Correlate Italy disconnected
Slide: 25Les Cottrell, SLAC
Moving towards application Is the server application listening:
telnet www.slac.stanford.edu 80Trying 134.79.18.188...Connected to www.slac.stanford.edu.Escape character is '^]'.^]telnet> quitConnection closed.
Try user application (mem to mem & disk to disk) GridFTP, bbcp, bbftp …
Iperf or thrulay (also provides RTT) to test TCP or UDP throughput dast.nlanr.net/Projects/Iperf/, www.internet2.edu/~shalunov/thrulay/
NDT (http://www.internet2.edu/performance/ndt/) What are the interface speeds?, What is the bottleneck? Is there a duplex mismatch?’ Are buffers set right (both ends)?
Slide: 26Les Cottrell, SLAC
NDT example
Try: http://netspeed.stanford.edu/
Slide: 27Les Cottrell, SLAC
And then … Wireless
Avoid peer-to-peer/ad-hoc connectionsDisable connecting to ad-hoc (set infrastructure only)Disable bridgingHow to do it varies by OS (XP, OSX, Linux)
Ad hoc can still interfere if on same channel Tools to locate an access point (e.g. Yellow-Jacket) See
www2.slac.stanford.edu/comp/net/wireless/Wireless-Meeting-Handout.mht
NAT boxes may block or not support application Private addresses:
10.0.0.0 - 10.255.255.255 a single class A net172.16.0.0 - 172.31.255.255 16 contiguous class Bs192.168.0.0 – 192.168.255.255 256 contiguous class Cs
Slide: 28Les Cottrell, SLAC
Strategy: divide & conquer Ping to localhost, ping to gateway & to remote host
Use IP address to avoid nameserver problems Look for connectivity, loss & RTT May need to run for a long time to see some pathologies
(e.g. bursty loss dues to DSL loss of sync) Use telnet host port to see if ping blocked
Traceroute to remote host Reverse traceroute from remote host to you Ping routers along route (mtr helps) Look at history plots (PingER), when did problem start,
how big an effect is it?• Look at own connectivity NDT (netspeed.stanford.edu)
Slide: 29Les Cottrell, SLAC
“Where is” a host? Beware some of information following is ephemeral, in general use
heuristics with Google Google “Internet country codes” for TLDs
Host may not be in TLD country, especially developing regions often use proxies elsewhere
Location may be encoded in router name ipls=Indianapolis, snv=Sunnyvale …
Name server lookup (nslookup & dig) to find hostname given IP address
47cottrell@netflow:~>nslookup 210.56.16.10Server: localhostAddress: 127.0.0.1Name: lhr.comsats.net.pkAddress: 210.56.16.10
Use a whois server (download www.gena01.com/win32whois/)www.networksolutions.com/cgi-bin/whois/whois (Americas & Africa)www.ripe.net/cgi-bin/whois (Europe)www.apnic.net/ (Asia)May identify site name, address, contact, etc, not all domains are in
databases (e.g. will not find comsats.net.pk)
Slide: 30Les Cottrell, SLAC
“Where is” a host – cont.
Find the Autonomous System (AS) administering Form giving AS for domain name
http://www.fixedorbit.com/search.htmGives AS number, name adjacent AS’s web page for
AS Given an AS find out more about it:
Use http://bgp.potaroo.net/cidr/ go to bottom and enter AS into form:
– Gives ISP name, web page, phone number, email, hours etc. Review list of AS's ordered by Upstream AS Adjacency
www.telstra.net/ops/bgp/bgp-as-upsstm.txtTells what AS is upstream of an ISP
Slide: 31Les Cottrell, SLAC
“Where is” a host - cont. Visit site’s www server, often location in home page May be able to get lat & long form database:
www.geoiptool.com/ or via: geotool.flagfox.net/ http://www.hostip.info/index.html Networldmap determines geographical information by
acquiring location information from willing participants. http://www.ip2location.com/
But it is a subscriber service ($$$, but …), however it is probably best for developing regions
Quova has a large (2.4 Billion addresses) database of IP addresses to locations that they can provide access to for organizations, but must subscribe ($$$).
Triangulate pings from landmarks: www.slac.stanford.edu/grp/scs/net/talk10/geolocation.pptx
Slide: 32Les Cottrell, SLAC
Who you gonna tell? Local network support people Internet Service Provider (ISP) usually done by local networker
Usually will know immediate one, e.g. [email protected] Use puck.nether.net/netops/nocs.cgi to find ISP Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to find upstream ISPs
Well managed sites and ISPs maintain a list of email addresses such as abuse@ or postmaster@, that one can send email to, for example to complain about spam etc. This follows an Internet recommendation (RFC 2142). Some less helpful sites do not provide such services, for more on these,
see RFC-ignorant.org
Slide: 33Les Cottrell, SLAC
What ya gonna tell ‘em? Describe problem with details
What is affected?Application, host OS (uname –a), NIC (ifconfig, route)
How is it affected?Non responsiveness, unable to contact remote hostSlow performance (see Brian’s talk), packet loss
When did it start? Send ping output between hosts Send traceroute forward & reverse – if possible
Maybe use –I (ICMP option) NDT Identify when it started If complex think about creating web page with details
Top, vmstat, pingroute, pipechar, application output (GridFTP, iperf)…
Slide: 34Les Cottrell, SLAC
More Information Tutorial on monitoring
www.slac.stanford.edu/comp/net/wan-mon/tutorial.html RFC 2151 on Internet tools
www.freesoft.org/CIE/RFC/Orig/rfc2151.txt Network monitoring tools
www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html www.caida.org/tools/taxonomy/
Network Performance Tools: an I2 Cookbook e2epi.internet2.edu/network-perf-wk/tools-cookbook.pdf
Case Studies: confluence.slac.stanford.edu/display/IEPM/Problem+Cases e2epi.internet2.edu/case-studies/
Slide: 35Les Cottrell, SLAC
More slides
Slide: 36Les Cottrell, SLAC
Local Host (also see NDT later) Usual Unix tools (uname -a, top, vmstat, iostat ..) Is the host overloaded, do you have a gateway
(route), name server (nslookup), which interface are you using (mii-tool (needs root), gives duplex & speed = common error source)
Net: ifconfig –a (look at errors), netstat –a Is server running (if you know port)?
>telnet localhost 2811 Trying 127.0.0.1 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI
type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready.
^] telnet> quit
Slide: 37Les Cottrell, SLAC
Ping example
syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com PING thumper.bellcore.com (128.96.41.1): 64 data bytes 72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms 72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms 72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4 ms 72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms 72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms --- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5
packets received, 16% packet loss round-trip min/avg/max = 482.1/880.5/1447.4 ms
Repeat count Packet size Remote host
RTT
Missing seq #
Summary
Slide: 38Les Cottrell, SLAC
Traceroute
UDP/ICMP tool to show route packets take from local to remote host
17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pktraceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms 4 snv-slac.es.net (134.55.208.30) 1.377 ms 5 nyc-snv.es.net (134.55.205.22) 75.536 ms 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms 7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms 8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms 9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms10 207.45.205.18 (207.45.205.18) 128.648 ms11 210.56.31.94 (210.56.31.94) 762.150 ms12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms13 * 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms
Probes/hopMax hops (20) Remote host
No response:Lost packet or router
ignores
Long delaysatellite
location
Slide: 39Les Cottrell, SLAC
Pingroute Ping routers along route, e.g. a tool to install that helps:
www.slac.stanford.edu/comp/net/fpingroute.pl or www.slac.stanford.edu/comp/net/fpingroute.pl if fping avaialable
15cottrell@noric04:~>fpingroute.plfpingroute.pl does a traceroute to the selected host. For each of the hops along the route it then uses fping to ping each node (in parallel) 'count' times. Output includes traceroute information, RTTs, losses for 100 and 'size‘ byte pings.Version=0.21, 8/24/04Usage: fpingroute.pl [Opts] host where host is the remote host's IP address or name e.g. www.slac.stanford.edu Opts: [-c count default=10] [-s size default=1400] [-i initial default=1]Example: fpingroute.pl -i 3 -c 10 -s 1400 www.triumf.ca
Slide: 40Les Cottrell, SLAC
Other tools Ntop
Summarizes libpcap (sniffer) infor Internet2 Detective:
Tests connectivity to I2, bandwidth, multicast, IPv6Can run as Java applethttp://detective.internet2.edu/
NLANR Internet Advisor Ethereal, tcpdump, snoop for masochists Passive tools:
Netflow for characterizing network, spotting abnormalities, e.g. www.itec.oar.net/abilene-netflow
www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html SNMP based tools