network path and application diagnostics matt mathis john heffner ragu reddy 4/24/06 mathis/papers/...

23
Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 http://www.psc.edu/~mathis/papers/ PathDiag20060424.ppt

Upload: gerald-casey

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Network Path andApplication Diagnostics

Matt Mathis

John Heffner

Ragu Reddy

4/24/06

http://www.psc.edu/~mathis/papers/

PathDiag20060424.ppt

Page 2: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Outline• What is the real problem?

– Lessons from Web100

– A new perspective

• Path and lower layer diagnosis– The pathdiag tool

– A diagnostic server

• Application and upper layer diagnosis– LAN bench testing

• Future plans

Page 3: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

TCP tuning requires expert knowledge• By design TCP/IP hides the ‘net from upper layers

– TCP/IP provides basic reliable data delivery

– The “hour glass” between applications and networks

• This is a good thing, because it allows:– Invisible recovery from data loss, etc

– Old applications to use new networks

– New application to use old networks

• But then (nearly) all problems have the same symptom– Less than expected performance

– The details are hidden from nearly everyone

Page 4: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

TCP tuning is painful debugging• All problems reduce performance

– But the specific symptoms are hidden

• Any one problem can prevent good performance– Completely masking all other problems

• Trying to fix the weakest link of an invisible chain– General tendency is to guess and “fix” random parts

– Repairs are sometimes “random walks”

– Repair one problem at time at best

Page 5: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

The Web100 project• When there is a problem, just ask TCP

– TCP has the ideal vantage point• In between the application and the network

– TCP already “measures” key network parameters• Round Trip Time (RTT), available data capacity, etc

• Can add many more

– TCP can identify the bottleneck• Why did it stop sending data?

– TCP can even adjust itself• “autotuning” eliminates one major class of flaws

See: www.web100.org

Page 6: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

The next step• Web100 tools still require too much expertise

– They are not really end user tools

– Too easy to overlook problems

– Current diagnostic procedures are still cumbersome

• New insight from web100 experience– Nearly all symptoms scale with round trip time

• New NSF funded project:Network Path and Application Diagnosis (NPAD)

Page 7: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Nearly all symptoms scale with RTT

• For example– TCP Buffer Space, Network loss and reordering, etc– On a short path TCP can compensate for the flaw

• Local Client to Server: all applications work– Including all standard diagnostics

• Remote Client to Server: all applications fail – Leading to faulty implication of other components

Page 8: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Examples of flaws that scale• Chatty application (e.g., 50 transactions per request)

– On 1ms LAN, this adds 50ms to user response time– On 100ms WAN, this adds 5s to user response time

• Fixed TCP socket buffer space (e.g., 32kBytes)– On a 1ms LAN, limit throughput to 200Mb/s– On a 100ms WAN, limit throughput to 2Mb/s

• Packet Loss (e.g., 0.1% loss at 1500 bytes)– On a 1ms LAN, models predict 300 Mb/s– On a 100ms WAN, models predict 3 Mb/s

Page 9: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

The confounded problems• For nearly all network flaws

– The only symptom is reduced performance

– But the reduction is scaled by RTT

• On short paths, most flaws are undetectable– False pass for even the best conventional diagnostics

– Leads to faulty inductive reasoning about flaw locations

– This is the essence of the “end-to-end” problem

– Current state-of-the-art diagnosis relies on tomography and complicated inference techniques

Page 10: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

The solutions• New diagnostic techniques to compensate for

“symptom scaling” • For path testing (and lower layers)

– Test path sections using a instrumented application that can extrapolate test results to a long path

• For applications (and upper layers)– Bench test over an (emulated) ideal long path

Page 11: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Testing the path• Need to test short path sections to localize a flaw

– But “symptom scaling” normally hides a failing section• New tool (“pathdiag”):

– Measure the performance of each short section• Use Web100 to collect detailed statistics• Loss, delay, queuing properties, etc

– Use models to extrapolate results to the full path• Assume that the rest of the path is ideal• You have to specify the end-to-end performance goal

– Data rate and RTT

– Pass/Fail on the basis of the extrapolated performance

Page 12: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Deploy as a Diagnostic Server

• Use pathdiag in a Diagnostic Server (DS)• Specify End to End target performance

– From server (S) to client (C) (RTT and data rate)• Measure the performance from DS to C

– Use Web100 in the DS to collect detailed statistics– Extrapolate performance assuming ideal backbone

• Pass/Fail on the basis of extrapolated performance

Page 13: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Example 1- good news

Page 14: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Example 1, continued

Page 15: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Example 2 - not so good

Page 16: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Example 2, continued

Page 17: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Key pathdiag/DS features• Results are intended for end-users

– Provides a list of specific items to be corrected• Failed tests are showstoppers for HPN apps

– Includes explanations and tutorial information– Details for escalation to network or system admins

• Coverage for a majority of OS and network flaws– Most of the remaining flaws can be detected with pathdiag in

the client or traceroute– Eliminates nearly all(?) false pass results

• Tests becomes more sensitive on shorter paths– Conventional diagnostics become less sensitive– Depending on models, perhaps too sensitive

• New problem is false fail (e.g. queue space tests)

Page 18: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Key features, continued• Flaws no longer completely mask other flaws

– A single test often detects several flaws• E.g. find both OS and network flaws in the same test

– They can be repaired concurrently• Archived DS results include raw web100 data

– Can reprocess with updated reporting SW• New reports from old data

– Critical feedback for the NPAD project• We really want to collect “interesting” failures

Page 19: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Status• Public servers are now online. See:

– http://www.psc.edu/networking/projects/pathdiag/

• Version 1.0 available for download– Follow the download link

– Requires current web100 kernel patches

– Should be faster than clients

• Version 1.1 is coming soon– Better support for non-local testing

– Better support for TeraGrid scale testing

Page 20: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Blast from the past• Same base algorithm as “Windowed Ping” [Mathis, INET’94]

– Aka “mping”– See http://www.psc.edu/~mathis/wping/– Killer diagnostic in use at PSC in the early 90s– Stopped working with the advent of “fast path” routers

• Use a simple fixed window protocol– Scan window size in 1 second steps– Measure data rate, loss rate, RTT, etc as window changes

Page 21: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Diagnosing applications• Goal: Tools to “bench test” applications in the lab

– Client and server on the same LAN• App developer has easy access to all components

– Emulate a long ideal path between client and server• Also checks some OS and TCP features• Several different techniques (next topic)

• Developer gets first hand experience with delay– If it fails in the lab, it will not work on a WAN– Can not blame the network– Can not repeal the speed of light– Has to fix the application

Page 22: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Emulating delay• Multiple techniques to emulate long paths

– Scenic routing via tunnels– Kernel delays (e.g. netem, nistnet, dummynet)– Application (pipe) delay via a proxy

• We have ~5 techniques prototyped/under test– Kernel hacking vs non-privileged users– Ease of use/ease of installation– Maximum data rate– Authenticity of the delay

• Not ready for prime time

Page 23: Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 mathis/papers/ PathDiag20060424.ppt

Try it!

http://www.psc.edu/networking/projects/pathdiag/