pushing up performance for everyone matt mathis 7-dec-99
TRANSCRIPT
Pushing Up Performance for Everyone
Matt Mathis
7-Dec-99
Why do so few people get good network performance?
• Context and history
• Architectural origins
• Approaches
The Wizard Gap
0.1
1
10
100
1000
Year
Dat
a R
ate
(Mb/
s)
Expert
Default
Past Performance Evolution
• Wizards wrote standards– Standard TCP could not go fast (1988)
• Wizards enhanced systems– Stock systems could not go fast (1995)
• Gurus tune systems (today)– Fast TCP is present – Badly misstuned by default
Ongoing Performance Evolution
• More disciples tune and debug (tomorrow)– All netadmins and sysadmins?
• Systems are tuned by default (future)– Web100..…
• Debugging will become “easy” (?)
Architecture
• The Good news– TCP hides the net from the application
• The Bad news– TCP hides the net
Architecture
• The Good news– TCP hides the net from the application
• The Bad news– TCP hides the net
……. including ALL bugs everywhere.
• The only legal symptom is less than expected performance
You get poor performance if:
– The application is inefficient– TCP is buggy – TCP is misstuned– The path is buggy– The path is congested– Routing is suboptimal
Especially on a long path.– Think: weakest link of an invisible chain
Closing the Wizard gap
• Share the expertise– Train more disciples
• Require less expertise– Systems should tune themselves
• Better observability– Focused and efficient debugging
• Documentation– Show that the world is improving
Share the expertise
• Joint Techs meetings
• TCP Tuning– In depth presentation by Matt Mathis
• DAST Application tutorials– See: dast.nlanr.net
Require less expertise
• TCP Autotuning– Presentation by Matt Mathis
• Web100– Presentation be Basil Irwin
• Online TCP debugging resources– See http://www.ncne.nlanr.net/TCP
Better Observability (Instrumentation)
• Network Instrumentation and Visualization– Presentation by Mark Gates
• Trace Analysis and Auto-Diagnosis– Presentation by Kathy Benninger
• Better TCP instrumentation (Web-100)– Just ask TCP why it is slow
Better Observability(Debugging methods)
• Sweden - Pittsburgh path– Presentation by Greg Miller & Jerry Sobieski
• iPerf tool– Presentation by Mark Gates
• Existing tools and tool repositories– See: http://www.ncne.nlanr.net/tools
• Still insufficient
Better Observability(Measurement)
• Measurements from Seattle I2 Meeting– Presentation by Matt Zekauskas
• Advanced Research and Engineering Atlas– Presentation by John Jamison
• Many distributed measurement efforts– AMP, Surveyor, NIMI, etc
Documentation
• vBNS stats and measurement– Tutorial by Rick Wilder
• NLANR MOAT vBNS traffic on NAI– See: moat.nlanr.net
• Many benchmark efforts– Surveyor, AMP, NIMI, Web100……
• HPC host census(?)
Conclusion
• We need to find every bug that TCP hides– Now and always
• We need to eliminate all irrelevant controls– Autotune TCP (and RED, etc)
Debugging flowchart
• http://www.ncne.nlanr.net/TCP/debugging
• Look at a trace and click to study symptoms
• Ongoing evolution
Testrig kit
• "Fool proof" TCP diagnosis starter kit with:– Simple diagnostic application– TCP trace collection tools– Visualization tools– Pointer to the debugging flowchart
• With wrapper scripts around everything
TCP Debugging In-depth
• Draft done at CAIDA this summer
• Future NCNE On-site– 1, 2.5 and 5 hour versions
• Basis for the debugging flowchart
• Update from flowchart as it evolves
• Interactive - Uses magicpoint/xplot
Trace Analysis and Auto-Diagnosis(TAAD)
• Scan GigaPop traffic for misstuned TCP connections– that fail to meet the model
rate = (MSS/RTT) * (C/sqrt(p))
• Running prototype
• Use to direct other resources
Autotuning
• Make TCP “do the right thing” by default
• No unneeded user controls
Generate data points (AMP)
• Nearly 100 systems already
• Kernel TCP bug– Need to upgrade to freeBSD 3.3
• Easy to create 100x1 data points
• Can create 100x100 data points
• Opportunity for NIMI
Generate OC-12 data points
• Max Okumoto working at PSC for SDSC
• Will start tuning selected paths
HPC Host Census
• Use existing data from MCI OC-Xmon
• Patterned after HWB big flow detection
• Measure the number of fast hosts
• Words needed to generalize to all of JET