pushing up performance for everyone matt mathis 7-dec-99

Pushing Up Performance for Everyone

Matt Mathis

7-Dec-99

Why do so few people get good network performance?

• Context and history

• Architectural origins

• Approaches

The Wizard Gap

0.1

1

10

100

1000

Year

Dat

a R

ate

(Mb/

s)

Expert

Default

Past Performance Evolution

• Wizards wrote standards– Standard TCP could not go fast (1988)

• Wizards enhanced systems– Stock systems could not go fast (1995)

• Gurus tune systems (today)– Fast TCP is present – Badly misstuned by default

Ongoing Performance Evolution

• More disciples tune and debug (tomorrow)– All netadmins and sysadmins?

• Systems are tuned by default (future)– Web100..…

• Debugging will become “easy” (?)

Architecture

• The Good news– TCP hides the net from the application

• The Bad news– TCP hides the net

Architecture

• The Good news– TCP hides the net from the application

• The Bad news– TCP hides the net

……. including ALL bugs everywhere.

• The only legal symptom is less than expected performance

You get poor performance if:

– The application is inefficient– TCP is buggy – TCP is misstuned– The path is buggy– The path is congested– Routing is suboptimal

Especially on a long path.– Think: weakest link of an invisible chain

Closing the Wizard gap

• Share the expertise– Train more disciples

• Require less expertise– Systems should tune themselves

• Better observability– Focused and efficient debugging

• Documentation– Show that the world is improving

Share the expertise

• Joint Techs meetings

• TCP Tuning– In depth presentation by Matt Mathis

• DAST Application tutorials– See: dast.nlanr.net

Require less expertise

• TCP Autotuning– Presentation by Matt Mathis

• Web100– Presentation be Basil Irwin

• Online TCP debugging resources– See http://www.ncne.nlanr.net/TCP

Better Observability (Instrumentation)

• Network Instrumentation and Visualization– Presentation by Mark Gates

• Trace Analysis and Auto-Diagnosis– Presentation by Kathy Benninger

• Better TCP instrumentation (Web-100)– Just ask TCP why it is slow

Better Observability(Debugging methods)

• Sweden - Pittsburgh path– Presentation by Greg Miller & Jerry Sobieski

• iPerf tool– Presentation by Mark Gates

• Existing tools and tool repositories– See: http://www.ncne.nlanr.net/tools

• Still insufficient

Better Observability(Measurement)

• Measurements from Seattle I2 Meeting– Presentation by Matt Zekauskas

• Advanced Research and Engineering Atlas– Presentation by John Jamison

• Many distributed measurement efforts– AMP, Surveyor, NIMI, etc

Documentation

• vBNS stats and measurement– Tutorial by Rick Wilder

• NLANR MOAT vBNS traffic on NAI– See: moat.nlanr.net

• Many benchmark efforts– Surveyor, AMP, NIMI, Web100……

• HPC host census(?)

Conclusion

• We need to find every bug that TCP hides– Now and always

• We need to eliminate all irrelevant controls– Autotune TCP (and RED, etc)

Debugging flowchart

• http://www.ncne.nlanr.net/TCP/debugging

• Look at a trace and click to study symptoms

• Ongoing evolution

Testrig kit

• "Fool proof" TCP diagnosis starter kit with:– Simple diagnostic application– TCP trace collection tools– Visualization tools– Pointer to the debugging flowchart

• With wrapper scripts around everything

TCP Debugging In-depth

• Draft done at CAIDA this summer

• Future NCNE On-site– 1, 2.5 and 5 hour versions

• Basis for the debugging flowchart

• Update from flowchart as it evolves

• Interactive - Uses magicpoint/xplot

Trace Analysis and Auto-Diagnosis(TAAD)

• Scan GigaPop traffic for misstuned TCP connections– that fail to meet the model

rate = (MSS/RTT) * (C/sqrt(p))

• Running prototype

• Use to direct other resources

Autotuning

• Make TCP “do the right thing” by default

• No unneeded user controls

Generate data points (AMP)

• Nearly 100 systems already

• Kernel TCP bug– Need to upgrade to freeBSD 3.3

• Easy to create 100x1 data points

• Can create 100x100 data points

• Opportunity for NIMI

Generate OC-12 data points

• Max Okumoto working at PSC for SDSC

• Will start tuning selected paths

HPC Host Census

• Use existing data from MCI OC-Xmon

• Patterned after HWB big flow detection

• Measure the number of fast hosts

• Words needed to generalize to all of JET

pushing up performance for everyone matt mathis 7-dec-99

Documents

net slide

fast tcp

default slide

buggy tcp

insufficient slide

slow slide

inefficient tcp

expected performance