linux tcp/ip tuning
Post on 12-Nov-2014
8.932 Views
Preview:
TRANSCRIPT
Copyright 2004 OSDL, All rights reserved.
Analyzing TCP Performance
Sr. Staff EngineerLinux Kongress 2004
2004-09-09
Stephen Hemminger
Copyright 2004 OSDL, All rights reserved. - 2 -
Agenda
■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup
Copyright 2004 OSDL, All rights reserved. - 3 -
Outside of scope
■ Non TCP protocols■ SCTP, multicast, etc
■ Queuing theory - “no math”■ Hardware and product comparisons
Copyright 2004 OSDL, All rights reserved. - 4 -
My Background
■ Did TCP back in the “old school”■ BSD 4.2, Ethernet■ SMP Unix versions of OSI, Netware, Appletalk, ...■ Plan9 Hypercube communication
■ Linux■ Incorporation of TCP research in 2.6 kernel■ Performance tests for LWE■ Wizard gap
Copyright 2004 OSDL, All rights reserved. - 5 -
Limits of my knowledge
■ Only worked with current Linux (2.4/2.6)■ Will mention tools here that I have not used
extensively■ Involved in development of Linux not deployment
or research
Copyright 2004 OSDL, All rights reserved. - 6 -
Agenda
■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup
Copyright 2004 OSDL, All rights reserved. - 7 -
TCP for “muggles”
■ connection establishment■ slow start■ windows■ congestion control■ silly window
Copyright 2004 OSDL, All rights reserved. - 8 -
Connection establishment
SYN
SYN+ACK
Data 1(10)
Ack 11
connect
Client Server
write
accept
read
Copyright 2004 OSDL, All rights reserved. - 9 -
ethereal
Copyright 2004 OSDL, All rights reserved. - 10 -
tcpdump trace
13:28:21.745624 IP 172.20.1.60.38052 > 216.239.39.99.http: S 1765497548:1765497548(0)win 5840 <mss 1460,sackOK,timestamp 1563951453 0,nop,wscale 7>
13:28:21.831935 IP 216.239.39.99.http > 172.20.1.60.38052: S 227058185:227058185(0)ack 1765497549 win 8190 <mss 1460>
13:28:21.832035 IP 172.20.1.60.38052 > 216.239.39.99.http: . ack 1 win 584013:28:21.832321 IP 172.20.1.60.38052 > 216.239.39.99.http: P 1:126(125) ack 1 win 584013:28:21.939237 IP 216.239.39.99.http > 172.20.1.60.38052: . ack 126 win 3146013:28:21.972448 IP 216.239.39.99.http > 172.20.1.60.38052: P 1:485(484) ack 126 win 3146013:28:21.972529 IP 172.20.1.60.38052 > 216.239.39.99.http: . ack 485 win 643213:28:21.973016 IP 172.20.1.60.38052 > 216.239.39.99.http: F 126:126(0) ack 485 win 6432
Copyright 2004 OSDL, All rights reserved. - 11 -
Flow control
Data 1011 (1400)
ACK 1010 (5000)
Ack 6010 (1000)
write
read (1000)
Data 3811 (1400)
Data 2411 (1400)Data 5211 (800)
Ack 6010 (0)
Copyright 2004 OSDL, All rights reserved. - 12 -
Retransmission
Data 1
Ack 1Ack 1
write
Data 2
Multiple ack's = fast retransmit
Copyright 2004 OSDL, All rights reserved. - 13 -
Tcptrace
http://tcptrace.org
Tool to convert captured data into graphs■ Time sequence graph■ Throughput■ RTT
Lots more than time to cover here!
Copyright 2004 OSDL, All rights reserved. - 14 -
Xplot
http://xplot.org■ Takes plot command scripts■ Mouse
■ Zoom – drag with the left button■ Zoom out – click the left button ■ Scroll – drag with middle button■ Dump – shift-left button produces postscript
■ Shift-middle and shift-right also
Copyright 2004 OSDL, All rights reserved. - 15 -
Time Sequence Graph
Copyright 2004 OSDL, All rights reserved. - 16 -
Copyright 2004 OSDL, All rights reserved. - 17 -
Windows & Buffering
■ Used to isolate TCP from application read/write■ Used for congestion control■ Upper bound determined by system parameters
Copyright 2004 OSDL, All rights reserved. - 18 -
Congestion window
■ slow start■ Window normally starts small■ Grows in response to ack
■ congestion control■ Packet loss = congestion
Copyright 2004 OSDL, All rights reserved. - 19 -
Silly Window
Data (2000)
Ack [10]write8k bytes
Ack [2000]
Read8k bytes
“Hey, I am not going to try and send this data now give me a bigger window first”
OK, thanks
Copyright 2004 OSDL, All rights reserved. - 20 -
Model of TCP networks
Network
Send Window
Sender
Receive Window
Receiver
Data
Ack
BDP = Bandwidth (bytes/sec) * Delay (secs/unit)
Copyright 2004 OSDL, All rights reserved. - 21 -
BDP - Bandwidth Delay Product
■ BDP = amount of data in transit■ Examples
■ DSL/Cable modem (international)
1,000,000 bit/sec * 1/8 byte/bit * 500 ms = 62500 bytes
■ Gigabit across US
1,000,000,000 bit/sec * 1/8 byte/bit * 70 ms = 8,75 Mbytes
Copyright 2004 OSDL, All rights reserved. - 22 -
0.1 1 10 100 10000.1
1
10
100
1000
Delay (ms)
Ban
dwid
thM
bits
/sec
Bandwidth Delay Product (BDP)
8K1M64K
Broadband
ResearchLAN
Copyright 2004 OSDL, All rights reserved. - 23 -
Internet
■ Router queues■ Delays
■ Speed of light (70ms coast/coast)■ Slow routers
■ Packet correlation, sizes■ DoS
Copyright 2004 OSDL, All rights reserved. - 24 -
Extensions for larger windows
■ TCP Selective Acknowlegement (SACK) RFC2018
■ Don't have to retransmit everything
■ Window scaling (RFC1323)■ Window size multiplied by 2n
■ Protection Against Wrapped Sequence (PAWS)■ Timestamp inside each packet
Copyright 2004 OSDL, All rights reserved. - 25 -
TCP options negotiation 1
IP 172.20.1.60.32820 > 216.239.39.99.http: S 3599527174:3599527174(0) win 5840<mss 1460,sackOK,timestamp 2519711 0,nop,wscale 2>
IP 216.239.39.99.http > 172.20.1.60.32820: S 3820474812:3820474812(0) ack 3599527175 win 8190 <mss 1460>IP 172.20.1.60.32820 > 216.239.39.99.http: . ack 1 win 5840IP 172.20.1.60.32820 > 216.239.39.99.http: P 1:126(125) ack 1 win 5840
Window scale by 4
But server doesn't support scaling
Copyright 2004 OSDL, All rights reserved. - 26 -
TCP options negotiation 2
IP 172.20.1.60.32823 > 65.172.181.13.http: S 4120108902:4120108902(0) win 5840 <mss 1460,sackOK,timestamp 3036627 0,nop,wscale 2>
IP 65.172.181.13.http > 172.20.1.60.32823: S 2295773021:2295773021(0) ack 4120108903 win 5792
<mss 1460,sackOK,timestamp 1818411318 3036627,nop,wscale 0>IP 172.20.1.60.32823 > 65.172.181.13.http: . ack 1 win 1460 <nop,nop,timestamp 3036628 1818411318>IP 172.20.1.60.32823 > 65.172.181.13.http: P 1:144(143) ack 1 win 1460 <nop,nop,timestamp 3036628 1818411318>
Window scale by 4
Your scaling is okay, but don't scale mine
Copyright 2004 OSDL, All rights reserved. - 27 -
Linux TCP window tuning
■ Send window - net.ipv4.tcp_wmem■ three values : initial default max
■ default is 4K 16K 128K■ also limited by net.core.wmem_max
■ Receive window – net.ipv4.tcp_rmem
■ three values : initial default max■ default is 4K 85K 170K
■ also limited by net.core.rmem_max
Copyright 2004 OSDL, All rights reserved. - 28 -
Linux TCP window tuning
■ Overall memory – net.ipv4.tcp_mem■ three values : low pressure max■ automatic value based on system memory
■ Application window – net.ipv4.tcp_app_mem
■ reserved space to handle slow applications
Copyright 2004 OSDL, All rights reserved. - 29 -
But!
■ Some firewalls and routers are buggy■ Corrupt window scale change N to 0■ Forget to track state, or read RFC wrong■ Connections will hang because initial window looks
like a silly window■ 1% of the net is buggy..
■ Linux 2.6.9 chooses window scale based on maximum possible receive window
■ Default tcp_rmem => window scale of 2■ Buggy devices will see ¼ of the real window
Copyright 2004 OSDL, All rights reserved. - 30 -
Break
Copyright 2004 OSDL, All rights reserved. - 31 -
Agenda
■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup
Copyright 2004 OSDL, All rights reserved. - 32 -
Performance Engineering process
■ Define what your goal■ Capture information■ Analyze and form hypothesis■ Prototype to validate hypothesis
■ If successful■ Make changes on production system■ Report problems or patches to others
Copyright 2004 OSDL, All rights reserved. - 33 -
Goal setting
■ Know what is possible:■ bus bandwidth, network latency, etc.
■ Know your application■ Compare with similar applications
Copyright 2004 OSDL, All rights reserved. - 34 -
TCP performance testing
■ Goal: Improve TCP performance over high bandwidth * delay links
■ Plan:■ New TCP congestion control■ Validate and test
Copyright 2004 OSDL, All rights reserved. - 35 -
Testing TCP over WAN
■ Want to test performance of TCP over high BDP links
■ Can't afford a 10Gbit trans-continental link■ Proposal: emulate network delay over 1Gbit
Ethernet
Copyright 2004 OSDL, All rights reserved. - 36 -
Existing network emulation tools
■ Dummynet
http://info.iet.unipi.it/~luigi/ip_dummynet/I don't want to setup separate FreeBSD machine
■ NISTnethttp://snad.ncsl.nist.gov/itg/nistnet/
Only on 2.4 and not ready to be in main tree
Copyright 2004 OSDL, All rights reserved. - 37 -
Netem
http://developer.osdl.org/shemminger/netem■ Started out as simple delay only hack■ Grown up to do all the functionality of NISTnet
Ethernet (eth0)
netem
IP
TCP
Copyright 2004 OSDL, All rights reserved. - 38 -
Current TCP research
■ Alternative TCP congestion■ Vegas■ Westwood■ Binary Increase Congestion Control (BIC)
■ Research community based around Web100
Copyright 2004 OSDL, All rights reserved. - 39 -
TCP Reno
■ Standard default in 2.4/2.6■ Adjusts congestion window based on packet loss■ Slow start – window grows slowly■ Additive Increase window on each Ack■ Multiplicative Decrease on loss
Copyright 2004 OSDL, All rights reserved. - 40 -
TCP Vegas
■ Original work by Larry Peterson■ Patches existed for 2.2, 2.4 and part of web100■ sysctl net.ipv4.tcp_cong_avoid
■ Measure bandwidth based on RTT■ Adjust congestion window on bandwidth■ Avoids packet loss
Copyright 2004 OSDL, All rights reserved. - 41 -
TCP Westwood
■ Work by Caludio Casetti■ Patches for 2.4 by Angelo Dell'Aera■ sysctl net.ipv4.tcp_westwood
■ Focused on wireless ■ packet loss != congestion
■ Measure bandwidth based on RTT■ Use normal Reno till congestion then adjust
congestion window based on bandwidth
Copyright 2004 OSDL, All rights reserved. - 42 -
Binary Increase Congestion Control (BIC)
■ Work by Lisung Xu■ Patches for Web100 (2.4)■ sysctl net.ipv4.tcp_bic
■ Designed for best high speed networks■ Modification of Reno■ Use additive increase when congestion window
is large■ Binary search increase when window is small
Copyright 2004 OSDL, All rights reserved. - 43 -
Tuning
■ Default tcp parameters not big enough ■ Need bigger send and receive window
■ Send window autosized based on rtt already■ Receive window autosizing was done in Web100
Copyright 2004 OSDL, All rights reserved. - 44 -
Receiver Tuning
■ Patches from John Heffner■ sysctl net.ipv4.tcp_moderate_rcvbuf
■ Dynamic Right Sizing (DRS)■ adjust receive window based on RTT■ If application doesn't set window then do it for them■ Window will grow from default to max
Copyright 2004 OSDL, All rights reserved. - 45 -
Receiver auto-tuning
0 50 100 150 2000
200
400
600
800
1000
Default
Auto Tuned
Delay (ms)
Thr
ough
put (
Mbi
ts/s
ec)
Copyright 2004 OSDL, All rights reserved. - 46 -
Throughput vs Delay (initial run)
0
100
200
300
400
500
600
700
800
0 50 100 150 200
Ba
nd
wid
th (
Mb
its/s
ec)
Delay (ms)
RenoVegas
WestwoodBic
Copyright 2004 OSDL, All rights reserved. - 47 -
What's happening
■ NAPI■ Driver API to allow avoiding interrupts■ Trades off latency for overall performance
■ E1000 driver■ Uses NAPI for transmit
Answer: Transmit ring gets full and driver flow blocks
Solution: set TxDescriptors=1000
Copyright 2004 OSDL, All rights reserved. - 48 -
Thorughput vs Delay (rerun)
0 25 50 75 100 125 150 175 2000
100
200
300
400
500
600
700
800
Reno
Vegas
Westwood
BIC
Delay (ms)
Thr
oug
hput
(bi
ts/s
ec)
Copyright 2004 OSDL, All rights reserved. - 49 -
Performance still slow
■ Vegas and Westwood are terrible■ Not at full link speed■ Performance falling off with delay
Copyright 2004 OSDL, All rights reserved. - 50 -
Vegas trace with 100ms delay
Copyright 2004 OSDL, All rights reserved. - 51 -
Vegas detail
Copyright 2004 OSDL, All rights reserved. - 52 -
Westwood (70ms)
Copyright 2004 OSDL, All rights reserved. - 53 -
Westwood detail
Copyright 2004 OSDL, All rights reserved. - 54 -
BIC trace (100ms)
Copyright 2004 OSDL, All rights reserved. - 55 -
BIC detail (100ms)
Copyright 2004 OSDL, All rights reserved. - 56 -
How to squeeze out more performance
■ Large MTU (4k) + 63%■ LAN driver not-module up to 10%■ Turn off timestamps + 4%■ Bind IRQ to processor varies
Copyright 2004 OSDL, All rights reserved. - 57 -
Congestion more work
■ Vegas doesn't use available window■ Does it under estimate bandwidth?
■ Westwood■ Another bandwidth problem
■ BIC■ When does it make into binary mode?■ What is holding back window?
■ Netem■ Higher resolution? Packet groups?
Copyright 2004 OSDL, All rights reserved. - 58 -
Break
Copyright 2004 OSDL, All rights reserved. - 59 -
Agenda
■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup
Copyright 2004 OSDL, All rights reserved. - 60 -
Other tools
■ Information about■ ISP connection■ Sockets open
■ Testing infrastructure■ More data capture■ Monitoring
Copyright 2004 OSDL, All rights reserved. - 61 -
Tools: basic
■ Network path information■ Ping – send icmp echo
■ Measure of round trip time and loss■ Can be blocked by firewall
■ Traceroute – use IP source routing■ Usually blocked now
■ Pathcapture (pcap)■ Bandwidth and delay measurement
Copyright 2004 OSDL, All rights reserved. - 62 -
Tools: Network interface
■ ifconfig■ Basic statistics, packets sent/received/errors
■ ip -stats link■ Alternate newer, may have more info
■ SNMP■ Remote access to same information■ Slightly more work
Copyright 2004 OSDL, All rights reserved. - 63 -
Tools: Sockets
■ Netstat■ TCP statistics■ Open sockets
■ Ss■ More statistics available (rtt, etc)
■ Recvmsg■ Application can see TCP info (cmsg)
Copyright 2004 OSDL, All rights reserved. - 64 -
Tools: test servers
■ SYN testtelnet syntest.psc.edu 7960
■ TCP bandwidthhttp://www.epm.ornl.gov/~dunigan/java/misc/tcpbw.html
http://dslreports.com
■ ANL network confighttp://miranda.ctd.anl.gov:7123
■ Path MTUhttp://www.ncne.org/jumbogram/mtu_discovery.php
Copyright 2004 OSDL, All rights reserved. - 65 -
Tools: testing
■ Ttcp■ Basic send /receive throughput
■ Iperf■ Longer running tests and turnaround
■ Netperf■ Includes cpu and other statistics
■ Dbs■ Multiclient testing
Copyright 2004 OSDL, All rights reserved. - 66 -
Tools: monitoring
■ Ntop■ Measure of network activity by service■ Nice web interface
■ Mailgraph■ Long term mail statistics
■ Web server activity log analysis
Copyright 2004 OSDL, All rights reserved. - 67 -
Tools: data capture
■ Tcpdump■ Filter packets by protocol, address, etc■ Decode many protcols
■ Ethereal■ GUI interface
■ RMON■ Remote monitoring
■ Kismet■ Wireless activity
Copyright 2004 OSDL, All rights reserved. - 68 -
Tools: generators
■ Pktgen■ Kernel level packet generation■ Can generate maximum hardware packet rate
■ Network packet generator■ Application level
Copyright 2004 OSDL, All rights reserved. - 69 -
Tools: simulation
■ Ns■ Describe overall system■ Event based simulation■ Used for protocol analysis
■ SSFnet■ More detailed models of real hardware
Copyright 2004 OSDL, All rights reserved. - 70 -
Tools: client simulator
■ Web■ SPECweb, Apache (as), httpload
■ NFS■ Nfsstone
■ FTP■ Dkftpbench
Copyright 2004 OSDL, All rights reserved. - 71 -
Conclusion
■ Data capture can provide clues of:■ Application problems■ Device problems■ TCP/IP problems
■ Nothing is ever simple
top related