anders magnusson tcp tuning and e2e performance trefpunkt - october 20, 2004
TRANSCRIPT
Anders Magnusson
TCP Tuning and E2E Performance
TREFpunkt - October 20, 2004
Anders Magnusson<[email protected]>
October 20, 2004
The speed-of-light problem
The sender must store every sent packet until it has received an ACK from the receiver
Due to the speed of light limitations this might take a while, even in small countries like Sweden
Theoretical RTT Luleå-Stockholm is (1000/300000)*2 = 6.7ms, in reality 20ms
TCP window size to keep up with 1Gbit/s must then be (1000/8)*.02 = 2.5Mbyte
Anders Magnusson<[email protected]>
October 20, 2004
Operating system buffers
Inside the operating system kernel there are usually a bunch of different buffers affecting performance
The term “buffers” is somewhat misleading, usually it is just some sort of data structure that is used to reference data in memory (but in theory it could as well be real buffers)
Anders Magnusson<[email protected]>
October 20, 2004
TCP window buffers
The TCP window sizes can be adjusted on virtually all operating systems
There are two windows, send and receive
The window size for one direction of flow is set to MIN(sender’s send window, receiver’s receive window)
The send window must be large enough to keep all segments sent during the RTT
Anders Magnusson<[email protected]>
October 20, 2004
Socket buffers
Limits the amount of data an application may write to the kernel before being blocked
Often combined with the TCP send window, when ACKs are received the socket buffer data is adjusted accordingly
Must be >= TCP window size to avoid limitations
Anders Magnusson<[email protected]>
October 20, 2004
MBUF clusters
There are limitations how many network buffers (in many OSes called MBUFs) that may be allocated
MBUFs may have external storage associated with them, allocated out of a separate (limited) area
These buffers are often allocated at compile time and it is not uncommon that physical memory is static allocated for them
Anders Magnusson<[email protected]>
October 20, 2004
Other knobs to turn
RFC1323 Turns on “Window scaling option” needed to use
larger TCP windows than 64k
Initial window size Avoid slow-start by injecting many packets into the
network at connection startup
Interface queues Be able to store the packets that are ready to send
until the network interface can transmit them
Anders Magnusson<[email protected]>
October 20, 2004
Problems often seen
Packet loss
On a long-distance high-speed connection, packet loss in a TCP flow will reduce the speed significantly
If the sender enters congestion avoidance, the congestion window will open linearly, and with large windows this will be really slow
With an RTT of 185ms and window size of 25MB it will take around 50 minutes to reach full speed
Anders Magnusson<[email protected]>
October 20, 2004
Problems often seen
Packet bursts During the startup of a TCP bulk flow, the
exponential increase in packet injection into the network during slow-start may cause packet bursts on links with large bandwidth-delay product
The result may be that intermediate switches/routers must drop packets, even though the TCP self-clocking would not permit more packets to be sent than could be received
Anders Magnusson<[email protected]>
October 20, 2004
Problems often seen
ACK/window updates Traditional approach for bulk flows is for the
receiver to send an ACK each second received packet
Window updates are sent as soon as data is delivered to the receiving process
This will cause the return traffic to be more than half the number of the transmitted packets
Interrupts, packet handling in the sending host may use a significant amount of CPU
Anders Magnusson<[email protected]>
October 20, 2004
Problems often seen
ARP timeouts
When an ARP entry times out, it is usually just removed from the ARP cache, and the next packet will initiate a new ARP request
If there is an ongoing packet flow, this approach may cause packets to be dropped until an ARP reply is received
Anders Magnusson<[email protected]>
October 20, 2004
Tuning of NetBSD
sysctl -w net.inet.tcp.rfc1323=1 Activate window scaling and timestamp options due to
RFC1323. sysctl -w kern.somaxkva=[sbmax]
Set maximum size for all socket buffers together in the system
sysctl -w kern.sbmax=[sbmax] Set maximum size of socket buffer for one TCP flow
sysctl -w net.inet.tcp.recvspace=[wstd] sysctl -w net.inet.tcp.sendspace=[wstd]
Set max size of TCP windows. sysctl kern.mbuf.nmbclusters
View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set by recompiling Your kernel.
Anders Magnusson<[email protected]>
October 20, 2004
Tuning of FreeBSD
sysctl net.inet.tcp.rfc1323=1 Activate window scaling and timestamp options due to
RFC1323. sysctl ipc.maxsockbuf=[sbmax]
Set maximum size of TCP window. sysctl net.inet.tcp.recvspace=[wstd] sysctl net.inet.tcp.sendspace=[wstd]
Set max size of TCP windows. sysctl kern.ipc.nmbclusters
View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set att boot time.
Anders Magnusson<[email protected]>
October 20, 2004
Tuning of Linux
echo "1" > /proc/sys/net/ipv4/tcp_window_scaling Activate window scaling according to RFC 1323
echo [wmax] > /proc/sys/net/core/rmem_max echo [wmax] > /proc/sys/net/core/wmem_max
Set maximum size of TCP windows. echo [wmax] > /proc/sys/net/core/rmem_default echo [wmax] > /proc/sys/net/core/wmem_default
Set default size of TCP windows. echo "[wmin] [wstd] [wmax]" >
/proc/sys/net/ipv4/tcp_rmem echo "[wmin] [wstd] [wmax]" >
/proc/sys/net/ipv4/tcp_wmem Set min, default, max windows. Used by the autotuning
function. echo "bmin bdef bmax" > /proc/sys/net/ipv4/tcp_mem
Set maximum total TCP buffer-space allocatable. Used by the autotuning function.
Anders Magnusson<[email protected]>
October 20, 2004
Tuning of Windows (2k, XP, 2k3)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Tcp1323Opts=1 Turn on window scaling option
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize =[wmax] Set maximum size of TCP window
Anders Magnusson<[email protected]>
October 20, 2004
How to set a Land Speed Record
Recipe: Really high-quality networks Hardware capable of sending/receiving fast
enough Operating system without foolish bottlenecks Enthusiasts that spend weekends sending an
obscene amount of data between Luleå and San Jose
Anders Magnusson<[email protected]>
October 20, 2004
GigaSunetOC-192 core
SprintlinkOC-192 core
10GE
10GE
OC192
End host inLuleå, Sweden
End host inSan Jose, CA
SUNET Internet Land Speed Record - Network setup
Network path consists of 42(!) router hops, using paths shared with other users of the networks.
Anders Magnusson<[email protected]>
October 20, 2004
Records submitted September 12
1 966 080 000 000 bytes in 3648 real seconds = 4310 Mbit/second
1831 Gbytes in almost exactly an hour 120 000 packets/second transferred with an MTU of
4470 bytes Record submitted for the IPv4 single and multiple
stream class is 124.935 Petabit-meters/second (which is a 78% increase of our previous record)
Anders Magnusson<[email protected]>
October 20, 2004
Compared with others
Compared to the previous record, we can note thatwe achieved this, using
Less powerful end hosts 200% longer distance Less than half the MTU size
(which generates heavier CPU-load on the end-hosts)
The normal GigaSunet and Sprintlink production infrastructures
Anders Magnusson<[email protected]>
October 20, 2004
Fiber path for the Internet LSR
Distance from Luleå, Sweden to San Jose, CA is approximately 28,983 km (18,013 miles)
Anders Magnusson<[email protected]>
October 20, 2004
More to read…
http://proj.sunet.se/LSR Describes how the Land Speed Record(s) were achieved
http://proj.sunet.se/E2E About end-to-end performance in GigaSunet