transport layer and data center tcp
DESCRIPTION
Transport Layer and Data Center TCP. Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking September 5, 2014. Slides used and adapted judiciously from Computer Networking, A Top-Down Approach. Goals for Today. Transport Layer - PowerPoint PPT PresentationTRANSCRIPT
Transport Layer and Data Center TCP
Hakim WeatherspoonAssistant Professor Dept of Computer Science
CS 5413 High Performance Systems and NetworkingSeptember 5 2014
Slides used and adapted judiciously from Computer Networking A Top-Down Approach
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
provide logical communication between app processes running on different hosts
transport protocols run in end systems send side breaks app
messages into segments passes to network layer
rcv side reassembles segments into messages passes to app layer
more than one transport protocol available to apps Internet TCP and UDP
applicationtransportnetworkdata linkphysical
logical end-end transport
applicationtransportnetworkdata linkphysical
Transport Layer ServicesProtocols
Transport Layer ServicesProtocols
network layer logical communication between hosts
transport layer logical communication between processes relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
Transport vs Network Layer
bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of
ldquobest-effortrdquo IPbull services not available
ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
logical end-end transport
Transport Layer ServicesProtocols
TCP servicebull reliable transport between
sending and receiving process
bull flow control sender wonrsquot overwhelm receiver
bull congestion control throttle sender when network overloaded
bull does not provide timing minimum throughput guarantee security
bull connection-oriented setup required between client and server processes
UDP servicebull unreliable data transfer
between sending and receiving process
bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup
Q why bother Why is there a UDP
Transport Layer ServicesProtocols
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
provide logical communication between app processes running on different hosts
transport protocols run in end systems send side breaks app
messages into segments passes to network layer
rcv side reassembles segments into messages passes to app layer
more than one transport protocol available to apps Internet TCP and UDP
applicationtransportnetworkdata linkphysical
logical end-end transport
applicationtransportnetworkdata linkphysical
Transport Layer ServicesProtocols
Transport Layer ServicesProtocols
network layer logical communication between hosts
transport layer logical communication between processes relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
Transport vs Network Layer
bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of
ldquobest-effortrdquo IPbull services not available
ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
logical end-end transport
Transport Layer ServicesProtocols
TCP servicebull reliable transport between
sending and receiving process
bull flow control sender wonrsquot overwhelm receiver
bull congestion control throttle sender when network overloaded
bull does not provide timing minimum throughput guarantee security
bull connection-oriented setup required between client and server processes
UDP servicebull unreliable data transfer
between sending and receiving process
bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup
Q why bother Why is there a UDP
Transport Layer ServicesProtocols
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
provide logical communication between app processes running on different hosts
transport protocols run in end systems send side breaks app
messages into segments passes to network layer
rcv side reassembles segments into messages passes to app layer
more than one transport protocol available to apps Internet TCP and UDP
applicationtransportnetworkdata linkphysical
logical end-end transport
applicationtransportnetworkdata linkphysical
Transport Layer ServicesProtocols
Transport Layer ServicesProtocols
network layer logical communication between hosts
transport layer logical communication between processes relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
Transport vs Network Layer
bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of
ldquobest-effortrdquo IPbull services not available
ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
logical end-end transport
Transport Layer ServicesProtocols
TCP servicebull reliable transport between
sending and receiving process
bull flow control sender wonrsquot overwhelm receiver
bull congestion control throttle sender when network overloaded
bull does not provide timing minimum throughput guarantee security
bull connection-oriented setup required between client and server processes
UDP servicebull unreliable data transfer
between sending and receiving process
bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup
Q why bother Why is there a UDP
Transport Layer ServicesProtocols
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Transport Layer ServicesProtocols
network layer logical communication between hosts
transport layer logical communication between processes relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
Transport vs Network Layer
bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of
ldquobest-effortrdquo IPbull services not available
ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
logical end-end transport
Transport Layer ServicesProtocols
TCP servicebull reliable transport between
sending and receiving process
bull flow control sender wonrsquot overwhelm receiver
bull congestion control throttle sender when network overloaded
bull does not provide timing minimum throughput guarantee security
bull connection-oriented setup required between client and server processes
UDP servicebull unreliable data transfer
between sending and receiving process
bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup
Q why bother Why is there a UDP
Transport Layer ServicesProtocols
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of
ldquobest-effortrdquo IPbull services not available
ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
logical end-end transport
Transport Layer ServicesProtocols
TCP servicebull reliable transport between
sending and receiving process
bull flow control sender wonrsquot overwhelm receiver
bull congestion control throttle sender when network overloaded
bull does not provide timing minimum throughput guarantee security
bull connection-oriented setup required between client and server processes
UDP servicebull unreliable data transfer
between sending and receiving process
bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup
Q why bother Why is there a UDP
Transport Layer ServicesProtocols
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP servicebull reliable transport between
sending and receiving process
bull flow control sender wonrsquot overwhelm receiver
bull congestion control throttle sender when network overloaded
bull does not provide timing minimum throughput guarantee security
bull connection-oriented setup required between client and server processes
UDP servicebull unreliable data transfer
between sending and receiving process
bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup
Q why bother Why is there a UDP
Transport Layer ServicesProtocols
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
process
socket
use header info to deliverreceived segments to correct socket
demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)
multiplexing at sender
transport
application
physical
link
network
P2P1
transport
application
physical
link
network
P4
transport
application
physical
link
network
P3
Transport LayerSockets MultiplexingDemultiplexing
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
no connection establishment (which can add delay)
simple no connection state at sender receiver
small header size no congestion control UDP
can blast away as fast as desired
why is there a UDP
UDP Connectionless TransportUDP Segment Header
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
UDP Connectionless Transport
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (onersquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of received
segmentbull check if computed checksum
equals checksum field value
ndash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
UDP Checksum
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
bull Congestion control
bull Data Center TCPndash Incast Problem
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
important in application transport link layers top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
important in application transport link layers top-10 list of important networking topics
Principles of Reliable Transport
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Principles of Reliable Transport
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
source port dest port
32 bits
applicationdata (variable length)
sequence number
acknowledgement number
receive window
Urg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now(generally not used)
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP Reliable TransportTCP Segment Structure
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
sequence numbers
ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgements
ndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segments
ndashA TCP spec doesnrsquot say - up to implementor
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window size N
sender sequence number space
source port dest port
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
TCP Reliable TransportTCP Sequence numbers and Acks
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Usertypes
lsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP Reliable TransportTCP Sequence numbers and Acks
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing
to establish connection)bull agree on connection parameters
connection state ESTABconnection variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
connection state ESTABconnection Variables
seq client-to-server server-to-clientrcvBuffer size at serverclient
application
network
Socket clientSocket = newSocket(hostnameport
number)
Socket connectionSocket = welcomeSocketaccept()
Connection Management TCP 3-way handshake
TCP Reliable Transport
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data
received ACK(y) indicates client is live
SYNSENT
ESTAB
SYN RCVD
client state
LISTEN
server state
LISTEN
TCP Reliable TransportConnection Management TCP 3-way handshake
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport
number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)
SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)
ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP Reliable TransportConnection Management TCP 3-way handshake
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
client server each close their side of connection send TCP segment with FIN bit = 1
respond to received FIN with ACK on receiving FIN ACK can be combined with
own FINsimultaneous FIN exchanges can be
handled
TCP Reliable TransportConnection Management Closing connection
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1 wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data
clientSocketclose()
client state server state
ESTABESTAB
TCP Reliable TransportConnection Management Closing connection
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
data rcvd from appcreate segment with seq
seq is byte-stream
number of first data byte in segment
start timer if not already running think of timer as for
oldest unacked segment expiration interval
TimeOutInterval
timeoutretransmit segment that
caused timeoutrestart timer ack rcvdif ack acknowledges
previously unacked segments update what is known to
be ACKed start timer if there are
still unacked segments
TCP Reliable Transport
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtim
eout
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
tim
eout
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP Reliable TransportTCP Retransmission Scenerios
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
tim
eout
Seq=100 20 bytes of data
ACK=120
TCP Reliable TransportTCP Retransmission Scenerios
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACK indicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP ACK generation [RFC 1122 2581]Reliable Transport
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
time-out period often relatively long long delay before
resending lost packetdetect lost segments via
duplicate ACKs sender often sends many
segments back-to-back if segment is lost there
will likely be many duplicate ACKs
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq
likely that unacked segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
TCP Reliable TransportTCP Fast Retransmit
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
tim
eout
ACK=100
ACK=100
ACK=100
Seq=100 20 bytes of data
Seq=100 20 bytes of data
TCP Reliable TransportTCP Fast Retransmit
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Q how to set TCP timeout value
longer than RTT but RTT varies
too short premature timeout unnecessary retransmissions
too long slow reaction to segment loss
Q how to estimate RTT
bull SampleRTT measured time from segment transmission until ACK receipt
ndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquo
ndash average several recent measurements not just current SampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases
exponentially fast typical value = 0125
RTT (
mill
iseco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTT
TCP Reliable TransportTCP Roundtrip time and timeouts
time (seconds)
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|
(typically = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP Reliable TransportTCP Roundtrip time and timeouts
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
full duplex data bi-directional data flow in
same connection MSS maximum segment
sizeconnection-oriented
handshaking (exchange of control msgs) inits sender receiver state before data exchange
flow controlled sender will not overwhelm
receiver
bull point-to-pointndash one sender one
receiver bull reliable in-order
byte steamndash no ldquomessage
boundariesrdquobull pipelined
ndash TCP congestion and flow control set window size
TCP Transmission Control ProtocolRFCs 79311221323 2018 2581
TCP Reliable Transport
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
application
OS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP Reliable TransportFlow Control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application processbull receiver ldquoadvertisesrdquo free
buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via
socket options (typical default is 4096 bytes)
ndash many operating systems autoadjust RcvBuffer
bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
receiver-side buffering
TCP Reliable TransportFlow Control
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
congestionbull informally ldquotoo many sources sending too
much data too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
Principles of Congestion Control
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
two broad approaches towards congestion control
end-end congestion control
no explicit feedback from network
congestion inferred from end-system observed loss delay
approach taken by TCP
network-assisted congestion control
routers provide feedback to end systemssingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
explicit rate for sender to send at
Principles of Congestion Control
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
TCP connection 1
bottleneckroutercapacity RTCP connection 2
TCP Congestion ControlTCP Fairness
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1
MSS every RTT until loss detected multiplicative decrease cut cwnd in half
after loss
cwnd
TC
P s
end
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
sender limits transmission
cwnd is dynamic function of perceived network congestion
TCP sending rateroughly send cwnd
bytes wait RTT for ACKS then send more bytes
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwnd
RTTbytessec
TCP Congestion Control
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
h pu t
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
TCP Congestion ControlTCP Fairness Why is TCP Fair
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Fairness and UDPmultimedia apps
often do not use TCP do not want rate
throttled by congestion control
instead use UDP send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
application can open multiple parallel connections between two hosts
web browsers do this eg link of rate R with 9
existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2
TCP Congestion ControlTCP Fairness
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd
for every ACK receivedsummary initial rate is
slow but ramps up exponentially fast
Host A
one segment
RT
T
Host B
time
two segments
four segments
TCP Congestion ControlSlow Start
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP Congestion Control
loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly
loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
Detecting and Reacting to Loss
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Implementation variable ssthresh on loss event ssthresh is
set to 12 of cwnd just before loss event
TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++
duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++
duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP Congestion Control
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT
W
W2
avg TCP thruput = 34WRTTbytessec
TCP Throughput
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Goals for Todaybull Transport Layer
ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport
bull Abstraction Connection Management Reliable Transport Flow Control timeouts
ndash Congestion control
bull Data Center TCPndash Incast Problem
Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster
bull Client performs synchronized reads
bull Increase of servers involved in transferndash SRU size is fixed
bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Cluster-based Storage Systems
Client Switch
Storage Servers
RR
RR
1
2
Data Block
Server Request Unit(SRU)
3
4
Synchronized Read
Client now sendsnext batch of requests
1 2 3 4
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Link idle time due to timeouts
Client Switch
RR
RR
1
2
3
4
Synchronized Read
4
Link is idle until server experiences a timeout
1 2 3 4 Server Request Unit(SRU)
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP Throughput Collapse Incast
bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts
Collapse
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP data-driven loss recovery
Sender Receiver
123
4
5
Ack 1
Ack 1
Ack 1
Ack 1
3 duplicate ACKs for 1(packet 2 is probably lost)
2
Seq
Retransmit packet 2 immediately
In SANsrecovery in usecsafter loss
Ack 5
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP timeout-driven loss recovery
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
bull Timeouts are expensive(msecs to recover after loss)
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP Loss recovery comparison
Sender Receiver
12345
Ack 1
Ack 1Ack 1Ack 1
Retransmit 2
Seq
Ack 5
Sender Receiver
123
4
5
1
RetransmissionTimeout(RTO)
Ack 1
Seq
Timeout driven recovery is slow (ms)
Data-driven recovery issuper fast (us) in SANs
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
TCP Throughput Collapse Summary
bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse
bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)
bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-
network signaling and end-to-end TCP modifications
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control
instantiation implementation in the Internet UDP TCP
Next timebull Network Layerbull leaving the network
ldquoedgerdquo (application transport layers)
bull into the network ldquocorerdquo
Perspective
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-
Before Next time
bull Project Proposalndash due in one weekndash Meet with groups TA and professor
bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday
bull No required reading and review duebull But review chapter 4 from the book Network Layer
ndash We will also briefly discuss data center topologiesndash
bull Check website for updated schedule
- Transport Layer and Data Center TCP
- Goals for Today
- Transport Layer ServicesProtocols
- Transport Layer ServicesProtocols (2)
- Transport Layer ServicesProtocols (3)
- Transport Layer ServicesProtocols (4)
- Goals for Today (2)
- Transport Layer
- Goals for Today (3)
- UDP Connectionless Transport
- UDP Connectionless Transport (2)
- Goals for Today (4)
- Principles of Reliable Transport
- Principles of Reliable Transport (2)
- Principles of Reliable Transport (3)
- Principles of Reliable Transport (4)
- TCP Reliable Transport
- TCP Reliable Transport (2)
- TCP Reliable Transport (3)
- TCP Reliable Transport (4)
- TCP Reliable Transport (5)
- TCP Reliable Transport (6)
- TCP Reliable Transport (7)
- TCP Reliable Transport (8)
- TCP Reliable Transport (9)
- TCP Reliable Transport (10)
- TCP Reliable Transport (11)
- TCP Reliable Transport (12)
- TCP Reliable Transport (13)
- TCP Reliable Transport (14)
- Reliable Transport
- TCP Reliable Transport (15)
- TCP Reliable Transport (16)
- TCP Reliable Transport (17)
- TCP Reliable Transport (18)
- TCP Reliable Transport (19)
- TCP Reliable Transport (20)
- TCP Reliable Transport (21)
- TCP Reliable Transport (22)
- Goals for Today (5)
- Principles of Congestion Control
- Principles of Congestion Control (2)
- TCP Congestion Control
- TCP Congestion Control (2)
- TCP Congestion Control (3)
- TCP Congestion Control (4)
- TCP Congestion Control (5)
- TCP Congestion Control (6)
- TCP Congestion Control (7)
- TCP Congestion Control (8)
- TCP Congestion Control (9)
- TCP Throughput
- TCP over ldquolong fat pipesrdquo
- Goals for Today (6)
- TCP Throughput Collapse
- Cluster-based Storage Systems
- Link idle time due to timeouts
- TCP Throughput Collapse Incast
- TCP data-driven loss recovery
- TCP timeout-driven loss recovery
- TCP Loss recovery comparison
- TCP Throughput Collapse Summary
- Perspective
- Before Next time
-