xin liu 1 transport layer our goals: understand principles behind transport layer services:...
TRANSCRIPT
Xin Liu
1
Transport Layer
Our goals: • understand principles
behind transport layer services:– multiplexing/
demultiplexing
– reliable data transfer
– flow control
– congestion control
• learn about transport layer protocols in the Internet:– UDP: connectionless
transport
– TCP: connection-oriented transport
– TCP congestion control
Ref: slides by J. Kurose and K. Ross
Xin Liu
2
Outline• Transport-layer services• Multiplexing and demultiplexing• Connectionless transport: UDP• Connection-oriented transport: TCP
– segment structure– reliable data transfer– flow control– connection management
• TCP congestion control
Xin Liu
3
Transport services and protocols• provide logical communication
between app processes running on different hosts
• transport protocols run in end systems
– send side: breaks app messages into segments, passes to network layer
– rcv side: reassembles segments into messages, passes to app layer
• more than one transport protocol available to apps
– Internet: TCP and UDP
application
transportnetworkdata linkphysical
application
transportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysicalnetwork
data linkphysical
logical end-end transport
Xin Liu
4
Transport vs. network layer
• network layer: logical communication between hosts
• transport layer: logical communication between processes – relies on, enhances,
network layer services
Household analogy:
12 kids sending letters to 12 kids
• processes = kids
• app messages = letters in envelopes
• hosts = houses
• transport protocol = Ann and Bill
• network-layer protocol = postal service
Xin Liu
5
Internet transport-layer protocols
• reliable, in-order delivery (TCP)– congestion control
– flow control
– connection setup
• unreliable, unordered delivery: UDP– no-frills extension of “best-
effort” IP
• services not available: – delay guarantees
– bandwidth guarantees
application
transportnetworkdata linkphysical
application
transportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysicalnetwork
data linkphysical
logical end-end transport
Xin Liu
6
Outline• Transport-layer services• Multiplexing and demultiplexing• Connectionless transport: UDP• Principles of reliable data transfer• Connection-oriented transport: TCP
– segment structure– reliable data transfer– flow control– connection management
• Principles of congestion control• TCP congestion control
Xin Liu
7
Multiplexing/demultiplexing
application
transport
network
link
physical
P1 application
transport
network
link
physical
application
transport
network
link
physical
P2P3 P4P1
host 1 host 2 host 3
= process= socket
delivering received segmentsto correct socket
Demultiplexing at rcv host:gathering data from multiplesockets, enveloping data with header (later used for demultiplexing)
Multiplexing at send host:
Xin Liu
8
How demultiplexing works• host receives IP datagrams
– each datagram has source IP address, destination IP address
– each datagram carries 1 transport-layer segment
– each segment has source, destination port number (recall: well-known port numbers for specific applications)
• host uses IP addresses & port numbers to direct segment to appropriate socket
source port # dest port #
32 bits
applicationdata
(message)
other header fields
TCP/UDP segment format
Xin Liu
9
Connectionless demultiplexing• Create sockets :sock=socket(PF_INET,SOCK_DGR
AM, IPPROTO_UDP);bind(sock,(struct sockaddr
*)&addr,sizeof(addr));sendto(sock,buffer,size,0);recvfrom(sock,Buffer,buffers
ize,0);
• UDP socket identified by two-tuple:
(dest IP address, dest port number)
• When host receives UDP segment:– checks destination port
number in segment– directs UDP segment to socket
with that port number
• IP datagrams with different source IP addresses and/or source port numbers directed to same socket
Xin Liu
10
Connection-oriented demux
• TCP socket identified by 4-tuple: – source IP address
– source port number
– dest IP address
– dest port number
• recv host uses all four values to direct segment to appropriate socket
• Server host may support many simultaneous TCP sockets:– each socket identified by
its own 4-tuple
• Web servers have different sockets for each connecting client– non-persistent HTTP will
have different socket for each request
Xin Liu
11
Outline• Transport-layer services• Multiplexing and demultiplexing• Connectionless transport: UDP• Connection-oriented transport: TCP
– segment structure– reliable data transfer– flow control– connection management
• TCP congestion control
Xin Liu
12
UDP: User Datagram Protocol [RFC 768]
• “no frills,” “bare bones” Internet transport protocol
• “best effort” service, UDP segments may be:
– lost
– delivered out of order to app
• connectionless:
– no handshaking between UDP sender, receiver
– each UDP segment handled independently of others
Why is there a UDP?• no connection establishment
(which can add delay)
• simple: no connection state at sender, receiver
• small segment header
• no congestion control: UDP can blast away as fast as desired
Xin Liu
13
DHCP client-server scenarioDHCP server: 223.1.2.5 arriving
client
time
DHCP discover
src : 0.0.0.0, 68 dest.: 255.255.255.255,67yiaddr: 0.0.0.0transaction ID: 654
DHCP offer
src: 223.1.2.5, 67 dest: 255.255.255.255, 68yiaddrr: 223.1.2.4transaction ID: 654Lifetime: 3600 secs
DHCP request
src: 0.0.0.0, 68 dest:: 255.255.255.255, 67yiaddrr: 223.1.2.4transaction ID: 655Lifetime: 3600 secs
DHCP ACK
src: 223.1.2.5, 67 dest: 255.255.255.255, 68yiaddrr: 223.1.2.4transaction ID: 655Lifetime: 3600 secs
Xin Liu
14
Applications and protocolsapplication App_layer prtcl Transport prtcl
E-mail SMTP TCPRemote terminal access Telnet TCP
Web HTTP TCP
File transfer FTP TCP
Streaming proprietary Typically UDP
IP-phone proprietary Typically UDP
Routing RIP Typically UDPName translation DNS Typically UDP
Dynamic IP DHCP Typically UDP
Network mng. SNMP Typically UDP
Xin Liu
15
UDP: more
• often used for streaming multimedia apps
– loss tolerant
– rate sensitive
• reliable transfer over UDP: add reliability at application layer
– application-specific error recovery!
source port # dest port #
32 bits
Applicationdata
(message)
UDP segment format
length checksumLength, in
bytes of UDPsegment,including
header
Xin Liu
16
Checksum
• Goal: detect “errors” (e.g., flipped bits) in transmitted segment
• UDP header and data
• Pseudo header– Source/dest IP address– Protocol, length
• Same procedure for TCP
Xin Liu
17
UDP checksum
Sender:• treat segment contents as
sequence of 16-bit integers• checksum: addition (1’s
complement sum) of segment contents
• sender puts checksum value into UDP checksum field
Receiver:• compute checksum of received
segment• check if computed checksum
equals checksum field value:– NO - error detected– YES - no error detected. But
maybe errors nonetheless? – may pass the damaged data
Xin Liu
18
Outline• Transport-layer services• Multiplexing and demultiplexing• Connectionless transport: UDP• Connection-oriented transport: TCP
– segment structure– reliable data transfer– flow control– connection management
• TCP congestion control
Xin Liu
19
TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581
• full duplex data:– bi-directional data flow in
same connection
– MSS: maximum segment size
• connection-oriented: – handshaking (exchange of
control msgs) init’s sender, receiver state before data exchange
• flow controlled:– sender will not overwhelm
receiver
• point-to-point:– one sender, one receiver
• reliable, in-order byte steam:– no “message boundaries”
• pipelined:– TCP congestion and flow
control set window size
• send & receive buffers
socketdoor
T C Psend buffer
T C Preceive buffer
socketdoor
segm ent
applicationwrites data
applicationreads data
Xin Liu
20TCP segment structure
source port # dest port #
32 bits
applicationdata
(variable length)
sequence number
acknowledgement numberReceive window
Urg data pnterchecksum
FSRPAUheadlen
notused
Options (variable length)
URG: urgent data (generally not used)
ACK: ACK #valid
PSH: push data now
RST, SYN, FIN:connection estab(setup, teardown
commands)
# bytes rcvr willingto accept
countingby bytes of data(not segments!)
Internetchecksum
(as in UDP)
Urgent data pointer
Xin Liu
21
TCP Connection Management
Recall: TCP sender, receiver establish “connection” before exchanging data segments
• initialize TCP variables:
– seq. #s
– buffers, flow control info (e.g. RcvWindow)
• client: connection initiator– connect();
• server: contacted by client– accept();
Three way handshake:
Step 1: client host sends TCP SYN segment to server
– specifies initial seq #
– no data
Step 2: server host receives SYN, replies with SYNACK segment
– server allocates buffers
– specifies server initial seq. #
Step 3: client receives SYNACK, replies with ACK segment, which may contain data
Xin Liu
22
TCP Connection Management (cont.)Closing a connection:
client closes socket: close();
Step 1: client end system sends TCP FIN control segment to server
Step 2: server receives FIN, replies with ACK. Closes connection, sends FIN.
client
FIN
server
ACK
ACK
FIN
close
close
closed
tim
ed w
ait
Xin Liu
23
TCP Connection Management (cont.)Step 3: client receives FIN, replies
with ACK.
– Enters “timed wait” - will respond with ACK to received FINs
Step 4: server, receives ACK. Connection closed.
Note: with small modification, can handle simultaneous FINs.
client
FIN
server
ACK
ACK
FIN
closing
closing
closed
tim
ed w
ait
closed
FIN_WAIT_2
FIN_WAIT_1
TIME_WAIT
Xin Liu
24
TCP Connection Management (cont)
TCP clientlifecycle
TCP serverlifecycle
Xin Liu
25
TCP Connection Management
• Allow half-close, i.e., one end to terminate its output, but still receiving data
• Allow simultaneous open
• Allow simultaneous close
• Crashes?
Xin Liu
26
[root@shannon liu]# tcpdump -S tcp port 22tcpdump: listening on eth023:01:51.363983 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: S 3036713598:3036713598(0) win 5840 <mss 1460,sackOK,timestamp 13989220 0,nop,wscale 0> (DF)
23:01:51.364829 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: S 2462279815:2462279815(0) ack 3036713599 win 24616 <nop,nop,timestamp 626257407 13989220,nop,wscale 0,nop,nop,sackOK,mss 1460> (DF)
23:01:51.364844 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279816 win 5840 <nop,nop,timestamp 13989220 626257407> (DF)
23:01:51.375451 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P 2462279816:2462279865(49) ack 3036713599 win 24616 <nop,nop,timestamp 626257408 13989220> (DF)
23:01:51.375478 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279865 win 5840 <nop,nop,timestamp 13989221 626257408> (DF)
23:01:51.379319 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P 3036713599:3036713621(22) ack 2462279865 win 5840 <nop,nop,timestamp 13989221 626257408> (DF)
23:01:51.379570 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036713621 win 24616 <nop,nop,timestamp 626257408 13989221>(DF)
Xin Liu
27
23:01:51.941616 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P 3036714373:3036714437(64) ack 2462281065 win 7680 <nop,nop,timestamp 13989277 626257462> (DF)
23:01:51.952442 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P 2462281065:2462282153(1088) ack 3036714437 win 24616 <nop,nop,timestamp 626257465 13989277> (DF)
23:01:51.991682 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282153 win 9792 <nop,nop,timestamp 13989283 626257465> (DF)
23:01:54.699597 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: F 3036714437:3036714437(0) ack 2462282153 win 9792 <nop,nop,timestamp 13989553 626257465> (DF)
23:01:54.699880 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036714438 win 24616 <nop,nop,timestamp 626257740 13989553>(DF)
23:01:54.701129 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: F 2462282153:2462282153(0) ack 3036714438 win 24616 <nop,nop,timestamp 626257740 13989553> (DF)
23:01:54.701143 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282154 win 9792 <nop,nop,timestamp 13989553 626257740> (DF) 26 packets received by filter0 packets dropped by kernel
Xin Liu
28
Outline• Transport-layer services• Multiplexing and demultiplexing• Connectionless transport: UDP• Connection-oriented transport: TCP
– segment structure– reliable data transfer– flow control– connection management
• TCP congestion control
Xin Liu
29
TCP seq. #’s and ACKs
Seq. #’s:
– byte stream “number” of first byte in segment’s data
ACKs:
– seq # of next byte expected from other side
– cumulative ACK
Q: how receiver handles out-of-order segments
– A: TCP spec doesn’t say, - up to implementor
Host A Host B
Seq=42, ACK=79, data = ‘C’
Seq=79, ACK=43, data = ‘C’
Seq=43, ACK=80
Usertypes
‘C’
host ACKsreceipt
of echoed‘C’
host ACKsreceipt of
‘C’, echoesback ‘C’
timesimple telnet scenario
Xin Liu
30
TCP Round Trip Time and Timeout
Q: how to set TCP timeout value?
• longer than RTT– but RTT varies
• too short: premature timeout
– unnecessary retransmissions
• too long: slow reaction to segment loss
Q: how to estimate RTT?• SampleRTT: measured time
from segment transmission until ACK receipt
– ignore retransmissions• SampleRTT will vary, want
estimated RTT “smoother”
– average several recent measurements, not just current SampleRTT
Xin Liu
31
TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
• Exponential weighted moving average• influence of past sample decreases exponentially fast• typical value: = 0.125
Xin Liu
32
Example RTT estimation:RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT
(mill
isec
onds
)
SampleRTT Estimated RTT
Xin Liu
33
TCP Round Trip Time and Timeout
Setting the timeout• EstimtedRTT plus “safety margin”
– large variation in EstimatedRTT -> larger safety margin
• first estimate of how much SampleRTT deviates from EstimatedRTT:
TimeoutInterval = EstimatedRTT + 4*DevRTT
DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT|
(typically, = 0.25)
Then set timeout interval:
Xin Liu
34
RTT
• Timestamp can be used to measure RTT for each segment
• Better RTT estimate
• NO synchronization required
Xin Liu
35
TCP reliable data transfer
• TCP creates reliable service on top of IP’s unreliable service
• Pipelined segments• Cumulative acks• TCP uses single
retransmission timer
• Retransmissions are triggered by:– timeout events
– duplicate acks
• Initially consider simplified TCP sender:– ignore duplicate acks
– ignore flow control, congestion control
Xin Liu
36
TCP sender events:data rcvd from app:
• Create segment with seq #
• seq # is byte-stream number of first data byte in segment
• start timer if not already running (think of timer as for oldest unacked segment)
• expiration interval: TimeOutInterval
timeout:
• retransmit segment that caused timeout
• restart timer
Ack rcvd:
• If acknowledges previously unacked segments– update what is known to be
acked
– start timer if there are outstanding segments
Xin Liu
37
TCP sender
(simplified)
NextSeqNum = InitialSeqNum SendBase = InitialSeqNum
loop (forever) { switch(event)
event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data)
event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer
event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer }
} /* end of loop forever */
Comment:• SendBase-1: last cumulatively ack’ed byteExample:• SendBase-1 = 71;y= 73, so the rcvrwants 73+ ;y > SendBase, sothat new data is acked
Xin Liu
38
TCP: retransmission scenariosHost A
Seq=100, 20 bytes data
ACK=100
timepremature timeout
Host B
Seq=92, 8 bytes data
ACK=120
Seq=92, 8 bytes data
Seq=
92
tim
eout
ACK=120
Host A
Seq=92, 8 bytes data
ACK=100
loss
tim
eout
lost ACK scenario
Host B
X
Seq=92, 8 bytes data
ACK=100
time
Seq=
92
tim
eout
SendBase= 100
SendBase= 120
SendBase= 120
Sendbase= 100
Xin Liu
39
TCP retransmission scenarios (more)Host A
Seq=92, 8 bytes data
ACK=100
loss
tim
eout
Cumulative ACK scenario
Host B
X
Seq=100, 20 bytes data
ACK=120
time
SendBase= 120
Xin Liu
40TCP ACK generation [RFC 1122, RFC 2581]
Event at Receiver
Arrival of in-order segment withexpected seq #. All data up toexpected seq # already ACKed
Arrival of in-order segment withexpected seq #. One other segment has ACK pending
Arrival of out-of-order segmenthigher-than-expect seq. # .Gap detected
Arrival of segment that partially or completely fills gap
TCP Receiver action
Delayed ACK. Wait up to 500msfor next segment. If no next segment,send ACK
Immediately send single cumulative ACK, ACKing both in-order segments
Immediately send duplicate ACK, indicating seq. # of next expected byte
Immediate send ACK, provided thatsegment startsat lower end of gap
Xin Liu
41
TCP Flow Control• receive side of TCP
connection has a receive buffer:
• speed-matching service: matching the send rate to the receiving app’s drain rate• app process may be
slow at reading from buffer
sender won’t overflow
receiver’s buffer bytransmitting too
much, too fast
flow control
Xin Liu
42
TCP Flow control: how it works
(Suppose TCP receiver discards out-of-order segments)
• spare room in buffer= RcvWindow
= RcvBuffer-[LastByteRcvd - LastByteRead]
• Rcvr advertises spare room by including value of RcvWindow in segments
• Sender limits unACKed data to RcvWindow– guarantees receive
buffer doesn’t overflow
Xin Liu
43
More
• Slow receiver– Ack new window
• Long fat pipeline: high speed link and/or long RTT
• Window scale option during handshaking
Xin Liu
44
Header
source port # dest port #
32 bits
applicationdata
(variable length)
sequence number
acknowledgement numberReceive window
Urg data pnterchecksum
FSRPAUheadlen
notused
Options (variable length)
Xin Liu
45
Outline• Transport-layer services• Multiplexing and demultiplexing• Connectionless transport: UDP• Connection-oriented transport: TCP
– segment structure– reliable data transfer– flow control– connection management
• TCP congestion control
Xin Liu
46
Principles of Congestion Control
Congestion:• informally: “too many sources sending too much data too
fast for network to handle”
• different from flow control!
• Who benefits?
• manifestations:
– lost packets (buffer overflow at routers)
– long delays (queueing in router buffers)
• a top-10 problem!
Xin Liu
47
TCP Congestion Control
• end-end control (no network assistance)
• sender limits transmission: LastByteSent-LastByteAcked
cwnd
• Roughly,
• cwnd is dynamic, function of perceived network congestion
How does sender perceive congestion?
• loss event = timeout or 3 duplicate acks
• TCP sender reduces rate (cwnd) after loss event
mechanisms:– slow start– congestion avoidance– AIMD
rate = cwnd
RTT Bytes/sec
Xin Liu
48
TCP Slow Start
• When connection begins, cwnd = 1 MSS– Example: MSS = 500
bytes & RTT = 200 msec
– initial rate = 20 kbps
• available bandwidth may be >> MSS/RTT– desirable to quickly ramp
up to respectable rate
• When connection begins, increase cwnd when an ack received
Xin Liu
49
TCP Slow Start (more)
• When connection begins, increase rate exponentially until first loss event:– incrementing cwnd for
every ACK received
– double cwnd every RTT
• Summary: initial rate is slow but ramps up exponentially fast
Host A
one segment
RTT
Host B
time
two segments
four segments
Xin Liu
50
Congestion Avoidance
• ssthresh: when cwnd reaches ssthresh, congestion avoidance begins
• Congestion avoidance: increase cwnd by 1/cwnd each time an ACK is received
• Congestion happens: ssthresh=max(2MSS, cwnd/2)
Xin Liu
51
TCP AIMD
8 Kbytes
16 Kbytes
24 Kbytes
time
congestionwindow
multiplicative decrease: cut cwnd in half after loss event
additive increase: increase cwnd by 1 MSS every RTT in the absence of loss events: probing
Long-lived TCP connection
Xin Liu
52
Reno vs. Tahoe
• After 3 dup ACKs:– cwnd is cut in half
– window then grows linearly
• But after timeout event:– cwnd instead set to 1 MSS;
– window then grows exponentially
– to a sshthresh, then grows linearly
• 3 dup ACKs indicates network capable of delivering some segments• timeout before 3 dup ACKs is “more alarming”
Philosophy:
Xin Liu
53
Summary: TCP Congestion Control
• When cwnd is below sshthresh, sender in slow-start phase, window grows exponentially.
• When cwnd is above sshthresh, sender is in congestion-avoidance phase, window grows linearly.
• When a triple duplicate ACK occurs, sshthresh set to cwnd/2 and cwnd set to sshthresh.
• When timeout occurs, sshthresh set to cwnd/2 and cwnd is set to 1 MSS.
Xin Liu
54
Trend
• Recent research proposes network-assisted congestion control: active queue management
• ECN: explicit congestion notification– 2 bits: 6 &7 in the IP TOS field
• RED: random early detection– Implicit
– Can be adapted to explicit methods by marking instead of dropping
Xin Liu
55
Wireless TCP
• Motivation– Wireless channels are unreliable and time-
varying– Cause TCP timeout/Duplicate acks
• Approaches