cs 54001-1: large-scale networked systems
DESCRIPTION
CS 54001-1: Large-Scale Networked Systems. Professor: Ian Foster TAs: Xuehai Zhang, Yong Zhao Winter Quarter www.classes.cs.uchicago.edu/classes/archive/2003/winter/54001-1. CS 54001-1 Course Goals. Yes - PowerPoint PPT PresentationTRANSCRIPT
CS 54001-1: Large-Scale Networked Systems
Professor: Ian FosterTAs: Xuehai Zhang, Yong Zhao
Winter Quarter
www.classes.cs.uchicago.edu/classes/archive/2003/winter/54001-1
CS 54001-1 Winter Quarter 2
CS 54001-1 Course Goals Yes
– Gain understanding of fundamental issues that effect design, construction, and operation of large-scale networked systems
– Gain understanding of some significant future trends in network design and use
No– Learn how to write network applications
CS 54001-1 Winter Quarter 3
Remember I ask you to:
– Read Peterson and Davies Ch 1 and 2– Read “End to End Arguments in System
Design”– Use traceroute to determine paths to
following locations & build map of network> ANL, IIT, NWU, UIC, Loyola, UIUC, Purdue, Indiana
CS 54001-1 Winter Quarter 4
Last Week:Internet Design Principles & Protocols
An introduction to the mail system An introduction to the Internet Internet design principles and layering Brief history of the Internet Packet switching and circuit switching Protocols Addressing and routing Performance metrics A detailed FTP example
CS 54001-1 Winter Quarter 5
This Week: Routing and Transport Routing techniques
– Flooding– Distributed Bellman Ford Algorithm– Dijkstra’s Shortest Path First Algorithm
Routing in the Internet– Hierarchy and Autonomous Systems– Interior Routing Protocols: RIP, OSPF– Exterior Routing Protocol: BGP
Transport: achieving reliability Transport: achieving fair sharing of links
CS 54001-1 Winter Quarter 6
Recap:An Introduction to the Internet
Ian Dave
gargoyle.cs.uchicago.edu Athena.MIT.edu
Network Layer
Link Layer
Application Layer
Transport Layer
O.S. O.S.HeaderData HeaderData
HD
HD
HD
HD HD
HD
CS 54001-1 Winter Quarter 7
Characteristics of the Internet Each packet is individually routed No time guarantee for delivery No guarantee of delivery in sequence No guarantee of delivery at all!
– Things get lost– Acknowledgements– Retransmission
> How to determine when to retransmit? Timeout?> Need local copies of contents of each packet.> How long to keep each copy?> What if an acknowledgement is lost?
CS 54001-1 Winter Quarter 8
Characteristics of the Internet (2) No guarantee of integrity of data. Packets can be fragmented. Packets may be duplicated.
CS 54001-1 Winter Quarter 9
Size of the Routing Table at the core of the Internet
Source: http://www.telstra.net/ops/bgptable.html
CS 54001-1 Winter Quarter 10
This Week: Routing and Transport Routing techniques
– Flooding– Distributed Bellman Ford Algorithm– Dijkstra’s Shortest Path First Algorithm
Routing in the Internet– Hierarchy and Autonomous Systems– Interior Routing Protocols: RIP, OSPF– Exterior Routing Protocol: BGP
Transport: achieving reliability Transport: achieving fair sharing of links
CS 54001-1 Winter Quarter 11
The Problem“A” “B”
R1
R2
R3
R4
How does R1 choose a route to host B?
CS 54001-1 Winter Quarter 12
Technique 1: Flooding
Advantages: Every destination in the network is reachable. Useful when network topology is unknown.
Disadvantages: Some routers receive packet multiple times. Packets can go round in loops forever.
Routers forward packets to all portsexcept the ingress port.
R1
CS 54001-1 Winter Quarter 13
Technique 2: Bellman-Ford AlgorithmObjective: Determine the route from (R1, …, R7) to R8
that minimizes the cost.
R5R3
R7
R6R4R2R1
1 1 4
2
4
2 2 3
23
R8
Examples of link cost: Distance, data rate, price,
congestion/delay, …
CS 54001-1 Winter Quarter 14
Solution is simple by inspection... (in this case)
R3
R1
R5
R4
R8
R6
The solution is a spanning tree with R8 as the root of the tree. The Bellman-Ford Algorithm finds the spanning tree automatically.
R2
R7
1 1 4
2
4
2 23
2 3
CS 54001-1 Winter Quarter 15
The Distributed Bellman-Ford Algorithm
1 2 7 8
0
8
( )( , , ,..., ).
1. Let where: cost f rom to .2. Set 3. Every seconds, router sends to its neighbors.4. I f router is told of a lower cost path
to , it updates
n i i
i
X C , C , ..., C C R RX
T i C
iR C
1
1
( )(.)
. Hence, where determines the next step improvement.
5. I f then goto step (3).6. Stop.
n ni
n n
X f Xf
X X
CS 54001-1 Winter Quarter 16
Bellman-Ford Algorithm: Example
R5R3
R7
R8
R6R4R2R1
1 1 4
2
4
2 2 3
2 3
R2
R5R7
R4 R6
R8R3
R11 1 4
2 2 32
42 3
42 3
2R1 Inf
R2 Inf
R3 4, R8
R4 Inf
R5 2, R8
R6 2, R8
R7 3, R8
CS 54001-1 Winter Quarter 17
Bellman-Ford Algorithm
R3
R5R7
R8
R6R4R2R1
1 1 4
222
3
324
6 4 6 2
42 3
R1 5, R2
R2 4, R5
R3 4, R8
R4 5, R2
R5 2, R8
R6 2, R8
R7 3, R8
R1 6, R3
R2 4, R5
R3 4, R8
R4 6, R7
R5 2, R8
R6 2, R8
R7 3, R8
R8
R6R4R2R1
R3R5
1 1
42 3
2
5 4 5 2
4 2 32R7
Solution
CS 54001-1 Winter Quarter 18
Bellman-Ford AlgorithmQuestions:1. How long can the algorithm take to run?2. How do we know that the algorithm
always converges?3. What happens when link costs change, or
when routers/links fail?
CS 54001-1 Winter Quarter 19
A Problem with Bellman-Ford
R4R3R2R11 1 1
“Bad news travels slowly”
Consider the calculation of distances to R4:
…………5,R24,R35,R233,R24,R33,R223,R22,R33,R211, R42,R33,R20R3R2R1Time
“Counting to infinity”
R3 R4 fails
CS 54001-1 Winter Quarter 20
Counting to Infinity ProblemSolutions
1. Set infinity = “some small integer” (e.g., 16) Stop when count = 16
2. Split Horizon: Because R2 received lowest cost path from R3, it does not advertise cost to R3
3. Split-horizon with poison reverse: R2 advertises infinity to R3
CS 54001-1 Winter Quarter 21
Technique 3: Dijkstra’s Shortest Path First Algorithm
Routers send out update messages whenever the state of a link changes. Hence the name: “Link State” algorithm
Each router calculates lowest cost path to all others, starting from itself
At each step of the algorithm, router adds the next shortest (i.e., lowest-cost) path to the tree
Finds spanning tree routed on source router
CS 54001-1 Winter Quarter 22
Dijkstra’s Shortest Path First Algorithm: Example
R6
R8
R6
R8
R5
R8
R5
R5
R7
8 3 5 7 6
8 5
3 7 6 2
8 5 6
3 7 2 4
8 5 6 7
3 2 4
{ }. { , , , }{ , },
{ , , , }.
{ , , },{ , , , }.
{ , , , },{ , , }.
Step 1: Shortest path set, Candidate set, Step 2:
Step 3:
Step 4:
S R C R R R RS R R
C R R R R
S R R RC R R R R
S R R R RC R R R
CS 54001-1 Winter Quarter 23
Dijkstra’s SPF Algorithm
R5
R7
R3
R4R2
R8
R6R11 1
42 3
2
{}.},,,,,,,{ 4127658
CRRRRRRRS :8 Step
CS 54001-1 Winter Quarter 24
This Week: Routing and Transport Routing techniques
– Flooding– Distributed Bellman Ford Algorithm– Dijkstra’s Shortest Path First Algorithm
Routing in the Internet– Hierarchy and Autonomous Systems– Interior Routing Protocols: RIP, OSPF– Exterior Routing Protocol: BGP
Transport: achieving reliability Transport: achieving fair sharing of links
CS 54001-1 Winter Quarter 25
Routing in the Internet The Internet uses hierarchical routing Internet is split into Autonomous Systems (ASs)
– Examples of ASs: Stanford (32), HP (71), MCI Worldcom (17373)– Try: whois –h whois.arin.net ASN “MCI Worldcom”
Within an AS, the administrator chooses an Interior Gateway Protocol (IGP)– Examples of IGPs: RIP (rfc 1058), OSPF (rfc 1247).
Between ASs, the Internet uses an Exterior Gateway Protocol– ASs today use the Border Gateway Protocol, BGP-4 (rfc 1771)
CS 54001-1 Winter Quarter 26
Routing in the Internet
Stub AS Transit ASe.g. backbone service provider
Stub AS
AS ‘A’ AS ‘B’ AS ‘C’
Interior GatewayProtocol
Interior GatewayProtocol
Interior GatewayProtocol
BGP BGP
CS 54001-1 Winter Quarter 27
Routing within a Stub AS There is only one exit point, so routers
within the AS can use default routing – Each router knows all Network IDs within AS– Packets destined to another AS are sent to
the default router– Default router is the border gateway to the
next AS Routing tables in Stub ASs tend to be small
CS 54001-1 Winter Quarter 28
Interior Routing Protocols RIP (Routing Information Protocol)
– Uses distributed Bellman-Ford algorithm– Updates sent every 30 seconds– No authentication– Originally in BSD UNIX
OSPF (Open Shortest Path First)– Link-state updates sent (using flooding) as and when
required– Every router runs Dijkstra’s algorithm– Authenticated updates– Autonomous system may be partitioned into “areas”
CS 54001-1 Winter Quarter 29
Exterior Routing ProtocolsProblems:
– Topology: The Internet is a complex mesh of different ASs with very little structure
– Autonomy of ASs: Each AS defines link costs in different ways, so not possible to find lowest cost paths
– Trust: Some ASs can’t trust others to advertise good routes (e.g., two competing backbone providers), or to protect the privacy of their traffic (e.g., two warring nations)
– Policies: Different ASs have different objectives (e.g., route over fewest hops; use one provider rather than another)
CS 54001-1 Winter Quarter 30
Border Gateway Protocol (BGP-4) BGP is not a link-state or distance-vector routing
protocol BGP advertises complete paths (a list of ASs) Example of path advertisement:
– “The network 171.64/16 can be reached via the path {AS1, AS5, AS13}”.
Paths with loops are detected locally and ignored Local policies pick the preferred path among options When link/router fails, the path is “withdrawn”
CS 54001-1 Winter Quarter 31
This Week: Routing and Transport Routing techniques
– Flooding– Distributed Bellman Ford Algorithm– Dijkstra’s Shortest Path First Algorithm
Routing in the Internet– Hierarchy and Autonomous Systems– Interior Routing Protocols: RIP, OSPF– Exterior Routing Protocol: BGP
Transport: achieving reliability Transport: achieving fair sharing of links
CS 54001-1 Winter Quarter 32
Outline The Transport Layer The TCP Protocol
– TCP Characteristics– TCP Connection setup– TCP Segments– TCP Sequence Numbers– TCP Sliding Window– Timeouts and Retransmission– (Congestion Control and Avoidance)
The UDP Protocol
CS 54001-1 Winter Quarter 33
The Transport Layer What is the transport layer for? What characteristics might it have?
– Reliable delivery– Flow control– …
CS 54001-1 Winter Quarter 34
Review of the Transport Layer
Ian Dave
Gargoyle.cs.uchicago.edu Athena.MIT.edu
Network Layer
Link Layer
Application Layer
Transport Layer
O.S. O.S.HeaderData HeaderData
HD
HD
HD
HD HD
HD
CS 54001-1 Winter Quarter 35
Layering: FTP Example
Network
Link
Transport
Application
Presentation
SessionTransportNetwork
Link
Physical
The 7-layer OSI Model The 4-layer Internet model
ApplicationFTP
ASCII/Binary
IP
TCP
Ethernet
CS 54001-1 Winter Quarter 36
TCP Characteristics TCP is connection-oriented
– 3-way handshake used for connection setup TCP provides a stream-of-bytes service TCP is reliable:
– Acknowledgements indicate delivery of data– Checksums are used to detect corrupted data– Sequence numbers detect missing, or mis-sequenced data– Corrupted data is retransmitted after a timeout– Mis-sequenced data is re-sequenced– (Window-based) Flow control prevents over-run of receiver
TCP uses congestion control to share network capacity among users
CS 54001-1 Winter Quarter 37
TCP is connection-oriented
Connection Setup3-way handshake
(Active)Client
(Passive)Server
Syn
Syn + Ack
Ack
Connection Close/Teardown2 x 2-way handshake
(Active)Client
(Passive)Server
Fin
(Data +) Ack
Fin
Ack
CS 54001-1 Winter Quarter 38
TCP supports a “stream of bytes” service
Byte 0
Byte 1
Byte 2
Byte 3
Byte 0
Byte 1
Byte 2
Byte 3
Host A
Host B
Byte 8 0
Byte 8 0
CS 54001-1 Winter Quarter 39
…which is emulated using TCP “segments”
Byte 0
Byte 1
Byte 2
Byte 3
Byte 0
Byte 1
Byte 2
Byte 3
Host A
Host B
Byte 8 0
TCP Data
TCP Data
Byte 8 0
Segment sent when:1. Segment full (MSS
bytes),2. Not full, but times out, or3. “Pushed” by application.
CS 54001-1 Winter Quarter 40
The TCP Segment FormatIP Hdr
IP DataTCP HdrTCP Data
Src port Dst port
Sequence #
Ack Sequence #HLEN
4RSVD
6 URG
ACK
PSH
RST
SYN
FIN
Flags Window Size
Checksum Urg Pointer
(TCP Options)
0 15 31
TCP Data
TCP Header and Data + IP
Addresses
Src/dst port numbersand IP addresses uniquely identify
socket
CS 54001-1 Winter Quarter 41
Sequence NumbersHost A
Host B
TCP Data
TCP Data
TCP HDR
TCP HDR
ISN (initial sequence number)
Sequence number = 1st
byte Ack sequence number =
next expected byte
CS 54001-1 Winter Quarter 42
Initial Sequence Numbers
Connection Setup3-way handshake
(Active)Client
(Passive)Server
Syn +ISNA
Syn + Ack +ISNB
Ack
CS 54001-1 Winter Quarter 43
TCP Sliding Window How much data can a TCP sender have
outstanding in the network? How much data should TCP retransmit
when an error occurs? Just selectively repeat the missing data?
How does the TCP sender avoid over-running the receiver’s buffers?
CS 54001-1 Winter Quarter 44
TCP Sliding WindowWindow Size
OutstandingUn-ack’d data
Data OK to send
Data not OK to send yet
Data ACK’d
Retransmission policy is “Go Back N” Current window size is “advertised” by receiver (usually 4k – 8k Bytes when connection set-up)
CS 54001-1 Winter Quarter 45
TCP Sliding Window
Host A
Host B ACK
Window SizeRound-trip time
(1) RTT > Window sizeACK
Window Size
Round-trip time
(2) RTT = Window sizeACK
Window Size???
CS 54001-1 Winter Quarter 46
TCP: Retransmission and Timeouts
Host A
Host BACK
Round-trip time (RTT)
ACK
Retransmission TimeOut (RTO)
Estimated RTT
Data1 Data2
Guard
Band
TCP uses an adaptive retransmission timeout value:
CongestionChanges in Routing
RTT changes frequently
CS 54001-1 Winter Quarter 47
TCP: Retransmission and Timeouts Picking the RTO is important:
Pick a values that’s too big and it will wait too long to retransmit a packet,
Pick a value too small, and it will unnecessarily retransmit packets.
The original algorithm for picking RTO:1. EstimatedRTT = EstimatedRTT + (1 - ) SampleRTT2. RTO = 2 * EstimatedRTT
Characteristics of the original algorithm: Variance is assumed to be fixed. But in practice, variance increases as congestion
increases.
CS 54001-1 Winter Quarter 48
TCP: Retransmission and Timeouts
Newer Algorithm includes estimate of variance in RTT:
Difference = SampleRTT - EstimatedRTT EstimatedRTT = EstimatedRTT + (*Difference) Deviation = Deviation + *( |Difference| - Deviation )
RTO = * EstimatedRTT + * Deviation 1 4
CS 54001-1 Winter Quarter 49
TCP: Retransmission and TimeoutsKarn’s Algorithm
Retransmission
Wrong RTT Sample
Host A Host B
Retransmission
Wrong RTT Sample
Host A Host B
Problem: How can we estimate RTT when packets are retransmitted?Solution: On retransmission, don’t update estimated RTT (and double RTO)
CS 54001-1 Winter Quarter 50
User Datagram Protocol (UDP) Characteristics
UDP is a connectionless datagram service– There is no connection establishment: packets may show up at
any time UDP packets are self-contained UDP is unreliable:
– No acknowledgements to indicate delivery of data– Checksums cover the header, and only optionally cover the data– Contains no mechanism to detect missing or mis-sequenced
packets– No mechanism for automatic retransmission– No mechanism for flow control, and so can over-run the receiver
CS 54001-1 Winter Quarter 51
User-Datagram Protocol (UDP)
App AppA1 A2
App AppB1 B2
UDP
OS
IP Like TCP, UDP uses port number to demultiplex packets
CS 54001-1 Winter Quarter 52
SRC port DST port
checksum
length
DATA
User-Datagram Protocol (UDP)Packet format
Why do we have UDP? It is used by applications that don’t need reliable delivery, or Applications that have their own special needs, such as streaming of real-time audio/video.
By default, only covers the
header.
CS 54001-1 Winter Quarter 53
This Week: Routing and Transport Routing techniques
– Flooding– Distributed Bellman Ford Algorithm– Dijkstra’s Shortest Path First Algorithm
Routing in the Internet– Hierarchy and Autonomous Systems– Interior Routing Protocols: RIP, OSPF– Exterior Routing Protocol: BGP
Transport: achieving reliability Transport: achieving fair sharing of links
CS 54001-1 Winter Quarter 54
Main points Congestion is inevitable TCP sources detect congestion and, co-operatively,
reduce the rate at which they transmit The rate is controlled using the TCP window size TCP modifies the rate according to “Additive Increase,
Multiplicative Decrease (AIMD)” To jump start flows, TCP uses a fast restart
mechanism (called “slow start”!) TCP achieves high throughput by encouraging high
delay
CS 54001-1 Winter Quarter 55
CongestionH1
H2
R1 H3
A1(t)10Mb/s
D(t)1.5Mb/s
A2(t)100Mb/s
A1(t)
A2(t)X(t)
D(t)
A1(t)
A2(t)
D(t)
X(t)
Cumulativebytes
t
CS 54001-1 Winter Quarter 56
Congestion is unavoidableArguably it’s good!
We use packet switching because it makes efficient use of the links. Therefore, buffers in the routers are frequently occupied
If buffers are always empty, delay is low, but our usage of the network is low
If buffers are always occupied, delay is high, but we are using the network more efficiently
So how much congestion is too much?
CS 54001-1 Winter Quarter 57
Load, Delay and Power
AveragePacket delay
Load
Typical behavior of queueing systems with random arrivals:
Power
Load
A simple metric of how well the network is performing:
LoadPowerDelay
“optimalload”
Burstiness tends to moveasymptote to the left
CS 54001-1 Winter Quarter 58
Options for Congestion Control1. Implemented by host versus network2. Reservation-based, versus feedback-
based3. Window-based versus rate-based
CS 54001-1 Winter Quarter 59
TCP Congestion Control TCP implements host-based, feedback-
based, window-based congestion control. TCP sources attempts to determine how
much capacity is available TCP sends packets, then reacts to
observable events (loss)
CS 54001-1 Winter Quarter 60
TCP Congestion Control TCP sources change the sending rate by
modifying the window size:Window = min{Advertized window, Congestion Window}
In other words, send at the rate of the slowest component: network or receiver
“cwnd” follows additive increase/multiplicative decrease– On receipt of Ack: cwnd += 1– On packet loss (timeout): cwnd *= 0.5
Receiver Transmitter (“cwnd”)
CS 54001-1 Winter Quarter 61
Additive Increase
D A D D A A D D A AD A
Src
Dest
Actually, TCP uses bytes, not segments to count:When ACK is received:
MSScwnd MSScwnd
CS 54001-1 Winter Quarter 62
Leads to the TCP “sawtooth”
t
Rate
halved
Timeouts
Could take a long time to get started!
CS 54001-1 Winter Quarter 63
“Slow Start”Designed to cold-start connection quickly at startup or if
a connection has been halted (e.g. window dropped to zero,or window full, but ACK is lost).
How it works: increase cwnd by 1 for each ACK received.
D A D D A A D DA A
DA
Src
Dest
D
A
1 2 4 8
CS 54001-1 Winter Quarter 64
Slow Start
halved
Timeouts
Exponential “slow start” t
Rate
Why is it called slow-start? Because TCP originally hadno congestion control mechanism. The source would just
start by sending a whole window’s worth of data.
Slow start in operation until it reaches half of
previous cwnd.
CS 54001-1 Winter Quarter 65
Fast Retransmit & Fast Recovery TCP source can take advantage of an
additional hint: if a duplicate ACK arrives out of sequence, there was probably some data lost, even if it hasn’t yet timed out.
Upon 3 duplicate ACKs, TCP retransmits. Does not enter slow-start: there are
probably ACKs in the pipe that will continue correct AIMD operation.
CS 54001-1 Winter Quarter 66
Course Outline (Subject to Change)1. (January 9th) Internet design principles and protocols2. (January 16th) Internetworking, transport, routing 3. (January 23rd) Mapping the Internet and other networks4. (January 30th) Security (with guest lecturer: Gene Spafford)5. (February 6th) P2P technologies & applications (Matei Ripeanu)
(plus midterm)6. (February 13th) Optical networks (Charlie Catlett)7. *(February 20th) Web and Grid Services (Steve Tuecke)8. (February 27th) Network operations (Greg Jackson)9. *(March 6th) Advanced applications (with guest lecturers: Terry Disz, Mike
Wilde)10. (March 13th) Final exam
* Ian Foster is out of town.