Download - CompSci514/ECE558: Computer Networks
![Page 1: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/1.jpg)
CompSci514/ECE558: Computer NetworksLecture 22: Review
Xiaowei Yang
[email protected]://www.cs.duke.edu/~xwy
![Page 2: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/2.jpg)
Roadmap
• Summarize what we have learned in this semester– Design principles of computer networks– Congestion control– Routing– Datacenter networking: topology and congestion
control– SDN, NFV, Programmable Routers, RDMA,
Network measurement, DDoS, and DHT
![Page 3: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/3.jpg)
Architectural questions tend to dominate CS networking
research
![Page 4: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/4.jpg)
Decomposition of Function
Definition and placement of function– What to do, and where to do it
The “division of labor”– Between the host, network, and management
systems– Across multiple concurrent protocols and
mechanisms
4
![Page 5: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/5.jpg)
CompSci 514: Computer Networks
Lecture 3: The Design Philosophy of the DARPA Internet Protocols
Xiaowei [email protected]
![Page 6: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/6.jpg)
What is the paper about?• Where to place functions in a distributed computer system
– End point, networks, or a joint venture?
• Authors’ arguments:
“ The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)”
![Page 7: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/7.jpg)
End-to-End Argument
• Extremely influential
• �…functions placed at the lower levels may be redundant or of little value when compared to the cost of providing them at the lower level…�
• �…sometimes an incomplete version of the function provided by the communication system (lower levels) may be useful as a performance enhancement…�
7
![Page 8: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/8.jpg)
8
Example: Reliable File Transfer
• Solution 1: make each step reliable, and then concatenate them– Uneconomical if each step has small error
probability
OS
Appl.
OS
Appl.
Host A Host B
Network
![Page 9: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/9.jpg)
9
Example: Reliable File Transfer
• Solution 2: end-to-end check and retry– Correct and complete
OS
Appl.
OS
Appl.
Host A Host B
OKNetwork
![Page 10: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/10.jpg)
10
Example: Reliable File Transfer
• An intermediate solution: the communication system provides internally, a guarantee of reliable data transmission, e.g., a hop-by-hop reliable protocol– Only reducing end-to-end retries– No effect on correctness
OS
Appl.
OS
Appl.
Host A Host B
OKNetwork
![Page 11: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/11.jpg)
11
Question: should lower layer play a part in obtaining reliability?
![Page 12: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/12.jpg)
The Design Philosophy of the DARPA Internet Protocols
• Inter-networking: an IP layer– Alternative: A unified approach
• Can’t connect existing networks• Inflexible
• Packet switching vs circuit switching– Applications suitable for packet switching– Existing networks were packet switching
• Gateways– Chosen from ARPANET– Store and forward– Question: can we interconnect without gateways?
![Page 13: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/13.jpg)
Secondary goals
§ In order of importance1. Survivable of network failures2. Multiple services3. Varieties of networks4. Distributed management5. Cost effective6. Easy attachment7. Resource accountable
§ How will the order differ in a commercial environment?
![Page 14: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/14.jpg)
Design Goals of Congestion Control
• Congestion avoidance: making the system operate around the knee to obtain low latency and high throughput
• Congestion control: making the system operate left to the cliff to avoid congestion collapse
• Congestion avoidance: making the system operate around the knee to obtain low latency and high throughput
• Congestion control: making the system operate left to the cliff to avoid congestion collapse
![Page 15: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/15.jpg)
Key insight: packet conservation principle and self-clocking
• When pipe is full, the speed of ACK returns equals to the speed new packets should be injected into the network
![Page 16: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/16.jpg)
Solution: Dynamic window sizing
• Sending speed: SWS / RTT
• à Adjusting SWS based on available bandwidth
• The sender has two internal parameters:– Congestion Window (cwnd)– Slow-start threshold Value (ssthresh)
• SWS is set to the minimum of (cwnd, receiver advertised win)
![Page 17: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/17.jpg)
Two Modes of Congestion Control
1. Probing for the available bandwidth– slow start (cwnd < ssthresh)
2. Avoid overloading the network– congestion avoidance (cwnd >= ssthresh)
![Page 18: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/18.jpg)
Slow Start
• Initial value: Set cwnd = 1 MSS• Modern TCP implementation may set initial cwnd to a much larger
value
• When receiving an ACK, cwnd+= 1 MSS
![Page 19: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/19.jpg)
Congestion Avoidance
• If cwnd >= ssthresh then each time an ACK is received, increment cwnd as follows:
• cwnd += MSS * (MSS / cwnd) (cwnd measured in bytes)
• So cwnd is increased by one MSS only if all cwnd/MSS segments have been acknowledged.
![Page 20: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/20.jpg)
Example of Slow Start/Congestion Avoidance
Assume ssthresh = 8 MSS cwnd = 1
cwnd = 2
cwnd = 4
cwnd = 8
cwnd = 9
cwnd = 10
02468101214
t=0 t=2 t=4 t=6Roundtrip times
Cw
nd (i
n se
gmen
ts)
ssthresh
![Page 21: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/21.jpg)
21
The Sawtooth behavior of TCP
• For every ACK received– Cwnd += 1/cwnd
• For every packet lost– Cwnd /= 2
RTT
Cwnd
![Page 22: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/22.jpg)
22
Why does it work? [Chiu-Jain]
– A feedback control system– The network uses feedback y to adjust users� load åx_i
![Page 23: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/23.jpg)
23
Goals of Congestion Avoidance
– Efficiency: the closeness of the total load on the resource ot its knee– Fairness:
• When all x_i�s are equal, F(x) = 1• When all x_i�s are zero but x_j = 1, F(x) = 1/n
– Distributedness• A centralized scheme requires complete knowledge of the state of the
system– Convergence
• The system approach the goal state from any starting state
![Page 24: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/24.jpg)
24
Metrics to measure convergence
• Responsiveness• Smoothness
![Page 25: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/25.jpg)
25
Model the system as a linear control system
• Four sample types of controls• AIAD, AIMD, MIAD, MIMD
![Page 26: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/26.jpg)
26
Phase plane
x1
x2
![Page 27: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/27.jpg)
27
TCP congestion control is AIMD
• Problems:– Each source has to probe for its bandwidth– Congestion occurs first before TCP backs off– Unfair: long RTT flows obtain smaller bandwidth
shares
RTT
Cwnd
![Page 28: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/28.jpg)
28
Macroscopic behavior of TCP
pRTTMSS••5.1
• Throughput is inversely proportional to RTT:
• In a steady state, total packets sent in one sawtooth cycle:– S = w + (w+1) + … (w+w) = 3/2 w2
• the maximum window size is determined by the loss rate– 1/S = p– w =
• The length of one cycle: w * RTT• Average throughput: 3/2 w * MSS / RTT
11.5p
![Page 29: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/29.jpg)
29
Explicit Congestion Notification
• Use a Congestion Experience (CE) bit to signal congestion, instead of a packet drop
• Why is ECN better than a packet drop?
• AQM is used for packet marking
XCE=1
ECE=1
CWR=1
![Page 30: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/30.jpg)
Other Congestion Control Algorithms
• XCP• VCP• BBR• Cubic
![Page 31: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/31.jpg)
Design Space for resource allocation
• Router-based vs. Host-based
• Reservation-based vs. Feedback-based
• Window-based vs. Rate-based
![Page 32: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/32.jpg)
Fair Queuing Motivation
• End-to-end congestion control + FIFO queue (or AQM) has limitations– What if sources mis-behave?
• Approach 2: – Fair Queuing: a queuing algorithm that aims to �fairly� allocate buffer, bandwidth, latency among competing users
![Page 33: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/33.jpg)
Outline
• What is fair?• Weighted Fair Queuing• Other FQ variants
![Page 34: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/34.jpg)
One definition: Max-min fairness• Many fair queuing algorithms aim to achieve this
definition of fairness• Informally
– Allocate user with �small� demand what it wants, evenly divide unused resources to �big� users
• Formally– 1. No user receives more than its request– 2. No other allocation satisfies 1 and has a higher minimum
allocation• Users that have higher requests and share the same bottleneck link
have equal shares– Remove the minimal user and reduce the total resource
accordingly, 2 recursively holds
![Page 35: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/35.jpg)
Max-min example1. Increase all flows� rates equally, until some users� requests
are satisfied or some links are saturated2. Remove those users and reduce the resources and repeat step
1• Assume sources 1..n, with resource demands X1..Xn in an
ascending order• Assume channel capacity C.
– Give C/n to X1; if this is more than X1 wants, divide excess (C/n -X1) to other sources: each gets C/n + (C/n - X1)/(n-1)
– If this is larger than what X2 wants, repeat process
![Page 36: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/36.jpg)
Design of weighted fair queuing
• Resources managed by a queuing algorithm– Bandwidth: Which packets get transmitted– Promptness: When do packets get transmitted– Buffer: Which packets are discarded– Examples: FIFO
• The order of arrival determines all three quantities
• Goals: – Max-min fair– Work conserving: link�s not idle if there is work to do– Isolate misbehaving sources– Has some control over promptness
• E.g., lower delay for sources using less than their full share of bandwidth
• Continuity– On Average does not depend discontinuously on a packet’s time of arrival– Not blocked if no packet arrives
![Page 37: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/37.jpg)
Design goals• Max-min fair• Work conserving: link�s not idle if there is work to do• Isolate misbehaving sources• Has some control over promptness
– E.g., lower delay for sources using less than their full share of bandwidth
– Continuity• On Average does not depend discontinuously on a packet’s time of arrival• Not blocked if no packet arrives
![Page 38: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/38.jpg)
39
Implementing max-min Fairness
• Generalized processor sharing– Fluid fairness– Bitwise round robin among all queues
• WFQ:– Emulate this reference system in a packetized system– Challenges: bits are bundled into packets. Simple round robin
scheduling does not emulate bit- by-bit round robin
![Page 39: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/39.jpg)
Emulating Bit-by-Bit round robin
• Define a virtual clock: the round number R(t) as the number of rounds made in a bit-by-bit round-robin service discipline up to time t
• A packet with size P whose first bit serviced at round R(t0) will finish at round:– R(t) = R(t0) + P
• Schedule which packet gets serviced based on the finish round number
![Page 40: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/40.jpg)
Compute finish times
• Arrival time of packet i from flow α: tiα
• Pacet size: Piα
• Siα be the round number when the packet starts
service• Fi
α be the finish round number• Fi
α = Siα + Pi
α
• Siα = Max (Fi-1
α, R(tiα))
![Page 41: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/41.jpg)
Compute R(t) can be complicated
• Single flow: clock ticks when a bit is transmitted. For packet i:– Round number ≤ Arrival time Ai
– Fi = Si+Pi = max (Fi-1, Ai) + Pi
• Multiple flows: clock ticks when a bit from all active flows is transmitted– When the number of active flows vary, clock ticks
at different speed: ¶ R/¶ t = ¹/Nac(t)
![Page 42: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/42.jpg)
An example
• Two flows, unit link speed 1 bit per second
P=3P=5t=0t=4
P=4P=2t=1t=6
t
R(t)
0
P=6t=12
![Page 43: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/43.jpg)
Weighted Fair Queuing
• Different queues get different weights– Take wi amount of bits from a queue in each round– Fi = Si + Pi / wi
w=2
w=1
![Page 44: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/44.jpg)
What is routing?
End-hosts
Routers
![Page 45: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/45.jpg)
The Internet: Zooming In
• ASes: Independently owned & operated commercial entities
Duke
Comcast
Abilene
AT&T Cogent
Autonomous Systems (ASes)
BGP
IGPs
(OSPF, etc)
![Page 46: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/46.jpg)
ASes (or domains)• Autonomously administered• Economically motivated• All must cooperate to ensure reachability• Routing between: BGP• Routing inside: Up to the AS
– OSPF, E-IGRP, ISIS (You may have heard of RIP; almost nobody uses it)
• Inside an AS: Independent policies about nearly everything.
![Page 47: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/47.jpg)
Transit ASes vs Stub ASes
Duke
Comcast
Abilene
AT&T
Cogent
BGP
All ASes are not equal
![Page 48: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/48.jpg)
AS relationships
• Very complex economic landscape.• Simplifying a bit:
– Transit: �I pay you to carry my packets to everywhere�(provider-customer)
– Peering: �For free, I carry your packets to my customers only.� (peer-peer)
• Technical definition of tier-1 ISP: In the �default-free� zone. No transit.– Note that other �tiers� are marketing, but convenient. �Tier 3� may connect to tier-1.
![Page 49: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/49.jpg)
Zooming in 4
Tier 1 ISP
Tier 2Regional
Tier 2
Tier 1 ISP
Tier 2
Tier 3 (local)Tier 2: Regional/National Tier 3: Local
$$$$
$$
Default free,Has information on every prefix
Default: provider
![Page 50: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/50.jpg)
Who pays whom?• Transit: Customer pays the provider
– Who is who? Usually, the one who can �live without� the other. AT&T does not need Duke, but Duke needs some ISP.
• What if both need each other? Free Peering.– Instead of sending packets over $$ transit, set
up a direct connection and exchange traffic for free!
![Page 51: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/51.jpg)
• Tier 1s must all peer with each other by definition– Tier 1s form a full mesh Internet core
• Peering can give:– Better performance– Lower cost– More �efficient� routing (keeps packets local)
• But negotiating can be very tricky!
![Page 52: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/52.jpg)
Terms
• Route: a network prefix plus path attributes• Customer/provider/peer routes: route
advertisements heard from customers/providers/peers.
• Transit service: If A advertises a route to B, it implies that A will forward packets coming from B to any destination in the advertised prefix
Duke NC RegNet
UNC152.3/16 152.3/16
152.3.137.179152.2.3.4
![Page 53: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/53.jpg)
BGP version 4
• Design goals:– Scalability as more networks connect– Policy: ASes should be able to enforce
business/routing policies• Result: Flexible attribute structure, filtering
– Cooperation under competition:• ASes should have great autonomy for routing and
internal architecture• But BGP should provide global reachability
![Page 54: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/54.jpg)
BGP
Route Advertisement
Autonomous Systems (ASes)
Session (over TCP)
Traffic
BGP peers
![Page 55: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/55.jpg)
• BGP messages– OPEN– UPDATE
• Announcements– Dest Next-hop AS Path … other attributes …– 128.2.0.0/16 196.7.106.245 2905 701 1239 5050 9
• Withdrawals
– KEEPALIVE• Keepalive timer / hold timer
• Key thing: The Next Hop attribute
![Page 56: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/56.jpg)
Path Vector
• ASPATH Attribute– Records what ASes a route went through– Loop avoidance: Immediately discard– Short path heuristics
• Like distance vector, but fixes the count-to-infinity problem
![Page 57: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/57.jpg)
An example of BGP advertisement• BGP routing table entry for 152.3.0.0/16, version 1009002• Paths: (36 available, best #10, table default)• Not advertised to any peer• Refresh Epoch 1• 54728 20130 6939 11164 81 13371• 140.192.8.16 from 140.192.8.16 (140.192.8.16)• Origin IGP, localpref 100, valid, external• rx pathid: 0, tx pathid: 0• Refresh Epoch 1• 58901 51167 3356 209 81 13371• 93.104.209.174 from 93.104.209.174 (93.104.209.174)• Origin IGP, localpref 100, valid, external• rx pathid: 0, tx pathid: 0• Refresh Epoch 1
![Page 58: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/58.jpg)
Two Flavors of BGP
• External BGP (eBGP): exchanging routes between ASes– External peers typically directly connected
• Internal BGP (iBGP): disseminating routes to external destinations among the routers within an AS– Internal peers are not– Require IGP to find routes
eBGPiBGP
![Page 59: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/59.jpg)
BGP
Route Advertisement
Autonomous Systems (ASes)
Session (over TCP)
Traffic
AB
![Page 60: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/60.jpg)
Enforcing business relationships
• Two mechanisms:• Route export filters
– Control what routes you send to neighbors• Route import ranking
– Controls which route you prefer of those you hear.
– �LOCALPREF� – Local Preference. More later.
![Page 61: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/61.jpg)
Export Policies• Provider à Customer
– All routes so as to provide transit service• Customer à Provider
– Only customer routes– Why?– Only transit for those that pay
• Peer à Peer– Only customer routes
![Page 62: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/62.jpg)
Import policies
• Same routes heard from providers, customers, and peers, whom to choose?– customer > peer > provider– Why?– Choose the most economic routes!
• Customer route: charge $$ J• Peer route: free• Provider route: pay $$ L
![Page 63: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/63.jpg)
An annotated AS Graph
• Sibling: provide transit services for each other. – May belong to the same company64
AS5
AS1
AS2 AS3
AS4
peer-to-peer
sibling-to-sibling edge
edgeAS6
AS7
provider-to-customer edge
![Page 64: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/64.jpg)
The valley-free property
• Valley-free: After traversing a provider-to-customer or peer-to-peer edge, the AS path can not traverse a customer-to-provider or peer-to-peer edge
65
![Page 65: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/65.jpg)
Datacenter Networks
• Topology• Incast• DCTCP
![Page 66: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/66.jpg)
Exercise
• 24*10G silicon
• 12-line cards
• 288 port non-blocking switch
![Page 67: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/67.jpg)
Figure 11: Two ways to depopulate the fabric for 50% ca-pacity.
Figure 10 also depicts how we started connecting ourcluster fabric to the external inter cluster networking.We defer detailed discussion to Section 4.
While Watchtower cluster fabrics were substantiallycheaper and of greater scale than anything available forpurchase, the absolute cost remained substantial. Weused two observations to drive additional cost optimiza-tions. First, there is natural variation in the bandwidthdemands of individual clusters. Second, the dominantcost of our fabrics was in the optics and the associatedfiber.
Hence, we enabled Watchtower fabrics to support de-populated deployment, where we initially deployed only50% of the maximum bisection bandwidth. Impor-tantly, as the bandwidth demands of a depop clustergrew, we could fully populate it to 100% bisection inplace. Figure 11 shows two high-level options, (A) and(B), to depopulate switches, optics, and fiber, shown inred. (A) achieves 50% capacity by depopulating halfof the S2 switches and all fiber and optics touching anydepopulated S2 switch. (B) instead depopulates half S3switches and associated fiber and optics. (A) shows 2xmore depopulated elements vs. (B) for the same fabriccapacity.
(A) requires all spine S3 chassis to be deployed upfront even though edge aggregation blocks may be de-ployed slowly leading to higher initial cost. (B) has amore gradual upfront cost as all spine chassis are notdeployed initially. Another advantage of (B) over (A)is that each ToR has twice the burst bandwidth.
In Watchtower and Saturn (Section 3.4) fabrics, wechose option (A) because it maximized cost savings. ForJupiter fabrics (Section 3.5), we moved to option (B)because the upfront cost of deploying the entire spineincreased as we moved toward building-size fabrics andthe benefits of higher ToR bandwidth became more ev-ident.
3.4 Saturn: Fabric Scaling and 10G
Servers
Saturn was the next iteration of our cluster archi-tecture. The principal goals were to respond to con-tinued increases in server bandwidth demands and tofurther increase maximum cluster scale. Saturn wasbuilt from 24x10G merchant silicon building blocks. A
Figure 12: Components of a Saturn fabric. A 24x10G PlutoToR Switch and a 12-linecard 288x10G Saturn chassis (in-cluding logical topology) built from the same switch chip.Four Saturn chassis housed in two racks cabled with fiber(right).
Saturn chassis supports 12-linecards to provide a 288port non-blocking switch. These chassis are coupledwith new Pluto single-chip ToR switches; see Figure 12.In the default configuration, Pluto supports 20 serverswith 4x10G provisioned to the cluster fabric for an av-erage bandwidth of 2 Gbps for each server. For morebandwidth-hungry servers, we could configure the PlutoToR with 8x10G uplinks and 16x10G to servers provid-ing 5 Gbps to each server. Importantly, servers couldburst at 10Gbps across the fabric for the first time.
3.5 Jupiter: A 40G Datacenter-scale Fab-
ric
As bandwidth requirements per server continued togrow, so did the need for uniform bandwidth across allclusters in the datacenter. With the advent of dense40G capable merchant silicon, we could consider ex-panding our Clos fabric across the entire datacenter sub-suming the inter-cluster networking layer. This wouldpotentially enable an unprecedented pool of computeand storage for application scheduling. Critically, theunit of maintenance could be kept small enough relativeto the size of the fabric that most applications couldnow be agnostic to network maintenance windows un-like previous generations of the network.
Jupiter, our next generation datacenter fabric,needed to scale more than 6x the size of our largestexisting fabric. Unlike previous iterations, we set a re-quirement for incremental deployment of new networktechnology because the cost in resource stranding anddowntime was too high. Upgrading networks by sim-ply forklifting existing clusters stranded hosts alreadyin production. With Jupiter, new technology wouldneed to be introduced into the network in situ. Hence,the fabric must support heterogeneous hardware andspeeds. Because of the sheer scale, events in the net-work (both planned and unplanned) were expected to
![Page 68: CompSci514/ECE558: Computer Networks](https://reader033.vdocument.in/reader033/viewer/2022052105/6286fa706785534941633986/html5/thumbnails/68.jpg)
Summary
• Fundamental design philosophies• Congestion control• Routing• Datacenter networking• Other modern networking topics
– SDN, NFV, Programmable Routers, RDMA, Network measurement, DDoS, DHT