computer network (5)
TRANSCRIPT
![Page 1: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/1.jpg)
Data Center Networking
Stanford CS144 Lecture 17Philip Levis, 11/30/11
![Page 2: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/2.jpg)
![Page 3: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/3.jpg)
Low latencies: µsHigh capacity: GigE, 10 GigESpecialized trafficCentrally managed
![Page 4: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/4.jpg)
Topology
(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)
![Page 5: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/5.jpg)
Storage Workload
(picture courtesy of Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
![Page 6: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/6.jpg)
Query Workload
(picture courtesy of Alizadeh et al., “Data Center TCP (DCTCP)”)
![Page 7: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/7.jpg)
Problems
![Page 8: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/8.jpg)
Per-Pair Bandwidth
(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)
![Page 9: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/9.jpg)
Incast
(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
![Page 10: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/10.jpg)
Incast Details
(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
![Page 11: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/11.jpg)
Mixed traffic
• Low latency for short flows
• High burst tolerance (incast)
• High throughput for long flows
![Page 12: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/12.jpg)
Recent Research
• New switching topology: Al-Fares et al.
• Fix TCP incast: Vasudevan et al.
• Data Center TCP: Alizadeh et al.
![Page 13: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/13.jpg)
Per-Pair Bandwidth
(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)
![Page 14: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/14.jpg)
Fat Tree
![Page 15: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/15.jpg)
Fat Tree
(k/2)2
k/2k/2k
![Page 16: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/16.jpg)
SwitchingPrefix Port
10.2.0.0/24 0
10.2.1.0/24 1
0.0.0.0/0 Suffix Port
0.0.0.2/8 2
0.0.0.3/8 3
10.2.0.X
10.2.1.X
X.X.X.2
X.X.X.3
TCAM
Encoder
Prefix Next Hop Port
00 10.2.0.1 0
01 10.2.1.1 1
10 10.4.1.1 2
11 10.4.1.2 3
![Page 17: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/17.jpg)
Not Perfect
(k/2)2
k/2k/2k
![Page 18: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/18.jpg)
Fat-Tree Status
![Page 19: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/19.jpg)
Incast
• RTO = SRTT + (4 X RTTVAR)
![Page 20: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/20.jpg)
Behavior
(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
![Page 21: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/21.jpg)
RFC 6298 (2.4) Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. - in practice, often 200ms
The delayed ACK algorithm specified in [Bra89] SHOULD be used by a TCP receiver. When used, a TCP receiver MUST NOT excessively delay acknowledgments. Specifically, an ACK SHOULD be generated for at least every second full-sized segment, and MUST be generated within 500 ms of the arrival of the first unacknowledged packet. - in practice, often 40ms
RFC 2581
![Page 22: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/22.jpg)
Solutions
• Proposal 1: Adjust RTO (Vasudevan et al.)
• Proposal 2: DCTCP (Alizadeh et al.)
![Page 23: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/23.jpg)
RTT
![Page 24: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/24.jpg)
RTT 2
![Page 25: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/25.jpg)
RTO
• Make RTOmin 200µs
• Timeout = (RTO + (rand(0.5) x RTO))
![Page 26: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/26.jpg)
Improvement
![Page 27: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/27.jpg)
Wide Area
![Page 28: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/28.jpg)
DCTCP
• Three goals• Low latency for short flows
• High burst tolerance (incast)
• High throughput for long flows
• Basic approach: keep switch queues short
![Page 29: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/29.jpg)
Queue Length
• RTT measurements are noisy
• At high speeds, very small• GigE: 10 packets is 120µs
• 10GigE: 10 paciets is 12µs
• Use ECN (explicit congestion notification)• RFC 3168
![Page 30: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/30.jpg)
Setting ECN
K
Set ECN bit
![Page 31: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/31.jpg)
Monitoring α
• Per RTT, measure F, the fraction of packets sent that had the ECN bit set• DCTCP acks copy the ECN bit of the corresponding
data packets into ECN-Echo field
• Compute α, EWMA of F
![Page 32: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/32.jpg)
Adjusting cwnd
• cwnd = cwnd x (1 - α/2)
![Page 33: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/33.jpg)
DCTCP Caveat
“We stress that DCTCP is designed for the data center environment. In this paper, we make no claims about suitability of DCTCP for wide area networks.”
![Page 34: Computer network (5)](https://reader030.vdocument.in/reader030/viewer/2022032422/55a9c3641a28ab967d8b46e3/html5/thumbnails/34.jpg)
Data Center Networks
• Very different than wide area Internet• Tiny RTTs
• Different traffic patterns
• Single administrative domain
• Standards (e.g., IETF) much less important
• A lot of very novel network design