by liyong

46
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data Center TCP (DCTCP) 1 Microsoft Research, Stanford University

Upload: tala

Post on 08-Feb-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Data Center TCP (DCTCP). Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan. Microsoft Research, Stanford University. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: by liyong

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan

by liyong

Data Center TCP(DCTCP)

1

Microsoft Research, Stanford University

Page 2: by liyong

OutLook• Introduction• Communications in Data Centers• The DCTCP Algorithm• Analysis and Experiments• Conclusion

2

Page 3: by liyong

• Introduction• Communications in Data Centers• The DCTCP Algorithm• Analysis and Experiments• Conclusion

3

Page 4: by liyong

Data Center Packet Transport

• Cloud computing service provider– Amazon,Microsoft,Google Need to build highly available,

highly performant computing and storage infrastructure using low cost, commodity

components

4

Page 5: by liyong

we focus on

soft real-time applications– Supporting

Web searchRetailAdvertisingRecommendation

– Require three things from the data center network Low latency for short flowsHigh burst toleranceHigh utilization for long fows

5

Page 6: by liyong

Two major contributions– First

Measure and analyze production trafficExtracting application patterns and needs Impairments that hurt performance are identified

– SecondPropose Data Center TCP (DCTCP)Evaluate DCTCP at 1 and 10Gbps speeds on ECN-capable

commodity switches

6

Page 7: by liyong

Production traffic

• >150TB of compressed data,collected over the course of a month from ~6000 servers

• The measurements reveal that 99.91% of traffic in our data center is TCP traffic

• The traffic consists of three kinds of traffic —query traffic (2KB to 20KB in size)— delay sensitive short messages (100KB to 1MB)—throughput sensitive long flows (1MB to 100MB)

• Our key learning from these measurements is that to meet the requirements of such a diverse mix of short and long flows, switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flows.

• DCTCP is designed to do exactly this.

7

Page 8: by liyong

TCP in the Data Center

• We’ll see TCP does not meet demands of apps.– Incast

Suffers from bursty packet dropsNot fast enough utilize spare bandwidth

– Builds up large queues: Adds significant latency. Wastes precious buffers, esp. bad with shallow-buffered switches.

• Operators work around TCP problems.‒ Ad-hoc, inefficient, often expensive solutions

• Our solution: Data Center TCP

8

Page 9: by liyong

• Introduction• Communications in Data Centers• The DCTCP Algorithm• Analysis and Experiments• Conclusion

9

Page 10: by liyong

Partition/Aggregate Application Structure

10

Page 11: by liyong

TLA

MLAMLA

Worker Nodes

………

Partition/Aggregate

11

Picasso

“Everything you can imagine is real.”

“Bad artists copy. Good artists steal.”

“It is your work in life that is the ultimate seduction.“

“The chief enemy of creativity is good sense.“

“Inspiration does exist, but it must find you working.”“I'd like to live as a poor man

with lots of money.““Art is a lie that makes us

realize the truth.“Computers are useless.

They can only give you answers.”

1.

2.

3.

…..

1. Art is a lie…

2. The chief…

3.

…..

1.

2. Art is a lie…

3. …

..Art is…

Picasso• Time is money

Strict deadlines (SLAs)

• Missed deadline Lower quality result

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Page 12: by liyong

Workloads

• Partition/Aggregate (Query)

• Short messages [50KB-1MB] (Coordination, Control state)

• Large flows [1MB-50MB] (Data update)

12

Delay-sensitive

Delay-sensitive

Throughput-sensitive

Page 13: by liyong

Impairments

13

Page 14: by liyong

Switches

• Like most commodity switches in clusters are shared memory switches that aim to exploit statistical multiplexing gain through use of logically common packet buffers available to all switch ports.

• Packets arriving on an interface are stored into a high speed multi-ported memory shared by all the interfaces.

• Memory from the shared pool is dynamically allocated to a packet by a MMU(attempts to give each interface as much memory as it needs while

preventing unfairness by dynamically adjusting the maximum amount of memory any one interface can take).

• Building large multi-ported memories is very expensive, so most cheap switches are shallow buffered, with packet buffer being the scarcest resource. The shallow packet buffers cause three specific performance impairments,which we discuss next.

14

Page 15: by liyong

Incast

15

TCP timeout

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

RTOmin = 300 ms

• Synchronized mice collide. Caused by Partition/Aggregate.

Page 16: by liyong

Queue Buildup

16

Sender 1

Sender 2

Receiver

• Big flows buildup queues. Increased latency for short flows.

• Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms

Page 17: by liyong

Buffer Pressure

• Indeed, the loss rate of short flows in this traffic pattern depends on the number of long flows traversing other ports• The bad result is packet loss and timeouts, as in incast, but without requiring synchronized flows.

17

Page 18: by liyong

Data Center Transport Requirements

18

1. High Burst Tolerance– Incast due to Partition/Aggregate is common.

2. Low Latency– Short flows, queries

3. High Throughput – Large file transfers

The challenge is to achieve these three together.

Page 19: by liyong

Balance Between Requirements

19

High Burst ToleranceHigh Throughput

Low Latency

DCTCP

Deep Buffers: Queuing Delays Increase Latency

Shallow Buffers: Bad for Bursts & Throughput

Reduced RTOmin

(SIGCOMM ‘09) Doesn’t Help Latency

AQM – RED: Avg Queue Not Fast Enough for Incast

Objective:Low Queue Occupancy & High Throughput

Page 20: by liyong

• Introduction• Communications in Data Centers• The DCTCP Algorithm• Analysis and Experiments• Conclusion

20

Page 21: by liyong

Review TCP Congestion Control

• Four Stage:Slow StartCongestion AvoidanceQuickly RetransmissionQuickly Recovery

• Router must maintain one or more queues on port, so it is important to control queue– Two queue control algorithm

Queue Management Algorithm: manage the queue length through dropping packets when necessary

Queue Scheduling Algorithm: determine the next packet to send

21

Page 22: by liyong

Two queue management algorithm• Passive management Algorithm: dropping packets

after queue is full.– Traditional Method

Drop-tailRandom dropDrop front

– Some drawbacksLock-out: several flows occupy queue exclusively, prevent the

packets from others flows entering queueFull queues: send congestion signals only when the queues are

full, so the queue is full state in quite a long period

• Active Management Algorithm(AQM)22

Page 23: by liyong

AQM (dropping packets before queue is full)

• RED(Random Early Detection)[RFC2309]Calculate the average queue length(aveQ): Estimate the degree of

congestionCalculate probability of dropping packets (P): according to the

degree of congestion. (two threshold: minth, maxth)

abeQ<minth:don’t drop packetsMinth<abeQ<maxth: drop packets in PabeQ>maxth: drop all packets

Drawback: drop packets sometimes when queue isn’t full

• ECN(Explicit Congestion Notification)[RFC3168] An method to use multibit feed-back notifying

congestion instead of dropping packets 23

Page 24: by liyong

ECN• Routers or Switches must support it.(ECN-capable)

– Set two bits by the ECN field in the IP packet headerECT (ECN-Capable Transport): set by sender, To display the sender’s

transmission protocol whether support ECN or not.CE(Congestion Experienced): set by routers or switches, to display whether

congestion occur or not.

– Set two bits field in TCP headerECN-Echo: receiver notify sender that it has received CE packetCWR(Congestion Window Red-UCed): sender notify receiver that it has

decreased the congestion window

Integrate ECN with RED

24

Page 25: by liyong

2

ECN working principle

25

1 0

0

1

1 1

0

1

ETC CEIP头部

TCP头部

ETC CE

CWR CWRACK

ECN-Echo

1

TCP头部CWR

3

4

源端 路由器 目的端

Page 26: by liyong

Review: The TCP/ECN Control Loop

26

Sender 1

Sender 2

Receiver

ECN Mark (1 bit)

ECN = Explicit Congestion Notification

Page 27: by liyong

Two Key Ideas1. React in proportion to the extent of congestion, not its presence.

Reduces variance in sending rates, lowering queuing requirements.

2. Mark based on instantaneous queue length. Fast feedback to better deal with bursts.

27

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

Page 28: by liyong

Data Center TCP AlgorithmSwitch side:

– Mark packets when Queue Length > K.

28

Sender side:– Maintain an estimate of fraction of packets marked (α).

In each RTT:

– where F is the fraction of packets that were marked in the last window of data– 0 < g < 1 is the weight given to new samples against the past in the estimation of α

Adaptive window decreases:

B KMark Don’t Mark

Page 29: by liyong

DCTCP in Action

29

Setup: Win 7, Broadcom 1Gbps SwitchScenario: 2 long-lived flows, K = 30KB

(Kby

tes)

Page 30: by liyong

• Introduction• Communications in Data Centers• The DCTCP Algorithm• Analysis and Experiments• Conclusion

30

Page 31: by liyong

Why it Works

1. High Burst Tolerance Large buffer headroom → bursts fit. Aggressive marking → sources react before packets are

dropped.

2. Low Latency Small buffer occupancies → low queuing delay.

3. High Throughput ECN averaging → smooth rate adjustments, cwind low

variance.31

Page 32: by liyong

Packets sent in this RTT are marked.

Analysis

32

Time

(W*+1)(1-α/2)

W*

Window Size

W*+1

Page 33: by liyong

Analysis

33

We are interested in computing the following quantities:The maximum queue size (Qmax)The amplitude of queue oscillations (A)The period of oscillations (TC)

Page 34: by liyong

Analysis• Consider N infinitely long-lived flows with identical round-trip times RTT, sharing a

single bottleneck link of capacity C. We further assume that the N flows are synchronized

• The queue size at time t is given by Q(t) = NW(t)-C×RTT (3) where W(t) is the window size of a single source

• The fraction of marked packets α

• S(W1,W2) denote the number of packets sent by the sender, while its window size increases from W1 to W2 > W1.

• Since this takes W2-W1 round-trip times, during which the average window size is (W1 +W2)/2

34

Page 35: by liyong

Analysis

• LetW* = (C × RTT +K)/N, This is the critical window size at which the queue size reaches K, and the switch starts marking packets with the CE codepoint. During the RTT it takes for the sender to react to these marks, its window size increases by one more packet, reaching W* + 1.

Hence

Plugging (4) into (5) and rearranging, we get:

• Assuming α is small, this can be simplified as . We can now compute A , TC and in Qmax.

35

Page 36: by liyong

Analysis

• Note that the amplitude of oscillation in window size of a single flow, D, is given by:

• Since there are N flows in total

• Finally, using (3), we have:

36

Page 37: by liyong

Analysis• How do we set the DCTCP parameters?

37

Marking Threshold(K). The minimum value of the queue occupancy sawtooth is given by:

Choose K so that this minimum is larger than zero, i.e. the queue does not underflow. This results in:

Estimation Gain(g). The estimation gain g must be chosen small enough to ensure the exponential moving average (1) “spans” at least one congestion event. Since a congestion event occurs

every TC round-trip times, we choose g such that:

Plugging in (9) with the worst case value N = 1, results in thefollowing criterion:

)(2/386.1 kRTTCg

Page 38: by liyong

Evaluation• Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments

– 90 server testbed– Broadcom Triumph 48 1G ports – 4MB shared memory– Cisco Cat4948 48 1G ports – 16MB shared memory– Broadcom Scorpion 24 10G ports – 4MB shared memory

• Numerous benchmarks– Incast– Queue Buildup– Buffer Pressure

38

Page 39: by liyong

Experiment implement

• 45 1G servers connected to a Triumph, a 10G server extern connection– 1Gbps links K=20 – 10Gbps link K=65

• Generate query, and background traffic– 10 minutes, 200,000 background, 188,000 queries

• Metric:– Flow completion time for queries and background flows.

39

We use RTOmin = 10ms for both TCP & DCTCP.

Page 40: by liyong

Baseline

40

Background Flows(95th Percentile) Query Flows

Page 41: by liyong

Baseline

41

Background Flows(95th Percentile) Query Flows

Low latency for short flows.

Page 42: by liyong

Baseline

43

Background Flows(95th Percentile) Query Flows

Low latency for short flows.High burst tolerance for query flows.

Page 43: by liyong

(95th Percentile)Scaled Background & Query10x Background, 10x Query

44

Page 44: by liyong

These results make three key points

• First, if our data center used DCTCP it could handle 10X larger query responses and 10X larger background flows while performing better than it does with TCP today.

• Second, while using deep buffered switches (without ECN) improves performance of query traffic, it makes

performance of short transfers worse, due to queue build up.

• Third, while RED improves performance of short transfers, it does not improve the performance of query traffic, due to queue length variability.

44

Page 45: by liyong

Conclusions

• DCTCP satisfies all our requirements for Data Center packet transport. Handles bursts well Keeps queuing delays low Achieves high throughput

• Features: Very simple change to TCP and a single switch parameter

K. Based on ECN mechanisms already available in

commodity switch.46

Page 46: by liyong

46