tagger: practical pfc deadlock prevention in data center ...tagger: practical pfc deadlock...

36
Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo* (Toutiao), Kun Tan*(Huawei), Jitendra Padhye, Kai Chen (HKUST) Microsoft 1 * Work done while at Microsoft CoNEXT 2017, Incheon, South Korea

Upload: others

Post on 20-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

Tagger: Practical PFC Deadlock Prevention in Data Center Networks

Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo* (Toutiao), Kun Tan*(Huawei), Jitendra Padhye, Kai Chen (HKUST)

Microsoft

1* Work done while at Microsoft

CoNEXT 2017, Incheon, South Korea

Page 2: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

2

RDMA: Remote Direct Memory Accessv High throughput, low latency with low CPU overheadv Microsoft, Google, etc. are deploying RDMA

RDMAApplication

RDMANIC

Kernel

RDMAApplication

RDMANIC

LosslessNetwork

kernelbypass

kernelbypass

(WithPFC)

Kernel

RDMA is Being Widely Deployed

Page 3: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

Congestion

PAUSE upstream switch when PFC threshold reachedv Avoid packet drop due to buffer overflow

3

Priority Flow Control (PFC)

PFCthreshold:3pkts

PAUSE

Page 4: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

4

Due to Cyclic Buffer Dependency (CBD) A->B->C->ANot just a theoretical problem, we have seen it in our datacenters too!

PFC thresholdSwitch A

Switch BPAUSE

PAUSEPAUSE

Switch C

A Simple Illustration of PFC Deadlock

Page 5: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

5

CBD in the Clos Network

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

Page 6: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

6

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

consider two flows initially follow shortest UP-DOWN paths

CBD in the Clos Network

Page 7: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

7

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

CBD in the Clos Network

due to link failures, both flows are locally rerouted to non-shortest paths

Page 8: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

8

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

CBD: L2->S1->L3->S2->L2

L2

S1RX

L3

S2RX

RX RX

RX RX

buffer dependency graph

CBD in the Clos Network

these two DOWN-UP bounced flows create CBD

Page 9: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

9

Real in Production Data Centers?

Packetreroutemeasurementsinmorethan20datacenters:

~100,000 DOWN-UP reroutes!

Page 10: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

• #1: transient problem à PERMANENT deadlockv Transient loops due to link failuresv Packet floodingv …

• #2: small deadlock can cause large deadlock

deadlock

10

PAUSEPAUSE

PAUSE PAUSE

PAUSE

PAUSEPAUSE

Handling Deadlock is Important

Page 11: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

Three Key Challenges

11

What are the challenges in designing a practical deadlock prevention solution?

Ø No change to existing routing protocols or hardwareØ Link failures & routing errors are unavoidable at scaleØ Switches support at most 8 limited lossless priorities

(and typically only two can be used)

Page 12: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

• #1: deadlock-free routing protocolsv not supported by commodity switches (fail challenge #1)v not work with link failures or routing errors (fail challenge #2)

• #2: buffer management schemesv require a lot of lossless priorities (fail challenge #3)

12

The Existing Deadlock Prevention Solutions

Our answer: Tagger

Page 13: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

TAGGER DESIGN

13

Page 14: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

14

Important Observation

Fat-tree [Sigcomm’08] VL2 [Sigcomm’09]

desired path set: all shortest paths

BCube [Sigcomm’09]

desired path set: dimension-order paths

HyperX [SC’09]

Takeaway: In a data center, we can ask operator to supply a set of expected lossless paths (ELP)!

Page 15: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

15

Basic Idea of Tagger

1. Ask operators to provide: v topology & expected lossless paths (ELP)

2. Packets carrying tags when in the network

3. Pre-install match-action rules at switches for tag manipulation and packet queueingv packets travel over ELP: lossless queues & CBD never formsv packets deviate ELP: lossy queue, thus PFC not triggered

Page 16: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

16

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

Illustrating Tagger for Clos Topology

ELP = all shortest paths (CBD-free)

Root cause of CBD: packets deviate UP-DOWN routing!

Page 17: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

17

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

tag = NoBounce

• Under Tagger, packets carry tags when travelling in the network • Initially, tag value = NoBounce• At switches, Tagger pre-install match-action rules for tag manipulation

Tag InPort OutPort NewTag

NoBounce S1 S2 Bounced

… … … …

flow 1

match action

Illustrating Tagger for Clos Topology

match-action rules installed at switches

Page 18: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

18

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

Packet received by switch L3

Tag InPort OutPort NewTag

NoBounce S1 S2 Bounced

… … … …

flow 1

match actiontag = NoBounce

Illustrating Tagger for Clos Topology

match-action rules installed at switches

Page 19: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

19

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

tag = NoBounce

rewrite tag once DOWN-UP bounce detected

flow 1

match action

Tag InPort OutPort NewTag

NoBounce S1 S2 Bounced

… … … …

down-up bounce observed!

Bounced

Illustrating Tagger for Clos Topology

Page 20: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

20

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2 tag = Bounced

• S2 knows it is a bounced packet that deviates ELP à placed in the lossy queue• No PFC PAUSE sent from S2 to L3 à buffer dependency from L3 to S2 removed

flow 1

Illustrating Tagger for Clos Topology

Page 21: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

21

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 2

• Tagger will do the same for packets of flow 2• 2 buffer dependency edges are removed à CBD is eliminated

CBD: L2->S1->L3->S2->L2

L2

S1RX

L3

S2RX

RX RX

RX RX

buffer dependency graph

L2

S1RX

L3

S2RX

RX RX

RX RX

Illustrating Tagger for Clos Topology

Page 22: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

22

What If ELP Has CBD?

ELP = shortestpaths

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

+ 1-bounce paths

(ELP has CBD now!)

Page 23: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

23

Segmenting ELP into CBD-free Subsets

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

path segments before bounce(only have UP-DOWN paths, no CBD)

two bounced paths are in ELP now

path segments after bounce(only have UP-DOWN paths, no CBD)

Page 24: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

24

Isolating Path Segments with Tags

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

tag 1 à path segments before bounce tag 2 à path segments after bounce

Page 25: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

25

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1

tag = 1

Isolating Path Segments with Tags

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2 tag = 2

Adding a rule at switch L3: (Tag = 1, Inport=S1, OutPort = S2) -> NewTag = 2

Page 26: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

26

No CBD after Segmentation

CBD: L2->S1->L3->S2->L2

buffer dependency graph

L2

S112

1

1L3

S221

1

1

packets with tag i à i-th lossless queue

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

tag 2tag 1

Page 27: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

27

What If k-bounce Paths all in ELP?

ELP = shortest up-down paths + 1-bounce paths

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

k-bounce paths

solution: just segmenting ELP into k CBD-free subsets based on number of bounced times!

Page 28: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

28

Summary: Tagger Design for Clos Topology

1. Initially, packets carry with tag = 1

2. pre-install match-action rules at switches:• DOWN-UP bounce: increase tag by 1 • Enqueue packets with tag i to i-th lossless queue (i <= k+1)• Enqueue packets with tag i to lossy queue(i > k+1)

For Clos topology, Tagger is optimal in terms of # of lossless priorities.

Page 29: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

29

How to Implement Tagger?

• DSCP field in the IP header as the tag carried in the packets

• build 3-step match-action pipeline with basic ACL rules available in commodity switches

Page 30: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

30

Tagger Meets All the Three Challenges

1. Work with existing routing protocols & hardware

2. Work with link failures & routing errors

3. Work with limited number of lossless queues

Page 31: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

More Details in the Paper

• Proof of Deadlock freedom

• Analysis & Discussions– Algorithm complexity– Optimality– Compression of match-action rules– …

31

Page 32: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

32

Evaluation-1: Tagger prevents Deadlock

L1 L2

T1 T2

L3 L4

T3 T4

S1 S2

flow 1 flow 2

Scenario: two flows forms CBD

Tagger avoids CBD caused by bounced flows, and prevents deadlock!

deadlock!

Page 33: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

33

Evaluation-2: Scalability of Tagger

Tagger is scalable in terms of number of lossless priorities and ACL rules.

Match-actionrulesandprioritiesrequiredforJellyfishtopology

*lastentryincludes additional 20,000randompaths.

Page 34: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

34

Evaluation-3: Overhead of Tagger

Tagger rules have no impact on throughput and latency

Page 35: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

35

Conclusion

• Tagger: a tagging system guarantees deadlock-freedom– Practical:

Ørequire no change to existing routing protocolsØimplementable with existing commodity switching ASICsØwork with limited number of lossless priorities

– General: Øwork with any topologies Øwork with any ELPs

Page 36: Tagger: Practical PFC Deadlock Prevention in Data Center ...Tagger: Practical PFC Deadlock Prevention in Data Center Networks Shuihai Hu*(HKUST), Yibo Zhu, Peng Cheng, Chuanxiong Guo*

36

Thanks!