zupdate : updating data center networks with zero loss
DESCRIPTION
zUpdate : Updating Data Center Networks with Zero Loss. Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer , Dave Maltz (Microsoft). DCN is constantly in flux. Upgrade Reboot. New Switch. Traffic Flows. DCN is constantly in flux. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/1.jpg)
1
zUpdate:Updating Data Center
Networks with Zero LossHongqiang Harry Liu (Yale University)
Xin Wu (Duke University)Ming Zhang, Lihua Yuan, Roger Wattenhofer, Dave Maltz
(Microsoft)
![Page 2: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/2.jpg)
2
Switches
DCN is constantly in fluxUpgrade Reboot
Traffic Flows
New Switch
![Page 3: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/3.jpg)
3
Switches
DCN is constantly in flux
Virtual Machines
Traffic Flows
![Page 4: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/4.jpg)
4
Network updates are painful for operators
Bob: An operator
Two weeks before update, Bob has to:• Coordinate with application owners• Prepare a detailed update plan• Review and revise the plan with colleagues
At the night of update, Bob executes plan by hands, but• Application alerts are triggered unexpectedly• Switch failures force him to backpedal several times.
Eight hours later, Bob is still stuck with update:• No sleep over night• Numerous application complaints • No quick fix in sight
Holy C**p
Complex Planning
Unexpected Performance Faults
Laborious Process
Switch Upgrade
![Page 5: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/5.jpg)
5
Congestion-free DCN update is the key• Applications want network updates to be seamless
• Reachability• Low network latency (propagation, queuing)• No packet drops
• Congestion-free updates are hard• Many switches are involved• Multi-step plan• Different scenarios have distinct requirements• Interactions between network and traffic demand changes
Congestion
![Page 6: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/6.jpg)
6
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
A clos network with ECMP
300
Link capacity: 1000
300
150
150 = 920620 + 150 + 150
300 300
600 600
150150
All switches: Equal-Cost Multi-Path (ECMP)
![Page 7: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/7.jpg)
7
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
+ 150
Switch upgrade: a naïve solution triggers congestion
Link capacity: 1000
Drain AGG1600
+ 300 = 1070= 920620 + 150
![Page 8: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/8.jpg)
8
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
Switch upgrade: a smarter solution seems to be working
Link capacity: 1000
Drain AGG1100500
+ 50 = 970620 + 300 + 150= 1070
Weighted ECMP
![Page 9: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/9.jpg)
9
Traffic distribution transition
Initial Traffic DistributionCongestion-free
Final Traffic Distribution Congestion-free
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
300 300 300 300ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
0 600 500 100?
Asynchronous Switch Updates
Transition
Simple? NO!
![Page 10: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/10.jpg)
10
Asynchronous changes can cause transient congestion
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
600300300
Drain AGG1
Link capacity: 1000
620 + 300 + 150 = 1070
Not Yet
When ToR1 is changed but ToR5 is not yet:
![Page 11: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/11.jpg)
11ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
Solution: introducing an intermediate step
Initial Final
IntermediateCongestion-free regardless the asynchronizations
Congestion-free regardless the asynchronizations
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
300 300 300 300ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
0 600 500 100
200 400 450 150?
Transition
![Page 12: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/12.jpg)
12
How zUpdate performs congestion-free update
Data Center Network
zUpdate
Current Traffic Distribution
Target Traffic Distribution
UpdateScenario
Update requirementsOperator
IntermediateTraffic Distribution
IntermediateTraffic Distribution
![Page 13: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/13.jpg)
13
Key technical issues
• Describing traffic distribution
• Representing update requirements
• Defining conditions for congestion-free transition
• Computing an update plan
• Implementing an update plan
![Page 14: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/14.jpg)
14
ToR
AGG
CORE s4
s2
s5
s3
s1
f
Describing traffic distribution: flow f’s load on the link from switch v to u
Traffic Distribution:
600
300=300
=150150
![Page 15: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/15.jpg)
15
ToR
AGG
CORE s4
s2
s5
s3
s1
f
Representing update requirements
To upgrade switch : To restore ECMP:
Drain s2
Constraint: = 0
When s2 recovers
Constraint: =
![Page 16: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/16.jpg)
16
Switch asynchronization exponentially inflates the possible load values
Asynchronous updates can result in possible load values on link during transition.
f
25𝑒7,8
ingressegress
f
𝑙7 ,8𝑓
In large networks, it is impossible to check if the load value exceeds link capacity.
Transition from old traffic distribution to new traffic distribution
1 2
3
4 6
78
5
![Page 17: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/17.jpg)
17
Two-phase commit reduces the possible load values to two
•With two-phase commit, f’s load on link only has two possible values throughout a transition:
𝑒𝑣 ,𝑢
𝑙𝑣 ,𝑢𝑓 (old ) 𝑙𝑣 ,𝑢𝑓 (new )or
f
version flip
ingressegress
f
Transition from old traffic distribution to new traffic distribution
1 2
3
4 6
78
5
![Page 18: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/18.jpg)
18
Flow asynchronization exponentially inflates the possible load values
f1
f2
𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8
𝑓 2 (old )
1 2
3
4
5
6
7
8
𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8
𝑓 2 (new )
𝑙7 ,8𝑓 1 (new )+𝑙7 , 8
𝑓 2 (old )
𝑙7 ,8𝑓 1 ( new )+𝑙7 , 8
𝑓 2 (new )
0
Asynchronous updates to N independent flows can result in possible load values on link 2𝐍 𝑒7,8
f1 + f2
=
![Page 19: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/19.jpg)
19
Handling flow asynchronization
[Congestion-free transition constraint] There is no congestion throughout a transition if and only if:
the capacity of link 𝑒𝑣 ,𝑢
∀𝑒𝑣 ,𝑢 :∑∀ 𝑓max {𝑙𝑣 ,𝑢𝑓 (old ) ,𝑙𝑣 ,𝑢
𝑓 (new ) }≤𝑐𝑣 ,𝑢
f1
f2
1 2
3
4
5
6
7
80
Basic idea:𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8
𝑓 2 (old )
𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8
𝑓 2 (new )
𝑙7 ,8𝑓 1 (new )+𝑙7 , 8
𝑓 2 (old )
𝑙7 ,8𝑓 1 ( new )+𝑙7 , 8
𝑓 2 (new )
=
![Page 20: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/20.jpg)
20
Computing congestion-free transition plan
Constant:Current Traffic
Distribution
Variable:Target TrafficDistribution
Variable:Intermediate
Traffic Distribution
Constraint:Congestion-free Constraint:
Update Requirements
Constraint:• Deliver all traffic• Flow conservation
Variable:Intermediate
Traffic Distribution
Linear Programming
![Page 21: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/21.jpg)
21
Implementing an update plan
• Computation time
• Switch table size limit
• Update overhead
• Failure during transition
• Traffic demand variation
Other FlowsCriticalFlows
Weighted-ECMP ECMP
Flows traversing bottleneck links
![Page 22: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/22.jpg)
22
Evaluations
• Testbed experiments
• Large-scale trace-driven simulations
![Page 23: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/23.jpg)
23
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12
Switch: OpenFlow 1.0Link: 10Gbps
Testbed setup
Drain AGG1
ToR5: 6Gbps ToR8: 6Gbps
ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps
Traffic Generator
![Page 24: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/24.jpg)
24
0 5 10 15 20 250.8
0.85
0.9
0.95
1
1.05
Real-time link utilization
Link: CORE1-AGG3 Link: CORE3-AGG4
Time (sec)
Link
Util
izatio
n
zUpdate achieves congestion-free switch upgrade
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
Initial
Final
Intermediate
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
3Gbps 3Gbps 3Gbps3Gbps
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
0 6Gbps 5Gbps 1Gbps
2Gbps 4Gbps 4.5Gbps 1.5Gbps
![Page 25: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/25.jpg)
25
-1 1 3 5 7 9 11 13 150.7
0.8
0.9
1
1.1
Real-time link utilization
Link: CORE1-AGG3 Link: CORE3-AGG4
Time (sec)
Link
Util
izatio
n
One-step update causes transient congestion
Initial
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
3Gbps 3Gbps 3Gbps3Gbps
Final
ToR
AGG
CORE 1
1
2 3 4
2 3 4 5 6
1 2 3 4 5
0 6Gbps 5Gbps 1Gbps
![Page 26: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/26.jpg)
26
Large-scale trace-driven simulations
ToR
AGG
CORE
A production DCN topology
New Switch
Test flows (1%)Flows
![Page 27: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/27.jpg)
27
zUpdate beats alternative solutions
zUpdate zUpdate-OneStep ECMP-OneStep ECMP-Planned
Post-transition Loss Rate
Transition Loss Rate
#step 2 1 1 300+
10
15
5
0Loss
Rat
e (%
)
![Page 28: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/28.jpg)
28
Conclusion
• Switch and flow asynchronization can cause severe congestion during DCN updates
• We present zUpdate for congestion-free DCN updates• Novel algorithms to compute update plan • Practical implementation on commodity switches• Evaluations in real DCN topology and update scenarios
The End
![Page 29: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/29.jpg)
29
Thanks & Questions?
![Page 30: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/30.jpg)
30
Updating DCN is a painful process
Operator
InteractiveApplications
This is Bob
Switch Upgrade
Any performance disruption?
How bad will the latency be?
How long will the disruption last?
What servers will be affected?
Uh?…
![Page 31: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/31.jpg)
31
Network update: a tussle between applications and operators• Applications want network update to be fast and seamless
• Update can happen on demand• No performance disruption during update
• Network update is time consuming• Nowadays, an update is planned and executed by hands• Rolling back in unplanned cases
• Network update is risky• Human errors• Accidents
![Page 32: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/32.jpg)
32
Challenges in congestion-free DCN update• Many switches are involved
• Multi-step plan
• Different scenarios have distinctive requirements• Switch upgrade/failure recovery• New switch on-boarding• Load balancer reconfiguration• VM migration
• Coordination between changes in routing (network) and traffic demand (application)
Help!
![Page 33: zUpdate : Updating Data Center Networks with Zero Loss](https://reader035.vdocument.in/reader035/viewer/2022062323/56816468550346895dd6553a/html5/thumbnails/33.jpg)
33
Related work
• SWAN [SIGCOMM’13] • maximizing the network utilization• Tunnel-based traffic engineering
• Reitblatt et al. [SIGCOMM’12]• Control plane consistency during network updates• Per-packet and per-flow cannot guarantee “no congestions”
• Raza et al. [ToN’2011], Ghorbani et al. [HotSDN’12]• One a specific scenario (IGP update, VM migration)• One link weight change or one VM migration at a time