1 backward congestion notification version 2.0 davide bergamasco ([email protected])[email protected]...
TRANSCRIPT
1
Backward Congestion Notification Version 2.0
Davide Bergamasco ([email protected])
Rong Pan ([email protected])
Cisco Systems, Inc.
IEEE 802.1 Interim Meeting
Garden Grove, CA (USA)
September 22, 2005
222
Credits
• Valentina Alaria (Cisco)
• Andrea Baldini (Cisco)
• Flavio Bonomi (Cisco)
• Manoj K. Wadekar (Intel)
333
BCN v2.0
• Desire from Mick to see an analytical studyof BCN stability
• BCN v2.0 improvements
• Linear control loop allows analysis of stability
• Simplified detection mechanism
• Reduced signaling rate
• Original BCN framework remains the same
444
BCN Background
Data Center Network
10 Gbps
End Node A 10 Gbps
10 Gbps
End Node B
10 Gbps
10 Gbps End Node C
10 Gbps
Tra
ffic
Traffic
BCN Message
BC
N M
essa
ge
Congestion
Traffic
Traf
fic
Traffic
Edge Switch A
Core Switch
Edge Switch B
Edge Switch C
555
Detection & Signaling
FULL QUEUE
OUTIN
Qeq
BCN (Qoff, Qdelta)
BCN (0,0) No Message
BCN (0,0)
RLTaggedFrame?
SampleFrame with
Probability P
No
Yes
MESSAGE TO GENERATE
MESSAGE TO GENERATE
EMPTY QUEUE
Qsc
BCN (Qoff, Qdelta)
SampledFrame?
Yes
No
SendBCN
NOP
Qoff = Qeq - Qlen [-Qeq. +Qeq]
Qdelta = #pktEnq - #pktDeq [-2Qeq, +2Qeq]
666
Reaction
Data OUT
R1F1
R2F2
RnFn
No
Mat
ch
Control IN
Data IN
Packets Marked withRATE_LIMITED_TAG
EDGENODE
NETWORKCORE
BCN Messagesfrom congestedpoint
* Feedback
Fb = (Qoff - W * Qdelta)
* Additive Increase (Fb > 0)
R = R + Gi * Fb * ru
* Multiplicative Decrease (Fb < 0)
R = R * ( 1 - Gd * |Fb| )
* Parameters
W = derivative weightGi = increase gainGd = decrease gainru = rate unit
777
Suggested BCN Message Format 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + DA = SA of sampled frame +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SA = MAC Address of CP + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IEEE 802.1Q Tag or S-Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EtherType = BCN |Version| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + CPID + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qoff | Qdelta | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | | First N bytes of sampled frame starting from DA | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FCS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
888
Suggested RLT Tag Format 0 3 7 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + DA of rate-limited frame +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SA of rate-limited frame + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IEEE 802.1Q Tag or S-Tag of rate-limited frame | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EtherType = RLT |Version| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + CPID + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp |EtherType of rate limited frame|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| |+ Payload of rate-limited frame +| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| FCS |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
999ST1 SU1 ST2 SU2 ST3 SU3 ST4 SU4 DT DU
SR2
DR2
SJ
Core Switch
ES2 ES3 ES4 ES5
ES6
SR1
ES1
DR1
Simulation Environment (1)
Congestion
TCP Bulk
UDP On/Off
101010
Simulation Environment (2)
• Short Range, High Speed DC Network
• Link Capacity = 10 Gbps
• Switch latency = 1 s
• Link Length = 100 m (0.5 s propagation delay)
• Control loop
• Delay ~ 3 s
• Parameters
• W = 2
• Gi = 4
• Gd = 1/64
• Ru = 8 Mbps
• Workload
• ST1-ST4: 10 parallel TCP connections transferring 1 MB each continuously
• SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms
111111
BCNv1.0
121212
BCNv2.0
Higher Stability @ Steady State
Faster Transient Response
131313
Simulation Environment (3)
• Long Range, High Speed DC Network
• Link Capacity = 10 Gbps
• Switch latency = 1 s
• Link Length = 20000 m (100 s propagation delay)
• Control loop
• Delay ~ 200 s
• Parameters
• W = 2
• Gi = 4
• Gd = 1/64
• Ru = 8 Mbps
• Workload
• ST1-ST4: 10 parallel TCP connections transferring 1 MB each continuously
• SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms
141414
BCNv1.0
151515
BCNv2.0
Much higher stability @ steady state with larger
loop delays
161616
Summary
• BCN v2 has a number of advantages …
• Can be studied analytically
• Better protection of TCP flows in mixed TCP and UDP traffic scenarios
• Detection algorithm independent of Switch implementation
• Better Performance
• Lower signaling frequency (from 10% to 1%)
• Better stability
• Increased tolerance to loop delays
• … and one disadvantage
• Slower convergence to fairness
171717
A Control-Theoretic Approach to BCNDesign and Analysis
181818
Notation
N: Number of FlowsC: Link Capacity: Round Trip Delay
w: Weight of the DerivitivePm: Sampling ProbabilityGi: Additive Increase GainGd: Multiplicative Decrease Gain
191919
Block Diagram of BCN Congestion Control
+
C
_qR
Time Delay
+
+
_
Gi
∆R
Pm
))1()((*
))(()(
TqTqw
TqqTFb eq
N
Gd
+
+
202020
Non-linear Differential Equations
meq PCdt
tdqwtqqtFb
CtRNdt
tdq
*
1*
)(*))(()(
)(*)(
md PtRtFbGtRdt
tdR*)(*)(**)(
)(
If Fb(t-) > 0
If Fb(t-) < 0mi PtRtFbG
dt
tdR*)(*)(*
)(
Link Control
Source Control
212121
Linearization Around Operating Point
• Using feedback control to analyze local stability
• Operating point:
R = C/N;
q’ = qeq – q = 0;
• Linearization
Difficulty: depending on sgn(Fb(t-d)), the system responses are different
– Luckily, a piecewise-linear function
Details are in the appendix
222222
Block Diagram of BCN Feedback Control
+
R
_
+
+
s
N q
Fb)
*
*1()(
mPC
swsFb se
N
CGws
N
CPG
d
md
2
2**
wGsNC
PG
i
mi
*
**
lose 90o margin
add lead zero to compensate
)*
*1()(
mPC
swsFb
Multiplicative Decrease:
Additive Increase:
232323
The Effect Of Zero From Time Domain’s Eyes
R
q
zero:dq/dt
242424
Choosing Parameters – an example
• Network conditions (10G link)
N = 50
= 200us
• Choose parameters such that the feedback loop is stable with a 35o margin
w = 4
Gi = 2Mbps
Gd = 1/128
Pm = 0.01
252525
Stability Result:lo
st 9
0o m
argi
n
1. With N = 50, delay = 200us, the system is stable
2. Phase margin translates into allowing extreme network conditions of N -> 1000 flows or -> 1ms before oscillation
262626
Simulation Result Shows A Stable System for N = 50; Delay = 200us
272727
Simulation Result Shows System is stable, but on the verge of oscillation: N = 50, Delay = 1ms
282828
Change W = 4 -> 1
1. When w = 1, a system with N = 50, delay = 200us already runs out of margin, on the verge of oscillation
2. w = 1, diminishing zero effect. System can’t cope with wide range of network conditions
292929
Indeed System is stable, but on the verge of oscillation even for N = 50, Delay = 200us when w = 1.0
303030
Requests to 802.1
• Start a Task Force on Congestion Management
• Use BCN as a Baseline Proposal
313131
Appendix
323232
Linearizing…
)()*
*1(*)(
)*
)(*)((*)(
)()(
)()(
.
.
sqPmC
swGsFb
PmC
tqwtqGtFb
s
sNRsq
tRNtq
333333
Linearizing Additive Increase Function
)(
)*
*)(
)((***)(*(
)(
**)(*
)(
)(**)(*)(
:
tR
PCw
dttdq
tqqGPtRG
tR
f
N
PCGPtRG
tFb
f
tFbPtRGdt
tdRf
meqmi
mimi
mi
343434
Linearizing Additive Increase Function
FbwGGsNC
PGR
RwGGFbN
CPGRs
wGG
PC
wNPtRGG
tR
PCw
CtNR
PtRGG
tR
PCw
CtNR
PtRGG
tRPmCw
dttdq
tqqPtRGG
PC
w
dt
tdqtqqGPG
tR
PCw
dttdq
tqqGPtRG
tR
f
i
mi
imi
i
mmi
mmi
mmi
eq
mim
eqmi
meqmi
**
**
*****
**
****)(**
)(
)*
*))(((
**)(**
)(
)*
*))(((
**)(**
)(
)*
*)(
)((**)(**)
**
)()((***
)(
)*
*)(
)((***)(*(
)(
353535
Linearizing Multiplicative Decrease Function
)(
)*
*)(
)((***)(*)(*(
)(
***)(*)(*
)(
)(**)(*)(*)(
:
2
2
tR
PCw
dttdq
tqqGPtRtRG
tR
g
N
CPGPtRtRG
tFb
g
tFbPtRtRGdt
tdRg
meqmd
mdmd
md
363636
Fb
NCG
wGs
NCPG
R
RN
CGwGFb
N
CPGRs
wGN
CGwGtRG
PC
wNGPtRG
tR
PCw
CtNR
GPtRG
tR
PCw
CtNR
GPtRG
tR
PCw
dttdq
tqq
GPtRGPC
w
dt
tdqtqqGPtRG
tR
PCw
dttdq
tqqGPtRtRG
tR
g
d
md
dmd
dd
mmd
mmd
mmd
meq
mdm
eqmd
meqmd
**
**
**
***
*****)(*
*****)(*
)(
)*
*))(((
***)(*
)(
)*
*))(((
***)(*
)(
)*
*)(
)((
***)(*)*
*)(
)((***)(**2
)(
)*
*)(
)((***)(*)(*(
)(
2
2
2
2
22
2
2
Linearizing Multiplicative Decrease Function
373737
- - - -+ + + +
Stop Generation of BCN Messages
t
Q
Qeq
Issue #1: Non-linearity
• ISSUE: Overshoots and undershoots accumulate over time
• SOLUTION: Signal only when
• Q > Qeq && dQ/dt > 0
• Q < Qeq && dQ/dt < 0
• Easy to implement in hardware: just an Up/Down counter
• Increment @ every enqueue
• Decrement @ every dequeue
• Reduces signaling rate by 50%!!
383838
Issue #2: Specific Detection Mechanism
FULL QUEUE
OUTIN
T+4T+3T+2T+1T+0
BCN+4BCN+3BCN+2BCN+1
BCN 0 No Message
NoMessage
BCN 0
RLTaggedFrame?
SampleFrame with
Probability P
No
Yes
MESSAGE TO GENERATE
MESSAGE TO GENERATE
EQUILIBRIUMEMPTY QUEUE
T-1T-2T-3T-4
BCN-1BCN-2BCN-3BCN-4
BCN-1BCN-2BCN-3BCN-4
SampledFrame?
Yes
RL Tag && Solicit
Bit Set?
No
Yes
No
BCNtype
dQ/dt < 0?
dQ/dt > 0
+ Yes
NOP
SendBCN
NOP
Yes
No
No
-
BCN+4BCN+3BCN+2BCN+1No Message
MESSAGE TO GENERATE
NOP
0
393939393939