1 ee384y: packet switch architectures part ii load-balanced switch (borrowed from isaac keslassys...
TRANSCRIPT
1
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
EE384Y: Packet Switch ArchitecturesPart II
Load-balanced Switch
(Borrowed from Isaac Keslassy’s Defense Talk)
Nick McKeownProfessor of Electrical Engineering and Computer Science, Stanford University
[email protected]://www.stanford.edu/~nickm
2
The Arbitration Problem
A packet switch fabric is reconfigured for every packet transfer.
For example, at 160Gb/s, a new IP packet can arrive every 2ns.
The configuration is picked to maximize throughput and not waste capacity.
Known algorithms are probably too slow.
3
Approach
We know that a crossbar with VOQs, and uniform Bernoulli i.i.d. arrivals, gives 100% throughput for the following scheduling algorithms: Pick a permutation uar from all permutations. Pick a permutation uar from the set of size N in which each
input-output pair (i,j) are connected exactly once in the set. From the same set as above, repeatedly cycle through a fixed
sequence of N different permutations.
Can we make non-uniform, bursty traffic uniform “enough” for the above to hold?
4
Design Example
GoalsScale to High Linecard Speeds (160Gb/s)
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards (640)
Provide Performance Guarantees 100% Throughput Guarantee No Packet Reordering
Stanford “Optics in Routers” projecthttp://yuba.stanford.edu/or/
Some challenging numbers: 100Tb/s 160Gb/s linecards 640 linecards
5
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
6
In
In
In
Out
Out
Out
R
R
R
R
R
R
Router capacity = NRSwitch capacity = N2R
100% Throughput in a Mesh Fabric
?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
7
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/NR/N
R/N
R/N
R/N
R/N
If Traffic Is Uniform
RNR /NR /NR /
R
NR / NR /
8
Real Traffic is Not Uniform
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
RNR /NR /NR /
R
RNR /NR /NR /
R
RNR /NR /NR /
R
R
R
R
?
9
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Load-Balanced Switch
Load-balancing stage Forwarding stage
In
In
In
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
100% throughput for weakly mixing traffic (Valiant, C.-S. Chang)
10
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
112233
Load-Balanced Switch
11
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
22
11
Load-Balanced Switch
12
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/NR/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Intuition: 100% Throughput
Arrivals to second mesh:
Capacity of second mesh:
Second mesh: arrival rate < service rate
111
111
111
where,1
UaUN
b
01
-b RUaUN
C
UN
RC
Cba
[C.-S. Chang]
13
Another way of thinking about it
1
N
1
N
1
N
External Outputs
Internal Inputs
External Inputs
Load-balancing cyclic shift
Switching cyclic shift
Load Balancing
First stage load-balances incoming packets Second stage is a cyclic shift
14
Load-Balanced Switch
External Outputs
Internal Inputs
1
N
ExternalInputs
Load-balancing cyclic shift
Switching cyclic shift
1
N
1
N
11
2
2
15
ˆ( ) ,
ˆ mod
1. Consider a periodic sequence of permutation matrices:
where is a one-cycle permutation matrix
(f or example, a TDM sequence), and .
2. I f 1st stage is
tP t P P
t t N
Main Result [Chang et al.]:
1 1
1
2 2
( ) ( ),
( ) ( ),
scheduled by a sequence of permutation
matrices:
where is a random starting phase, and
3. The 2nd stage is scheduled by a sequence of permutation
matrices:
4. Then the swit
P t P t
P t P t
ch gives 100% throughput f or a very broad
range of traffi c types.
1st stage makes non-unif orm traffi c unif orm,
and breaks up burstiness.
Observation:
16
Outline of Chang’s Proof
1
( )
( )
( ) ( ) ( )
( )
( 1)
1. Let be the matrix of arrivals at time , where
indicates an arrival at f or .
2. Let be the input traffi c to the second stage.
3. Let be the queue length matrix:
ij
a t t
a t i j
b t P t a t
q t
q t
2
20
1
1 1
max ( ) ( 1) ( 1), 0 ,
( ) max .
( ) ( ).
1( ) ( ) ( ) ( ) ( ) .
1lim
expands to
I f no output is oversubscribed, converges to steady state
t
s ts
t
q t b t P t
q t b P
q t q
E b t E P t a t E P t E a t eN
bt
:Theorem
Proof :
21
1 1( ) ( ) 0.
( )Holds under some mild conditions on (weakly mixing arrival processes).
t
s
s P s e eN N
a t
17
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
18
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Packet Reordering
12
19
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Delay Difference Between Middle Ports
1
2
cells
20
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
123
0
UFS (Uniform Frame Spreading)
12
21
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
FOFF (Full Ordered Frames First)
12
22
FOFF (Full Ordered Frames First)
Input Algorithm N FIFO queues corresponding to the N output flows Spread each flow uniformly: if last packet was sent to
middle port k, send next to k+1. Every N time-slots, pick a flow:
- If full frame exists, pick it and spread like UFS - Else if all frames are partial, pick one in round-robin order and send it
123
12
4
N
23
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Reordering
123
NN
24
FOFF
Output properties N FIFO queues corresponding to the N middle
ports Buffer size less than N2 packets If there are N2 packets, one of the head-of-line
packets is in order
111
22
333
Output
4
N
25
FOFF Properties
Property 1: FOFF maintains packet order.
Property 2: FOFF has O(1) complexity.
Property 3: Congestion buffers operate independently.
Property 4: FOFF maintains an average packet delay within constant from ideal output-queued router.
Corollary: FOFF has 100% throughput for any adversarial traffic.
26
In
In
In
Out
Out
Out
R
R
R
R
R
R
Output-Queued Router?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
27
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
28
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
From Two Meshes to One Mesh
One linecard
In
Out
29
From Two Meshes to One Mesh
First meshIn Out
In Out
In Out
In Out
One linecard
Second mesh
R R
R
R
R
30
From Two Meshes to One Mesh
Combined meshIn Out
In Out
In Out
In Out
2RR
2R
2R
2R
31
Many Fabric Options
Options
Space: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM
Any spreadingdevice
C1, C2, …, CN
C1
C2
C3
CN
In Out
In Out
In Out
In Out
N channels each at rate 2R/NOne linecard
32
AWGR (Arrayed Waveguide Grating Router) A Passive Optical Component
Wavelength i on input port j goes to output port (i+j-1) mod N
Can shuffle information from different inputs
1, 2…N
NxN AWGR
Linecard 1
Linecard 2
Linecard N
1
2
N
Linecard 1
Linecard 2
Linecard N
33
In Out
In Out
In Out
In Out
Static WDM Switching: Packaging
AWGR
Passive andAlmost Zero
Power
A
B
C
D
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, D
A, A, A, A
B, B, B, B
C, C, C, C
D, D, D, D
N WDM channels, each at rate 2R/N
34
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
35
Scaling Problem
For N < 64, an AWGR is a good solution. We want N = 640. Need to decompose.
36
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R 2R
Mesh
2R In Out
In Out
In Out
In Out
R
2RR
37
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R In Out
In Out
In Out
In Out
R2R/N
38
1
2
3
4
Example: N=8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
2R/8
39
When N is Too LargeDecompose into groups (or racks)
4R/42R 2R1
2
3
4
5
6
7
8
2R2R
1
2
3
4
5
6
7
8
4R 4R
40
When N is Too LargeDecompose into groups (or racks)
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
41
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
42
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
2RL
Solution: replace mesh with sum of permutations
= + +
2RL/G 2RL/G 2RL/G 2RL/G
≤
2RL 2RL/G
G *
43
Hybrid Electro-Optical ArchitectureUsing MEMS Switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
44
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
45
Fiber Link Capacity
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
MEMSSwitch
Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R
Laser/Modulator
MUX
46
Group/Rack 1
1
2
2R
2R 4R
Group/Rack 2
1
2
2R
2R 4R
Example2 Groups of 2 Linecards
1
2
2R
2R
Group/Rack 1
1
2
2R
2R
Group/Rack 2
4R
4R
2R
2R
2R
2R
2R
2R
47
Theorem: M≡L+G-1 MEMS switches are sufficient for bandwidth.
Number of MEMS Switches
Examples:
5540,16,640
2
MGLN
NMNGL
G groups, Li linecards in group i,
G
iiLN
1
,max kk
LL
48
Group A
1
2
2R
2R 4R
Group B
1
2
2R
2R 4R
Packet Schedule
1
2
2R
2R
Group A
1
2
2R
2R
Group B
4R
4R
2R
2R
2R
2R
49
At each time-slot: Each transmitting linecard sends one packet Each receiving linecard receives one packet (MEMS constraint) Each transmitting group i
sends at most one packet to each receiving group j through each MEMS connecting them
In a schedule of N time-slots: Each transmitting linecard sends exactly one
packet to each receiving linecard
Rules for Packet Schedule
50
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 ? ? ? ?
Tx LC A2 ? ? ? ?
Tx LC B1 ? ? ? ?
Tx LC B2 ? ? ? ?
Tx Group A
Tx Group B
51
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
52
Bad Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
53
Group Schedule
T+1 T+2 T+3 T+4
Tx Group A AB AB AB AB
Tx Group B AB AB AB AB
54
Good Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 B1 A2 A1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 A1 B2 B1
Theorem: There exists a polynomial-time algorithm that finds the correct packet schedule.
Tx Group A
Tx Group B
55
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards