automated generation of round-robin arbitration and ... · automated generation of round-robin...
TRANSCRIPT
Automated Generation of Round-robin Arbitration and Crossbar Switch Logic
Eung S. ShinAdvisor: Professor Vincent J Mooney III
School of Electrical and Computer Engineering, Georgia Institute of Technology
2/12/2004 2
Overview
DSP
on-chip network&
arbiter perip
hera
lco
re
DSPGPPGPP
cust
omlo
gic
mem
ory
mod
ule
mem
ory
mod
ule
mem
ory
mod
ule
mem
ory
mod
ule
Crossbar (Xbar)
round-robin arbiter
Multiprocessor System-on-a Chip (SoC)
2/12/2004 3
Arbiter Problems
A fast and powerful arbiter for an SoCA fast arbiter for terabit switching speedsA tedious and error-prone task
Network Switch (16x16)
CrossbarSwitch Fabric
(16x16)x16
16 (16x16 arbiter)s
… … …
VOQ(0,16)
VOQ(0,0)
.
.
.input port 0
VOQ(16,0)
.
.
.
VOQ(16,16)
input port 16
.
.
.
.
.
.
output port 16
output port 0
...
.
.
.
...
req(0, 0)
req(16, 16)
grant(0, 0-16) grant(16, 0-16)
2/12/2004 4
Xbar Problems
Multiple communication channels demanded in a multiprocessor SoCChallenge: reducing productivity gapProductivity gap reduction techniques:
Enhancing IP core reusabilityDeveloping a CAD tool
2/12/2004 5
ObjectiveTo design and automate a fast round-robin arbiter logic generation for a bus or a network switch
The generated arbiter employed to crossbar (Xbar) switch arbitration logic
To automate Xbar generation providing multiple communication paths among masters
The generated Xbar customized according to user specifications
2/12/2004 6
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
2/12/2004 7
Terminology
MxN Switch: M-input by N-output switchExample: A 32x32 switch − 32-input by 32-output switch with 1024 (322) possible connections between input ports and output ports
Virtual Output Queues (VOQs): to remove possible output port contention (Head of Line (HOL) blocking)VOQ (m, n): m − the input port index; n − the output port index
Example: VOQ (1, 0)
Network Switch (32x32)
CrossbarSwitch Fabric
(32x32)x32
32 (32x32 arbiter)s
… … …
VOQ(0,31)
VOQ(0,0)
.
.
.input port 0
VOQ(31,0)
.
.
.
VOQ(31,31)input port 31
.
.
.
.
.
.
output port 31
output port 0
...
.
.
.
...
req(0, 0)
req(31, 31)
grant(0, 0-31) grant(31, 0-31)
2/12/2004 8
Terminology (Continued)(MxV)xN Switch:
M − the number of input ports of an MxN switchV − the number of VOQs per input portN − the number of output ports of an MxN switchTypically, V = NThe total number of VOQs in an MxN switch − M∗N
Network Switch (32x32)
CrossbarSwitch Fabric
(32x32)x32
32 (32x32 arbiter)s
… … …
VOQ(0,31)
VOQ(0,0)
.
.
.input port 0
VOQ(31,0)
.
.
.
VOQ(31,31)
input port 31
.
.
.
.
.
.
output port 31
output port 0
...
.
.
.
...
req(0, 0)
req(31, 31)
grant(0, 0-31) grant(31, 0-31)
2/12/2004 9
Terminology (Continued)(MxV)xN crossbar switch fabric:
Connections between (MxV) inputs and N outputs
MxM Switch Arbiter (SA):
Controlling M specific transmission gates between M VOQs and a particular output port N MxM SAs in an MxN switch
Thirty-two 32x32 SAs
32 x 32 SA_0
. . .
gran
t (0,
0)
gran
t (1,
0)
gran
t (31
, 0)
VOQ (0, 0)
VOQ (1, 0)
VOQ (31, 0)
.
.
.
.
.
.
VOQ (0, 0)
VOQ (1, 0)
VOQ (31, 0)
.
.
.
.
.
.
32 x 32 SA_31
. . .
gran
t (0,
31)
gran
t (1,
31)
gran
t (31
, 31)
VOQ (0, 31)
VOQ (1, 31)
VOQ (31, 31)
.
.
.
.
.
.
(32x32)x32 Crossbar Switch Fabric
output port 0
output port 31
. . .
.
.
.
.
.
.
2/12/2004 10
Terminology (Continued)MxM distributed SA (MxM hierarchical SA):
Equivalent to an MxM SAConsisting of smaller switch arbiter in the form of a hierarchical tree structure
Bus Arbiter (BA): resolving bus conflicts
PriorityLogic 0
req[0]req[1]req[2]req[3]
Ring Counter
toke
n [0
]
toke
n [1
]
toke
n [2
]
toke
n [3
]
toke
n [0
]
toke
n [1
]
toke
n [2
]
toke
n [3
]
PriorityLogic 2
PriorityLogic 3
PriorityLogic 1
EN
EN
EN
EN
grant[0]
grant[1]
grant[2]
grant[3]
4x4 BA
ackreset
output[0]output[1]output[2]output[3]
in[0]in[1]in[2]in[3]
D-FFD-FFclock
PriorityLogic 0
req[0]req[1]req[2]req[3]
Ring Counter
toke
n [0
]
toke
n [1
]
toke
n [2
]
toke
n [3
]
toke
n [0
]
toke
n [1
]
toke
n [2
]
toke
n [3
]
PriorityLogic 2
PriorityLogic 3
PriorityLogic 1
EN
EN
EN
EN
grant[0]
grant[1]
grant[2]
grant[3]
4x4 BA
ackreset
output[0]output[1]output[2]output[3]
in[0]in[1]in[2]in[3]
D-FFD-FFclock
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
req0[0]req0[1]req0[2]req0[3]
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
req0[0]req0[1]req0[2]req0[3]
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
req0[0]req0[1]req0[2]req0[3]
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
req0[0]req0[1]req0[2]req0[3]
req0[0]req0[1]req0[2]req0[3]
ack0[0]ack0[1]
req1[0]req1[1]req1[2]req1[3]
req1[0]req1[1]req1[2]req1[3]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant1[0]grant1[1]grant1[2]grant1[3]
req0req1
8x8 hierarchical SAclock
4x4ack-req
SA 0
counterD
4x4ack-req
SA 1
counterD
ack
2/12/2004 11
Requirements for a Terabit Switch ArbiterStarvation freeFast ArbitrationSimplicity to implementLow power:
Power budget of single rack router ~ 10kW
2/12/2004 12
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
2/12/2004 13
History: Arbiter in PPECentralized Switch Arbiters:
Programmable Priority Encoder (PPE) implementing iterative round-robin algorithm (iSLIP)
P. Gupta and N. Mckeown, “Designing and Implementing a Fast Crossbar Scheduler,” IEEE Micro, 1999, pp. 20-28. N. Mckeown, P. Varaiya, and J. Warland, “The iSLIP Scheduling Algorithm for Input-Queued Switch,”IEEE Transaction on Networks, 1999, pp. 188-201.
tothermo
P_enc
log2 n
Req
n
Priority EncoderPriority Encoder_thermo
n new_Req
n
nn
any_Gnt_PE_thermo
n
Gnt_PE_thermo
Gnt_PE
n
Gnt
P_thermo
2/12/2004 14
History: Arbiter in PPADistributed Switch Arbiter:
Ping Pong Arbiter (PPA)H. J. Chao, C. H. Lam, and X. Guo, “A Fast Arbitration Scheme for Terabit Packet Switches,” Proceedings of IEEE Global Telecommunications Conference, 1999, pp. 1236-1243.
Comparison: our generated SA 2.3X faster than PPE and 1.8X faster than PPA
1613 14 15129 10 1185 6 741 2 31 2
external grant signals
layer 1
layer 2
layer 3
layer 4
root PPA intermediate PPA leaf PPA
r0r1Fi
Gg0
Gg1
g0 g1 Fo
2x2PPA
Q D
Clock
r0
r1
g0
g1
Fi Fo
Gg0 Gg1
2/12/2004 15
Why do we need an arbiter for an SoC?
Arbitration required by all busesOur arbiter applicable to anywhere requiring arbitrationThe generated arbiter utilized in our Xbar
2/12/2004 16
History: Crossbar Switch
Smart memory:Reconfigurable crossbar between bus masters (2 integer clusters and 1 floating point cluster) and slaves (SRAMs)
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally and M. Horowitz, “Smart Memories: A Modular Reconfigurable Architecture,” Proceedings of International Symposium on Computer Architecture (ISCA), June 2000, pp. 161-171.
2/12/2004 17
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
2/12/2004 18
Bus Arbiter DesignImplemented based on ring counter for a token and “priority logic”Priority Logic for 4 inputs:
output[0] = EN•in[0]output[1] = EN•in[0]'•in[1]output[2] = EN•in[0]'•in[1]'•in[2]output[3] = EN•in[0]'•in[1]'•in[2]'•in[3]
000000001
100010001
0100X1001
0010XX101
0001XXX11
0000XXXX0
output [3]output [2]output [1]output [0]in [3]in [2]in [1]in [0]EN
2/12/2004 19
Previous Bus Arbiter Approach
FIFO ArbiterRound-robin arbiter with fixed prioritiesRound-robin arbiter with priority rotation
M-inputPriority Logic
M-request M-grant M-grantM-inputPriority Encoder
M-requestro
tatio
n
deco
der
2/12/2004 20
Example: Our Bus Arbiter
Condition:Token=4’b0100 → Processor 2 with the highest priorityProcessors 0 and 1 requesting a bus
Result:Only Priority Logic 2 enabledProcessor 0 granted due to negated request signals of the higher priority parties (processor 2 and processor 3) Token rotated to 4’b1000 after the ring counter receiving ack signal
Processor 2
req[0]
Processor 0 Processor 1 Processor 3
Memory
PL 2
token[2]
ring counter
grant[0]
4x4 BA
req[1]output[2]
ack
2/12/2004 21
Example: Our Bus Arbiter (Continued)
PriorityLogic 0
Ring Counter
toke
n [0
]
toke
n [1
]
toke
n [2
]
toke
n [3
]
PriorityLogic 2
PriorityLogic 3
PriorityLogic 1
grant[2]
grant[3]
4x4 BA
ack
reset
output[0]output[1]output[2]output[3]
req[0]req[1]req[2]req[3]
EN
EN
EN
in[0]in[1]in[2]in[3]
D-FFclock
grant[0]
grant[1]
grant[0]
PriorityLogic 2
toke
n [2
]to
ken
[2]
PriorityLogic 2
EN
2/12/2004 22
Hierarchical SA DesignA hierarchical SA consisting of small switch arbiter blocksSix types of switch arbiter blocks:
2x2 ack-req SA3x3 ack-req SA4x4 ack-req SA2x2 root SA3x3 root SA4x4 root SA
A root SA: placed on the top of a hierarchy
2/12/2004 23
Switch Arbiter Blocks
2x2Bus
Arbiter
ack
req0[0]req0[1]
grant0[1]
grant0[2]
req0
2x2 ack-reqSAclock
reset3x3Bus
Arbiter
ack grant0[0]grant0[1]
grant0[2]req0[0]req0[1]req0[2]
req0
3x3 ack-reqSAclockreset 4x4
BusArbiter
ack grant0[0]grant0[1]
grant0[2]grant0[3]req0[3]
req0[0]req0[1]req0[2]
req0
4x4 ack-reqSAclockreset
2x2BA
without D flip-flop
ack0
ack1
req0
req1
2x2 root SAclock
ring counterresettoken[1:0] 3x3
BAwithout
D flip-flop
ack0ack1
ack2
req0req1req2
3x3 root SAclock ring counterreset
4x4BA
without D flip-flop
ack0ack1ack2ack3
req0req1req2req3
4x4 root SAclock ring counterreset
Example: 32x32
hierarch-ical SA
4x4ack-req
SAl0.sa0
req0[0]req0[1]req0[2]req0[3]
ack0[0]
4x4ack-req
SAl0.sa1
ack0[1]
4x4ack-req
SAl0.sa2
ack0[3]
4x4ack-req
SAl0.sa3
4x4ack-req
SAl1.sa0
req1[0]req1[1]req1[2]req1[3]
req2[0]req2[1]req2[2]req2[3]
req3[0]req3[1]req3[2]req3[3]
ack0[2]
4x4ack-req
SAl0.sa4
req4[0]req4[1]req4[2]req4[3]
ack1[0]
4x4ack-req
SAl0.sa5
ack1[1]
4x4ack-req
SAl0.sa6
ack1[3]
4x4ack-req
SAl0.sa7
4x4ack-req
SAl1.sa1
req5[0]req5[1]req5[2]req5[3]
req6[0]req6[1]req6[2]req6[3]
req7[0]req7[1]req7[2]req7[3]
ack1[2]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant2[0]grant2[1]grant2[2]grant2[3]
grant3[0]grant3[1]grant3[2]grant3[3]
grant4[0]grant4[1]grant4[2]grant4[3]
grant5[0]grant5[1]grant5[2]grant5[3]
grant6[0]grant6[1]grant6[2]grant6[3]
grant7[0]grant7[1]grant7[2]grant7[3]
up_req[0]up_req[1]
req0
req1
req2 req3
req4req5
req6req7
up_ack0
up_ack1
clock
token = 2’b10
token = 4’b0010
token = 4’b0010
0111
1100
1111
1101
0011
1111
0011
1111
4x4ack-req
SAl0.sa0
req0[0]req0[1]req0[2]req0[3]
ack0[0]
4x4ack-req
SAl0.sa1
ack0[1]
4x4ack-req
SAl0.sa2
ack0[3]
4x4ack-req
SAl0.sa3
4x4ack-req
SAl1.sa0
req1[0]req1[1]req1[2]req1[3]
req2[0]req2[1]req2[2]req2[3]
req3[0]req3[1]req3[2]req3[3]
ack0[2]
4x4ack-req
SAl0.sa4
req4[0]req4[1]req4[2]req4[3]
ack1[0]
4x4ack-req
SAl0.sa5
ack1[1]
4x4ack-req
SAl0.sa6
ack1[3]
4x4ack-req
SAl0.sa7
4x4ack-req
SAl1.sa1
req5[0]req5[1]req5[2]req5[3]
req6[0]req6[1]req6[2]req6[3]
req7[0]req7[1]req7[2]req7[3]
ack1[2]
2x2rootSA
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant2[0]grant2[1]grant2[2]grant2[3]
grant3[0]grant3[1]grant3[2]grant3[3]
grant4[0]grant4[1]grant4[2]grant4[3]
grant5[0]grant5[1]grant5[2]grant5[3]
grant6[0]grant6[1]grant6[2]grant6[3]
grant7[0]grant7[1]grant7[2]grant7[3]
up_req[0]up_req[1]
req0
req1
req2 req3
req4req5
req6req7
up_ack0
up_ack1
clock
up_req[1]
req5[1]
4x4ack-req
SAl0.sa5
req5
grant5[1]
Example: 32x32
hierarch-ical SA
2x2rootSA
4x4ack-req
SAl1.sa1
up_ack1
ack1[1]
ack1[1]
D
D
up_ack1
req5[1]
up_req[1]
req5
2x2rootSA
grant5[1]
up_ack1
req5[1]
4x4ack-req
SAl0.sa5
req5
ack1[1]
up_req[1]
2x2rootSA
up_ack1
grant5[1]
ack1[1]
grant5[1]
clock
2x2rootSA
up_req[0]
up_req[1]
req5[0]
req5[1]req5[2]
req5[3]
4x4BA
counterD
req4req5req6req7
4x4BA
counterD
up_ack0
up_ack1
l1.sa1.output[1]
l0.sa5.output[1]
32 x 32 SA Critical Path:•Travels through two 4-input OR gates in series•Then through a 2x2 root SA•Finally through two 2-input AND gates in series•Results in 0.94ns delay using a TSMC 0.25µ standard cell library from LEDA Systems.
ack1[1]
D
D
up_ack1
req5[1]
up_req[1]
req5
2x2rootSA
grant5[1]
up_ack1
ack signals look like feedback path through the same logic block. In fact there is no input to the same logic gates.
2/12/2004 26
Hierarchical BAExtra ‘ack’ input: indicating the completion of a single use of the bus Root token rotation:
By an ‘ack’ signal for a hierarchical BABy clock signal for a hierarchical SA
Possession of a bus for multiple cycles
req0[0]
req0[1]
req0[2]
req0[3]
ack0[0]
ack0[1]
req1[0]
req1[1]
req1[2]
req1[3]
2x2rootSA
grant0[0]
grant0[1]
grant0[2]
grant0[3]
grant1[0]
grant1[1]
grant1[2]
grant1[3]
req0req1
8x8 hierarchical BA
clock
4x4ack-req
BA 0
counterD
4x4ack-req
BA 1
counterD
ack
counterD
D
4x4Bus
Arbiter
D
D
D
ackclock
req0[3]
req0[1]
req0[2]
req0[0]
reset
4x4 ack-req BA
grant0[3]
grant0[1]
grant0[2]
grant0[0]
req0
D latches triggered the positive edge of ack
2/12/2004 27
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
2/12/2004 28
Comparisons with PPE and PPA
Using TSMC 0.25µ std. cell library from LEDA SystemsSynthesized by Design Compiler from Synopsys
0
1000
2000
3000
4000
5000
6000
0 50 100 150
MxM arbiter
Are
a of
arb
iter i
n th
e nu
mbe
r of
inve
rter
eq
uiva
lent
s
SAPPEPPA
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100 150
MxM arbiter
Del
ay in
arb
iter w
ith T
SMC
.25u
m
SAPPEPPA
1.81X1.85X
1.94X
2.31X 2.03X
2.38X
2/12/2004 29
Comparisons (Continued)
The shortest delay results fromLimiting the size of switch arbiter blocks to 2x2, 3x3 and 4x4 to reduce the critical path delayDue to the expansion of priority logic blocks compared with PPEReducing the number of levels in a hierarchy by preferring to use more 4x4 switch arbiter blocks compared with PPA
2/12/2004 30
Key Insights
Our hierarchical SA vs. PPE:
Much larger logic equations of PPEMore constrained to a specific structure in our SAA distributed “state” kept by token
16x16 Hierarchical SA
4x44x4
4x44x4
4x416 requests
4
4
4
4
16 grants
16x16 PPE
16-inputPE16 requests 16 grants
No token
token
token
2/12/2004 31
Key Insights (Continued)
Our hierarchical SA vs. PPA:
PPA with token passing approach.MxM PPA implemented by 2-input basic arbiterOur hierarchical SA preferring utilizes 4-input switch arbiter blocks.
16 grants
16x16 Hierarchical SA
4x44x4
4x44x4
4x416 requests
4
4
4
4
16 grants
16x16 PPA
2x2
2x22x2
2x2
2x22x2
2x2
2x2
2x22x2
2x2
2x22x2
2x2
2x216 requests
2
2x2PPA
Q D
Clock
r0
r1
g0
g1
Fi Fo
Gg0 Gg1
2/12/2004 32
Power Comparison
For dynamic power estimation, a Switching Activity Information File (SAIF) is extracted from Synopsys VCS.Assume congested network: all input requests are asserted.
RTL VerilogRTL Simulation
using of Synopsys VCS
Power Compiler
Back AnnotationSAIF file
PowerEstimation
TechnologyLibrary
Design Compiler
2/12/2004 33
Power Comparison
Using TSMC 0.25µ std. cell library from LEDA SystemsEstimated by Power Compiler from Synopsys
Static Power Dissipation with TSMC 0.25um
0
2000
4000
6000
8000
10000
0 20 40 60 80 100 120 140
MxM
pW
SAPPAPPE
Dynamic Power Dissipation with TSMC 0.25um
05
101520253035
0 50 100 150
MxMm
W
SAPPAPPE
2/12/2004 34
Power ComparisonTotal Power Dissipation
05
101520253035
0 50 100 150
Number of Requests
mW
SAPPAPPE
2/12/2004 35
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
RAGA user specification: an arbiter typeA user specification: the number of masters (M) to be arbitratedOutput of RAG: a synthesizable Verilog code for a Bus Arbiter or a Switch Arbiter at the RTL
User input:1. Type of the arbiter:3. Number of masters
Library2x2 ack-req SA3x3 ack-req SA4x4 ack-req SA2x2 ack-req BA3x3 ack-req BA4x4 ack-req BA2x2 root SA3x3 root SA4x4 root SA
integrate M x Mhierarchical switch arbiterinteg_arb();
integrate M x Mhierarchical bus arbiterinteg_bus_arb();
calculate the number oflevels and the number ofbasic arbiter blocks foreach levelInterpret_sa();
SABA
BA SA
number-of_levels ← 0;dividend←number_of_masters;remainder← (dividend modulo 4);
dividend=0?
No
dividend ←floor(dividend/4);number_of_levels ++;
dividend=1 and remainder=0 ?
dividend←number_of_masters;n← 0;
Yes
dividend>3?
remainder ← dividend modulo 4;dividend← (dividend/4);4by4_in_level[n]← dividend;
Yes
dividend modulo 3 =0 ?
No Yes
Yes
dividend modulo 4=0?
YesNo
remainder ← dividend modulo 3;dividend← (dividend/3);3by_in_level[n]← dividend;
remainder ← dividend modulo 4;dividend← floor (dividend/4);4by_in_level[n]← dividend;
No
remainder=3 ?
2by2_in_level[n]++; Yes
3by3_in_level[n]++;
No
dividend>2?
No
dividend← (dividend/3);3by3_in_level[n]← dividend;
Yes
dividend← floor (dividend/2);2by2-in_level[n]← dividend;
No
dividend = 4by4_in_level[n] + 3by3_in_level[n] + 2by2_in_level[n] ;
n = number_of_levels-1 ?
Yes
No
Generate Hierarchical Arbiter
Yes
num_level ← 0dividend←num_mastersremainder← 0
dividend=0?
No
dividend ←(integer) (dividend/4)n ←num_levelnum_level ++
dividend=0 and remainder=0 ?
dividend←num_mastersn← 0
Yes
dividend>2?
remainder ← dividend mod 4dividend←(integer) (dividend/4)num_4by4_level[n]← dividend
remainder ← dividend mod 2dividend←(integer) (dividend/2)num_2by2_level[n]← dividend
No Yes
remainder=0 ?
remainder>2 ?
No
num_4by4_level[n]++num_2by2_level[n]++
YesNo
n++
remainder=0 ?
No
Yes
No Yes
dividend←num_4by4_level[n]+num_2by2_level[n]n<num_level?
No
Yes Hierarchical SA
dividend=32
dividend=32n=0
Yes
RAG (Continued)dividend=8n=0num_4by4_level(0)=0num_2by2_level(0)=0num_level=1
dividend=2n=1num_4by4_level(1)=0num_2by2_level(1)=0num_level=2
dividend=0n=2num_4by4_level(2)=0num_2by2_level(2)=0num_level=3
remainder=0dividend=8num_4by4(0)=8
4x4
SA
4x4
SA
4x4
SA
4x4
SA
4x4
SA
4x4
SA
4x4
SA
4x4
SAn=1
dividend=8
remainder=0dividend=2num_4by4(1)=2
4x4
SA
4x4
SA
n=2
dividend=2
remainder=0dividend=1num_2by2(2)=12x2
root
SA
n=3
dividend=1
2/12/2004 38
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
2x2
roo t
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
2x2
roo t
S A
RAG (Continued)
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
2x2
roo t
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
4x4
S A
2x2
roo t
S A
4x4ack-req
SAl0.sa0
req0[0]req0[1]req0[2]req0[3]
req0[0]req0[1]req0[2]req0[3]
ack0[0]
4x4ack-req
SAl0.sa1
ack0[1]
4x4ack-req
SAl0.sa2
ack0[3]
4x4ack-req
SAl0.sa3
4x4ack-req
SAl1.sa0
req1[0]req1[1]req1[2]req1[3]
req1[0]req1[1]req1[2]req1[3]
req2[0]req2[1]req2[2]req2[3]
req2[0]req2[1]req2[2]req2[3]
req3[0]req3[1]req3[2]req3[3]
req3[0]req3[1]req3[2]req3[3]
ack0[2]
4x4ack-req
SAl0.sa4
req4[0]req4[1]req4[2]req4[3]
req4[0]req4[1]req4[2]req4[3]
ack1[0]
4x4ack-req
SAl0.sa5
ack1[1]
4x4ack-req
SAl0.sa6
ack1[3]
4x4ack-req
SAl0.sa7
4x4ack-req
SAl1.sa1
req5[0]req5[1]req5[2]req5[3]
req5[0]req5[1]req5[2]req5[3]
req6[0]req6[1]req6[2]req6[3]
req6[0]req6[1]req6[2]req6[3]
req7[0]req7[1]req7[2]req7[3]
req7[0]req7[1]req7[2]req7[3]
ack1[2]
2x2root
SA
grant0[0]grant0[1]grant0[2]grant0[3]
grant0[0]grant0[1]grant0[2]grant0[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant1[0]grant1[1]grant1[2]grant1[3]
grant2[0]grant2[1]grant2[2]grant2[3]
grant2[0]grant2[1]grant2[2]grant2[3]
grant3[0]grant3[1]grant3[2]grant3[3]
grant3[0]grant3[1]grant3[2]grant3[3]
grant4[0]grant4[1]grant4[2]grant4[3]
grant4[0]grant4[1]grant4[2]grant4[3]
grant5[0]grant5[1]grant5[2]grant5[3]
grant5[0]grant5[1]grant5[2]grant5[3]
grant6[0]grant6[1]grant6[2]grant6[3]
grant6[0]grant6[1]grant6[2]grant6[3]
grant7[0]grant7[1]grant7[2]grant7[3]
grant7[0]grant7[1]grant7[2]grant7[3]
up_req[0]up_req[1]
req0
req1
req2 req3
req4req5
req6req7
up_ack0
up_ack1
clock
User input:1. Type of the arbiter:3. Number of masters
Library2x2 ack-req SA3x3 ack-req SA4x4 ack-req SA2x2 ack-req BA3x3 ack-req BA4x4 ack-req BA2x2 root SA3x3 root SA4x4 root SA
integrate M x Mhierarchical switch arbiterinteg_arb();
integrate M x Mhierarchical bus arbiterinteg_bus_arb();
calculate the number oflevels and the number ofbasic arbiter blocks foreach levelInterpret_sa();
SABA
BA SA
integrate M x Mhierarchical switch arbiterinteg_arb();
2/12/2004 39
SpeedupThroughput comparison for 64-bit switching 32x32 network switch
The longest delay of 32x32 switch = 0.63ns in .25µTSMC → the maximum throughput of switch determined by arbitration delay
Experimental setup:Replacing SA in 32x32 network switch with our hierarchical SA, PPE and PPA
VO
Q C
ontro
llers
32 32x32 SAs
64-bit 32x32 Switch Fabric
322
64-b
it V
OQ
s
The floorplan of the 64-bit 32x32 switch fabric, VOQs, controllers and SAs
Area: 125.64mm2
2/12/2004 40
Speedups (Continued)Speedup in .25µ TSMC
Our hierarchical 32x32 SA: 64 [email protected] delay →2.18Tbps32x32 PPA: [email protected] delay → 1.20Tbps32x32 PPE: [email protected] delay → 0.94TbpsResults: The throughput achieved by our SA > 1.8X than PPA and > 2.3X than PPE
Network Switch (32x32)
CrossbarSwitch Fabric
(32x32)x32
32 (32x32 arbiter)s
… … …
VOQ(0,31)
VOQ(0,0)
.
.
.input port 0
VOQ(31,0)
.
.
.
VOQ(31,31)
input port 31
.
.
.
.
.
.
output port 31
output port 0
...
.
.
.
...
req(0, 0)
req(31, 31)
grant(0, 0-31) grant(31, 0-31)
2/12/2004 41
Perfectly Fair?
PriorityLogic 0
req[0]req[1]req[2]req[3]
Ring Counter
toke
n [0
]
toke
n [1
]
toke
n [2
]
toke
n [3
]
PriorityLogic 2
PriorityLogic 3
PriorityLogic 1
EN
EN
EN
EN
grant[0]
grant[1]
grant[2]
grant[3]
4x4 BA
ackreset
output[0]output[1]output[2]output[3]
in[0]in[1]in[2]in[3]
D-FFclock
EN
0110
# of grants
req #
11
10
# of grants
req #
PriorityLogic 0
EN
EN
PriorityLogic 1
ENEN
PriorityLogic 2
EN
11
20
# of grants
req #
PriorityLogic 3
EN11
30
# of grants
req #How about PPE and PPA?
2/12/2004 42
Fairness SimulationJoint work with Dr. Riley and Tom ChengSimulation with the 32x32 hierarchical SASeveral input patterns applied for uniform traffic simulationBursty and TCP traffics applied Traffic intensity, ρ: a measure of the demand on the switch for bursty and TCP traffics.
2/12/2004 43
Simulation ResultsUniform traffic: metrics are the number of grants
002499993
25000002500002
2500002500002500001
4999997490002500000
Grants,0, 1, 2 asserted
Grants,0 , 1 asserted
Grant, all asserted4x4 ack-req input
Run 3Run 2Run 1
2/12/2004 44
Simulation Results (Continued)
Bursty traffic: metrics are average delay
1.102.004.6316.4431.113
1.011.824.6117.1430.952
1.101.934.6015.5931.231
1.051.984.4616.6531.430
Delay,ρ = 0.1
Delay,ρ = 0.5
Delay,ρ = 0.9
Delay,ρ = 1.0
Delay,ρ = 2.0
4x4 ack-req input
2/12/2004 45
Simulation Results (Continued)
TCP traffic: metrics are average delay
1.101.94 12.13 17.54 19.22 3
1.10 1.93 10.69 19.71 20.44 2
1.10 1.95 11.2919.48 19.91 1
1.10 1.95 11.68 16.75 18.12 0
Delay,ρ = 0.1
Delay,ρ = 0.5
Delay,ρ = 0.9
Delay,ρ = 1.0
Delay,ρ = 2.0
4x4 ack-req input
2/12/2004 46
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
2/12/2004 47
Why Xbar and X-Gt?
Xbar:Demand on multiple communication channelsConcurrent accesses to resources
X-Gt:Generation customized Xbar on-the-flyThe generated Xbar in RTL Verilog
2/12/2004 48
Xbar SwitchOne configuration for 4x4 caseEach 4x1 switches: Comparing physical addresses from PEs and judging if addresses belong to the address space of the attached memory blockMaximum of 4 concurrent memory access in one cycle
PE3
PE2
PE1
PE0
4x4 Xbar
4x1switch
34x1sw
itch2
mem0
mem1
mem3
mem2
4x1switch
04x1sw
itch1
2/12/2004 49
Xbar (Continued)An MxN switch consisting of N Mx1 switches, M = # of PEs & N = # of memory blocksAn assertion of ‘mem_req’ inside a Mx1 switch: depending on an ‘address’ input from a PEGrant: given by an arbiter by the assertion of ‘mem_on’ signalsArbitration in round-robin order: handling M requests from M processors
2/12/2004 50
4x1 Switch ExampleScenario:
Request from PE 0 and PE 3PE 3 granted
pe_req[0]
pe_req[3]
arbiter
comp...
pe_addr0
pe_addr3
addr bus
switch
mem_on[0]
mem_on[3]
mem_data
mem_req[3]
...
...
.
.
.
...
pe_data0
pe_data3
databus
switch
.
.
.
pe_re0
pe_re3
wireswitch
.
.
.
...
pe_we0
pe_we3
wireswitch
.
.
.
...
pe_ta0
pe_ta3
wire_taswitch
.
.
.
...
mem_we mem_ta
mem_req[0]
mem_addr mem_re
2/12/2004 51
Example of Xbar Configuration
4x1switch
74x1sw
itch4
mem0
mem1
mem2
mem3
mem7
mem6
mem5
mem4
4x8 Xbar:Supporting 4 PEs and 8 memory modulesConnecting a particular PE signals to a specific memory module by a 4x1 switch according to a physical address from a PE
4x1switch
04x1sw
itch1
4x1switch
24x1sw
itch3
4x1switch
64x1sw
itch5
2/12/2004 52
X-Gt (Xbar Generator)
Generation of customized Xbar in Verilog at the RTL
An arbiter generated by RAGParameterizable switch blocks in an Mx1 switchAll submodules connected by wire names
2/12/2004 53
X-Gt (Continued)Parameters:
The number of PEs that determines M in an MxN XbarThe number of memory blocks that determines N in an MxN XbarThe total global memory size that determines (pe_)address bus widthThe data bus width of each PE determined by PE typeThe (mem_)address bus width determined by the size of memory block
2/12/2004 54
parameters
m<M
pe_req[m]...
m++
gen_proc_wire(M)
n<N
mem_addrn...
n++
m<N
gen_mem_wire(M)
gen_addr_bus_switch(M)gen_data_bus_switch(M)
gen_wire_switch(M)gen_wre_ta_switch(M)
yes
yes
RAG generatingan arbiter
m++
gen_comp(M)
yesgen_Mx1(parameters)
MxN Xbar
Flowchart of X-Gt
2/12/2004 55
parameters
m<M
pe_req[m]...
m++
gen_proc_wire(M)
n<N
mem_addrn...
n++
m<N
gen_mem_wire(M)
gen_addr_bus_switch(M)gen_data_bus_switch(M)
gen_wire_switch(M)gen_wre_ta_switch(M)
yes
yes
RAG generatingan arbiter
n++
gen_comp(M)
yes
gen_Mx1(parameters)
4x4 Xbar
pe_req[0]pe_addr0pe_data0pe_read0pe_write0pe_ta0
m=0
M=4, N=4
pe_req[1]pe_addr1pe_data1pe_read1pe_write1pe_ta1
m=1pe_req[2]pe_addr2pe_data2pe_read2pe_write2pe_ta2
m=2pe_req[3]pe_addr3pe_data3pe_read3pe_write3pe_ta3
m=3
mem_addr0mem_data0mem_read0mem_write0mem_ta0
n=0mem_addr1mem_data1mem_read1mem_write1mem_ta1
n=1mem_addr2mem_data2mem_read2mem_write2mem_ta2
n=2mem_addr3mem_data3mem_read3mem_write3mem_ta3
n=3
4x1switch
0
m=1
4x1switch
1
m=2
4x1switch
2
m=3
4x1switch
3
m=4
4x1switch
34x1sw
itch2
4x1switch
04x1sw
itch1
Verilog in RTL
2/12/2004 56
OutlineTerminologyOrigin and history of problems:
Arbiter design: PPE and PPACrossbar switch design: “Smart” Memory
Arbiter designArbiter experimentsRAG: Round-robin Arbiter GeneratorX-Gt: Xbar GeneratorXbar experimentsConclusion
2/12/2004 57
Xbar Synthesis: Mx1 Gate Area
Using TSMC 0.25µ std. cell library from Artisan ComponentsEstimated by Design Compiler from Synopsys
0
1000
2000
3000
4000
5000
6000
2 3 4 5 6 7 8 9
Number of processors
Mx1
sw
itch
area
in th
e nu
mbe
r of
INVE
RTE
R e
quiv
alen
ts w
ith T
SMC
.2
5um
2/12/2004 58
Xbar Synthesis: MxM Area
Using TSMC 0.25µ std. cell library from Artisan ComponentsGate Area estimated by Design Compiler from SynopsysGate + Wire Area estimated by Silicon Ensemble from Cadence
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 4 6 8 10Number of processors and number of
memory blocks
MxM
Xba
r are
a in
squ
are
mm
with
TS
MC
.25u
m
Gate AreaGate + Wire Area
2/12/2004 59
Back Annotated Timing An RTL Verilog (output of X-Gt) synthesized by DC with TSMC .25µfrom Artisan ComponentsBack annotation of SDF and set_load files to logic synthesis stage after place and route using Synopsys Design Compiler and Cadence Silicon Ensemble
Logic Synthesis
Place & Route
ExtractRC values
Design Compiler
Silicon Ensemble
report timing
RTL VerilogX-Gt
Back annotate SDF and set load files
2/12/2004 60
MxM Xbar Delay
0
5
10
15
20
25
30
35
2 3 4 5 6 7 8 9
Number of processors and number of memory blocks
MxM
Xba
r del
ay in
ns
with
TSM
C
.25u
mw/o back annotation w/ back annotation
2/12/2004 61
ConclusionShowed how we design hierarchical SA and hierarchical BA.Demonstrated ≥ 1.8X speedups for our hierarchical SA over PPE and PPA.Illustrated how RAG generates a hierarchical SA.Demonstrated Xbar switch and showed how X-Gt generates an Xbar according to user specifications.Illustrated gap between delays from logic synthesis stage and delays from physical synthesis stage for MxM Xbars.
2/12/2004 62
PublicationsAs (co-) First Author:
K. Ryu, E. S. Shin, and V. Mooney, “A Comparison of Five DifferentMultiprocessor SoC Bus Architectures,” Proceedings of the 2001 Euromicro Symposium on Digital Systems Design (DSD’01), September 2001. E. S. Shin, V. Mooney, and G. Riley, “Round-robin Arbiter Design and Generation,” Proceedings of 15th International Symposium on System Synthesis (ISSS’02), October 2002. E. S. Shin, V. Mooney, and G. Riley, “Round-robin Arbiter Design and Generation,” Georgia Institute of Technology Technical Report, GIT-CC-02-38, Available HTTP: http://www.cc.gatech.edu/tech_reports/index.02.html M. Shalan, E. S. Shin and V. Mooney, “DX-Gt: Memory Management and Crossbar Switch Generator for Multriprocessor System-on-a-Chip,”Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI’03), April 2003. E. S. Shin, V. Mooney, and G. Riley, “Round-robin Arbiter Design and Generation,” submitted to IEEE Transactions on CAD.
2/12/2004 63
Publications (Continued)
Other publications:P. Cheng, Eung S. Shin, G. R. Riley and V. J. Mooney III, “SASim: Switch Arbiter Simulator,” Georgia Institute of Technology Technical Report, GIT-CC-03-38, Available HTTP: http://www.cc.gatech.edu/tech_reports/index.03.html.A. Talpasanu , J. A. Davis, E. S. Shin and V. J. Mooney, “Crossbar Switch Interconnect Delay Calculation,” Georgia Institute of Technology Technical Report, GIT-CC-03-37, Available HTTP: http://www.cc.gatech.edu/tech_reports/index.03.html.
Provisional Patent:E. S. Shin and V. Mooney, “Fast Distributed Switch Arbiter Design and Generation.”