a fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching
TRANSCRIPT
◆ A Fast Hierarchical Arbitration Scheme for Multi-Tb/s Packet Switches with SharedMemory SwitchingDaniel Popa, Georg Post, and Ludovic Noirie
One challenge in multi-terabit per second packet switches is the design oflow latency and high performance arbitration schemes. In this regard, wepropose a hierarchical multi-cell arbiter, which interconnects multiple parallelprocessing devices. By comparison with iterative non-hierarchical multi-cellarbiters, we show that our scheme significantly decreases the signalingoverhead and arbitration processing time. Performance evaluation resultsshow the proposed solution maximizes the switch throughput and delayperformance. © 2009 Alcatel-Lucent.
of one cell duration on computation time and 2) reduc-
ing the linear dependence O(T) of the computation
time of a maximal matching on the switch size (T).
The goal of the first level can be achieved by
either using envelope or multi-cell arbiters. In envelope
arbitration [2], instead of segmenting variable-size
packets into small fixed-size cells, packets are aggre-
gated into large fixed-size envelopes. During each
arbitration cycle, an envelope arbiter matches a single
envelope between any input and output. Although
an envelope arbiter can considerably relax the con-
straint on arbitration time, since the arbitration cycle
is equal to the envelope time length, it has two major
drawbacks that make it unsuitable for practical imple-
mentations. First, it saves no time because packets
have to wait at the “head of the line” for a filled enve-
lope before being considered for arbitration. Second,
some bandwidth is wasted because of the mismatch
between the variable size of packets and the fixed size
of the envelope. In multi-cell arbitration [11, 13],
during each arbitration cycle, the arbiter matches
a group of cells between any input and output. As a
IntroductionCombined input-output queued (CIOQ) architec-
tures have been widely considered as a feasible solution
for high-speed packet switches and Internet Protocol
(IP) routers [3, 5]. CIOQ switches are attractive because
they achieve high throughput under admissible traffic
patterns using simple arbitration schemes. However,
although CIOQ architectures deal efficiently with arbi-
tration time constraints, since input and output port
selection are independently performed, the arbitration
speed issue still persists in Tb/s systems.
Packet switch arbitration is an application of
bipartite graph matching, which attempts to find a
maximal input-to-output bandwidth matching [1].
Conventional fabric arbiters use single-cell arbitration
[3]. During each arbitration cycle, a single-cell arbiter
matches one cell between any pair of input and out-
put. In Tb/s packet switches, single-cell arbiters face
the challenge of computing a maximal bandwidth
matching within one cell duration and of signaling
the results to all the inputs.
The arbitration time issue can be simplified by
dividing it into two levels: 1) breaking the upper limit
Bell Labs Technical Journal 14(2), 81–96 (2009) © 2009 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Publishedonline in Wiley InterScience (www.interscience.wiley.com) • DOI: 10.1002/bltj.20374
82 Bell Labs Technical Journal DOI: 10.1002/bltj
consequence, an arbitration cycle can be several
degrees of magnitude longer than the cell duration.
This feature considerably relaxes the hard timing con-
straint on arbitration.
The goal of the second level has been somewhat
achieved by designing fast single-cell arbiters, such as
the ping pong arbiter (PPA) [4] and fast fair arbiter
(FFA) [15]. However, although FFA and PPA decrease
arbitration time from O(T) to O(log2(T)), and thus
keep computation time below cell duration, their
design is still challenging for line card rates greater
than 10 Gb/s. As an example of the stringent timing,
PPA and FFA must provide a maximal matching in
less than 51.2 nanoseconds (ns) at 10 Gb/s, 12.8 ns at
40 Gb/s, and 5.12 ns at 100 Gb/s, for a cell of 64 bytes!
An alternative solution is to make multi-cell arbiters
more scalable, by designing hierarchical approaches.
In this regard, we attempt to provide new
insights on multi-cell arbitration and introduce a
hierarchical centralized arbiter. The scope of this
paper is twofold. First, we present the proposed
arbiter and discuss its hardware complexity. In par-
ticular, we demonstrate that a single-device (i.e.,
non-hierarchical) centralized iterative arbiter cannot
accommodate the signaling requirements of a large
packet switch arbitration, while a multi-device (i.e.,
hierarchical) approach efficiently solves this issue.
Second, through simulation, we show that our
approach comes close to the fully centralized one in
terms of switch delay and throughput performance.
The results obtained are encouraging since only a few
results highlighting the benefit of multi-cell arbiters
in high speed switching systems are available in the
literature [11, 13].
CIOQ packet switches support the following fea-
tures and conventions:
• Segmentation of incoming variable-size packets
into cells of 64 to 80 bytes at the switch input.
This allows internal switching with the granular-
ity of a cell and the reassembly of packets at out-
put, before they are scheduled for departure from
the switch.
• Separate (virtual) queues at the point of input
and for each output, to avoid the well-known
problem of head-of-line blocking.
• Use of crossbar or shared-memory switching fab-
ric because of its non-blocking capability and mar-
ket availability.
• Time-slotted switching matrices.
• Identical number of inputs and outputs.
• Cyclic arbitration schemes.
In this paper, we apply the conventions above
and consider a CIOQ packet switch with shared mem-
ory switching [10].
The Packet Switch Arbitration: Distributed VersusCentralized Schemes
During every cycle, the arbitration procedure
between T inputs and T outputs involves the follow-
ing steps. Each active input submits request signals to
the outputs, indicating the quantity of data destined
for each of the appropriate outputs. Each output
arbiter collects the request signals, among which the
inputs with active requests are granted according to
some priority order: A grant signal is sent back
to acknowledge the amount of data that an active
input can forward to the output.
Panel 1. Abbreviations, Acronyms, and Terms
AR—ArbiterCIOQ—Combined input-output queuedFFA—Fast fair arbiterGMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationI/O—Input/outputIP—Internet ProtocolIPP—Interrupted Poisson Process
MA—Matrix arbitrationOA—Output arbitrationPPA—Ping pong arbiterSBW—Signaling bandwidthVS—Virtual switchWFA—Wavefront arbiterWFBVND—Wavefront Birkoff-von Neumann
decomposition
DOI: 10.1002/bltj Bell Labs Technical Journal 83
The signals exchanged between input and output
arbiters can be represented as a matrix:
(1)
where i and j are the index of input and output line
cards, respectively; p(i, j) represents the signaling
information (i.e., requests or grants) exchanged
between input i and output j.
As packets are discarded only at input line cards,
the grant matrix should respect the following input
and output capacity constraints:
(2)
(3)
where �(i) and �(j) represent the capacity of input i
and output j, respectively, expressed in either bits,
bytes or cells per cycle. In other words, equation 2
implies that an output cannot receive more band-
width than its capacity, while equation 3 implies that
an input cannot send more bandwidth than its capac-
ity. In practical implementations, inputs �(i) and out-
puts �(j), (�) i,j, have the same capacity. In the
remainder of this work, we follow this practice and
consider �(i) � �(j) � �, (�) i, j.
Packet arbitration schemes can be divided into
distributed and centralized approaches (for further
details, the reader may refer to [3]). Distributed arbi-
tration schemes use single request-grant exchanges
between input and output arbiters. In this paper, we
will call them output arbitration (OA). Centralized arbi-
tration, called matrix arbitration (MA), contains
schemes that introduce additional devices and steps
of bandwidth matching, where grants provided by
outputs are used to compute new grants (also called
final grants). In this work, we further divide central-
ized arbitration schemes into non-hierarchical global
matrix arbitration (GMA) and hierarchical matrix arbi-
tration (HMA). When centralized arbiters are used,
aT
j�1
p(i, j) � £(i), (�)i � 1, p , T
aT
i�1
p(i, j) � £( j), (�)j � 1, p , T
ß(T,T) �
p(1,1) p(1,2) p p(1,T)
p(2,1) p(2,2) p p(2,T)
o ∞ op(i, j)
p(T,1) p p(T,T)
a single-device arbiter collects grant signals and re-
adjusts them (e.g., by delaying some of them) to new
(final) grants.
Centralized arbiters have state information from
all inputs and outputs. This feature allows them to
efficiently arbitrate traffic matrices with different sta-
tistical properties, at the expense of some additional
complexity. Fast heuristic algorithms provide a maxi-
mal matching, where the maximal solution strictly
respects input and output capacity constraints, as
defined in equation 2 and equation 3. However, sin-
gle device (i.e., non-hierarchical) centralized arbiters
face scalability issues, in terms of computing time and
signaling bandwidth. In contrast to centralized
arbiters, distributed arbiters of the type OA do no itera-
tions and solve contentions based on local knowl-
edge alone. As there is no coordination between
output arbiters, i.e., each output arbiter independ-
ently shares the bandwidth among competing inputs,
the switch fabric bandwidth is less efficiently used.
In the OA model, constraints on input capacity (from
equation 3) are not resolved, and inputs frequently
cut off the grants that are received. As we shall illus-
trate later, this sub-optimal output arbitration can lead
to severe degradation of switch performance: The
lower efficiency must be compensated by a signifi-
cant matrix interconnection speedup (e.g., at least
150 percent [7] to asymptotically emulate an ideal
switching system), which increases costs and power
consumption.
Motivated by the performance advantages of cen-
tralized arbiters, we intend to make them more scala-
ble by designing hierarchical schemes and this is the
subject of the next section.
A Multi-Cell Hierarchical ArbiterOur solution splits the arbitration routine—done
by a single-device arbiter with T inputs and T outputs—
into several subroutines and disperses the subroutines
over separate arbitration devices. This allows the
design of arbitration devices with sizes much smaller
than T and parallelization of some of the arbitration
subroutines.
The idea above requires a spatial separation of the
physical switch into several virtual switches; each of
the virtual switches will be separately arbitrated by
q
84 Bell Labs Technical Journal DOI: 10.1002/bltj
one of the “small size” arbiters. The spatial switch vir-
tualization further requires that the number T of
inputs and outputs can be divided into a product
of two integers. In practical implementations, T is a
power of 2 and thus it can be decomposed as a product
T � N � M, where N and M are integers, also powers of 2.
We propose a spatial switch separation based on
virtual aggregation and grouping of individual line
cards. This virtual aggregation is the result of group-
ing and aggregating M individual line cards into a
higher-capacity virtual line card or shelf. Each one of
the resulting N virtual shelves has a capacity equiva-
lent to M individual line cards. As a consequence, the
physical packet switch can be intuitively viewed as a
packet switch made of N shelves with M line cards
per shelf. Subsets of the switch that involve the
shelves will be referred to as virtual switches.
In Figure 1, we illustrate the possible separation
of a physical switch into three classes of virtual
switches. Each class contains one or several
virtual switches with identical size and capacity. Note
that we use the respective terms “virtual input” and
“input shelf,” and “virtual output” and “output shelf,”
interchangeably. Each input and output shelf therefore
has a capacity equivalent to M individual line cards.
The first class contains the virtual switch VS1, as
depicted in Figure 1b. It corresponds to an N�N
switch, with N input shelves and N output shelves.
j
M
1 A group of M individual line cards
iA virtualline card
T�N-M
3
2
1
2
3
T�N-M
1
(a)
(b) (c)
(d) (e)
1
N N
1
1
MN
M
11 1
M
1
N
1
M
M
11
1
MN
1
N
1
N
1
M
1
N
1
M
Figure 1.A generic physical packet switch (a) and a possible separation of switch line cards into input and output shelvesand groups of M individual line cards (b, c, d, e).
DOI: 10.1002/bltj Bell Labs Technical Journal 85
Each input and output, respectively, has a capacity of
M � � cells/cycle (recall that each individual input and
output has a capacity of � cells/cycle). The second
class contains N virtual switches VS2(a), a � 1, . . . . , N,
as illustrated in Figure 2. A virtual switch VS2(a) cor-
responds to an M�N switch, with M individual inputs
and N output shelves; each input and output has a
capacity of � and M�� cells/cycle, respectively. The
second class further contains N virtual switches
VS2*(b), b � 1, . . . , N, as illustrated in Figure 3. A
virtual switch VS2*(b) corresponds to an N � M switch,
with N input shelves and M individual outputs; each
input and output has each a capacity of M�� and �
cells/cycle, respectively. Finally, the fourth class con-
tains N �N virtual switches VS3(a,b), as depicted in
Figure 4. A virtual switch VS3(a,b) corresponds to an
M � M switch, with M individual inputs and M indi-
vidual outputs; each input and output, respectively,
has a capacity of � cells/cycles.
Separation of the T-input/T-output physical
packet switch, as presented above, permits the design
of a centralized arbitration scheme in a three-layer
hierarchical arrangement. The arbiter architecture is
depicted in Figure 5. The arbiter AR1 arbitrates the
virtual switch VS1; each of the arbiters AR2(a) and
AR2*(b) arbitrates its corresponding virtual switch
VS2(a) and VS2*(b), respectively, and each of the
arbiters AR3(a,b) arbitrates its corresponding virtual
switch VS3(a,b).
The arbitration cycle is divided into three sub-
cycles, corresponding to the three classes of arbiters; a
sub-cycle length corresponds to the processing time
required by the arbiters belonging to its associated
class. During each arbitration cycle, the inputs to each
“small size” arbiter are the grant outputs from the pre-
ceding layer of arbiters.
The hierarchical arbiter functions as follows:
During the first sub-cycle, AR1 arbitrates output con-
tention in VS1; it provides an aggregated bandwidth
matching between all virtual inputs and outputs. The
outputs (i.e., grants) of AR1 are used as capacity con-
straints (see equations 2 and 3) by the next level
arbiters, i.e., AR2(a) and AR2*(b). During the second
sub-cycle, each element AR2(a) arbitrates output con-
tention in associated VS2(a); it matches aggregated
bandwidth between M individual inputs and N output
shelves. At the same time, each element AR2*(b) arbi-
trates output contention in its associated VS2*(b); it
matches aggregated bandwidth between N input
shelves and M individual outputs. The outputs of
AR2(a) and AR2*(b), respectively, are used as capacity
constraints by the last level arbiters. Finally, during
the third sub-cycle, each item AR3(a,b) arbitrates out-
put contention in its associated VS3(a,b); it matches
bandwidth between M individual inputs and M indi-
vidual outputs. The outputs of all AR3(a,b) are sent to
line card arbiters. They represent the final line card-
to-line card grants, which satisfy the constraint rela-
tionship established by equation 2 and equation 3.
At the end of the arbitration cycle, feedback is
required between downstream (i.e., layer 3) and
upstream partners (i.e., layer 2 and layer 3). This feed-
back permits upstream arbiters to determine the dif-
ference between the bandwidth granted at the first
and second layers and that not yet granted at the third
layer. This difference has to be considered over the
next arbitration cycles. Consequently, all AR3(a,b) ele-
ments simultaneously send feedback signals, e.g.,
grants, to all their upstream partners.
An Example of Hierarchical ArbitrationWe assume a packet switch, with T � 4 inputs
and outputs, and N � M � 2. We consider that each
individual input and output line card has a capacity of
30 cells/cycle. As a result, each virtual input and out-
put line card has a capacity of M�30 � 60 cells/cycle.
Let us consider the following request matrix
between line cards denoted i and j:
(4)
First, AR1 matches aggregated bandwidth for a
2 � 2 switch, with two virtual inputs and outputs.
The aggregated request matrix RAR1(a, b), used to pro-
vide bandwidth matching between virtual inputs a
and virtual outputs b, is as follows:
(5)
AR1 executes a wavefront arbitration algorithm
on RAR1(a, b) and provides a grant matrix GAR1(a, b),
RAR1(a, b) � c 60 60
140 140d
R(i, j) � ≥10 10 10 10
20 20 20 20
30 30 30 30
40 40 40 40
¥
86 Bell Labs Technical Journal DOI: 10.1002/bltj
VS—
Vir
tual
sw
itch
N1
N
VS 2
(1)
VS 2
(a)
VS 2
(N)
1
N1
Fig
ure
2.
The
sep
arat
ion
of
a T-
inp
ut/
T-o
utp
ut
pac
ket
swit
ch in
to N
vir
tual
sw
itch
es V
S2(a
); e
ach
vir
tual
sw
itch
VS2
(a),
a �
1, .
.., N
, has
M in
div
idu
al in
pu
ts a
nd
N o
utp
ut
shel
ves.
DOI: 10.1002/bltj Bell Labs Technical Journal 87
VS—
Vir
tual
sw
itch
VS 2
(1)
VS 2
(b)
VS 2
(N)
1 N
1
M1
N
1 M
1 N
1 N
1 N
Fig
ure
3.
The
sep
arat
ion
of
a T-
inp
ut/
T-o
utp
ut
pac
ket
swit
ch in
to N
vir
tual
sw
itch
es V
S2*(
b);
eac
h v
irtu
al s
wit
ch V
S2*(
b),
b �
1, .
.., N
, has
N in
pu
t sh
elve
s an
dM
ind
ivid
ual
ou
tpu
ts.
88 Bell Labs Technical Journal DOI: 10.1002/bltj
VS—
Vir
tual
sw
itchV
S 3(1
,1)
VS 3
(1,N
)V
S 3(N
,1)
VS 3
(N,N
)
1 MN
M11
1
M1
N
1 M
Fig
ure
4.
The
dec
om
po
siti
on
of
a T-
inp
ut/
T-o
utp
ut
pac
ket
swit
ch in
to N
–N v
irtu
al s
wit
ches
VS3
(a, b
); e
ach
vir
tual
sw
itch
VS3
(a, b
), a
, b �
1, .
.., N
, has
Min
div
idu
al in
pu
ts a
nd
M in
div
idu
al o
utp
uts
.
DOI: 10.1002/bltj Bell Labs Technical Journal 89
which respects the input and output capacity con-
straints, as described in equation 2 and equation 3:
(6)
(7)
Second, each AR2*(b) matches aggregated band-
width for a 2 � 2 switch, with two virtual inputs a and
two individual outputs jj belonging to b. The aggre-
gated request matrices RAR2*(1)(a, jj) and RAR2
*(2)(a, jj), used
to compute the bandwidth matching, are as follows:
(8)
Each AR2*(b) in parallel executes a wavefront arbi-
tration algorithm on its corresponding RAR2*(b)(a, jj) and
provides a grant matrix GAR2*(b)(a, jj), which respects
the following input and output capacity constraints, as
described in equation 2 and equation 3:
(9)
(10)
(�)b � 1, p , N, jj � 1, . . . , M
aN
a�1
GAR2*(b)(a, jj) � £ � 30 cells�cycle,
aM
jj�1
GAR2*(b)(a, jj) � GAR1(a, b), (�)a, b � 1, . . . , N
RAR2*(1)(a, jj) � c30 30
70 70d RAR2
*(2)(a, jj) � c30 30
70 70d
(�)a � 1, p , N
aN
b�1
GAR1(a, b) � M .£ �60 cells�cycle,
(�)b � 1, p , N
aN
a�1
GAR1(a, b) � M .£ �60 cells�cycle,
At the same time, each AR2(a) matches aggre-
gated bandwidth for a 2 � 2 switch, with two indi-
vidual inputs ii belonging to a and two virtual outputs
b. The aggregated request matrices RAR2(1)(ii, b) and
RAR2(2)(ii, b), used to compute the bandwidth matching,
are as follows:
(11)
Each AR2(a) in parallel executes a wavefront arbi-
tration algorithm on its corresponding RAR2(a)(ii, b) and
provides a grant matrix GAR2(a)(ii, b), which respects
the following input and output capacity constraints, as
described in equation 2 and equation 3:
(12)
(13)
Each AR3(a,b) matches the bandwidth of a 2 � 2
switch, with two individual inputs and outputs. The
request matrices RAR3(1,1)(ii, jj), RAR3(1,2)(ii, jj), RAR3(2,1)(ii, jj),
and RAR3(2,2)(ii, jj), used to compute the bandwidth
matching, are as follows:
aM
ii�1
GAR2(a)(ii, b) � GAR1(a, b), (�)a, b � 1, p , N
(�)a � 1, p , N, ii � 1, . . . , M
aN
b�1
GAR2*(a)(ii, b) � £ � 30 cells�cycle,
RAR2(1)(ii, b) � c20 20
40 40d RAR2(2)(ii, b) � c60 60
80 80d .
AR1
AR2(a) a � 1,...,N
Hierarchical arbiter
O U T P U T S
I N P U T S
Layer 1 Layer 2 Layer 3
AR—Arbiter
AR3(a,b) a � 1,...,N b � 1,...,N
AR2*(b) b � 1,...,N
Figure 5.Logical blocks of the hierarchical arbiter; for the sake of clarity, we do not plot all signals exchanged between(input, output, and “small size” hierarchical) arbiters.
90 Bell Labs Technical Journal DOI: 10.1002/bltj
(14)
Each AR3(a,b) in parallel executes the wavefront
arbitration algorithm on its corresponding RAR3(a, b)
and provides a grant matrix GAR3(a, b), which respects
the following input and output capacity constraints
(as described in equation 2 and equation 3):
(15)
(16)
Complexity AnalysisThe complexity of the centralized and the hierar-
chical scheme can be compared using the same algo-
rithm used for the basic matrix matching. We find
that the hierarchical arbiter is considerably faster and
exchanges much less signaling traffic with the line
cards. However, the higher degree of parallel process-
ing implies a multiplication of (distributed) electronic
circuits.
Arbitration Time and Silicon RequirementsA typical (non-hierarchical) centralized arbiter
(GMA) uses a generalization of the wavefront arbiter
(WFA) [14] from bit to integer-number matching, or
a more general wavefront Birkoff-von Neumann
decomposition (WFBVND) [12], to match T inputs to
T outputs. WFA needs T sub-circuits which run T
times per pass to allocate a (T,T) matrix by diagonals.
The number of bits in the registers is log2(�) and the
number of iterations per diagonal is log2(�) (recall
that � represents the input/output capacity expressed
in cells/cycles). As a consequence, a single-device
centralized arbiter scales as O(T�log2(�)) � O(T) both in
silicon and runtime. We expect that this reference
model, i.e., GMA, comes closest to optimal matching
(�)a, b � 1, p , N, jj � 1, . . . , M
aM
ii�1
GAR3(a,b)(ii, jj) � GAR2*(b)(a, jj),
(�)a, b � 1, p , N, ii � 1, . . . , M
aM
jj�1
GAR3(a,b)(ii, jj) � GAR2(a)(ii, b),
RAR3(2,1)(ii, jj) � c30 30
40 40d
RAR3(1,2)(ii, jj) � c10 10
20 20d
RAR3(1,1)(ii, jj) � c10 10
20 20d because it can evaluate all the input and output con-
straints together at the lowest granularity.
In the case of the HMA proposed hierarchical
scheme, small size arbiters use wavefront implemen-
tations. In this regard, AR1 and each AR3(a,b) matches
square matrices (i.e., (N,N) and (M,M)) with a latency
of O(N�log2(�)) and O(M�log2(�)), respectively. Each
AR2(a) and AR2*(b) matches rectangular matrices with
a latency of O(max(N,M) �log2(�)) [14]. As both sce-
narios N � M and M � N lead to similar complexity,
we only discuss the scenario N � M. In this context,
the total computation latency of our arbiter is
O(N�log2(�) � N�log2(�) � M�log2(�)) � O(2�N � M).
This arbitration time is much smaller than O(T) (recall
that T � N�M).
The silicon required by the hierarchical approach
scales as O(N�log2(�) � 2�N�N�log2(�) � N2�M�log2(�)) �O(N2�M), bigger than O(T) by the factor N. However,
AR2(a), AR2*(b) and AR3(a,b) can be integrated into
line cards and switching fabric and thus small size
devices can be used.
Signaling Bandwidth RequirementsAs each of T inputs can send requests to each of T
outputs (and receive grants from each of T outputs,
respectively), the half-duplex signaling bandwidth
(SBW) required by GMA scales as O(T 2�log2(�)) �O(T2), as depicted in Table I, where Y is the number
of arbitration cycles per second.
In Table II, we present numerical results for tera-
bit capacities. We consider cells of 64 bytes, line cards
at 10 Gb/s, and an arbitration cycle (Y) of approxi-
mately 10 ms. As the input/output (I/O) half-duplex
bandwidth of existing chips is typically a few tens of
gigabits per second, the results presented here clearly
show that a non-hierarchical centralized arbiter
(GMA) is not scalable with T and cannot easily be
implemented into a monolithic chip for large value
of T.
As for HMA, each of the small size arbiters within
the hierarchy arbitrates between small numbers (M or
N) of (individual or virtual) inputs and outputs,
instead of the large number T. This feature allows a
significant reduction in the signaling bandwidth per
arbitration device, as was illustrated in Table I. In
Table III, we present numerical results for signaling
DOI: 10.1002/bltj Bell Labs Technical Journal 91
bandwidth required by AR1, AR2(a), AR2*(b) and
AR3(a,b), respectively. For example, the half-duplex
signaling bandwidth required by each AR3(a,b) is
approximately 0.50 Gb/s, for a capacity of 2 Tb/s, and
approximately 2.6 Gb/s, for a capacity of 10 Tb/s.
We notice that the choice of different numerical
values for N and M will achieve different values for
the required signaling bandwidth. However, in all sce-
narios, the amount of signaling bandwidth required
for small size arbiters is significantly smaller than the
I/O bandwidth of existing chips.
Performance EvaluationThis section presents the performance evaluation
of three arbiters, each one using a distinct arbitration
scheme: HMA, GMA, or OA. GMA and each small
size arbiter within the hierarchical scheme use a WFB-
VND algorithm [12], which performs a parallelized
round robin–like selection. For OA, we consider
round robin selection, first at each output to serve the
input requests, then at each input to satisfy the output
grants. The performance evaluations are produced
through computer simulation for a 32 � 32 switch
with line cards running at 10 Gb/s and an arbitration
cycle of 10 ms. Each input has a shared buffer organ-
ized in virtual output queues, which can accommo-
date the amount of data received at full line card
speed during 1,000 arbitration cycles. Outputs have
small reassembly buffers and buffer overflow only
occurs at input line cards. Simulations ran for
1,000,000 cycles and 95 percent confidence intervals
have been computed.
Although long-range dependent and heavily varia-
ble traffic has an adverse effect on switch perfor-
mance, it has been shown that its effect is not as
significant as the packet destination pattern [9]. In
this regard, we consider benchmark traffic models [8]
and generate traffic flows according to an Interrupted
Poisson Process (IPP), with ON and OFF periods
Capacity
� 2 Tb/s (T � 256; � 10 Tb/s (T � 1024;M � 4; N � 64) M � 16; N � 64)
SBW � 50 Gb/s � 760 Gb/s
Table II. A numerical example of the half-duplexsignaling bandwidth required by GMA, a non-hierarchical arbiter.
GMA—Global matrix arbitrationSBW—Signaling bandwidth
Signaling bandwidth (bps)
Nonhierarchical SBW � T2�log2(�)�Y; T � N�M
AR1 SBW � N2�log2(�� M2)�Y
AR2(a) SBW � T�log2(�� M)�Y � N2�log2(F� M2) )�Y; T � N�M
AR2*(b) SBW � T�log2(�� M)�Y � N2�log2(�� M2) )�Y; T � N�M
AR3(a,b) SBW � (M2) �log2(�) �Y � 2�T�log2(�� M)�Y ; T � N�M
Table I. Signaling bandwidth: Hierarchical versus nonhierarchical arbiter.
AR—ArbiterSBW—Signaling bandwidth
Capacity
� 2 Tb/s (T � 256; � 10 Tb/s (T � 1024;M � 4; N � 64) M � 16; N � 64)
AR1 � 5 Gb/s � 7 Gb/s
AR2(a) � 5.3 Gb/s � 8.2 Gb/s
AR2*(b) � 5.3 Gb/s � 8.2 Gb/s
AR3(a,b) � 0.5 Gb/s � 2.6 Gb/s
Table III. A numerical example of the half-duplexsignaling bandwidth required by small size arbiters.
AR—Arbiter
92 Bell Labs Technical Journal DOI: 10.1002/bltj
exponentially distributed. The peak-to-mean ratio of
IPP is 5 and the average length of the ON period is
ten cycles. The size of incoming packets follows the
packet size distribution observed in the Internet [6],
and the packet destination pattern is described by
Zipf’s power law [8], with parameter z. z � 0 corre-
sponding to packet destinations uniformly distributed
among the outputs. For 0 z each input has a
different hot output destination, with a fraction of
packets from an input destined to its hot output and
the remaining packets randomly destined to other
outputs. A parameter z � 5 represents quasi-directional
traffic flows, where each input has traffic destined to a
single output.
In the following figures we show simulation
results, in terms of normalized throughput and delay,
for OA, GMA, and HMA. Simulation work has been
done for several values of N and M; we observed that
for both scenarios N � M and N � M, HMA achieves
identical performance. In the following, we present
only the scenario N � M.
As depicted in Figure 6, a load value below 0.9
shows no measurable delay difference between OA,
GMA, and HMA under uniform distribution of packet
destination. The simulation results also show that OA
achieves the lowest throughput (i.e., the load at
which the switch saturates), as expected from its mod-
est quality of matching. The positive effect of the near-
optimal matching by GMA is also visible.
As load values below 0.9 show no significant dif-
ference between the performance achieved by OA,
GMA, and HMA, Figures 7 and 8 only show results for
an input load value of 0.95 and non-uniform traffic
patterns. The results plotted in Figure 7 indicate that
both GMA and HMA maximize throughput to nearly
100 percent for non-uniform traffic. In particular,
results show that the performance achieved by HMA
is very close to that of GMA. We observe that OA
achieves poor throughput when the matrix band-
width becomes the main bottleneck, i.e., z � (0,3).
This is due to the non-cooperative way of sharing the
bandwidth, as was qualitatively explained in the sec-
tion titled “Packet Switch Arbitration.”
The present analysis reveals that the intercon-
nection of several wavefront arbiters presents a
slightly adverse effect on packet switch performance.
In the case of delay, as depicted in Figure 8, we
observe a slight degradation of the performance of
HMA compared to GMA, when inputs receive traffic
composed of a dominant amount of directional flows,
i.e., z � (2,4). We suspect this is a fairness problem,
which takes place as a consequence of arbitrating
bandwidth from virtual to individual inputs/outputs,
as well as of cascading round robin–like (wavefront)
arbiters. Here, the tendency is for the hierarchical
arbiter to favor the heaviest flows.
ConclusionIn this paper, we introduced a fast multi-cell cen-
tralized arbiter for contention resolution in combined
input-output queued packet switches with shared
memory switching. Our arbiter is hierarchically struc-
tured, consisting of a three-layer hierarchy and mul-
tiple “small size” arbiters at each layer.
The results obtained in this work provide several
useful insights. First, the analysis showed that a non-
hierarchical centralized arbiter is not easily scalable: In
100OA
GMAHMA
80
60
40
20
00.1 0.2 0.3 0.4
z � 0.0 N � 8; M � 4
0.5 0.6
Offered load
GMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationOA—Output arbitration
Ave
rag
e p
acke
t d
elay
(cy
cles
)
0.7 0.8 0.9 1
Figure 6.Average packet delay as a function of offered load;incoming traffic is uniformly distributed amongoutputs (i.e., z � 0).
DOI: 10.1002/bltj Bell Labs Technical Journal 93
1
0.95
Thro
ug
ho
ut
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.5
0.55
0.450 1 2
Zipf parameter (z)
N � 8; M � 4rho � 0.95
3 4 5
OAGMAHMA
GMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationOA—Output arbitration
Figure 7.Normalized throughput under unbalanced traffic.
1400
1200
1000
Ave
rag
e p
acke
t d
elay
(cy
cles
)
800
600
400
200
00 1 2
Zipf parameter (z)
3 4 5
N � 8; M � 4rho � 0.95
OAGMAHMA
GMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationOA—Output arbitration
Figure 8.Average packet delay under unbalanced traffic.
94 Bell Labs Technical Journal DOI: 10.1002/bltj
terabit per second packet switches, the arbitration sig-
naling bandwidth exceeds the I/O bandwidth of exist-
ing chips. That is, at multi-terabit switching capacities,
a centralized arbiter cannot be implemented into a
monolithic device as hardware, contrary to the hier-
archical schemes. Second, we showed that the pro-
posed hierarchical arbiter provides significant timing
and signaling bandwidth relaxation at the expense of
an increase in parallel processing on distributed and
small size arbiters. Third, the performance evaluation
showed that our hierarchical arbiter does not degrade
the switch delay and throughput performance com-
pared to that of a fully centralized single-device
arbiter. In addition, results indicated that the per-
formance of both single-device and multi-device
arbiters significantly exceeds the one achieved by a
basic distributed arbiter. Finally, the performance
analysis revealed a possible fairness issue with the
hierarchical arbiter, with round robin–like selection
under non-uniform traffic patterns: The tendency is
for the arbiter to favor the heaviest flows.
We are currently analyzing the fairness issue via
comparison to a hierarchical arbiter with proportional
selection. The preliminary simulation results revealed
that the fairness degradation is a consequence of cas-
cading arbiters with round robin–like (i.e., “max-min
fair”) selection: The fairness issue is significantly alle-
viated when cascading arbiters with proportional
selection.
References[1] T. E. Anderson, S. S. Owicki, J. B. Saxe, and
C. P. Thacker, “High-Speed Switch Schedulingfor Local-Area Networks,” ACM Trans. Comput.Syst., 11:4 (1993), 319–352.
[2] M. Andrews and L. Zhang, “SchedulingProtocols for Switches With Large Envelopes,”ACM J. Scheduling, 7:3 (2004), 171–186.
[3] H. J. Chao, “Next Generation Routers,” Proc.IEEE, 90:9 (2002), 1518–1558.
[4] H. J. Chao, C. H. Lam, and X. Guo, “A FastArbitration Scheme for Terabit PacketSwitches,” Proc. IEEE Global Telecommun.Conf. (GLOBECOM ’99) (Rio de Janeiro, Braz.,1999), vol. 2, pp. 1236–1243.
[5] F. M. Chiussi and A. Francini, “ScalableElectronic Packet Switches,” IEEE J. Select.Areas Commun., 21:4 (2003), 486–500.
[6] Cooperative Association for Internet DataAnalysis (CAIDA), http://www.caida.org/research/�.
[7] A. S. Diwan, R. Guérin, and K. N. Sivarajan,“Performance Analysis of Speeded-Up High-Speed Packet Switches,” J. High SpeedNetworks, 10:3 (2001), 161–186.
[8] I. Elhanany, D. Chiou, V. Tabatabaee, R. Noro,and A. Poursepanj, “The Network ProcessingForum Switch Fabric Benchmark Specifications:An Overview,” IEEE Network, 19:2 (2005), 5–9.
[9] S. Fong and S. Singh, “Performance Evaluationof Shared-Buffer ATM Switches Under Self-Similar Traffic,” Proc. IEEE Internat. Perform.,Comput., and Commun. Conf. (IPCCC ’97)(Phoenix, AZ, 1997), pp. 252–258.
[10] J. Y. Hui, Switching and Traffic Theory forIntegrated Broadband Networks, KluwerAcademic Publishers, Boston, MA, 1989.
[11] H. Kim, C. Oh, and K. Kim, “A High-SpeedATM Switch Architecture Using Random AccessInput Buffers and Multi-Cell-Time Arbitration,”Proc. IEEE Global Telecommun. Conf.(GLOBECOM ’97) (Phoenix, AZ, 1997), vol. 1,pp. 536–540.
[12] J. Li and N. Ansari, “Enhanced Birkhoff-vonNeumann Decomposition Algorithm for InputQueued Switches,” IEE Proc. Commun., 148:6(2001), 339–342.
[13] R. Rojas-Cessa and E. Oki, “Round-RobinSelection With Adaptable-Size Frame in aCombined Input-Crosspoint Buffered Switch,”IEEE Commun. Lett., 7:11 (2003), 555–557.
[14] Y. Tamir and H.-C. Chi, “Symmetric CrossbarArbiters for VLSI Communication Switches,”IEEE Trans. Parallel Distrib. Syst., 4:1 (1993),13–27.
[15] F. Wang and M. Hamdi, “Fast Fair ArbiterDesign in Packet Switches,” Proc. IEEE HighPerform. Switching and Routing Workshop(HPSR ’05) (Hong Kong, Ch., 2005), pp. 472–476.
(Manuscript approved March 2009)
DANIEL POPA was a research engineer in the Semanticand Autonomic Technologies department atAlcatel-Lucent Bell Labs France when thispaper was written. He is now a seniorresearch engineer in the ProductDevelopment department of ITRON. He
received an engineering degree in electrical andtelecommunication engineering from Polytechnic
DOI: 10.1002/bltj Bell Labs Technical Journal 95
Institute, Bucharest, Romania, and a Ph.D. degree incomputer science from Telecom-SudParis, France.During his tenure at Bell Labs France, he focused onwork with scheduling mechanisms for high-speedswitching systems. His experience includes extensivework on optical networking with a focus on traffic andQoS control, medium access control protocols, andperformance evaluation. Dr. Popa also held a one-yearpost-doctoral position at Telecom-SudParis, where heparticipated in the design of a hybrid handoverarchitecture deployed by the French National Railwayin high-speed trains. His current research interestsinclude traffic management, routing protocols, andmedium access protocols for wireless networks andpower line communications. He holds six patents,co-authored a book chapter on optical networking,and has published 22 technical papers in internationalconferences and journals.
GEORG POST is a research engineer in the Semanticand Autonomic Technologies department ofthe Networking and Networks researchdomain within Alcatel-Lucent Bell Labs inFrance. He received a doctoral degree insolid-state physics from Pierre et Marie
Curie University, Paris. Dr. Post conducted research onIII-V semiconductor device technologies at FranceTelecom CNET laboratories, in the Opto�
Alcatel/France Telecom joint venture, and later atAlcatel. He joined the network and system team towork on the “byte switch” research concept with itspacket/time division multiplexing (TDM) agnosticfabric, which led to the Alcatel-Lucent 1850 TransportService Switch. His current research interests includecontent-aware and flow-aware features of packetswitches for the future of the Internet.
LUDOVIC NOIRIE is a research manager in the Semanticand Autonomic Technologies department of the Networking and Networks researchdomain within Alcatel-Lucent Bell Labs inFrance. After receiving diplomas inengineering from both Ecole Polytechnique
and Telecom ParisTech, France, he joined Alcatel to workas a researcher on optical node and networkarchitectures. He pioneered the optical multi-granularityconcept that mixes wavelengths, wavebands, and fibersin the same network. His definition of the “byte switch”concept with its packet/time division multiplexing (TDM)agnostic fabric led to the Alcatel-Lucent 1850 TransportService Switch (TSS), and he received the Bell LabsPresident’s Award in 2006 as a member of the 1850 TSSproject team. He is now leading the Tera-Scale Semantic
Networking team within the Semantic and AutonomicTechnologies department, which investigates newsolutions for the future of the Internet. He is co-leaderof the semantic networking activity of the joint researchlab between Alcatel-Lucent Bell Labs and INRIA. He is amember of the Alcatel-Lucent Technical Academy. Hehas authored 40 publications and is the inventor of 30patents. ◆