a fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

15
A Fast Hierarchical Arbitration Scheme for Multi-Tb/s Packet Switches with Shared Memory Switching Daniel Popa, Georg Post, and Ludovic Noirie One challenge in multi-terabit per second packet switches is the design of low latency and high performance arbitration schemes. In this regard, we propose a hierarchical multi-cell arbiter, which interconnects multiple parallel processing devices. By comparison with iterative non-hierarchical multi-cell arbiters, we show that our scheme significantly decreases the signaling overhead and arbitration processing time. Performance evaluation results show the proposed solution maximizes the switch throughput and delay performance. © 2009 Alcatel-Lucent. of one cell duration on computation time and 2) reduc- ing the linear dependence O(T) of the computation time of a maximal matching on the switch size (T). The goal of the first level can be achieved by either using envelope or multi-cell arbiters. In envelope arbitration [2], instead of segmenting variable-size packets into small fixed-size cells, packets are aggre- gated into large fixed-size envelopes. During each arbitration cycle, an envelope arbiter matches a single envelope between any input and output. Although an envelope arbiter can considerably relax the con- straint on arbitration time, since the arbitration cycle is equal to the envelope time length, it has two major drawbacks that make it unsuitable for practical imple- mentations. First, it saves no time because packets have to wait at the “head of the line” for a filled enve- lope before being considered for arbitration. Second, some bandwidth is wasted because of the mismatch between the variable size of packets and the fixed size of the envelope. In multi-cell arbitration [11, 13], during each arbitration cycle, the arbiter matches a group of cells between any input and output. As a Introduction Combined input-output queued (CIOQ) architec- tures have been widely considered as a feasible solution for high-speed packet switches and Internet Protocol (IP) routers [3, 5]. CIOQ switches are attractive because they achieve high throughput under admissible traffic patterns using simple arbitration schemes. However, although CIOQ architectures deal efficiently with arbi- tration time constraints, since input and output port selection are independently performed, the arbitration speed issue still persists in Tb/s systems. Packet switch arbitration is an application of bipartite graph matching, which attempts to find a maximal input-to-output bandwidth matching [1]. Conventional fabric arbiters use single-cell arbitration [3]. During each arbitration cycle, a single-cell arbiter matches one cell between any pair of input and out- put. In Tb/s packet switches, single-cell arbiters face the challenge of computing a maximal bandwidth matching within one cell duration and of signaling the results to all the inputs. The arbitration time issue can be simplified by dividing it into two levels: 1) breaking the upper limit Bell Labs Technical Journal 14(2), 81–96 (2009) © 2009 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Published online in Wiley InterScience (www.interscience.wiley.com) • DOI: 10.1002/bltj.20374

Upload: daniel-popa

Post on 06-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

◆ A Fast Hierarchical Arbitration Scheme for Multi-Tb/s Packet Switches with SharedMemory SwitchingDaniel Popa, Georg Post, and Ludovic Noirie

One challenge in multi-terabit per second packet switches is the design oflow latency and high performance arbitration schemes. In this regard, wepropose a hierarchical multi-cell arbiter, which interconnects multiple parallelprocessing devices. By comparison with iterative non-hierarchical multi-cellarbiters, we show that our scheme significantly decreases the signalingoverhead and arbitration processing time. Performance evaluation resultsshow the proposed solution maximizes the switch throughput and delayperformance. © 2009 Alcatel-Lucent.

of one cell duration on computation time and 2) reduc-

ing the linear dependence O(T) of the computation

time of a maximal matching on the switch size (T).

The goal of the first level can be achieved by

either using envelope or multi-cell arbiters. In envelope

arbitration [2], instead of segmenting variable-size

packets into small fixed-size cells, packets are aggre-

gated into large fixed-size envelopes. During each

arbitration cycle, an envelope arbiter matches a single

envelope between any input and output. Although

an envelope arbiter can considerably relax the con-

straint on arbitration time, since the arbitration cycle

is equal to the envelope time length, it has two major

drawbacks that make it unsuitable for practical imple-

mentations. First, it saves no time because packets

have to wait at the “head of the line” for a filled enve-

lope before being considered for arbitration. Second,

some bandwidth is wasted because of the mismatch

between the variable size of packets and the fixed size

of the envelope. In multi-cell arbitration [11, 13],

during each arbitration cycle, the arbiter matches

a group of cells between any input and output. As a

IntroductionCombined input-output queued (CIOQ) architec-

tures have been widely considered as a feasible solution

for high-speed packet switches and Internet Protocol

(IP) routers [3, 5]. CIOQ switches are attractive because

they achieve high throughput under admissible traffic

patterns using simple arbitration schemes. However,

although CIOQ architectures deal efficiently with arbi-

tration time constraints, since input and output port

selection are independently performed, the arbitration

speed issue still persists in Tb/s systems.

Packet switch arbitration is an application of

bipartite graph matching, which attempts to find a

maximal input-to-output bandwidth matching [1].

Conventional fabric arbiters use single-cell arbitration

[3]. During each arbitration cycle, a single-cell arbiter

matches one cell between any pair of input and out-

put. In Tb/s packet switches, single-cell arbiters face

the challenge of computing a maximal bandwidth

matching within one cell duration and of signaling

the results to all the inputs.

The arbitration time issue can be simplified by

dividing it into two levels: 1) breaking the upper limit

Bell Labs Technical Journal 14(2), 81–96 (2009) © 2009 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Publishedonline in Wiley InterScience (www.interscience.wiley.com) • DOI: 10.1002/bltj.20374

Page 2: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

82 Bell Labs Technical Journal DOI: 10.1002/bltj

consequence, an arbitration cycle can be several

degrees of magnitude longer than the cell duration.

This feature considerably relaxes the hard timing con-

straint on arbitration.

The goal of the second level has been somewhat

achieved by designing fast single-cell arbiters, such as

the ping pong arbiter (PPA) [4] and fast fair arbiter

(FFA) [15]. However, although FFA and PPA decrease

arbitration time from O(T) to O(log2(T)), and thus

keep computation time below cell duration, their

design is still challenging for line card rates greater

than 10 Gb/s. As an example of the stringent timing,

PPA and FFA must provide a maximal matching in

less than 51.2 nanoseconds (ns) at 10 Gb/s, 12.8 ns at

40 Gb/s, and 5.12 ns at 100 Gb/s, for a cell of 64 bytes!

An alternative solution is to make multi-cell arbiters

more scalable, by designing hierarchical approaches.

In this regard, we attempt to provide new

insights on multi-cell arbitration and introduce a

hierarchical centralized arbiter. The scope of this

paper is twofold. First, we present the proposed

arbiter and discuss its hardware complexity. In par-

ticular, we demonstrate that a single-device (i.e.,

non-hierarchical) centralized iterative arbiter cannot

accommodate the signaling requirements of a large

packet switch arbitration, while a multi-device (i.e.,

hierarchical) approach efficiently solves this issue.

Second, through simulation, we show that our

approach comes close to the fully centralized one in

terms of switch delay and throughput performance.

The results obtained are encouraging since only a few

results highlighting the benefit of multi-cell arbiters

in high speed switching systems are available in the

literature [11, 13].

CIOQ packet switches support the following fea-

tures and conventions:

• Segmentation of incoming variable-size packets

into cells of 64 to 80 bytes at the switch input.

This allows internal switching with the granular-

ity of a cell and the reassembly of packets at out-

put, before they are scheduled for departure from

the switch.

• Separate (virtual) queues at the point of input

and for each output, to avoid the well-known

problem of head-of-line blocking.

• Use of crossbar or shared-memory switching fab-

ric because of its non-blocking capability and mar-

ket availability.

• Time-slotted switching matrices.

• Identical number of inputs and outputs.

• Cyclic arbitration schemes.

In this paper, we apply the conventions above

and consider a CIOQ packet switch with shared mem-

ory switching [10].

The Packet Switch Arbitration: Distributed VersusCentralized Schemes

During every cycle, the arbitration procedure

between T inputs and T outputs involves the follow-

ing steps. Each active input submits request signals to

the outputs, indicating the quantity of data destined

for each of the appropriate outputs. Each output

arbiter collects the request signals, among which the

inputs with active requests are granted according to

some priority order: A grant signal is sent back

to acknowledge the amount of data that an active

input can forward to the output.

Panel 1. Abbreviations, Acronyms, and Terms

AR—ArbiterCIOQ—Combined input-output queuedFFA—Fast fair arbiterGMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationI/O—Input/outputIP—Internet ProtocolIPP—Interrupted Poisson Process

MA—Matrix arbitrationOA—Output arbitrationPPA—Ping pong arbiterSBW—Signaling bandwidthVS—Virtual switchWFA—Wavefront arbiterWFBVND—Wavefront Birkoff-von Neumann

decomposition

Page 3: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 83

The signals exchanged between input and output

arbiters can be represented as a matrix:

(1)

where i and j are the index of input and output line

cards, respectively; p(i, j) represents the signaling

information (i.e., requests or grants) exchanged

between input i and output j.

As packets are discarded only at input line cards,

the grant matrix should respect the following input

and output capacity constraints:

(2)

(3)

where �(i) and �(j) represent the capacity of input i

and output j, respectively, expressed in either bits,

bytes or cells per cycle. In other words, equation 2

implies that an output cannot receive more band-

width than its capacity, while equation 3 implies that

an input cannot send more bandwidth than its capac-

ity. In practical implementations, inputs �(i) and out-

puts �(j), (�) i,j, have the same capacity. In the

remainder of this work, we follow this practice and

consider �(i) � �(j) � �, (�) i, j.

Packet arbitration schemes can be divided into

distributed and centralized approaches (for further

details, the reader may refer to [3]). Distributed arbi-

tration schemes use single request-grant exchanges

between input and output arbiters. In this paper, we

will call them output arbitration (OA). Centralized arbi-

tration, called matrix arbitration (MA), contains

schemes that introduce additional devices and steps

of bandwidth matching, where grants provided by

outputs are used to compute new grants (also called

final grants). In this work, we further divide central-

ized arbitration schemes into non-hierarchical global

matrix arbitration (GMA) and hierarchical matrix arbi-

tration (HMA). When centralized arbiters are used,

aT

j�1

p(i, j) � £(i), (�)i � 1, p , T

aT

i�1

p(i, j) � £( j), (�)j � 1, p , T

ß(T,T) �

p(1,1) p(1,2) p p(1,T)

p(2,1) p(2,2) p p(2,T)

o ∞ op(i, j)

p(T,1) p p(T,T)

a single-device arbiter collects grant signals and re-

adjusts them (e.g., by delaying some of them) to new

(final) grants.

Centralized arbiters have state information from

all inputs and outputs. This feature allows them to

efficiently arbitrate traffic matrices with different sta-

tistical properties, at the expense of some additional

complexity. Fast heuristic algorithms provide a maxi-

mal matching, where the maximal solution strictly

respects input and output capacity constraints, as

defined in equation 2 and equation 3. However, sin-

gle device (i.e., non-hierarchical) centralized arbiters

face scalability issues, in terms of computing time and

signaling bandwidth. In contrast to centralized

arbiters, distributed arbiters of the type OA do no itera-

tions and solve contentions based on local knowl-

edge alone. As there is no coordination between

output arbiters, i.e., each output arbiter independ-

ently shares the bandwidth among competing inputs,

the switch fabric bandwidth is less efficiently used.

In the OA model, constraints on input capacity (from

equation 3) are not resolved, and inputs frequently

cut off the grants that are received. As we shall illus-

trate later, this sub-optimal output arbitration can lead

to severe degradation of switch performance: The

lower efficiency must be compensated by a signifi-

cant matrix interconnection speedup (e.g., at least

150 percent [7] to asymptotically emulate an ideal

switching system), which increases costs and power

consumption.

Motivated by the performance advantages of cen-

tralized arbiters, we intend to make them more scala-

ble by designing hierarchical schemes and this is the

subject of the next section.

A Multi-Cell Hierarchical ArbiterOur solution splits the arbitration routine—done

by a single-device arbiter with T inputs and T outputs—

into several subroutines and disperses the subroutines

over separate arbitration devices. This allows the

design of arbitration devices with sizes much smaller

than T and parallelization of some of the arbitration

subroutines.

The idea above requires a spatial separation of the

physical switch into several virtual switches; each of

the virtual switches will be separately arbitrated by

q

Page 4: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

84 Bell Labs Technical Journal DOI: 10.1002/bltj

one of the “small size” arbiters. The spatial switch vir-

tualization further requires that the number T of

inputs and outputs can be divided into a product

of two integers. In practical implementations, T is a

power of 2 and thus it can be decomposed as a product

T � N � M, where N and M are integers, also powers of 2.

We propose a spatial switch separation based on

virtual aggregation and grouping of individual line

cards. This virtual aggregation is the result of group-

ing and aggregating M individual line cards into a

higher-capacity virtual line card or shelf. Each one of

the resulting N virtual shelves has a capacity equiva-

lent to M individual line cards. As a consequence, the

physical packet switch can be intuitively viewed as a

packet switch made of N shelves with M line cards

per shelf. Subsets of the switch that involve the

shelves will be referred to as virtual switches.

In Figure 1, we illustrate the possible separation

of a physical switch into three classes of virtual

switches. Each class contains one or several

virtual switches with identical size and capacity. Note

that we use the respective terms “virtual input” and

“input shelf,” and “virtual output” and “output shelf,”

interchangeably. Each input and output shelf therefore

has a capacity equivalent to M individual line cards.

The first class contains the virtual switch VS1, as

depicted in Figure 1b. It corresponds to an N�N

switch, with N input shelves and N output shelves.

j

M

1 A group of M individual line cards

iA virtualline card

T�N-M

3

2

1

2

3

T�N-M

1

(a)

(b) (c)

(d) (e)

1

N N

1

1

MN

M

11 1

M

1

N

1

M

M

11

1

MN

1

N

1

N

1

M

1

N

1

M

Figure 1.A generic physical packet switch (a) and a possible separation of switch line cards into input and output shelvesand groups of M individual line cards (b, c, d, e).

Page 5: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 85

Each input and output, respectively, has a capacity of

M � � cells/cycle (recall that each individual input and

output has a capacity of � cells/cycle). The second

class contains N virtual switches VS2(a), a � 1, . . . . , N,

as illustrated in Figure 2. A virtual switch VS2(a) cor-

responds to an M�N switch, with M individual inputs

and N output shelves; each input and output has a

capacity of � and M�� cells/cycle, respectively. The

second class further contains N virtual switches

VS2*(b), b � 1, . . . , N, as illustrated in Figure 3. A

virtual switch VS2*(b) corresponds to an N � M switch,

with N input shelves and M individual outputs; each

input and output has each a capacity of M�� and �

cells/cycle, respectively. Finally, the fourth class con-

tains N �N virtual switches VS3(a,b), as depicted in

Figure 4. A virtual switch VS3(a,b) corresponds to an

M � M switch, with M individual inputs and M indi-

vidual outputs; each input and output, respectively,

has a capacity of � cells/cycles.

Separation of the T-input/T-output physical

packet switch, as presented above, permits the design

of a centralized arbitration scheme in a three-layer

hierarchical arrangement. The arbiter architecture is

depicted in Figure 5. The arbiter AR1 arbitrates the

virtual switch VS1; each of the arbiters AR2(a) and

AR2*(b) arbitrates its corresponding virtual switch

VS2(a) and VS2*(b), respectively, and each of the

arbiters AR3(a,b) arbitrates its corresponding virtual

switch VS3(a,b).

The arbitration cycle is divided into three sub-

cycles, corresponding to the three classes of arbiters; a

sub-cycle length corresponds to the processing time

required by the arbiters belonging to its associated

class. During each arbitration cycle, the inputs to each

“small size” arbiter are the grant outputs from the pre-

ceding layer of arbiters.

The hierarchical arbiter functions as follows:

During the first sub-cycle, AR1 arbitrates output con-

tention in VS1; it provides an aggregated bandwidth

matching between all virtual inputs and outputs. The

outputs (i.e., grants) of AR1 are used as capacity con-

straints (see equations 2 and 3) by the next level

arbiters, i.e., AR2(a) and AR2*(b). During the second

sub-cycle, each element AR2(a) arbitrates output con-

tention in associated VS2(a); it matches aggregated

bandwidth between M individual inputs and N output

shelves. At the same time, each element AR2*(b) arbi-

trates output contention in its associated VS2*(b); it

matches aggregated bandwidth between N input

shelves and M individual outputs. The outputs of

AR2(a) and AR2*(b), respectively, are used as capacity

constraints by the last level arbiters. Finally, during

the third sub-cycle, each item AR3(a,b) arbitrates out-

put contention in its associated VS3(a,b); it matches

bandwidth between M individual inputs and M indi-

vidual outputs. The outputs of all AR3(a,b) are sent to

line card arbiters. They represent the final line card-

to-line card grants, which satisfy the constraint rela-

tionship established by equation 2 and equation 3.

At the end of the arbitration cycle, feedback is

required between downstream (i.e., layer 3) and

upstream partners (i.e., layer 2 and layer 3). This feed-

back permits upstream arbiters to determine the dif-

ference between the bandwidth granted at the first

and second layers and that not yet granted at the third

layer. This difference has to be considered over the

next arbitration cycles. Consequently, all AR3(a,b) ele-

ments simultaneously send feedback signals, e.g.,

grants, to all their upstream partners.

An Example of Hierarchical ArbitrationWe assume a packet switch, with T � 4 inputs

and outputs, and N � M � 2. We consider that each

individual input and output line card has a capacity of

30 cells/cycle. As a result, each virtual input and out-

put line card has a capacity of M�30 � 60 cells/cycle.

Let us consider the following request matrix

between line cards denoted i and j:

(4)

First, AR1 matches aggregated bandwidth for a

2 � 2 switch, with two virtual inputs and outputs.

The aggregated request matrix RAR1(a, b), used to pro-

vide bandwidth matching between virtual inputs a

and virtual outputs b, is as follows:

(5)

AR1 executes a wavefront arbitration algorithm

on RAR1(a, b) and provides a grant matrix GAR1(a, b),

RAR1(a, b) � c 60 60

140 140d

R(i, j) � ≥10 10 10 10

20 20 20 20

30 30 30 30

40 40 40 40

¥

Page 6: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

86 Bell Labs Technical Journal DOI: 10.1002/bltj

VS—

Vir

tual

sw

itch

N1

N

VS 2

(1)

VS 2

(a)

VS 2

(N)

1

N1

Fig

ure

2.

The

sep

arat

ion

of

a T-

inp

ut/

T-o

utp

ut

pac

ket

swit

ch in

to N

vir

tual

sw

itch

es V

S2(a

); e

ach

vir

tual

sw

itch

VS2

(a),

a �

1, .

.., N

, has

M in

div

idu

al in

pu

ts a

nd

N o

utp

ut

shel

ves.

Page 7: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 87

VS—

Vir

tual

sw

itch

VS 2

(1)

VS 2

(b)

VS 2

(N)

1 N

1

M1

N

1 M

1 N

1 N

1 N

Fig

ure

3.

The

sep

arat

ion

of

a T-

inp

ut/

T-o

utp

ut

pac

ket

swit

ch in

to N

vir

tual

sw

itch

es V

S2*(

b);

eac

h v

irtu

al s

wit

ch V

S2*(

b),

b �

1, .

.., N

, has

N in

pu

t sh

elve

s an

dM

ind

ivid

ual

ou

tpu

ts.

Page 8: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

88 Bell Labs Technical Journal DOI: 10.1002/bltj

VS—

Vir

tual

sw

itchV

S 3(1

,1)

VS 3

(1,N

)V

S 3(N

,1)

VS 3

(N,N

)

1 MN

M11

1

M1

N

1 M

Fig

ure

4.

The

dec

om

po

siti

on

of

a T-

inp

ut/

T-o

utp

ut

pac

ket

swit

ch in

to N

–N v

irtu

al s

wit

ches

VS3

(a, b

); e

ach

vir

tual

sw

itch

VS3

(a, b

), a

, b �

1, .

.., N

, has

Min

div

idu

al in

pu

ts a

nd

M in

div

idu

al o

utp

uts

.

Page 9: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 89

which respects the input and output capacity con-

straints, as described in equation 2 and equation 3:

(6)

(7)

Second, each AR2*(b) matches aggregated band-

width for a 2 � 2 switch, with two virtual inputs a and

two individual outputs jj belonging to b. The aggre-

gated request matrices RAR2*(1)(a, jj) and RAR2

*(2)(a, jj), used

to compute the bandwidth matching, are as follows:

(8)

Each AR2*(b) in parallel executes a wavefront arbi-

tration algorithm on its corresponding RAR2*(b)(a, jj) and

provides a grant matrix GAR2*(b)(a, jj), which respects

the following input and output capacity constraints, as

described in equation 2 and equation 3:

(9)

(10)

(�)b � 1, p , N, jj � 1, . . . , M

aN

a�1

GAR2*(b)(a, jj) � £ � 30 cells�cycle,

aM

jj�1

GAR2*(b)(a, jj) � GAR1(a, b), (�)a, b � 1, . . . , N

RAR2*(1)(a, jj) � c30 30

70 70d RAR2

*(2)(a, jj) � c30 30

70 70d

(�)a � 1, p , N

aN

b�1

GAR1(a, b) � M .£ �60 cells�cycle,

(�)b � 1, p , N

aN

a�1

GAR1(a, b) � M .£ �60 cells�cycle,

At the same time, each AR2(a) matches aggre-

gated bandwidth for a 2 � 2 switch, with two indi-

vidual inputs ii belonging to a and two virtual outputs

b. The aggregated request matrices RAR2(1)(ii, b) and

RAR2(2)(ii, b), used to compute the bandwidth matching,

are as follows:

(11)

Each AR2(a) in parallel executes a wavefront arbi-

tration algorithm on its corresponding RAR2(a)(ii, b) and

provides a grant matrix GAR2(a)(ii, b), which respects

the following input and output capacity constraints, as

described in equation 2 and equation 3:

(12)

(13)

Each AR3(a,b) matches the bandwidth of a 2 � 2

switch, with two individual inputs and outputs. The

request matrices RAR3(1,1)(ii, jj), RAR3(1,2)(ii, jj), RAR3(2,1)(ii, jj),

and RAR3(2,2)(ii, jj), used to compute the bandwidth

matching, are as follows:

aM

ii�1

GAR2(a)(ii, b) � GAR1(a, b), (�)a, b � 1, p , N

(�)a � 1, p , N, ii � 1, . . . , M

aN

b�1

GAR2*(a)(ii, b) � £ � 30 cells�cycle,

RAR2(1)(ii, b) � c20 20

40 40d RAR2(2)(ii, b) � c60 60

80 80d .

AR1

AR2(a) a � 1,...,N

Hierarchical arbiter

O U T P U T S

I N P U T S

Layer 1 Layer 2 Layer 3

AR—Arbiter

AR3(a,b) a � 1,...,N b � 1,...,N

AR2*(b) b � 1,...,N

Figure 5.Logical blocks of the hierarchical arbiter; for the sake of clarity, we do not plot all signals exchanged between(input, output, and “small size” hierarchical) arbiters.

Page 10: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

90 Bell Labs Technical Journal DOI: 10.1002/bltj

(14)

Each AR3(a,b) in parallel executes the wavefront

arbitration algorithm on its corresponding RAR3(a, b)

and provides a grant matrix GAR3(a, b), which respects

the following input and output capacity constraints

(as described in equation 2 and equation 3):

(15)

(16)

Complexity AnalysisThe complexity of the centralized and the hierar-

chical scheme can be compared using the same algo-

rithm used for the basic matrix matching. We find

that the hierarchical arbiter is considerably faster and

exchanges much less signaling traffic with the line

cards. However, the higher degree of parallel process-

ing implies a multiplication of (distributed) electronic

circuits.

Arbitration Time and Silicon RequirementsA typical (non-hierarchical) centralized arbiter

(GMA) uses a generalization of the wavefront arbiter

(WFA) [14] from bit to integer-number matching, or

a more general wavefront Birkoff-von Neumann

decomposition (WFBVND) [12], to match T inputs to

T outputs. WFA needs T sub-circuits which run T

times per pass to allocate a (T,T) matrix by diagonals.

The number of bits in the registers is log2(�) and the

number of iterations per diagonal is log2(�) (recall

that � represents the input/output capacity expressed

in cells/cycles). As a consequence, a single-device

centralized arbiter scales as O(T�log2(�)) � O(T) both in

silicon and runtime. We expect that this reference

model, i.e., GMA, comes closest to optimal matching

(�)a, b � 1, p , N, jj � 1, . . . , M

aM

ii�1

GAR3(a,b)(ii, jj) � GAR2*(b)(a, jj),

(�)a, b � 1, p , N, ii � 1, . . . , M

aM

jj�1

GAR3(a,b)(ii, jj) � GAR2(a)(ii, b),

RAR3(2,1)(ii, jj) � c30 30

40 40d

RAR3(1,2)(ii, jj) � c10 10

20 20d

RAR3(1,1)(ii, jj) � c10 10

20 20d because it can evaluate all the input and output con-

straints together at the lowest granularity.

In the case of the HMA proposed hierarchical

scheme, small size arbiters use wavefront implemen-

tations. In this regard, AR1 and each AR3(a,b) matches

square matrices (i.e., (N,N) and (M,M)) with a latency

of O(N�log2(�)) and O(M�log2(�)), respectively. Each

AR2(a) and AR2*(b) matches rectangular matrices with

a latency of O(max(N,M) �log2(�)) [14]. As both sce-

narios N � M and M � N lead to similar complexity,

we only discuss the scenario N � M. In this context,

the total computation latency of our arbiter is

O(N�log2(�) � N�log2(�) � M�log2(�)) � O(2�N � M).

This arbitration time is much smaller than O(T) (recall

that T � N�M).

The silicon required by the hierarchical approach

scales as O(N�log2(�) � 2�N�N�log2(�) � N2�M�log2(�)) �O(N2�M), bigger than O(T) by the factor N. However,

AR2(a), AR2*(b) and AR3(a,b) can be integrated into

line cards and switching fabric and thus small size

devices can be used.

Signaling Bandwidth RequirementsAs each of T inputs can send requests to each of T

outputs (and receive grants from each of T outputs,

respectively), the half-duplex signaling bandwidth

(SBW) required by GMA scales as O(T 2�log2(�)) �O(T2), as depicted in Table I, where Y is the number

of arbitration cycles per second.

In Table II, we present numerical results for tera-

bit capacities. We consider cells of 64 bytes, line cards

at 10 Gb/s, and an arbitration cycle (Y) of approxi-

mately 10 ms. As the input/output (I/O) half-duplex

bandwidth of existing chips is typically a few tens of

gigabits per second, the results presented here clearly

show that a non-hierarchical centralized arbiter

(GMA) is not scalable with T and cannot easily be

implemented into a monolithic chip for large value

of T.

As for HMA, each of the small size arbiters within

the hierarchy arbitrates between small numbers (M or

N) of (individual or virtual) inputs and outputs,

instead of the large number T. This feature allows a

significant reduction in the signaling bandwidth per

arbitration device, as was illustrated in Table I. In

Table III, we present numerical results for signaling

Page 11: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 91

bandwidth required by AR1, AR2(a), AR2*(b) and

AR3(a,b), respectively. For example, the half-duplex

signaling bandwidth required by each AR3(a,b) is

approximately 0.50 Gb/s, for a capacity of 2 Tb/s, and

approximately 2.6 Gb/s, for a capacity of 10 Tb/s.

We notice that the choice of different numerical

values for N and M will achieve different values for

the required signaling bandwidth. However, in all sce-

narios, the amount of signaling bandwidth required

for small size arbiters is significantly smaller than the

I/O bandwidth of existing chips.

Performance EvaluationThis section presents the performance evaluation

of three arbiters, each one using a distinct arbitration

scheme: HMA, GMA, or OA. GMA and each small

size arbiter within the hierarchical scheme use a WFB-

VND algorithm [12], which performs a parallelized

round robin–like selection. For OA, we consider

round robin selection, first at each output to serve the

input requests, then at each input to satisfy the output

grants. The performance evaluations are produced

through computer simulation for a 32 � 32 switch

with line cards running at 10 Gb/s and an arbitration

cycle of 10 ms. Each input has a shared buffer organ-

ized in virtual output queues, which can accommo-

date the amount of data received at full line card

speed during 1,000 arbitration cycles. Outputs have

small reassembly buffers and buffer overflow only

occurs at input line cards. Simulations ran for

1,000,000 cycles and 95 percent confidence intervals

have been computed.

Although long-range dependent and heavily varia-

ble traffic has an adverse effect on switch perfor-

mance, it has been shown that its effect is not as

significant as the packet destination pattern [9]. In

this regard, we consider benchmark traffic models [8]

and generate traffic flows according to an Interrupted

Poisson Process (IPP), with ON and OFF periods

Capacity

� 2 Tb/s (T � 256; � 10 Tb/s (T � 1024;M � 4; N � 64) M � 16; N � 64)

SBW � 50 Gb/s � 760 Gb/s

Table II. A numerical example of the half-duplexsignaling bandwidth required by GMA, a non-hierarchical arbiter.

GMA—Global matrix arbitrationSBW—Signaling bandwidth

Signaling bandwidth (bps)

Nonhierarchical SBW � T2�log2(�)�Y; T � N�M

AR1 SBW � N2�log2(�� M2)�Y

AR2(a) SBW � T�log2(�� M)�Y � N2�log2(F� M2) )�Y; T � N�M

AR2*(b) SBW � T�log2(�� M)�Y � N2�log2(�� M2) )�Y; T � N�M

AR3(a,b) SBW � (M2) �log2(�) �Y � 2�T�log2(�� M)�Y ; T � N�M

Table I. Signaling bandwidth: Hierarchical versus nonhierarchical arbiter.

AR—ArbiterSBW—Signaling bandwidth

Capacity

� 2 Tb/s (T � 256; � 10 Tb/s (T � 1024;M � 4; N � 64) M � 16; N � 64)

AR1 � 5 Gb/s � 7 Gb/s

AR2(a) � 5.3 Gb/s � 8.2 Gb/s

AR2*(b) � 5.3 Gb/s � 8.2 Gb/s

AR3(a,b) � 0.5 Gb/s � 2.6 Gb/s

Table III. A numerical example of the half-duplexsignaling bandwidth required by small size arbiters.

AR—Arbiter

Page 12: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

92 Bell Labs Technical Journal DOI: 10.1002/bltj

exponentially distributed. The peak-to-mean ratio of

IPP is 5 and the average length of the ON period is

ten cycles. The size of incoming packets follows the

packet size distribution observed in the Internet [6],

and the packet destination pattern is described by

Zipf’s power law [8], with parameter z. z � 0 corre-

sponding to packet destinations uniformly distributed

among the outputs. For 0 z each input has a

different hot output destination, with a fraction of

packets from an input destined to its hot output and

the remaining packets randomly destined to other

outputs. A parameter z � 5 represents quasi-directional

traffic flows, where each input has traffic destined to a

single output.

In the following figures we show simulation

results, in terms of normalized throughput and delay,

for OA, GMA, and HMA. Simulation work has been

done for several values of N and M; we observed that

for both scenarios N � M and N � M, HMA achieves

identical performance. In the following, we present

only the scenario N � M.

As depicted in Figure 6, a load value below 0.9

shows no measurable delay difference between OA,

GMA, and HMA under uniform distribution of packet

destination. The simulation results also show that OA

achieves the lowest throughput (i.e., the load at

which the switch saturates), as expected from its mod-

est quality of matching. The positive effect of the near-

optimal matching by GMA is also visible.

As load values below 0.9 show no significant dif-

ference between the performance achieved by OA,

GMA, and HMA, Figures 7 and 8 only show results for

an input load value of 0.95 and non-uniform traffic

patterns. The results plotted in Figure 7 indicate that

both GMA and HMA maximize throughput to nearly

100 percent for non-uniform traffic. In particular,

results show that the performance achieved by HMA

is very close to that of GMA. We observe that OA

achieves poor throughput when the matrix band-

width becomes the main bottleneck, i.e., z � (0,3).

This is due to the non-cooperative way of sharing the

bandwidth, as was qualitatively explained in the sec-

tion titled “Packet Switch Arbitration.”

The present analysis reveals that the intercon-

nection of several wavefront arbiters presents a

slightly adverse effect on packet switch performance.

In the case of delay, as depicted in Figure 8, we

observe a slight degradation of the performance of

HMA compared to GMA, when inputs receive traffic

composed of a dominant amount of directional flows,

i.e., z � (2,4). We suspect this is a fairness problem,

which takes place as a consequence of arbitrating

bandwidth from virtual to individual inputs/outputs,

as well as of cascading round robin–like (wavefront)

arbiters. Here, the tendency is for the hierarchical

arbiter to favor the heaviest flows.

ConclusionIn this paper, we introduced a fast multi-cell cen-

tralized arbiter for contention resolution in combined

input-output queued packet switches with shared

memory switching. Our arbiter is hierarchically struc-

tured, consisting of a three-layer hierarchy and mul-

tiple “small size” arbiters at each layer.

The results obtained in this work provide several

useful insights. First, the analysis showed that a non-

hierarchical centralized arbiter is not easily scalable: In

100OA

GMAHMA

80

60

40

20

00.1 0.2 0.3 0.4

z � 0.0 N � 8; M � 4

0.5 0.6

Offered load

GMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationOA—Output arbitration

Ave

rag

e p

acke

t d

elay

(cy

cles

)

0.7 0.8 0.9 1

Figure 6.Average packet delay as a function of offered load;incoming traffic is uniformly distributed amongoutputs (i.e., z � 0).

Page 13: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 93

1

0.95

Thro

ug

ho

ut

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.5

0.55

0.450 1 2

Zipf parameter (z)

N � 8; M � 4rho � 0.95

3 4 5

OAGMAHMA

GMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationOA—Output arbitration

Figure 7.Normalized throughput under unbalanced traffic.

1400

1200

1000

Ave

rag

e p

acke

t d

elay

(cy

cles

)

800

600

400

200

00 1 2

Zipf parameter (z)

3 4 5

N � 8; M � 4rho � 0.95

OAGMAHMA

GMA—Global matrix arbitrationHMA—Hierarchical matrix arbitrationOA—Output arbitration

Figure 8.Average packet delay under unbalanced traffic.

Page 14: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

94 Bell Labs Technical Journal DOI: 10.1002/bltj

terabit per second packet switches, the arbitration sig-

naling bandwidth exceeds the I/O bandwidth of exist-

ing chips. That is, at multi-terabit switching capacities,

a centralized arbiter cannot be implemented into a

monolithic device as hardware, contrary to the hier-

archical schemes. Second, we showed that the pro-

posed hierarchical arbiter provides significant timing

and signaling bandwidth relaxation at the expense of

an increase in parallel processing on distributed and

small size arbiters. Third, the performance evaluation

showed that our hierarchical arbiter does not degrade

the switch delay and throughput performance com-

pared to that of a fully centralized single-device

arbiter. In addition, results indicated that the per-

formance of both single-device and multi-device

arbiters significantly exceeds the one achieved by a

basic distributed arbiter. Finally, the performance

analysis revealed a possible fairness issue with the

hierarchical arbiter, with round robin–like selection

under non-uniform traffic patterns: The tendency is

for the arbiter to favor the heaviest flows.

We are currently analyzing the fairness issue via

comparison to a hierarchical arbiter with proportional

selection. The preliminary simulation results revealed

that the fairness degradation is a consequence of cas-

cading arbiters with round robin–like (i.e., “max-min

fair”) selection: The fairness issue is significantly alle-

viated when cascading arbiters with proportional

selection.

References[1] T. E. Anderson, S. S. Owicki, J. B. Saxe, and

C. P. Thacker, “High-Speed Switch Schedulingfor Local-Area Networks,” ACM Trans. Comput.Syst., 11:4 (1993), 319–352.

[2] M. Andrews and L. Zhang, “SchedulingProtocols for Switches With Large Envelopes,”ACM J. Scheduling, 7:3 (2004), 171–186.

[3] H. J. Chao, “Next Generation Routers,” Proc.IEEE, 90:9 (2002), 1518–1558.

[4] H. J. Chao, C. H. Lam, and X. Guo, “A FastArbitration Scheme for Terabit PacketSwitches,” Proc. IEEE Global Telecommun.Conf. (GLOBECOM ’99) (Rio de Janeiro, Braz.,1999), vol. 2, pp. 1236–1243.

[5] F. M. Chiussi and A. Francini, “ScalableElectronic Packet Switches,” IEEE J. Select.Areas Commun., 21:4 (2003), 486–500.

[6] Cooperative Association for Internet DataAnalysis (CAIDA), http://www.caida.org/research/�.

[7] A. S. Diwan, R. Guérin, and K. N. Sivarajan,“Performance Analysis of Speeded-Up High-Speed Packet Switches,” J. High SpeedNetworks, 10:3 (2001), 161–186.

[8] I. Elhanany, D. Chiou, V. Tabatabaee, R. Noro,and A. Poursepanj, “The Network ProcessingForum Switch Fabric Benchmark Specifications:An Overview,” IEEE Network, 19:2 (2005), 5–9.

[9] S. Fong and S. Singh, “Performance Evaluationof Shared-Buffer ATM Switches Under Self-Similar Traffic,” Proc. IEEE Internat. Perform.,Comput., and Commun. Conf. (IPCCC ’97)(Phoenix, AZ, 1997), pp. 252–258.

[10] J. Y. Hui, Switching and Traffic Theory forIntegrated Broadband Networks, KluwerAcademic Publishers, Boston, MA, 1989.

[11] H. Kim, C. Oh, and K. Kim, “A High-SpeedATM Switch Architecture Using Random AccessInput Buffers and Multi-Cell-Time Arbitration,”Proc. IEEE Global Telecommun. Conf.(GLOBECOM ’97) (Phoenix, AZ, 1997), vol. 1,pp. 536–540.

[12] J. Li and N. Ansari, “Enhanced Birkhoff-vonNeumann Decomposition Algorithm for InputQueued Switches,” IEE Proc. Commun., 148:6(2001), 339–342.

[13] R. Rojas-Cessa and E. Oki, “Round-RobinSelection With Adaptable-Size Frame in aCombined Input-Crosspoint Buffered Switch,”IEEE Commun. Lett., 7:11 (2003), 555–557.

[14] Y. Tamir and H.-C. Chi, “Symmetric CrossbarArbiters for VLSI Communication Switches,”IEEE Trans. Parallel Distrib. Syst., 4:1 (1993),13–27.

[15] F. Wang and M. Hamdi, “Fast Fair ArbiterDesign in Packet Switches,” Proc. IEEE HighPerform. Switching and Routing Workshop(HPSR ’05) (Hong Kong, Ch., 2005), pp. 472–476.

(Manuscript approved March 2009)

DANIEL POPA was a research engineer in the Semanticand Autonomic Technologies department atAlcatel-Lucent Bell Labs France when thispaper was written. He is now a seniorresearch engineer in the ProductDevelopment department of ITRON. He

received an engineering degree in electrical andtelecommunication engineering from Polytechnic

Page 15: A fast hierarchical arbitration scheme for multi-tb/s packet switches with shared memory switching

DOI: 10.1002/bltj Bell Labs Technical Journal 95

Institute, Bucharest, Romania, and a Ph.D. degree incomputer science from Telecom-SudParis, France.During his tenure at Bell Labs France, he focused onwork with scheduling mechanisms for high-speedswitching systems. His experience includes extensivework on optical networking with a focus on traffic andQoS control, medium access control protocols, andperformance evaluation. Dr. Popa also held a one-yearpost-doctoral position at Telecom-SudParis, where heparticipated in the design of a hybrid handoverarchitecture deployed by the French National Railwayin high-speed trains. His current research interestsinclude traffic management, routing protocols, andmedium access protocols for wireless networks andpower line communications. He holds six patents,co-authored a book chapter on optical networking,and has published 22 technical papers in internationalconferences and journals.

GEORG POST is a research engineer in the Semanticand Autonomic Technologies department ofthe Networking and Networks researchdomain within Alcatel-Lucent Bell Labs inFrance. He received a doctoral degree insolid-state physics from Pierre et Marie

Curie University, Paris. Dr. Post conducted research onIII-V semiconductor device technologies at FranceTelecom CNET laboratories, in the Opto�

Alcatel/France Telecom joint venture, and later atAlcatel. He joined the network and system team towork on the “byte switch” research concept with itspacket/time division multiplexing (TDM) agnosticfabric, which led to the Alcatel-Lucent 1850 TransportService Switch. His current research interests includecontent-aware and flow-aware features of packetswitches for the future of the Internet.

LUDOVIC NOIRIE is a research manager in the Semanticand Autonomic Technologies department of the Networking and Networks researchdomain within Alcatel-Lucent Bell Labs inFrance. After receiving diplomas inengineering from both Ecole Polytechnique

and Telecom ParisTech, France, he joined Alcatel to workas a researcher on optical node and networkarchitectures. He pioneered the optical multi-granularityconcept that mixes wavelengths, wavebands, and fibersin the same network. His definition of the “byte switch”concept with its packet/time division multiplexing (TDM)agnostic fabric led to the Alcatel-Lucent 1850 TransportService Switch (TSS), and he received the Bell LabsPresident’s Award in 2006 as a member of the 1850 TSSproject team. He is now leading the Tera-Scale Semantic

Networking team within the Semantic and AutonomicTechnologies department, which investigates newsolutions for the future of the Internet. He is co-leaderof the semantic networking activity of the joint researchlab between Alcatel-Lucent Bell Labs and INRIA. He is amember of the Alcatel-Lucent Technical Academy. Hehas authored 40 publications and is the inventor of 30patents. ◆