anatomy of a high performance ip routercomlab/seminar/ovs1_ref5.pdf · anatomy of a high...

$: Anatomy of a High Performance IP Routercomlab/seminar/ovs1_Ref5.pdf · Anatomy of a High Performance IP Router Florian Brodersen Alexander Klimetschek {ﬂorian.brodersen, alexander.klimetschek}$
Anatomy of a High Performance IP Router

Florian BrodersenAlexander Klimetschek

{florian.brodersen, alexander.klimetschek}@hpi.uni-potsdam.de

Communication Networks Seminar 2003/04Hasso-Plattner-Institute, University of Potsdam

January 4, 2004

Abstract

In this paper we give an overview of the evolution of router architectures and show what

techniques are used to build high performance routers which can route packets at several

gigabit speeds. We start with an analysis of the jobs a router has to do and how traditional

or low-performance routers are built to accomplish these tasks. Therefore we describe the

traditional bus-based approach and show its limitations as well as the modern switch-based

layouts. Our focus lies on a detailed survey of the latter variants. The survey consists of de-

scribing the benefits and drawbacks of using packet-switching technology and a presentation

of different switch scheduling algorithms. Additionally, we show how the overall processing

power of routers can be increased by employing specific optimisations through the use of

intelligent hardware components or superior routing table data structures and lookup algo-

rithms. We conclude with a case-study of the Cisco 12416 internet router and an evaluation

of the latest efforts in router design.

Contents

1 Introduction 4

2 IP Router Analysis 5

2.1 IP Router Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Basic forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Complex forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Router-specific tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Division into slow and fast path . . . . . . . . . . . . . . . . . . . . . . 7

2.1.5 Router components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Limits of Traditional IP Router Architecture . . . . . . . . . . . . . . . . . . . . 10

3 Modern IP Routers 13

3.1 Improvements at the component level . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Forwarding Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.3 General Processing Module . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.4 Interconnection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.5 Router architectures comparison table . . . . . . . . . . . . . . . . . . . 15

3.2 Bus-based Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Switch-based Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.1 General switch idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.3 Buffering and queueing in switches . . . . . . . . . . . . . . . . . . . . 20

3.3.4 Crossbar Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.5 Quality-of-service via hierarchical scheduling . . . . . . . . . . . . . . . 30

3.4 Routing table lookup optimisations . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Case study: Cisco 12000 series . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2

4 Conclusion and Outlook 36

List of Figures

1 Interregional Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Simple router block diagram (FMC) . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Traditional router architecture with single CPU and single memory (FMC) . . . . 10

4 Routing table entries in the BGP system . . . . . . . . . . . . . . . . . . . . . . 12

5 Detailed router block diagram (FMC) . . . . . . . . . . . . . . . . . . . . . . . 14

6 List of described router architectures . . . . . . . . . . . . . . . . . . . . . . . . 15

7 A 4-input crossbar interconnection fabric . . . . . . . . . . . . . . . . . . . . . 17

8 Modern switch based architecture with forwarding engines on line card (FMC) . 19

9 Two Schemes of the output queueing family . . . . . . . . . . . . . . . . . . . . 22

10 Scheme of crosspoint/distributed output queueing . . . . . . . . . . . . . . . . . 23

11 Scheme of input queueing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

12 Performance impact of head-of-line (HOL) blocking . . . . . . . . . . . . . . . 25

13 Scheme of combined input and output queueing . . . . . . . . . . . . . . . . . . 26

14 Summary of theN ×M switch queueing architectures . . . . . . . . . . . . . . 27

15 2DRR scheduling algorithm operating on the adjacency matrix . . . . . . . . . . 28

16 Example run of the Parallel Iterative Matching (PIM) algorithm . . . . . . . . . . 29

17 Cisco 12000 series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

18 Cisco 12416 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

19 Unicast and multicast queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3

(a) 2001 (b) 2002

Figure 1: Interregional Bandwidth (http://www.telegeography.com)

1 Introduction

Within the last decade the internet has grown rapidly. As there are more and more users con-

nected to the internet and the per-user traffic is growing, networks are faced with increased

amount of traffic. This means that data has to be transmitted and processed at higher rates. Due

to advances in the research of transmission technology, currently more bandwidth is provided

than actually needed. Using Dense Wavelength Division Multiplexing (DWDM) technology in

fibre optic cables, transfer rates up to 80 Gbit/s per line can be achieved, which satisfies present

demands (figure 1). Being confronted with this high bandwidth, routers become the real bottle-

necks.

Traditional hardware routers could not handle this amount of data anymore because they

were limited by their architecture which typically consisted of only one processor and one central

memory unit. Routing and route processing was done purely in software and their performance

was limited by the processor.

In order to solve this problem, new router architectures are needed. Unlike traditional routers,

which used a bus for internal communication, present-day routers are based on a switch. On one

hand this enables fast routing at gigabit speed, but on the other hand this leads to new prob-

lems, such as input and output blocking. The purpose of this paper is to illustrate the different

approaches towards these new designs and to show up solutions for specific problems like fast

routing table lookup. To understand the central part of modern routers, the switch, we will take

4

an in-depth look at the problems of building a high-performance switch.

We will neither cover routing mechanisms nor routing protocols, such as RIP, BGP or OSPF.

For further information on routing protocols see [Bak95] [Mal98] [RL95] [Moy98].

Contents of this paper

Section 2 begins with analysing the functionality of a router. After explaining the steps the router

has to do, like packet forwarding and route processing, we identify the abstract router compo-

nents which provide this functionality. At last we will look at the traditional router architecture

and show its limitations to provide a motivation for new designs.

In section 3 we talk about the modern switch based router architectures. After giving a short

overview of how the different components can be optimised, we briefly discuss some variants of

the traditional bus-based router architecture. The main part consists of an in-depth analysis of

the switch concept. We will conclude with a presentation of a commercial router.

In the last section we will show some open problems and ask what the next router generation

will look like.

2 IP Router Analysis

2.1 IP Router Functionality

The basic IP router functionality is given by the path a typical IP packet takes through the router:

it is received at the inbound network interface, the appropriate outbound network interface is

determined by looking at the header of the packet and by consulting the routing table and is then

finally forwarded to the outbound network interface, which will transmit it to the next hop. This

is true for most of the packets, but there are also packets received at the router, which require

special treatment by the router itself.

Going into more detail, a classification of packets by their destination is helpful. Packets

destined for the router itself include automatic routing protocol messages (like RIP, BGP or

OSPF) as well as system configuration and management requests. All other packets, with a

destination outside the router, have to be either forwarded or rejected. Most of the time the

forwarding decision can easily be computed. In some cases it is more complex because packet

5

translation, traffic prioritisation and packet filtering need to take place before the packet can

eventually be delivered.

2.1.1 Basic forwarding

As described in “Requirements for IP Version 4 Routers” [Bak95] and “IP Router Architectures:

An Overview” [Awe99] the following actions must be done in order to properly forward a packet.

• IP Packet Validation:First of all the router has to validate the IP packet. Only well-formed

packets can be further processed, otherwise the packet is discarded. This test ensures that

(a) the version number is correct, (b) header length field is valid and (c) the value of the

header checksum field is equal to the calculated header checksum [BBP88].

• Packet Lifetime Control:Routers must decrement the time-to-live (TTL) field and discard

the packet if the value is zero or negative. In this case an ICMP error message is sent to

the original sender.

• Checksum Recalculation:Due to the fact that the TTL field has been modified, the header

checksum needs to be recalculated. When the TTL field is only decremented by one, it is

easy to compute the new checksum [MK90].

• Destination IP Address Parsing and Table Lookup:To determine the output port, the des-

tination IP address of the packet is taken to find a matching entry in the routing table. This

reveals whether the packet should be delivered locally, to a single output port (unicast) or

to a set of output ports (multicast).

• Fragmentation:To meet the Maximum Transmission Unit (MTU) of the outbound network

interface, packets might need to be split into smaller fragments. Because fragmentation

is considered harmful by virtue of its massive impact on performance, a method called

“Path MTU Discovery” was developed to avoid fragmentation on the path of an IP packet

[MD90].1

• Handling IP Options:Although very few packets require IP options handling and parsing,

their processing must be implemented [Pos81].

1The smallest MTU is discovered by sending packets with the “IP Don’t Fragment Bit” set. This forces a router,which would have to fragment the packet, to send an ICMP error message back to the sender. This error messageshould contain the value of the MTU possible at that router. The sender reduces the size of the packets and proceedsthis way until a packet successfully reaches its final destination.

6

2.1.2 Complex forwarding

Depending on the field of application, routers sometimes have to perform additional steps when

forwarding a packet. We call this complex forwarding and typical elements of it are:

• Packet Translation:Due to the lack of public IP addresses people are forced to use one

address for several hosts. A router acting as a gateway for local networks has to map local

addresses to the routers public IP address and vice versa. This feature is called network

address translation (NAT) and is achieved by maintaining a list of connected hosts and

translating incoming and outgoing packets.

• Traffic Prioritisation: In order to meet customer’s requirements, routers must be able to

guarantee a certain quality of service (QoS), for example delivering a fixed amount of

packets at a constant rate. New applications like multimedia streaming over the internet

demand this.

• Packet Filtering:Routers often act as firewalls for networks. Therefore they have to filter

packets according to firewalling rules by looking at the packets contents.

2.1.3 Router-specific tasks

In the last paragraphs we talked about packets not destined for the router, which make up the

largest part of arriving packets. A relative small amount of packets has to be interpreted by the

router itself.

• Routing Protocols:Routers need to support different routing protocols in order to be able

to send and receive route updates from other routers. These updates lead to automatic

modifications of the routing table.

• System Configuration and Management:Administrators can manually modify the config-

uration of the router. This includes modifications of the routing table and their implicit

propagation to other routers by sending route updates.

2.1.4 Division into slow and fast path

All tasks the router has to execute can be divided into time critical (fast path) and non-time

critical (slow path) operations depending on their frequency. Time critical operations are those

7

which affect the majority of packets and do not require special treatment. This includes IP packet

validation, packet lifetime control, checksum recalculation, destination IP address parsing and

table lookups. On the other hand, IP options handling, fragmentation and error handling as well

as all router-specific tasks are regarded as non-time critical. Complex forwarding tasks are not

commonly implemented in routers but none the less tend to be time critical operations. For

example when the router is also used as a firewall every packet must be filtered.

Otherwise, if packet translation is required, this does not necessarily need to be done with all

packets.

When building a high performance router it must be ensured that the fast path is implemented

in a highly efficient and optimised manner. It is also very important that the slow path does not

affect the fast path, in other words, time critical operations must have the highest priority under

any circumstances. Most of the architectures and optimisations described in the next section will

therefore concentrate on the fast path. The research in this area focuses on the fast path, too.

2.1.5 Router components

Figure 2: Simple router block diagram (FMC)

From the architectural point of view routers can be divided into several components. All

of the functionality mentioned above must be distributed among them. These components are

placed within the router and are connected to the environment via different network links that

8

are plugged into the router. Look at figure 2 to get the notion of a router in its environment. To

be able to route between the networks, the router must also maintain a routing table. This block

diagram provides the fundamental abstraction for the next diagrams that will show the internals

of the different architectures2.

According to [Awe99] four components can be identified: (a) network interfaces, (b) process-

ing module/s, (c) buffering module/s and (d) an internal interconnection unit. For our purposes it

is more convenient to identify two different kinds of processing modules as they are separated in

most router designs: theforwarding enginesare responsible for basic and complex forwarding,

all other processing is left to thegeneral processing module. This also aids in understanding

the division of functionality of a router into slow and fast path. The forwarding engine typically

is the main component of the fast path whereas the general processing module handles the slow

path. Furthermore the buffering module(s) depend(s) on the location of the processing module(s).

Therefore we will not state them explicitly.

• Network Interface(s):perform in- and outbound packet transmission to and from the con-

nected networks. This also implies handling different connection speeds and protocols.

• Forwarding Engine(s):determine the outgoing port for each packet by looking up route

entries. The lookups might either occur in the central routing table or in a derived data

structure such as a local route cache. The route cache entries only contain aggregated rout-

ing information. The Forwarding Engine is either implemented in hardware or software

and has the ability of performing basic and complex forwarding and requires a certain

amount of memory to buffer packets.

• General Processing Module:handles routing protocols as well as system configuration and

management requests. It controls the routing table, propagates route updates and performs

complex packet-by-packet operations. Memory is needed to store the routing table and as

a processing buffer. Due to the variety of tasks the general processing module can only be

implemented in software.

• Interconnection Unit:provides internal communication among the other components. The

aggregate bandwidth of all attached network interfaces defines the minimum transmission

speed of the interconnection unit.

2This, as well as the following diagrams marked with FMC, is a Fundamental Modeling Concepts (FMC) dia-gram. Information and notation references can be found at http://fmc.hpi.uni-potsdam.de

9

Figure 3: Traditional router architecture with single CPU and single memory (FMC)

These components will be used to further illustrate the various architectures. We will always

refer to those terms.

2.2 Limits of Traditional IP Router Architecture

The layout of traditional IP routers resembles the typical computer layout. It consists of a single

general processing module with an integrated forwarding engine, a simple bus representing the

interconnection unit and several low speed network interfaces as depicted in figure 3. All func-

tionality (packet processing and forwarding) is implemented in software and runs on a general-

10

purpose central processing unit (CPU). In the past this completely satisfied the requirements of

low traffic and low bandwidth networks and provided the necessary functionality at relatively low

cost. But since transmission technology and the demand for high speed communication quickly

evolved, routers operating at the backbones of the internet have to cope with more data in shorter

time. One can easily see that the traditional approach has various drawbacks when it comes to

handling high load. These will be discussed in the following.

First of all the overall processing speed is definitely limited by the single CPU. It has to

process every packet, but the processing rate of the CPU cannot be increased by large amounts,

because CPU speed is physically limited. More over the implementation of routing in software

is inefficient since the operations of the fast path are restricted to a small set which can be imple-

mented directly in hardware (as we will see later). General purpose CPUs are good at executing

a lot of different applications.

As we stated earlier, the slow path should not influence the fast path. Combining forwarding

engine and general processing module in a single CPU does not adhere to this policy.

The growth of networks in the last years resulted in increasing size of the routing tables

(figure 4 shows that the table sizes of routers using the BGP routing protocol has grown from

about 20,000 entries in 1994 to 160,000 entries today). This in turn means that the routing table

lookup is slowed down and delays the actual packet forwarding. In the traditional architecture,

which uses ordinary memory to store the routing table, an approach to improve the table lookup

time is missing.

The interconnection unit, in this case a conventional bus structure, is by any means the biggest

problem in this design because every packet has to pass it at least twice, once from the incoming

network interface to the memory and after being processed by the CPU to the outgoing network

interface. Though only the packet header is needed to determine the outgoing network interface,

the whole packet including the large data portion is transfered across the bus. Another constraint

of most bus systems is the missing support for selective multicasting. Broadcasting is easy to

achieve, because it targets all members on the bus, but selective multicasting, where only a subset

of all members are recipients of a message, is difficult to realise.

In order to give the reader an impression of how fast a router must operate these days we will

present a simple example of a 16-port IP router, each port operating at 2.5 Gbit/s (typical fibre

optic link speed)3. Further we take 250 Bytes as a mean value for the length of a standard TCP

segment. Considering that for every incoming packet a forwarding decision has to be computed

3Many fibre optic high speed links operate at 10 Gbit/s nowadays.

11

Figure 4: Routing table entries in the BGP system over the last 10 years (http://bgp.potaroo.net)

the router must be able to make 20 million decisions per second, a decision every 50 ns.

16 · 2.5 Gbit/s8 · 250 Bytes

= 20 · 106 Hz

Given that current SDRAM has an access time of 10 ns and large routing tables contain

tens of thousands entries, it is easy to see that this packet rate cannot be accomplished by this

traditional architecture.

Supposing that all 16 network interfaces may wish to transfer packets simultaneously the

interconnection unit has to support a bandwidth of16 · 2.5Gbit/s = 40 Gbit/s. This assumes an

improved design where the packets would only have to pass the bus once. According to [S+96]

the maximum bandwidth of buses is around 20 Gbit/s.

Considering the demands for high performance routing at multigigabit and terabit speed in

arbitrary complex networks, the traditional architecture is not suitable any more. It does not scale

well due to its inflexible structure and its described several bottlenecks.

12

3 Modern IP Routers

In this section we will present the new approaches in router design. In the last decade the ar-

chitectures of high performance routers changed from being bus based to switch based and in-

corporated many improvements over the old techniques. As a matter of fact routing at gigabit

speed is only possible by using a switch internally. In order to better understand the evolution of

router layouts we will briefly discuss the enhanced bus based layouts but our focus primarily lies

on switch based architecture and its particular problems. Therefore we will talk about queueing,

blocking, scheduling algorithms and different switch variants. After that we will show several

approaches to optimise the lookup of entries in the routing table. The section concludes with

a presentation of the popular Cisco 12000 series as an example for a current high speed switch

based router.

3.1 Improvements at the component level

The following will most importantly take a look at how to implement the “internal routing agent”,

as depicted in figure 5. This diagram is a direct refinement of figure 2. Each network link

is connected to a separate network interface card (NI), because these links can be of different

types. The actual routing agent is located between those network interfaces, that is why we call

it the “internal routing agent”.

3.1.1 Network Interface

Optimisations of network interfaces are not subject of this examination since they depend on

transmission technology and only connect the router with the networks. As we have seen in the

example calculation above, the router must be able to route packets at least at the combined rate

of all network interfaces.

3.1.2 Forwarding Engine

Due to the fact that the forwarding engine(s) are part of the fast path, they must be implemented

in a very efficient way. Their tasks are simple but time critical. In order to compute a forwarding

decision simple algorithms are used which can easily be coded in hardware. Therefore putting the

forwarding logic into hardware by using application specific integrated circuits (ASICs) gives an

13

Figure 5: Detailed router block diagram (FMC)

enormous performance boost. Additionally, the forwarding algorithms can be improved. Another

optimisation is combining network interfaces and forwarding engines in one unit, the so called

line card. This makes it possible to determine the destination of each incoming packet right after

it has reached the router and minimises the need for the packet to travel the interconnection unit

more than once. These forwarding engines need memory for the packets and special memory for

the route cache (see section 3.4), which both are placed directly on the line cards.

3.1.3 General Processing Module

As the general processing module is only responsible for executing non-time critical operations

no optimisation is needed. Its execution time does not affect the fast path. It only has to be

guaranteed that it does not affect other components execution. Nevertheless, it needs a large

memory for the main routing

3.1.4 Interconnection Unit

Generally, there are two different types of interconnection units, as we have already noted. The

bus is the traditional form and used in normal computers. A bus is a shared medium with multiple

members, but only one member can send at a time. This is sufficient for many applications where

14

high traffic across the bus is less important than the flexibility of adding or removing members.

However, in routers it is the central purpose to connect different network interfaces or line cards.

There are very often multiple independent connections between those, that can happen at the

same time. As a bus does not support this, a switch is needed. The layout of a switch allows

to route many connections in parallel, wherefore some kind of scheduler is needed, who decides

which routes to make. To implement a switch, various approaches have been made, which will

be presented in 3.3.

3.1.5 Router architectures comparison table

In order to give the reader an overview of all the different router architectures appearing in this

paper we have prepared a comparison table (see figure 6). The table lists the architectures at

the left and the possible combinations of components at the right. It is assumed that all layout

variants have a reasonable amount of network interfaces with links to other networks. Variants B

through D evolved from the traditional architecture A. E and F represent two types of the switch

based approach as there are many variants which differ only slightly. As we have stated earlier

our discussion focuses on the different switch implementation details and not on every possible

realisable architecture. In the following paragraphs we will shortly introduce the variants B

through F whereas A was already mentioned in section 2.

Architectures Forwarding General ProcessingInterconnectionEngine(s) Module(s) Unit

A (traditional) Central CPU BusB Multiple CPUs BusC on separate cardsCentral CPU Bus + Control BusD on line cards Central CPU BusE on separate cardsCentral CPU SwitchF on line cards Central CPU Switch

Figure 6: List of described router architectures

3.2 Bus-based Architectures

In this section we like to give an overview of the bus-based architectures (B,C and D) which

resulted in modifications of the traditional approach. The reason behind presenting them here

15

in this paper even though they are not used for high performance routers, is that it took quite a

while until the use of switches as interconnection units became common.

As we already identified in the previous section the traditional approach has several bottle-

necks. One of them which highly affects the overall processing power of a router is the routing

table lookup because it can only be performed by the single CPU. Therefore the first idea to

eliminate this bottleneck was to use multiple CPUs instead of one (architecture B). Still every

single CPU consisted of a general processing module with an integrated forwarding engine which

means that each CPU could perform forwarding and router-specific tasks. This resulted in su-

perior throughput compared to the traditional design because these tasks could be executed in

parallel.

But another issue was left open. The packets still had to pass the bus twice since the for-

warding decision could only be computed by the CPUs and the packets had to travel from the

incoming port to the memory of the CPU and then to their assigned output port. In order to solve

this problem it was realised that only the header portion of the packet which includes the desti-

nation address is needed to determine the output port. Hence, the idea was to split the packet into

two parts (header and data) and leave the data portion on the incoming port until the forwarding

decision was made. In order to quickly compute those decisions multiple independent forward-

ing engines were used. The router-specific tasks were handled by a single independent general

processing module (architecture C).

It still remained unsolved that the header portion had to travel the bus twice. But this problem

was addressed in the next step of the evolution of bus-based router architectures. It was taken

advantage of the fact that the forwarding decision can be computed as soon as the packets have

arrived at the router. For this reason the forwarding engines were placed right after the network

interfaces (architecture D). The combination of these two components is called a line card.

One must not forget that all these three designs were still based on a bus system which is a

serious bottleneck when it comes to transmission of vast amounts of packets. As we will see in

the following, superior architectures exist which use switches for internal communications.

3.3 Switch-based Architectures

3.3.1 General switch idea

A switch takes advantage of the fact that multiple members of the interconnection unit do not

always conflict with their requests. When member A has a message to send to member B, it has

16

Figure 7: A 4-input crossbar interconnection fabric [McK97]

no impact on a message from C to D, for example. These requests could be served in parallel

if there would be a dedicated line between the affected members. Exactly this is provided by

a switch: it basically connects each member with every other member and allows a dynamic

configuration of those connections. Aschedulercontrols the configuration of the switch by

establishing and releasing the connections inside the so-calledswitch fabric. The connections

to the outside are named input and output ports (to indicate the direction of the packet flow). A

router with network interfaces or line cards connected to the switch has one input port for each

card and one output port respectively.4

Several methods exist to connect each input with each output in a switch fabric, with the

crossbar as the most common. A crossbar is a net of interconnection lines, with switchable

crosspoints, as can be seen in figure 7. Crossbars can be implemented in CMOS operating at

high speed [MIM+97]. Other fabrics highly depend on the chosen general switch layout, which

are described in 3.3.3.

When using a switch for a router, it must be considered that switches operate with fixed

length packets, also called cells. Fixed length make the implementation much simpler because

all parts of the switch can work with the same clock rate and after one “time slot” all packet

transmissions have finished [McK97]. To be able to handle variable length packets, which is the

case with IP packets processed by a router, the switch must split those packets into cells at the

input and reassemble them at the output.

4One could also look at a time-multiplexed bus as a simple implementation of a switch [Awe99]. This sharedmedium switch is scheduled by continuously repeating time slots allocated for the members on the bus. This designis very limited in capacity, as we have seen in 2.2. Because it does not reflect the typical characteristics of the otherkinds of switches, we rather consider this as a typical bus and not as a switch implementation.

17

When talking about switches, we will useN for the number of input ports andM for the

number of output ports. The speed of the input lines will be denoted withS [bit/s], whereby we

simply assume they all have the same speed for a simpler examination.

3.3.2 Architectures

There are two basic architectures possible with a switch as the interconnection unit. The general

processing module is connected to the switch like the other components and is typically a CPU

with memory located on a special card. Note that the actual forwarding decision is still made in

the forwarding engines, not in the switch itself. It needs exact information of the output port(s)

for a given packet to be able to transfer it.

As in the bus-based architecture C, the forwarding engines are located on separate cards

in architecture E. The switch can provide higher throughput, but unfortunately the packets still

have to cross the interconnection twice. This does not take full advantage of the switch. The only

reason to use this architecture is to avoid merging network interfaces with forwarding engines on

line cards and to reduce the number of forwarding engines (which can be lower than the number

of network interfaces). That would make it easier and cheaper to add network interfaces at a later

time.

The modern reference architecture for high-performance routers is denoted as architecture F

in this paper (figure 8). In this design packets only have to cross the switch once which increases

the throughput of the switch a lot, compared to the previously described architecture. The general

packet flow is simple: after reaching the network interface the packets are put into a memory on

the line card, where the forwarding engine takes care of finding the next hop of the packet. These

hops can be another line card, where the according outgoing network interface is located, but also

the general processing module, which is connected to the switch like any other line card. Having

all components arranged as line cards around the central switch, helps in making the router more

flexible for later expansion, as it is done in the Cisco 12000 series (section 3.5), for example. The

“internal routing agent” found in diagram 5 is not directly visible, because the network interfaces

are not on separated cards, but part of the line cards, which also contain the forwarding engines.

One could imagine the “internal routing agent” as the dotted agent in figure 8.

18

Figure 8: Modern switch based architecture with forwarding engines on line card (FMC)

19

3.3.3 Buffering and queueing in switches

The benefit of a switch is to allow multiple transmissions in parallel. However, as the speeds of

the different connected links are different and the actual transmissions between these links vary

and might interfere with each other, not all outstanding transmissions can be handled at the same

time. This raises the need for some kind of memory inside the switch, which stores packets that

have to wait for their transmission. As we will see, the location of that memory is determinant

for the design of the switch, and therefore switches are categorised by it. The point that switches

handle packet flows, implies that those memories are queues5, thus we call the categorisation

“Queueing families”.

It is very convenient to see that there are three basic types: input queueing, output queueing

and shared buffer (“central queueing”). The memory is located either before the internal switch

fabric (what actually connects all inputs and outputs together) at the input ports, after the fabric

on the output ports or the fabric itself is a single large buffer.

There are still different variants of each basic type and an analysis reveals that the shared

buffer can be seen as a specific type of output queueing. The following will describe those

variants, examine their performances, exhibit what is needed in order to make it work properly

and provide their advantages and disadvantages.

Output Queueing

We start with the output queueing, because the simple nature of it can be seen as the “refer-

ence” design for a switch regarding throughput and delay. None the less certain issues prevent a

practical implementation wherefore the research in the last twenty years tried to reach the same

performance with other kinds of queueing. If output queueing would be easy to implement, less

would be known about input queueing and others.

Normal output queueingneeds a connection from every input port to every output port, con-

sisting of a filter (address or routing filter) and one output buffer for all input lines. All incoming

packets will be immediately transferred to all outputs, where the filter selects the packets destined

for that output and places them into the buffer (figure 9(a)).

First of all the output queueing has very good performance because it is internally non-

blocking (there is no buffer or delay between the input and output ports). Because packets are

5These are mostly First-In-First-Out (FIFO) queues, which can be simply implemented in hardware. It is alsopossible to take a random packet from those queues, but complexity is lower with strict FIFO-queues

20

queued directly at the output links to which they must be transmitted, this switch design is always

work conserving. This means that the output links will never suffer from starvation when there is

at least more than one packet about to be sent over it. As we will see later, other queueing types

do not necessarily ensure work conserving. The fact that the decision of which packet gets sent

next is made directly at a particular output link has another three advantages: multicasting, i.e.

broadcasting as well as selective multicasting, are easily achievable by selecting the appropriate

packets; delaying of packets can be controlled—this is more and more important for a sensible

handling of multimedia traffic; finally, general quality-of-service as demanded by customers can

be simply ensured by having multiple queues, one for each prioritisation level, that make up the

output buffer. A policy scheduler must decide to which level a packet belongs and put it into the

according queue, the output scheduler takes the packets from the prioritised queues according to

rules defined in a contract. More concerning scheduling can be found in section 3.3.5

With all these advantages, the question is: why is the normal output queueing not used in

real high performance routers or packet switchers? Because of two reasons: the memory used

must be working at(N + 1) · S speed6, because if an output receives packets from all inputs in

one cell time,N writes must occur to save the packets in this output queue and1 packet must be

read from the memory to transmit it to the output. Secondly, the internal crossbar must also work

N times faster than the output links in order to support the maximum ofN writes to the same

output buffer. This is called aspeedup(in this case a speedupK with K = N ). A speedup ofN

is impossible to built for largerN or higher line ratesS, whereas it would be best not to require

any speedup (K = 0). If we take the example presented in 2.2 withN = 16 ports operating at

S = 2.5 Gbit/s, one line in the crossbar would have to transfer16 · 2.5 = 40 Gbit/s, the entire

crossbar then needs a capacity of16 · 40 = 640 Gbit/s. These requirements are too high for the

implementation of both crossbars and memories.

Theshared bufferbelongs to the output queueing family, because it is a large memory shared

by all output links (figure 9(b)). This makes the memory better utilised than in the normal output

queueing. Packet contents can be distributed across the memory, only the pointers to the packet

locations must be stored in queues. These logical queues can be set up according to the particular

need: one queue for each output or queues per prioritisation and traffic flow. When exposed to

unicast traffic, the same performance as with normal output queueing is achieved.

Problems arise with multicast traffic, because there is generally not enough throughput to

store a packet in multiple queues (one for each output). Multicasting must therefore be imple-

6In the rest of the document, a memory throughput ofN will be measured inS line speed, so that we getN · S [bit/s] as the actual memory speed.

21

(a) Output queueing (b) Shared buffer

Figure 9: Two Schemes of the output queueing family

mented by copying a packet multiple times from a queue to the appropriate output ports. A

serious problem called Head-of-line blocking (HOL) will then occur. This prevents packets to

be transferred to a waiting output port because other packets in front of it can currently not be

transferred and therefore block the packets behind. See the discussion of input queueing below

for a more complete description of HOL and its serious impact on performance.

A shared buffer is impossible to implement for large and high performance switches, because

the large amount of required space and throughput cannot be achieved with today’s memory

technologies and HOL prevents a reasonable performance.

Crosspoint or distributed output queueingis an improved version of the normal output queue-

ing, as it has one queue for each input at each output. It is called crosspoint because the memory

parts are located at the joints of the internal crossbar (figure 10). A scheduler selects an appropri-

ate packet from one of theN buffers and passes it to the output link. No speedup is needed and

the memory only faces a maximum of two operations per cell time. But distributing the output

queues toN ·M memories is very inefficient, because the number of memories is quadratically

increasing (considering the common case ofN = M ) and we cannot share memory space due to

the split up.

There is another variant calledblock-crosspoint queueing, a combination of crosspoint and

shared memory queueing. Here, the crosspoint memories are combined to connect multiple

input and output ports to allow a sharing of memory, but there is still not one single memory as

in the shared buffer variant. This can be seen as a grouping of the memory blocks in crosspoint

queueing. It is useful for largerN , when a single shared buffer would need too much memory

throughput.

TheKnock-Out switch[YHA87] tries to combine the advantages of the (partly) shared mem-

22

Figure 10: Scheme of crosspoint/distributed output queueing

ory in normal output queueing and of less needed memory throughput in crosspoint queueing. A

buffer-less knock-out fabric puts the packets ofN incoming lines intoL queues, whereL <= N .

If more thenL packets arrive, onlyL will be stored and the rest will get lost. A good chosen

value forL can keep the packet loss at a very low rate, independent on the number of inputs.

Assuming uniform traffic, a value ofL = 8, for example, is sufficient to maintain the packet loss

rate at one packet per million.

Input Queueing

Not the actual dual of output queueing, the input queueing family has buffer memory at the input

ports and a large, buffer-less switching fabric7 which connects all inputs with all outputs. No

buffer is found at the output. Because the actual switching must happen inside the fabric, a

scheduler is needed, who decides which input lines are connected to which output lines (figure

11). These decisions can be made by looking at the input queues. The main advantage of input

queueing is that the internal switch fabric and the memories only have to operate at the line rate

S as opposed to the output queueing.

7typically a crossbar

23

Figure 11: Scheme of input queueing

There are two possibilities for the buffer at one input port: either one queue (simple input

queueing) or multiple queues, i.e. one per output or even more (advanced input queueing or

virtual output queues).

Simpleinput queueingsuffers from a serious problem, called Head-of-line (HOL) blocking.

This applies to any FIFO-queued packet systems, when only the first packet in the queue is

considered for the next action, i.e. taking that packet out of the queue and sending it out. If the

front packet (the head of the line) cannot be sent to an output port, because this output port is

blocked, the rest of the packets in the queue are also blocked. This is bad because very often

those packets are destined for a completely different output port and could be transferred, when

no other packets from other input lines contend for that output port.

An analysis of HOL blocking, assuming independent, identically distributed Bernoulli pro-

cesses with uniformly distributed destination ports, yields that under full load only2 −√

2 =

0.586 ≈ 58% throughput can be achieved (see figure 12).

HOL makes the simple input queueing unusable for high performance packet switches.Ad-

vanced input queueingwith virtual output queues(VOQ) remove the HOL blocking problem at

the cost of the complexity of the scheduler at the crossbar. If separate queues for each output port

exist at the input ports, no packet in front of others can block them any more, because they all

24

Figure 12: Performance impact of head-of-line (HOL) blocking [HK88]

have the same destination port. These virtual output queues can be compared with traffic lanes

at crossroads. It is obvious that cars (packets) with the same destination waiting in the same lane

(queue) cannot block each other. An important thing to consider is the maximum possible size

of the virtual output queues to prevent packet loss at the input.

Input queueing with VOQs is very popular and demands an intelligent scheduler for the

crossbar. Various algorithms are described in section 3.3.4.

Combined Input and Output Queueing

An interesting approach has been made to emulate an output queued switch by using input and

output buffers as well as a speedup in the switch fabric, called acombined input and output

queued(CIOQ) switch (see figure 13).

As written above, a speedup ofK means that the switch can transfer up toK packets from

its input ports to its output ports within one time slot. This time slot is the interval between

packet arrivals at the input ports. The output queued switch needs a speedup ofK = N , which

is very impractical. On the other hand, input queueing suffers from head-of-line blocking. The

question of interest is how much speedup is needed, so that a CIOQ switch behaves identically

25

Figure 13: Scheme of combined input and output queueing

to an output queued switch8. It has been proved, that a speedup of2 − 1N

is both necessary and

sufficient for this, considering anN ×N switch and any traffic pattern [CGMP99].

There are special scheduling algorithms needed for this kind of switch, because two buffers

are used and two scheduling phases are needed. Those proposed work with a speedup of2 and

are quite simple but still have a complexity ofO(N), which makes them not applicable for fast

switches. Future research might be able to find better algorithms, which can take advantage of

the output buffering emulation for simpler quality-of-service support. The implementation of a

low speedup like2 in switch fabrics is in the range of possibility.

3.3.4 Crossbar Scheduling

The problem crossbar schedulers have to deal with is finding a configuration of the switch where

each active input is connected to all necessary outputs in the shortest possible time. This chal-

lenge is quite popular in mathematics and can be reduced to the problem of finding the maximum

matching of a bipartite graph. The most difficult task is to develop an algorithm which has good

8This means that one cannot see a difference in the packets leaving the switches when both are fed with exactlythe same input

26

Maximum Memorymemory No. of space Complexity

Architecture throughput memories utilisation Performance cost (problems)

Shared high-throughputBuffer

N + M 1 BEST BESTsingle memory

Output high-throughput mem.,Queueing

N + 1 M MEDIUM BESTexpensive sw. fabric

Crosspoint manyQueueing

2 N ·M WORST BESTmemories

Input bad throughputQueueing

2 N MEDIUM WORSTunder high load

Input + VOQ MEDIUM (sh.sp.) VERY GOOD complexQueueing

2 NWORST (part.sp.) (if...) scheduler

CIOQ - VERY GOOD expensiveSpeedup of K

K + 1 N + M MEDIUM(if...) switching fabric

Figure 14: Summary of theN ×M switch queueing architectures [Kat03]

fairness properties which, in other words, is starvation free and leaves no input unserved indefi-

nitely. There exist several deterministic maximum matching algorithms with good runtimes but

they always produce the same matchings on equal or similar situation. An unmatched input will

therefore stay unmatched what is considered to be unfair. For iterative algorithms it is also nec-

essary to examine the behaviour under heavy load because usually only a single iteration can be

achieved under these circumstances. The last requirement for scheduling algorithms is that an

implementation in hardware must be easy to achieve.

Two Dimensional Round-Robin Scheduler - 2DRR

The basic idea behind this algorithm is to represent all inputs, all outputs and all requests in one

matrix as can be seen in figure 15. In mathematics this is an adjacency matrix of a bipartite graph

with the input ports as one set of vertexes and the output ports as the other set. The edges in this

graph symbolise the requests.

The algorithm starts with accepting all requests on the main diagonal and ignores all other

requests which are on the same rows and columns as the accepted requests. This procedures

now continues with all other diagonals in left-to-right order until every requests is accepted and

a maximum matching is found.

Considering fairness this approach has several drawbacks. First of all, it is not very clever

27

Figure 15: 2DRR scheduling algorithm operating on the adjacency matrix

to always start with the main diagonal. It would be better to select the first diagonal in a round-

robin fashion. It is also unfair to have a fixed left-to-right method for choosing the succeeding

diagonals. For further information see references [LS94] and [PT03].

Parallel Iterative Matching - PIM

Parallel Iterative Matching (PIM) was developed by DEC System Research Center at beginning

of the 1990s [AOST93]. The algorithm tries to find a maximum matching in several iterations.

Each iteration consists of three steps (see figure 16):

• Request:Each unmatched input sends a request to every output for which it has a queued

cell.

• Grant: If an unmatched output receives any requests, it chooses one randomly to grant.

28

Figure 16: Example run of the Parallel Iterative Matching (PIM) algorithm

• Accept:If an input receives a grant, it accepts one by selecting an output randomly among

those that granted to this input.

Tests have shown that this algorithm converges to a maximal match inO(logN) iterations. In

each iteration3/4 of the remaining requests are eliminated [AOST93]. Under heavy load when

only a single iteration is possible this algorithm does not perform well because the throughput is

limited to (1− 1e) for large N, which is approximately 63 % for N= 16 [PT03].

Iterative SLIP - iSLIP

The iSLIP algorithm is based on the PIM algorithm and was developed by Nick McKeown as

part of his thesis in 1995[McK95]. The main difference is that it uses a round-robin instead of

a random schedule for figuring out which input or output is matched next. This is achieved by

maintaining different kind of pointers. It is also possible to apply a prioritisation scheme in order

to support classes of services.

• Request:Each unmatched input sends a request to every output for which it has a queued

cell.

• Grant: If an output receives any requests, it chooses the one that appears next in a fixed

round-robin schedule starting from the highest priority element. The output notifies each

input whether or not its request was granted.

• Accept: If an input receives a grant, it accepts the one that appears next in a fixed round-

robin schedule starting from its own output pointer. The output pointer of the round-robin

schedule is incremented to one location after the accepted output.

The iSLIP has some advantages over the PIM approach. First of all, it performs better than

PIM even under heavy load and is relatively easy to implement [PT03]. It can achieve 100 %

29

throughput in a single iteration for uniform traffic [McK99]. PIM has the problem that random

selection is difficult to achieve. Another issue is that iSLIP is more fair by virtue of its round-

robin schedule.

McKeown has recently developed another enhanced variant of the iSLIP algorithm called

ESLIP [McK97] which can handle unicast and multicast traffic simultaneously. It is discussed in

the case-study at the end of section 4 as it is employed in the Cisco 12000 series internet router.

Lowest Occupancy Output First Algorithm - LOOFA

Another algorithm called LOOFA follows an iterative approach, too [KPC+99]. It consists of the

following two steps which are performed at every iteration:

1. Each unmatched input sends a request to an output with the lowest occupancy among those

for which it has at least one queued cell.

2. Each output, upon receiving requests from multiple inputs, selects one and sends a grant

to that input.

Unlike PIM or iSLIP this algorithm requiresO(N ) iterations to perform correctly. The be-

haviour under heavy load when there is only time for fewer iterations is unpredictable. This

makes LOOFA unsuitable for use in high speed implementations. However, it is used as the

basis for another algorithm which does not have this limitation and provides almost similar per-

formance. This derived algorithm is called Approximate LOOFA (A-LOOFA) and further in-

formation about it can be found in the paper “Stress-Resistant Scheduling Algorithms for CIOQ

Switches” [KPC+99].

3.3.5 Quality-of-service via hierarchical scheduling

The current development and research concentrates on more support for quality-of-service in

routers. Therefore we give some general information about possible methods for this support.

There are different aspects of resource control or quality-of-service in modern routers. These

are listed in the following with an example of what it means (taken from [GM01]):

• Packet Filtering:Deny all traffic from ISP3 (on interface X) destined to E2.

30

• Policy Routing:Send all voice-over-IP traffic arriving from E1 (on interface Y) and des-

tined to E2 via a separate ATM network.

• Accounting & Billing:Treat all video traffic to E1 (via interface Y) as highest priority and

perform accounting for the traffic sent this way.

• Traffic Rate Limiting:Ensure that ISP2 does not inject more than 10Mbps of email traffic

and 50Mbps of total traffic on interface X.

• Traffic Shaping:Ensure that no more than 50Mbps of web traffic is injected into ISP2 on

interface X.

Handling all these requirements claims for a special scheduling. As we have seen, input

queueing demands a scheduler inside the switch anyway, therefore it is good to introduce the

notion ofhierarchical scheduling. Hereby the work can be split into three stages on one hand,

and into multiple levels on the other hand [Kat03]. Looking at the stages, thepolicer is the first

who analyses incoming packets, drops any non-conforming packets and puts them in several

queues for the next steps. Then theregulatorwill ensure the correct delay of packets according

to the contract. Finally theschedulerwill select the appropriate packets among all those who are

eligible and send them out. Very often only the policer and scheduler are present, the regulator

is not as much required as the other two stages.

The levels are used because different algorithms are needed to implement different poli-

cies. These work on the different levels and compose the hierarchical scheduler. Some “sub-

schedulers” operate on distinct queues, e.g. sorted by traffic priority. The results of those sched-

ulers are passed to another scheduler selecting the actual packet for the transmission. This can

be carried on in multiple stages.

There are three prominent kinds of scheduling: static priority scheduling is the most simple

form, which always selects the packets with the highest priority. If higher priority queues never

get empty, packets with lower priority will never be served, which is not wanted. To overcome

this problem, the order of priorities can be changed into different time slots, which represent

the percentage of throughput allowed for certain packets. The round-robin scheduling is another

form with better results. Lastly, weighted round robin (WRR) can produce smoother output,

because different classes of packets do not come up in groups but change more often. This

improves the delay for most of the packets. Another name for that is Weighted Fair Queueing

(WFQ).

31

3.4 Routing table lookup optimisations

Performing routing lookups is one of the time critical tasks a router has to do. Typically, the data

structure for storing routes is a tree, where each path in the tree from root to leaf corresponds to

an entry in the routing table. In order to figure out where a given packet has to go, the longest

matching prefix of the destination address in the packet must be found in the tree. This is achieved

by finding the longest path in the tree that matches the destination address of the incoming packet.

For example, if the destination address is 192.168.1.2 and two prefixes, 192.168 and 192.168.1,

can be found in the tree, the longest matching prefix is 192.168.1.

The algorithms solving this problem9 have a complexity proportional to the number of ad-

dress bits of the destination address. This means that in the worst case it takes time proportional

to the length of the destination address to find the longest matching prefix. For example IP ver-

sion 4 addresses have 32 bits and it might take up to 32 bit comparisons to find a matching entry.

Performance even gets worse when the algorithms need to backtrack.

Different approaches exist to optimise the routing table lookups. They can be loosely cate-

gorised in (a) use of better algorithms (b) avoid routing table lookups by using caches (c) com-

press routing table (d) hardware based techniques.

As we have seen earlier routing tables contain tens of thousands of entries. One idea to opti-

mise the lookups in those big tables is to use a hash table as a front-end to the table. Waldvogel

[M. 97] has presented a scalable hash based algorithm that can look up the longest prefix match

for an N bit address inO(logN) steps. This is achieved by maintaining a hash table for every

possible prefix length.

In order to optimise the time consuming routing table lookups it also came up to avoid routing

lookups and use route caches instead. That means the routing table is only consulted when there

is no appropriate entry in the route cache. Looking up an entry in the route cache is far easier

because only exact matching is used and not best matching as for the routing table. Route caches

are only useful when there is enough locality in traffic in other words when the cache hit rate is

high enough because the destinations of packets do not change very often.

The third approach is to somehow compress the routing table in order to make the lookups

faster [M.D97]. This is achieved by exploiting the sparse distribution of route entries in the

space of all possible network addresses to build a complicated but compact data structure for

the routing table. The address space is partitioned into three levels. Each level uses a separate

9Patricia: Practical Algorithm to Retrieve Information Coded in Alphanumeric

32

encoding scheme to compress the tree structure. The resulting data structure requires only 150-

160 KBytes to represent a table with 40000 routes [Tze98]. This table is small enough to fit into

the data cache of high-end processors and allows route lookups at gigabit speeds.

Another alternative is to put the routing table lookup in a so called content-addressable-

memory (CAM) in order to provide a constant lookup rate. Unlike conventional memory this

type of memory is able to operate in two different modes. In one mode, they act like standard

memory and may be used to store binary data. The other mode is called matching mode because

it permits searching of all the data in the CAM in parallel [GLD00]. Still this kind of memory is

very expensive and therefore not useful for large routing tables. Aside from using CAMs, putting

the route table lookup logic in hardware is also under active investigation.

3.5 Case study: Cisco 12000 series

This case study is based on the paper by Nick McKeown [McK97] where he focuses on the

design of the Cisco 12000 series interconnection unit and on online resources found on the Cisco

web site (http://www.cisco.com).

As can be seen in figure 17 the Cisco 12000 series consists of five different models:

Model Slots Switching Capacity

12416 16 320 Gbit/s12410 10 200 Gbit/s12406 6 120 Gbit/s12404 4 80 Gbit/s12016 16 80 (upgradeable to 320) Gbit/s

Figure 17: Cisco 12000 series

In the following we will concentrate on the 12416 because it is the model with highest switch-

ing capacity. It has 16 slots each of them operating at 10 Gbit/s (20 Gbit/s in both directions)

which provide an aggregate switching capacity of 320 Gbit/s. Every slot is equipped with a

hardware forwarding engine in order to quickly compute the destination of packets as soon as

they arrive at the router. The architecture of the 12416 router resembles the variant F in our

comparison table 6.

The interconnection unit used is a crossbar switch which operates with fixed length packets

(“cells”) and a speedup of 2. Due to their simplicity, switches can operate at very high speed and

33

Figure 18: Cisco 12416

furthermore have the advantage of supporting multiple transactions simultaneously. As we have

seen above they are internally non-blocking which means that it allows parallel input and output

to and from the network interfaces.

A centralised scheduler controls the switch and is responsible for selecting a configuration

of the switch appropriate for delivering cells from inputs to outputs at any cell time. Each step

of the scheduler takes exactly one cell time. The scheduler uses the ESLIP algorithm, which we

will explain further down, to schedule unicast and multicast traffic. Scheduling multicast traffic

is achieved by using fanout-splitting where copies of the input cells may be delivered to output

ports at any cell time and not at the same cell time as it is the case when no fanout-splitting is

used. The fanout of a cell is the number of outputs to which an input cell is to be delivered.

Unicast cells always have a fanout of one and multicast cells have a fanout greater than one.

Fanout splitting can somehow be categorised as a best effort method because cells that can be

delivered to output ports will get delivered as soon as the output ports are free.

It is also worth mentioning that the switch has two different kinds of queues for unicast

and multicast traffic. Virtual output queues are used for unicast cells to eliminate head-of-line-

blocking and multicast queues are used to enable simple packet replication for multicast traffic

34

Figure 19: Unicast and multicast queues [McK97]

as depicted in figure 19.

ESLIP scheduling algorithm

The ESLIP scheduling algorithm is a modified version of the iSLIP algorithm. ESLIP handles

unicast and multicast queues at the same time [McK97] as opposed to iSLIP which can only

handle unicast traffic.

The problem which all scheduling algorithms have to solve is to determine which cell waiting

in an input queue is going to be delivered next. The algorithm at hand starts with an initial

switch configuration without any connections between ports. It is an iterative algorithm with

each iteration consisting of three steps (1) request, (2) grant and (3) accept. At each iteration the

configuration of the switch is modified. The main idea behind the algorithm is that each input

port with cells waiting to be delivered should ask the appropriate output ports whether they are

willing to receive. After the output ports have signalised their state, the input port sends the cell.

Due to the fact that either a multicast cell or a unicast cell can be delivered at one cell time,

the scheduler needs to remember which of the input ports had already sent a cell and which of

the output ports had already received a cell. For unicast cells this is achieved by maintaining

grant pointersugi to the input ports at the output ports and maintaining accept pointersuaj to the

output ports at the input ports. For multicast cells all output ports have a common grant pointer

mg and all input ports have a common accept pointerma.

Now follows a detailed description of the steps:

35

• Request:In this step each input which has cells waiting to be delivered sends requests to

every output for which it has data. This applies to unicast and multicast cells. If the switch

uses prioritisation the request can also be prioritised.

• Grant: When an output receives any requests it chooses the one that appears next in fixed

round-robin schedule starting from the highest priority element. For determining the next

input the pointersugi and the pointermg are used (see above). Each input then is informed

whether its requests is serviced or not.

• Accept: If an input receives a grant it accepts the one that appears next in a round-robin

order, starting from the highest priority element. If a unicast cell can now be delivered the

pointersuaj andugi are incremented (j is the current output which sent the grant and i the

current input). The common pointersmg andma only get incremented when all the cells

of a multicast request could be delivered.

The algorithm stops when a configuration of the switch is reached where all active inputs and

outputs are correctly connected to each other.

4 Conclusion and Outlook

The look into a high-performance IP router exposed several central insights. Routers need a

very specialised architecture to fulfil the demands of todays networks, the normal computer

architecture is only applicable to routers for local area networks with low bandwidth. The modern

layout is based on a central switch, which has forced research in this area, but also gained from

it. Likewise, to improve the process of forwarding, many ideas on how to decrease the time it

takes to look up an entry in the routing table lookup have been proposed.

Still some issues remain open. Like we have seen in section 3.3.5 better and simpler imple-

mentation of QoS schedulers are necessary in order to satisfy all customer requirements in this

area such as traffic rate limiting or traffic shaping. Especially accounting of a particular type of

packets is difficult to achieve.

As more and more devices (PDAs, MDAs) get connected to the internet IPv4 will be replaced

by IPv6 which in turn requires routing of IPv6 traffic at gigabit or terabit speed. Due to the fact

that IPv6 address are 128 bit long the routing table size increases and route lookups become more

complex. Moreover the table size also increases by virtue of the constant growth of the networks

36

especially of the internet. This will require efficient lookup algorithms and intelligent methods

like content-addressable-memory for storing the table.

From our point of view the next generation of routers will still be switch based because

packet switching technology using crossbar switches is not yet exhausted. We have seen that

the overall design using switches is scalable; faster memories and crossbars can simply provide

more performance, linearly. The usage of multiple parallel crossbars can increase the throughput

of switch fabrics and applying Combined Input-Output Queueing makes it possible to use WFQ

scheduling algorithms to support better QoS control.

In our paper we have only examined the properties of electronic packet-switched routers but

researchers are trying to use optical components to build even faster routers. Recently a project

at Stanford University has started to proof that building a 100Tb/s internet router with an optical

switch fabric is possible. This approach has several advantages over the electronic approach.

Since optical components are already used inside routers in order to interconnect components, it

is tried to reduce the number of conversions between the optical and the electronic domain and

therefore increase processing power [McK03].

Studying the anatomy of high performance routers is challenging and fascinating at the same

time as it requires deep knowledge of many different domains in order to fully understand the

decisions of the router architects.

Glossary

2DRR 2 Dimensional Round Robin

ASIC Application Specific Integrated Circuit

BGP Border Gateway Protocol - RFC1771

CAM Content Addressable Memory

CIOQ Combined Input Output Queueing

CoS Classes of Service

CPU Central Processing Unit

DWDM Dense Wavelength Division Multiplexing

37

HOL Head Of Line

ICMP Internet Control Message Protocol - RFC792

IP Internet Protocol - RFC791

LOOFA Lowest Occupancy Output First Algorithm

MDA Mobile Digital Assistant

MTU Maximum Transfer Unit

NAT Network Address Translation

OSPF Open Shortest Path First - RFC2328

PDA Personal Digital Assistant

PIM Parallel Iterative Matching

QoS Quality of Service

RIP Routing Information Protocol - RFC1058

SDRAM Synchronous Dynamic Random Access Memory

TTL Time To Live

UDP User Datagram Protocol - RFC768

VOQ Virtual Output Queue

WFQ Weighted Fair Queueing

WRED Weighted Random Early Detection

References

[AOST93] T. Anderson, S. Owicki, J. Saxe, and C. Thacker. High-speed switch scheduling for

localarea networks, 1993.

[Awe99] James Aweya. IP Router Architectures: An Overview, 1999.

38

[Bak95] F. Baker. Requirements for IP Version 4 Routers, June 1995. IETF RFC 1812.

[BBP88] R. Braden, D. Borman, and C. Partridge. Computing the Internet Checksum,

September 1988. IETF RFC 1071.

[CGMP99] Shang-Tse Chuang, Ashish Goel, Nick McKeown, and Balaji Prabhakar. Matching

Output Queueing with a Combined Input Output Queued Switch.IEEE Journal on

Selected Areas in Communications, 17(6):1030–1039, December 1999.

[GLD00] Steven Guccione, Delon Levi, and Daniel Downs. A Reconfigurable Content Ad-

dressable Memory. InProceedings of International Parallel and Distributed Pro-

cessing Symposium, May 2000.

[GM01] Pankaj Gupta and Nick McKeown. Algorithms for Packet Classification.IEEE

Network, March 2001.

[HK88] M. Hluchyj and M. Karol. Queueing in high-performance packet switching.IEEE

Journal on Selected Areas in Communication, 6(9):1587–1597, December 1988.

[Kat03] M. Katevenis. Packet Switch Architecture lecture, 2003.

[KPC+99] P. Krishna, N. Patel, A. Charny, et al. On the Speedup Required for Work-

Conserving Crossbar Switches.IEEE Journal on Selected Areas of Communica-

tions, June 1999.

[LS94] Richard LaMaire and Dimitrios Serpanons. Two-Dimensional Round-Robin Sched-

ulers for Packet Switches with Multiple Input Queues.IEEE Transactions on Net-

working, October 1994.

[M. 97] M. Waldvogel and G. Varghese and J.Turner and others. Scalable High Speed IP

Routing Lookups. InProceedings of ACM SIGCOMM 97, September 1997.

[Mal98] G. Malkin. RIP Version 2, November 1998. IETF RFC 2453.

[McK95] N. McKeown. Scheduling algorithms for input-queued cell switches, 1995.

[McK97] Nick McKeown. Fast Switched Backplane for a Gigabit Switched Router. Technical

report, Department of Electrical Engineering, Standford University, 1997.

[McK99] Nick McKeown. iSLIP: A Scheduling Algorithm for Input-Queued Switches.IEEE

Transactions on Networking, April 1999.

39

[McK03] Nick McKeown. Optics inside Routers, September 2003.

[MD90] J. Mogul and S. Deering. Path MTU Discovery, April 1990. IETF RFC 1191.

[M.D97] M.Degermark and A. Brodnik and S. Carlsson. Small Forwarding Tables for Fast

Route Lookups. InProceedings of ACM SIGCOMM 97, September 1997.

[MIM +97] Nick McKeown, Martin Izzard, Adisak Mekkittikul, William Ellersick, and Mark

Horowitz. Tiny Tera: A packet switch core — using new scheduling algorithms to

build a 1-terabits packet switch with a central hub no larger than a can of soda.IEEE

Micro, 17(1):26–33, /1997.

[MK90] T. Mallory and A. Kullberg. Incremental Updating the Internet Checksum, Januar

1990. IETF RFC 1141.

[Moy98] J. Moy. OSPF: Anatomy of an Internet Routing Protocol. Addison-Wesley, 1998.

[Pos81] J. Postel. Internet protocol, September 1981. IETF RFC 791.

[PT03] Prashantu Pappu and Jonathan Turner. Stress-Resistant Scheduling Algorithms for

CIOQ Switches. InProceedings of ICNF, November 2003.

[RL95] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4), March 1995. IETF

RFC 1771.

[S+96] A. Singhal et al. Gigaplane: A High Performance Bus for Large SMPs. InProc. of

Hot Interconnects IV, Stanford, California, USA, 1996.

[Tze98] H. Tzeng. Longest Prefix Search Using Compressed Trees. InProceedings of IEEE

Globecom, November 1998.

[YHA87] Y. Yeh, M. Hluchyj, and A. Acampora. The knockout switch: a simple, modular ar-

chitecture for high-performance packet switching.IEEE Journal on Selected Areas

in Communications, 5(8):1274–1283, October 1987.

40

anatomy of a high performance ip routercomlab/seminar/ovs1_ref5.pdf · anatomy of a high...

Documents