switching and router design 2007. 10

Switching and Router Design

2007. 10

A generic switch

Classification Packet vs. circuit switchesPacket vs. circuit switches

packets have headers and samples don’t packets have headers and samples don’t Connectionless vs. connection orientedConnectionless vs. connection oriented

connection oriented switches need a call setupconnection oriented switches need a call setup setup is handled in setup is handled in control planecontrol plane by switch controller connectionless switches deal with self-contained

datagrams

Connectionless (router)

Connection-oriented (switching system)

Packet switch

Internet router ATM switching system

Circuit switch

Telephone switching system

Requirements Capacity of switch is the maximum rate at which it Capacity of switch is the maximum rate at which it

can move information, assuming all data paths are can move information, assuming all data paths are simultaneously activesimultaneously active

Primary goal: Primary goal: maximize capacitymaximize capacity

subject to cost and reliability constraintssubject to cost and reliability constraints Circuit switchCircuit switch must reject call if can’t find a path for must reject call if can’t find a path for

samples from input to outputsamples from input to output

goal: goal: minimize call blockingminimize call blocking Packet switchPacket switch must reject a packet if it can’t find a must reject a packet if it can’t find a

buffer to store it awaiting access to output trunkbuffer to store it awaiting access to output trunk

goal: goal: minimize packet lossminimize packet loss Don’t reorder packetsDon’t reorder packets

Outline

Circuit switchingCircuit switching

Packet switchingPacket switching

Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placement (Buffer placement (ArchitectureArchitecture)) Schedule the fabric / crossbar / Schedule the fabric / crossbar /

backplanebackplane Routing lookupRouting lookup

Packet switching In a packet switch, For every packet, you must:In a packet switch, For every packet, you must:

Do a routing lookupDo a routing lookup: where (port) to send it: where (port) to send it Datagram -- lookup based on entire destination Datagram -- lookup based on entire destination

address (packets carry a destination field)address (packets carry a destination field) Cell -- lookup based on VCICell -- lookup based on VCI

ScheduleSchedule the fabric / crossbar / backplane the fabric / crossbar / backplane Maybe Maybe bufferbuffer, maybe , maybe QoSQoS, maybe filtering by ACLs, maybe filtering by ACLs

Back-of-the-envelope numbersBack-of-the-envelope numbers Line cards can be Line cards can be 40 Gbit/sec40 Gbit/sec today (OC-768) today (OC-768) To handle minimum-sized packets To handle minimum-sized packets (~40B(~40B))

125 Mpps125 Mpps, or , or 8ns per packet8ns per packet But note that this can be deeply But note that this can be deeply pipelinedpipelined, at the , at the

cost of buffering and complexity. Some lookup cost of buffering and complexity. Some lookup chips do this, though still with chips do this, though still with SRAMSRAM, not DRAM. , not DRAM. Good Good lookup algorithmslookup algorithms needed still. needed still.

Router Architecture Control PlaneControl Plane

How routing protocols establish routes How routing protocols establish routes etc.etc.

Port mappersPort mappers Data PlaneData Plane

How packets get forwardedHow packets get forwarded

Outline Circuit switchingCircuit switching Packet switchingPacket switching

Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placementBuffer placement Schedule the fabric / crossbar / Schedule the fabric / crossbar /


First generation switch - shared memory

Line card DMAs into buffer, CPU examines header, Line card DMAs into buffer, CPU examines header, has output DMA outhas output DMA out

Bottleneck Bottleneck can becan be CPU CPU, host-adapter or I/O bus, , host-adapter or I/O bus, dependingdepending

Most Ethernet switches and cheap packet routersMost Ethernet switches and cheap packet routers Low-cost routers, Speed: Low-cost routers, Speed: 300 Mbps – 1 Gbps300 Mbps – 1 Gbps

Example (First generation switch)

Today router built with 1.33 GHz Today router built with 1.33 GHz CPU - bottleneckCPU - bottleneck

Mean packet size 500 BMean packet size 500 B word (4B) access take 15 ns word (4B) access take 15 ns

Bus ~ 100 MHz, memory ~ 5 nsBus ~ 100 MHz, memory ~ 5 ns

1) Interrupt takes 1) Interrupt takes 2.5 µs2.5 µs per packet per packet

2) Per-packet processing time: (200 instrcts) = 2) Per-packet processing time: (200 instrcts) = 0.15 µs0.15 µs

3) Copying packet takes 500/4 *33 = 3) Copying packet takes 500/4 *33 = 4.1 µs4.1 µs 4 instructions + 2 memory accesses = 33 ns (4B word)4 instructions + 2 memory accesses = 33 ns (4B word)

TotalTotal time = 2.5 + 0.15 + 4.1 = time = 2.5 + 0.15 + 4.1 = 6.75 µs6.75 µs

=> speed is (500x8/6.75) = => speed is (500x8/6.75) = 600 Mbps600 Mbps

Amortized interrupt cost balanced by routing protocol Amortized interrupt cost balanced by routing protocol costcost

Second generation switch - shared bus

Port mapping (forwarding decisions) in line cardsPort mapping (forwarding decisions) in line cards

direct transfer over bus between line cardsdirect transfer over bus between line cards if no route in line card --> CPU (“slow” operations)if no route in line card --> CPU (“slow” operations)

BottleneckBottleneck is the is the busbus

Medium-end routers, switches, ATM switchesMedium-end routers, switches, ATM switches

Speed: Speed: 10 Gbps10 Gbps (> 8* 1-st generation switch) (> 8* 1-st generation switch)

Third generation switches - point-to-point (switched) bus (fabric)

++tagtag

----->----->

Third generation (contd.) Bottleneck in second generation switch is the busBottleneck in second generation switch is the bus

Third generation switch provides Third generation switch provides parallelparallel paths paths (fabric)(fabric)

FeaturesFeatures

self- routing fabricself- routing fabric (+ tag) (+ tag) output bufferoutput buffer is a point of contention is a point of contention

unless we unless we arbitrate arbitrate access to fabricaccess to fabric potential for unlimited scaling, as long as we can potential for unlimited scaling, as long as we can

resolve contention for output bufferresolve contention for output buffer High-end routers, switchesHigh-end routers, switches

Speed: 1000 GbpsSpeed: 1000 Gbps

Outline

Circuit switchingCircuit switching




Buffered crossbar

What happens if What happens if packets at two inputs packets at two inputs both want to go to both want to go to same output?same output?

Can defer one at an Can defer one at an input bufferinput buffer

Or, buffer Or, buffer crosspointscrosspoints

Broadcast

Packets are tagged with output port #Packets are tagged with output port #

Each output matches tagsEach output matches tags

Need to match N addresses in parallel at outputNeed to match N addresses in parallel at output

Useful only for small switches, or as a stage in a Useful only for small switches, or as a stage in a large switchlarge switch

Switch fabric element Can build complicated fabrics from a simple elementCan build complicated fabrics from a simple element

Self- routing rule: if tag = 0, send packet to upper Self- routing rule: if tag = 0, send packet to upper output, else to lower outputoutput, else to lower output

If both packets to same output, buffer or dropIf both packets to same output, buffer or drop

Fabrics built with switching elements

NxN switch with bxb elements has NxN switch with bxb elements has stages withstages with elements per stage elements per stage

ex: ex: 8x8 8x8 switch with switch with 2x22x2 elements has elements has 33 stages with 4 elements per stagestages with 4 elements per stage

Fabric is Fabric is self-routingself-routing

RecursiveRecursive

Can be synchronous or asynchronousCan be synchronous or asynchronous

Regular and suitable forRegular and suitable for VLSI VLSI implementation implementation

Nblog bN /

The fabric for Batcher-banyan switch

An example using Batcher-banyan switch

Outline Circuit switchingCircuit switching


Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placement (Architecture)Buffer placement (Architecture) Schedule the fabric / crossbar / Schedule the fabric / crossbar /


Speedup C – input/output link capacityC – input/output link capacity

RRII – maximum rate at which an – maximum rate at which an

input interface can send data into input interface can send data into backplanebackplane

RROO – maximum rate at which an – maximum rate at which an

output can read data from output can read data from backplanebackplane

B – maximum aggregate B – maximum aggregate backplane transfer ratebackplane transfer rate

input interface output interface

Inter-connection

Medium(Backplane)

C CRI ROB

Generic Router Architecture

Back-planeBack-plane speedup: B/C speedup: B/C

InputInput speedup: R speedup: RII/C/C

OutputOutput speedup: R speedup: ROO/C/C

Buffering - three router architectures

Where should we place buffers?Where should we place buffers?

Output queued Output queued (OQ)(OQ) Input queued Input queued (IQ)(IQ) Combined Input-Output queued Combined Input-Output queued (CIOQ)(CIOQ)

Function divisionFunction division

Input interfaces:Input interfaces: Must perform packet forwarding – need to Must perform packet forwarding – need to

know to which output interface to send know to which output interface to send packetspackets

May enqueue packets and perform schedulingMay enqueue packets and perform scheduling Output interfaces:Output interfaces:

May enqueue packets and perform schedulingMay enqueue packets and perform scheduling

Output Queued (OQ) Routers

Only output interfaces Only output interfaces store packetsstore packets

AdvantagesAdvantages

Easy to design Easy to design algorithms: only one algorithms: only one congestion pointcongestion point

DisadvantagesDisadvantages

Requires an Requires an output output speedup of Nspeedup of N, where N is , where N is the number of interfaces the number of interfaces not feasiblenot feasible


Backplane

CRO

Input Queueing (IQ) Routers Only input interfaces store packetsOnly input interfaces store packets AdvantagesAdvantages

Easy to built Easy to built Store packets at inputsStore packets at inputs if contention at outputs if contention at outputs

Relatively easy to design algorithmsRelatively easy to design algorithms Only one congestion point, Only one congestion point, but not output…but not output… need to implement backpressureneed to implement backpressure

DisadvantagesDisadvantages In general, In general, hard to achieve high utilizationhard to achieve high utilization However, theoretical and simulation results show However, theoretical and simulation results show

that for realistic traffic an input/output that for realistic traffic an input/output speedup of 2speedup of 2 is enough to achieve is enough to achieve utilizations close to 1utilizations close to 1


Backplane

CRO

Combined Input-Output Queueing (CIOQ) Routers

Both input and output Both input and output

interfaces store packetsinterfaces store packets AdvantagesAdvantages

Easy Easy to built to built Utilization 1Utilization 1 can be can be achieved with input/outputachieved with input/output speedup speedup (<= 2(<= 2))

DisadvantagesDisadvantages HarderHarder to design algorithms to design algorithms

Two congestion pointsTwo congestion points Need to design flow controlNeed to design flow control

An input/output speedup of 2, a CIOQ can emulate An input/output speedup of 2, a CIOQ can emulate any work-conserving OQ [G+98,SZ98] any work-conserving OQ [G+98,SZ98]


Backplane

CRO

Generic Architecture of a High Speed Router Today

CIOQCIOQ - Combined Input-Output Queued - Combined Input-Output Queued ArchitectureArchitecture

Input/output Input/output speedup <= 2speedup <= 2 Input interfaceInput interface

Perform packet forwarding (and classification)Perform packet forwarding (and classification) Output interfaceOutput interface

Perform packet (classification and) schedulingPerform packet (classification and) scheduling Backplane / Backplane / fabricfabric

Point-to-point Point-to-point (switched)(switched) bus; speedup N bus; speedup N ScheduleSchedule packet transfer from input to output packet transfer from input to output

Backplane / Fabric / Crossbar Point-to-point switchPoint-to-point switch allows to allows to simultaneouslysimultaneously

transfer a packet between any two disjoint pairs of transfer a packet between any two disjoint pairs of input-output interfacesinput-output interfaces

Goal: come-up with a Goal: come-up with a scheduleschedule that that MaximizeMaximize router router throughputthroughput Meet flow QoS requirementsMeet flow QoS requirements

Challenges:Challenges: Address Address head-of-line blockinghead-of-line blocking at inputs at inputs Resolve input/output Resolve input/output speedupsspeedups contention contention Avoid packet Avoid packet droppingdropping at output if possible at output if possible

Note: packets are fragmented in Note: packets are fragmented in fix sized cellsfix sized cells (why?) at inputs and reassembled at outputs (why?) at inputs and reassembled at outputs In Partridge et al, a cell is 64 B (what are the In Partridge et al, a cell is 64 B (what are the

trade-offs?)trade-offs?)

Head-of-line Blocking

The cell at the head of an input queue cannot be The cell at the head of an input queue cannot be transferred, thus blocking the following cells transferred, thus blocking the following cells

Cannot betransferred because output buffer full

Cannot be transferred because is blocked by red cell

Output 1

Output 2

Output 3

Input 1

Input 2

Input 3

To Avoid Head-of-line Blocking

Head-of-line blocking with only 1 queue per inputHead-of-line blocking with only 1 queue per input Max throughput <= (2-sqrt(2)) =~ 58%Max throughput <= (2-sqrt(2)) =~ 58%

Solution? Maintain at each input Solution? Maintain at each input N virtual queuesN virtual queues, , i.e., one per outputi.e., one per output Requires N queues; more if QoSRequires N queues; more if QoS The Way It’s Done NowThe Way It’s Done Now

Output 1

Output 2

Output 3

Input 1

Input 2

Input 3

Cell transfer

Schedule: ideally, find the Schedule: ideally, find the maximum numbermaximum number of of input-output pairsinput-output pairs such that: such that:

Resolve input/output contentionsResolve input/output contentions Avoid packet drops at outputsAvoid packet drops at outputs Packets meet their time constraints (e.g., Packets meet their time constraints (e.g.,

deadlines), if anydeadlines), if any Example:Example:

Use stable matchingUse stable matching Try to emulate an OQ switchTry to emulate an OQ switch

Stable Marriage Problem Consider N women and N menConsider N women and N men

Each woman/man ranks each man/woman in Each woman/man ranks each man/woman in the order of their preferencesthe order of their preferences

Stable matching, a matching with no Stable matching, a matching with no blocking pairsblocking pairs

Blocking pair; let p(i) denote the pair of iBlocking pair; let p(i) denote the pair of i

There are matched pairs (k, p(k)) and (j, There are matched pairs (k, p(k)) and (j, p(j)) such that k prefers p(j) to p(k), and p(j)) such that k prefers p(j) to p(k), and p(j) prefers k to jp(j) prefers k to j

Gale Shapely Algorithm (GSA)

As long as there is a free man mAs long as there is a free man m

m proposes to highest ranked women w in his m proposes to highest ranked women w in his list he hasn’t proposed yetlist he hasn’t proposed yet

If w is free, m an w are engagedIf w is free, m an w are engaged If w is engaged to m’ and w prefers m to m’, w If w is engaged to m’ and w prefers m to m’, w

releases m’releases m’ Otherwise m remains free Otherwise m remains free

A stable matching exists for every set of A stable matching exists for every set of preference listspreference lists

Complexity: worst-case Complexity: worst-case O(NO(N22))

OQ Emulation with a Speedup of 2

Each input and output maintains a preference listEach input and output maintains a preference list

Input preference list:Input preference list: list of cells at that input list of cells at that input ordered in the inverse order of their arrivalordered in the inverse order of their arrival

Output preference list:Output preference list: list of all input cells to be list of all input cells to be forwarded to that output ordered by the times they forwarded to that output ordered by the times they would be served in an Output Queueing schedulewould be served in an Output Queueing schedule

Use GSAUse GSA to match inputs to outputs to match inputs to outputs

Outputs initiate the matchingOutputs initiate the matching Can emulate all work-conserving schedulersCan emulate all work-conserving schedulers

Example with a Speedup of 2

c.2 b.1 a.1

a.2

c.3

1

2

3

a

b

c

b.2

c.1

b.3

(a)

c.2 b.1 a.1 1

2

3

a

b

c

b.2

a.2

c.3

c.1

b.3

a.1

c.1

(b)

c.2 b.1 a.1 1

2

3

a

b

c

b.2

a.2

c.3

c.1

b.3

a.1

c.1

b.3

(c)

c.2 b.1 1

2

3

a

b

c

b.2

a.2

c.3 c.1

b.3

a.1

(d) after step 1

A Case Study [Partridge et al ’98]

Goal: show that routers can keep pace with Goal: show that routers can keep pace with improvements of transmission link bandwidthsimprovements of transmission link bandwidths

ArchitectureArchitecture

A CIOQ routerA CIOQ router 15 (input/output) line cards: C = 2.4 Gbps (3.3 15 (input/output) line cards: C = 2.4 Gbps (3.3

Gpps including packet headers) Gpps including packet headers) Each input card can handle up to 16 Each input card can handle up to 16

(input/output) interfaces(input/output) interfaces Separate forward engines (FEs)Separate forward engines (FEs) to perform to perform

routing routing Backplane: Point-to-point (Backplane: Point-to-point (switchedswitched) bus, capacity ) bus, capacity

B = 50 Gbps (32 MPPS)B = 50 Gbps (32 MPPS) B/C = 50/2.4 = 20B/C = 50/2.4 = 20

Router Architecture

packet

header

Router Architecture

11

1515

input interface output interfaces

Backplane

forward engines NetworkprocessorNetwork

processor

Data in Data out

Control data(e.g., routing)

Updaterouting tables Set scheduling

(QoS) state

Router Architecture: Data Plane

Line cardsLine cards

Input processing: can handle input links up to 2.4 Input processing: can handle input links up to 2.4 Gbps Gbps

Output processing: use a 52 MHz Output processing: use a 52 MHz FPGA FPGA (Field (Field Programmable Gate Array); implements QoSProgrammable Gate Array); implements QoS

Forward engine:Forward engine:

415-MHz 415-MHz DECDEC AlphaAlpha 21164 processor, 21164 processor, three level three level cache to store recent routescache to store recent routes Up to 12,000 routes in second level cache (96 Up to 12,000 routes in second level cache (96

kB); ~ 95% hit ratekB); ~ 95% hit rate Entire routing table in tertiary cache (16 MB Entire routing table in tertiary cache (16 MB

divided in two banks)divided in two banks)

Router Architecture: Control Plane

Network processor: 233-MHz 21064 Network processor: 233-MHz 21064 AlphaAlpha running running NetBSD 1.1 NetBSD 1.1

Update routingUpdate routing Manage link statusManage link status Implement reservationImplement reservation

Backplane AllocatorBackplane Allocator: implemented by an : implemented by an FPGAFPGA

The allocator is the The allocator is the heartheart of the high-speed of the high-speed switch switch

ScheduleSchedule transfers between input/output transfers between input/output interfacesinterfaces

Control Plane: Backplane Allocator

Time divided in epochsTime divided in epochs 16 ticks of data clock (8 allocation clocks)16 ticks of data clock (8 allocation clocks)

Transfer unit: 64 B (8 data clock ticks)Transfer unit: 64 B (8 data clock ticks) Up to Up to 15 simultaneous transfers15 simultaneous transfers in an epoch in an epoch

One transfer: 128 B of data + 176 auxiliary bits One transfer: 128 B of data + 176 auxiliary bits Minimum of 4 epochs to schedule and complete a Minimum of 4 epochs to schedule and complete a

transfer but transfer but scheduling is pipelinedscheduling is pipelined..1.1. Source card signals that it has data to send to the Source card signals that it has data to send to the

destination card destination card 2.2. Switch allocator schedules transferSwitch allocator schedules transfer3.3. Source and destination cards are notified and told Source and destination cards are notified and told

to configure themselvesto configure themselves4.4. Transfer takes placeTransfer takes place

Flow control through inhibit pinsFlow control through inhibit pins

Early Crossbar Scheduling Algorithm Wavefront algorithmWavefront algorithm

Problems: Fairness, speed, …

Slow!

(36 “cycles”)

Observation:

2,1 1,2 don’t conflict with each other

(11 “cycles”)

Do in groups, with groups in parallel

(5 “cycles”)

Can find opt group size, etc.

The Switch Allocator Disadvantages of the simple allocatorDisadvantages of the simple allocator

Unfair: a preference for low-numbered sourcesUnfair: a preference for low-numbered sources Requires evaluating 225 positions per epoch, Requires evaluating 225 positions per epoch,

which is too fast for an FPGAwhich is too fast for an FPGA Solution to unfairness problem: random shuffling Solution to unfairness problem: random shuffling

of sources and destinationsof sources and destinations Solution to timing problem: parallel evaluation of Solution to timing problem: parallel evaluation of

multiple locationsmultiple locations wavefront allocation requires 29 steps for a 15 x wavefront allocation requires 29 steps for a 15 x

15 matrix15 matrix MGR allocator uses MGR allocator uses 3 x 5 groups3 x 5 groups

Priority to requests from forwarding engines over Priority to requests from forwarding engines over line cards to avoid header contention on line cardsline cards to avoid header contention on line cards

Alternatives to the Wavefront Scheduler

PIM: Parallel Iterative MatchingPIM: Parallel Iterative Matching

Request: Each input sends requests to all Request: Each input sends requests to all outputs for which it has packetsoutputs for which it has packets

Grant: Output selects an input at Grant: Output selects an input at randomrandom and and grantsgrants

Accept: Input selects from its received grantsAccept: Input selects from its received grants Problem: Matching may Problem: Matching may notnot be be maximalmaximal

Solution: Solution: Run several timesRun several times Problem: Matching may Problem: Matching may notnot be be “fair”“fair”

Solution: Grant/accept in Solution: Grant/accept in round robinround robin instead of instead of randomrandom

iSLIP – Round-robin Parallel Iterative Matching (PIM)

Each input maintains round-robin list of outputsEach input maintains round-robin list of outputs

Each output maints round-robin list of inputsEach output maints round-robin list of inputs

Request phase: Inputs send to all desired outputRequest phase: Inputs send to all desired output

Grant phase: Output picks first input in round-Grant phase: Output picks first input in round-robin sequencerobin sequence

Input picks first output from its RR seqInput picks first output from its RR seq Output updates RR seq only if it’s chosenOutput updates RR seq only if it’s chosen

Good fairness in simulationGood fairness in simulation

100% Throughput?

Why does it matter?Why does it matter? Guaranteed behavior regardless of load!Guaranteed behavior regardless of load! Same reason moved away from cache-based Same reason moved away from cache-based

router architecturesrouter architectures Cool result:Cool result:

Dai & Prabhakar: Dai & Prabhakar: AnyAny maximal matching maximal matching scheme in a crossbar with scheme in a crossbar with 2x speedup2x speedup gets gets 100% throughput100% throughput

Speedup: Run internal crossbar links at 2x the Speedup: Run internal crossbar links at 2x the input & output link speedsinput & output link speeds

Summary: Design Decisions (Innovations)

1.1. Each FE has a complete set of routing tablesEach FE has a complete set of routing tables

2.2. A switched fabric is used instead of the traditional A switched fabric is used instead of the traditional shared busshared bus

3.3. FEs are on boards distinct from the line cardsFEs are on boards distinct from the line cards

4.4. Use of an abstract link layer headerUse of an abstract link layer header

5.5. Include QoS processing in the routerInclude QoS processing in the router

Routing Lookup Problem / Port mappers

Identify the output interfaceIdentify the output interface to forward an to forward an incoming packet based on packet’s incoming packet based on packet’s destination addressdestination address

Forwarding tablesForwarding tables (a (a locallocal version of the version of the routing tablerouting table) summarize information by ) summarize information by maintaining a mapping between IP address maintaining a mapping between IP address prefixes and output interfacesprefixes and output interfaces

Route lookup Route lookup find find the longest prefixthe longest prefix in the in the table that matches the packet table that matches the packet destination destination addressaddress

Routing Lookups

Routing tables: 200,000 – Routing tables: 200,000 – 1M entries1M entries

Router must be able to handle routing table load Router must be able to handle routing table load 5 years hence. Maybe 10.5 years hence. Maybe 10.

So, how to do it?So, how to do it?

DRAM (Dynamic RAM, DRAM (Dynamic RAM, ~50ns~50ns latency) latency) Cheap, slowCheap, slow

SRAM (Static RAM, SRAM (Static RAM, <5ns<5ns latency) latency) Fast, $$Fast, $$

TCAM (Ternary Content Addressable Memory – TCAM (Ternary Content Addressable Memory – parallelparallel lookups in lookups in hardwarehardware)) Really fast, quite $$, lots of powerReally fast, quite $$, lots of power

Example Packet with destination address 12.82.100.101 is Packet with destination address 12.82.100.101 is

sent to interface 2, as 12.82.100.xxx is the longest sent to interface 2, as 12.82.100.xxx is the longest prefix matching packet’s destination addressprefix matching packet’s destination address

Need to find Need to find longest prefix matchlongest prefix match A standard solution: A standard solution: triestries

…………

3312.82.xxx.xxx12.82.xxx.xxx

11128.16.120.xxx128.16.120.xxx

1

2128.16.120.111

12.82.100.101

12.82.100.xxx12.82.100.xxx 22

Use Use binary treebinary tree paths to encode prefixes paths to encode prefixes Advantage: simple to implementAdvantage: simple to implement Disadvantage: one lookup may take Disadvantage: one lookup may take O(m)O(m), where m is , where m is

number of bits (32 in IPv4)number of bits (32 in IPv4) Ex. m = 5Ex. m = 5

Two ways to improve performanceTwo ways to improve performance cache recentlycache recently used addresses used addresses move common entries up to a move common entries up to a higher levelhigher level (match (match

longer strings)longer strings)

Patricia Tries

0 1

0

1 0

1

1

0

0

0

0

2

3

5

1001xx 2 0100x 310xxx 101100 5

How can we speed “longest prefix match” up?

Two general approaches:Two general approaches:

Shrink the tableShrink the table so it fits in so it fits in reallyreally fast fast memory (memory (cachecache)) Degermark et al.; optional readingDegermark et al.; optional reading Complete prefix tree (node has 2 or 0 kids) can be Complete prefix tree (node has 2 or 0 kids) can be

compressed well. 3 stages:compressed well. 3 stages:• Match 16 bits; match next 8; match last 8Match 16 bits; match next 8; match last 8

Drastically Drastically reduce the # of memory lookupsreduce the # of memory lookups WUSTL algorithm ca. same time (WUSTL algorithm ca. same time (Binary search Binary search

on prefixeson prefixes))

Lulea’s Routing Lookup Algorithm

Small Forwarding Tables for Fast Routing Lookups (Sigcomm’97)

Minimize number of memory accessesMinimize number of memory accesses

Minimize size of data structure (why?)Minimize size of data structure (why?)

Solution: use a Solution: use a three-level data structurethree-level data structure

First Level: Bit-Vector

Cover all prefixes down to depth 16Cover all prefixes down to depth 16

Use Use one bitone bit to encode to encode each prefixeach prefix

Memory requirements: 2Memory requirements: 21616 = 64 Kb = 8 KB = 64 Kb = 8 KB

genuine heads

root heads

First Level: Pointers

Maintain Maintain 16-bit pointers16-bit pointers

2 bits encode pointer type2 bits encode pointer type 14 bits represent an index into 14 bits represent an index into routing routing

tabletable or into an array containing or into an array containing level twolevel two chunckschuncks

Pointers are stored at consecutive memory Pointers are stored at consecutive memory addressesaddresses

Problem: find the pointerProblem: find the pointer

Example: find the pointer

…

pointerarray

Routingtable

Level two chunks

0006abcd

bit vector …

000acdef

1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1Problem:findpointer

Code Word and Base Indexes Array SplitSplit the the bit-vectorbit-vector in in bit-masks bit-masks (16 bits each)(16 bits each) Find corresponding bit-maskFind corresponding bit-mask How? How? @ Maintain a16-bit @ Maintain a16-bit code wordcode word for each bit-mask for each bit-mask ((10-bit10-bit value; value; 6-bit 6-bit offset) offset) @ Maintain a base @ Maintain a base index arrayindex array (one 16-bit entry (one 16-bit entry for each 4 code words)for each 4 code words)

number of previous ones in the bit-vector

Code word array

Base index array

Bit-vector

First Level: Finding Pointer Group Use first 12 bits to index into code word arrayUse first 12 bits to index into code word array

Use first 10 bits to index into base index array Use first 10 bits to index into base index array

address: 004C

first 12 bits4

1first 10 bits

13 + 0 = 13

Code word array

Base index array

First Level: Encoding Bit-masks Observation: Observation: not all 16-bit values are possiblenot all 16-bit values are possible

Example: bit-mask 1001… is not possible (Example: bit-mask 1001… is not possible (whynot?)whynot?) Let a(n) be number of non-zero bit-masks of length 2Let a(n) be number of non-zero bit-masks of length 2nn

Compute a(n) using recurrence:Compute a(n) using recurrence:

a(0) = 1a(0) = 1 a(n) = 1 + a(n-1)a(n) = 1 + a(n-1)22

For length 16, For length 16, 678 678 possiblepossible values for bit-masks values for bit-masks

This can be encoded in This can be encoded in 10 bits10 bits

Values rValues rii in code words in code words Store all possible bit-masks in a table, called maptable Store all possible bit-masks in a table, called maptable

First Level: Finding Pointer Index Each entry in maptable is an offset of 4 bits:Each entry in maptable is an offset of 4 bits:

Offset of pointer in the group Offset of pointer in the group Number of memory accesses: Number of memory accesses: 33 (7 bytes accessed) (7 bytes accessed)

First Level: Memory Requirements

Code word array: one code word per bit-maskCode word array: one code word per bit-mask

64 Kb64 Kb Based index array: one base index per four bit-maskBased index array: one base index per four bit-mask

16 Kb16 Kb Maptable: 677x16 entries, 4 bits eachMaptable: 677x16 entries, 4 bits each

~ 43.3 Kb~ 43.3 Kb TotalTotal: 123.3 Kb = : 123.3 Kb = 15.4 KB15.4 KB

First Level: Optimizations

Reduce number of entries in Maptable by two:Reduce number of entries in Maptable by two:

Don’t store bit-masks 0 and 1; instead encode Don’t store bit-masks 0 and 1; instead encode pointers directly into code word pointers directly into code word

If r value in code word larger than 676 If r value in code word larger than 676 direct direct encodingencoding

For direct encoding use r value + 6-bit offsetFor direct encoding use r value + 6-bit offset

Levels 2 and 3 Levels 2 and 3 consists of chunksLevels 2 and 3 consists of chunks A chunck covers A chunck covers a sub-tree of height 8a sub-tree of height 8 at most at most

256 heads256 heads Three types of chunksThree types of chunks

Sparse: 1-8 headsSparse: 1-8 heads 8-bit indices, eight pointers (24 B)8-bit indices, eight pointers (24 B)

Dense: 9-64 headsDense: 9-64 heads Like level 1, but only one base index (< 162 Like level 1, but only one base index (< 162

B) B) Very dense: 65-256 headsVery dense: 65-256 heads

Like level 1 (< 552 B)Like level 1 (< 552 B) Only Only 7 bytes7 bytes are accessed to search each of levels are accessed to search each of levels

2 and 32 and 3

Notes This data structure trades the This data structure trades the table constructiontable construction

time for lookup time (build time time for lookup time (build time < 100 ms< 100 ms))

Good trade-off because routes are not supposed Good trade-off because routes are not supposed to change oftento change often

Lookup performance:Lookup performance:

Worst-case: Worst-case: 101 cycles101 cycles A 200 MHz (A 200 MHz (2GHz2GHz) Pentium Pro can do at ) Pentium Pro can do at

least 2 (least 2 (2020) millions lookups per second ) millions lookups per second On average: On average: ~ 50 cycles~ 50 cycles

Open question: how effective is this data structure Open question: how effective is this data structure in the case of IPv6 ?in the case of IPv6 ?

Other challenges in Routing

Routers do more than just “Routers do more than just “longest prefix match”longest prefix match” & & crossbarcrossbar

Packet classification (L3 and up!)Packet classification (L3 and up!) Counters & stats for measurement & debuggingCounters & stats for measurement & debugging IPSec and VPNsIPSec and VPNs QoS, packet shaping, policingQoS, packet shaping, policing IPv6IPv6 Access control and filteringAccess control and filtering IP multicastIP multicast AQM: RED, etc. (maybe)AQM: RED, etc. (maybe) Serious QoS: DiffServ (sometimes), IntServ (not)Serious QoS: DiffServ (sometimes), IntServ (not)

Going Forward

Today’s highest-end: Multi-rack routersToday’s highest-end: Multi-rack routers

Measured in Measured in Tb/secTb/sec One scenario: Big optical switch One scenario: Big optical switch

connecting multiple electrical switchesconnecting multiple electrical switchesCool design: McKeown sigcomm 2003 Cool design: McKeown sigcomm 2003

paperpaper BBN MGR: Normal CPU for forwardingBBN MGR: Normal CPU for forwarding

Modern routers: Several Modern routers: Several ASICASICss

switching and router design 2007. 10

Documents