switching and router design 2007. 10
DESCRIPTION
Switching and Router Design 2007. 10. A generic switch. Classification. Packet vs. circuit switches packets have headers and samples don’t Connectionless vs. connection oriented connection oriented switches need a call setup setup is handled in control plane by switch controller - PowerPoint PPT PresentationTRANSCRIPT
Switching and Router Design
2007. 10
A generic switch
Classification Packet vs. circuit switchesPacket vs. circuit switches
packets have headers and samples don’t packets have headers and samples don’t Connectionless vs. connection orientedConnectionless vs. connection oriented
connection oriented switches need a call setupconnection oriented switches need a call setup setup is handled in setup is handled in control planecontrol plane by switch controller connectionless switches deal with self-contained
datagrams
Connectionless (router)
Connection-oriented (switching system)
Packet switch
Internet router ATM switching system
Circuit switch
Telephone switching system
Requirements Capacity of switch is the maximum rate at which it Capacity of switch is the maximum rate at which it
can move information, assuming all data paths are can move information, assuming all data paths are simultaneously activesimultaneously active
Primary goal: Primary goal: maximize capacitymaximize capacity
subject to cost and reliability constraintssubject to cost and reliability constraints Circuit switchCircuit switch must reject call if can’t find a path for must reject call if can’t find a path for
samples from input to outputsamples from input to output
goal: goal: minimize call blockingminimize call blocking Packet switchPacket switch must reject a packet if it can’t find a must reject a packet if it can’t find a
buffer to store it awaiting access to output trunkbuffer to store it awaiting access to output trunk
goal: goal: minimize packet lossminimize packet loss Don’t reorder packetsDon’t reorder packets
Outline
Circuit switchingCircuit switching
Packet switchingPacket switching
Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placement (Buffer placement (ArchitectureArchitecture)) Schedule the fabric / crossbar / Schedule the fabric / crossbar /
backplanebackplane Routing lookupRouting lookup
Packet switching In a packet switch, For every packet, you must:In a packet switch, For every packet, you must:
Do a routing lookupDo a routing lookup: where (port) to send it: where (port) to send it Datagram -- lookup based on entire destination Datagram -- lookup based on entire destination
address (packets carry a destination field)address (packets carry a destination field) Cell -- lookup based on VCICell -- lookup based on VCI
ScheduleSchedule the fabric / crossbar / backplane the fabric / crossbar / backplane Maybe Maybe bufferbuffer, maybe , maybe QoSQoS, maybe filtering by ACLs, maybe filtering by ACLs
Back-of-the-envelope numbersBack-of-the-envelope numbers Line cards can be Line cards can be 40 Gbit/sec40 Gbit/sec today (OC-768) today (OC-768) To handle minimum-sized packets To handle minimum-sized packets (~40B(~40B))
125 Mpps125 Mpps, or , or 8ns per packet8ns per packet But note that this can be deeply But note that this can be deeply pipelinedpipelined, at the , at the
cost of buffering and complexity. Some lookup cost of buffering and complexity. Some lookup chips do this, though still with chips do this, though still with SRAMSRAM, not DRAM. , not DRAM. Good Good lookup algorithmslookup algorithms needed still. needed still.
Router Architecture Control PlaneControl Plane
How routing protocols establish routes How routing protocols establish routes etc.etc.
Port mappersPort mappers Data PlaneData Plane
How packets get forwardedHow packets get forwarded
Outline Circuit switchingCircuit switching Packet switchingPacket switching
Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placementBuffer placement Schedule the fabric / crossbar / Schedule the fabric / crossbar /
backplanebackplane Routing lookupRouting lookup
First generation switch - shared memory
Line card DMAs into buffer, CPU examines header, Line card DMAs into buffer, CPU examines header, has output DMA outhas output DMA out
Bottleneck Bottleneck can becan be CPU CPU, host-adapter or I/O bus, , host-adapter or I/O bus, dependingdepending
Most Ethernet switches and cheap packet routersMost Ethernet switches and cheap packet routers Low-cost routers, Speed: Low-cost routers, Speed: 300 Mbps – 1 Gbps300 Mbps – 1 Gbps
Example (First generation switch)
Today router built with 1.33 GHz Today router built with 1.33 GHz CPU - bottleneckCPU - bottleneck
Mean packet size 500 BMean packet size 500 B word (4B) access take 15 ns word (4B) access take 15 ns
Bus ~ 100 MHz, memory ~ 5 nsBus ~ 100 MHz, memory ~ 5 ns
1) Interrupt takes 1) Interrupt takes 2.5 µs2.5 µs per packet per packet
2) Per-packet processing time: (200 instrcts) = 2) Per-packet processing time: (200 instrcts) = 0.15 µs0.15 µs
3) Copying packet takes 500/4 *33 = 3) Copying packet takes 500/4 *33 = 4.1 µs4.1 µs 4 instructions + 2 memory accesses = 33 ns (4B word)4 instructions + 2 memory accesses = 33 ns (4B word)
TotalTotal time = 2.5 + 0.15 + 4.1 = time = 2.5 + 0.15 + 4.1 = 6.75 µs6.75 µs
=> speed is (500x8/6.75) = => speed is (500x8/6.75) = 600 Mbps600 Mbps
Amortized interrupt cost balanced by routing protocol Amortized interrupt cost balanced by routing protocol costcost
Second generation switch - shared bus
Port mapping (forwarding decisions) in line cardsPort mapping (forwarding decisions) in line cards
direct transfer over bus between line cardsdirect transfer over bus between line cards if no route in line card --> CPU (“slow” operations)if no route in line card --> CPU (“slow” operations)
BottleneckBottleneck is the is the busbus
Medium-end routers, switches, ATM switchesMedium-end routers, switches, ATM switches
Speed: Speed: 10 Gbps10 Gbps (> 8* 1-st generation switch) (> 8* 1-st generation switch)
Third generation switches - point-to-point (switched) bus (fabric)
++tagtag
----->----->
Third generation (contd.) Bottleneck in second generation switch is the busBottleneck in second generation switch is the bus
Third generation switch provides Third generation switch provides parallelparallel paths paths (fabric)(fabric)
FeaturesFeatures
self- routing fabricself- routing fabric (+ tag) (+ tag) output bufferoutput buffer is a point of contention is a point of contention
unless we unless we arbitrate arbitrate access to fabricaccess to fabric potential for unlimited scaling, as long as we can potential for unlimited scaling, as long as we can
resolve contention for output bufferresolve contention for output buffer High-end routers, switchesHigh-end routers, switches
Speed: 1000 GbpsSpeed: 1000 Gbps
Outline
Circuit switchingCircuit switching
Packet switchingPacket switching
Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placementBuffer placement Schedule the fabric / crossbar / Schedule the fabric / crossbar /
backplanebackplane Routing lookupRouting lookup
Buffered crossbar
What happens if What happens if packets at two inputs packets at two inputs both want to go to both want to go to same output?same output?
Can defer one at an Can defer one at an input bufferinput buffer
Or, buffer Or, buffer crosspointscrosspoints
Broadcast
Packets are tagged with output port #Packets are tagged with output port #
Each output matches tagsEach output matches tags
Need to match N addresses in parallel at outputNeed to match N addresses in parallel at output
Useful only for small switches, or as a stage in a Useful only for small switches, or as a stage in a large switchlarge switch
Switch fabric element Can build complicated fabrics from a simple elementCan build complicated fabrics from a simple element
Self- routing rule: if tag = 0, send packet to upper Self- routing rule: if tag = 0, send packet to upper output, else to lower outputoutput, else to lower output
If both packets to same output, buffer or dropIf both packets to same output, buffer or drop
Fabrics built with switching elements
NxN switch with bxb elements has NxN switch with bxb elements has stages withstages with elements per stage elements per stage
ex: ex: 8x8 8x8 switch with switch with 2x22x2 elements has elements has 33 stages with 4 elements per stagestages with 4 elements per stage
Fabric is Fabric is self-routingself-routing
RecursiveRecursive
Can be synchronous or asynchronousCan be synchronous or asynchronous
Regular and suitable forRegular and suitable for VLSI VLSI implementation implementation
Nblog bN /
The fabric for Batcher-banyan switch
An example using Batcher-banyan switch
Outline Circuit switchingCircuit switching
Packet switchingPacket switching
Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placement (Architecture)Buffer placement (Architecture) Schedule the fabric / crossbar / Schedule the fabric / crossbar /
backplanebackplane Routing lookupRouting lookup
Speedup C – input/output link capacityC – input/output link capacity
RRII – maximum rate at which an – maximum rate at which an
input interface can send data into input interface can send data into backplanebackplane
RROO – maximum rate at which an – maximum rate at which an
output can read data from output can read data from backplanebackplane
B – maximum aggregate B – maximum aggregate backplane transfer ratebackplane transfer rate
input interface output interface
Inter-connection
Medium(Backplane)
C CRI ROB
Generic Router Architecture
Back-planeBack-plane speedup: B/C speedup: B/C
InputInput speedup: R speedup: RII/C/C
OutputOutput speedup: R speedup: ROO/C/C
Buffering - three router architectures
Where should we place buffers?Where should we place buffers?
Output queued Output queued (OQ)(OQ) Input queued Input queued (IQ)(IQ) Combined Input-Output queued Combined Input-Output queued (CIOQ)(CIOQ)
Function divisionFunction division
Input interfaces:Input interfaces: Must perform packet forwarding – need to Must perform packet forwarding – need to
know to which output interface to send know to which output interface to send packetspackets
May enqueue packets and perform schedulingMay enqueue packets and perform scheduling Output interfaces:Output interfaces:
May enqueue packets and perform schedulingMay enqueue packets and perform scheduling
Output Queued (OQ) Routers
Only output interfaces Only output interfaces store packetsstore packets
AdvantagesAdvantages
Easy to design Easy to design algorithms: only one algorithms: only one congestion pointcongestion point
DisadvantagesDisadvantages
Requires an Requires an output output speedup of Nspeedup of N, where N is , where N is the number of interfaces the number of interfaces not feasiblenot feasible
input interface output interface
Backplane
CRO
Input Queueing (IQ) Routers Only input interfaces store packetsOnly input interfaces store packets AdvantagesAdvantages
Easy to built Easy to built Store packets at inputsStore packets at inputs if contention at outputs if contention at outputs
Relatively easy to design algorithmsRelatively easy to design algorithms Only one congestion point, Only one congestion point, but not output…but not output… need to implement backpressureneed to implement backpressure
DisadvantagesDisadvantages In general, In general, hard to achieve high utilizationhard to achieve high utilization However, theoretical and simulation results show However, theoretical and simulation results show
that for realistic traffic an input/output that for realistic traffic an input/output speedup of 2speedup of 2 is enough to achieve is enough to achieve utilizations close to 1utilizations close to 1
input interface output interface
Backplane
CRO
Combined Input-Output Queueing (CIOQ) Routers
Both input and output Both input and output
interfaces store packetsinterfaces store packets AdvantagesAdvantages
Easy Easy to built to built Utilization 1Utilization 1 can be can be achieved with input/outputachieved with input/output speedup speedup (<= 2(<= 2))
DisadvantagesDisadvantages HarderHarder to design algorithms to design algorithms
Two congestion pointsTwo congestion points Need to design flow controlNeed to design flow control
An input/output speedup of 2, a CIOQ can emulate An input/output speedup of 2, a CIOQ can emulate any work-conserving OQ [G+98,SZ98] any work-conserving OQ [G+98,SZ98]
input interface output interface
Backplane
CRO
Generic Architecture of a High Speed Router Today
CIOQCIOQ - Combined Input-Output Queued - Combined Input-Output Queued ArchitectureArchitecture
Input/output Input/output speedup <= 2speedup <= 2 Input interfaceInput interface
Perform packet forwarding (and classification)Perform packet forwarding (and classification) Output interfaceOutput interface
Perform packet (classification and) schedulingPerform packet (classification and) scheduling Backplane / Backplane / fabricfabric
Point-to-point Point-to-point (switched)(switched) bus; speedup N bus; speedup N ScheduleSchedule packet transfer from input to output packet transfer from input to output
Outline Circuit switchingCircuit switching Packet switchingPacket switching
Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placementBuffer placement Schedule the fabric / crossbar / Schedule the fabric / crossbar /
backplanebackplane Routing lookupRouting lookup
Backplane / Fabric / Crossbar Point-to-point switchPoint-to-point switch allows to allows to simultaneouslysimultaneously
transfer a packet between any two disjoint pairs of transfer a packet between any two disjoint pairs of input-output interfacesinput-output interfaces
Goal: come-up with a Goal: come-up with a scheduleschedule that that MaximizeMaximize router router throughputthroughput Meet flow QoS requirementsMeet flow QoS requirements
Challenges:Challenges: Address Address head-of-line blockinghead-of-line blocking at inputs at inputs Resolve input/output Resolve input/output speedupsspeedups contention contention Avoid packet Avoid packet droppingdropping at output if possible at output if possible
Note: packets are fragmented in Note: packets are fragmented in fix sized cellsfix sized cells (why?) at inputs and reassembled at outputs (why?) at inputs and reassembled at outputs In Partridge et al, a cell is 64 B (what are the In Partridge et al, a cell is 64 B (what are the
trade-offs?)trade-offs?)
Head-of-line Blocking
The cell at the head of an input queue cannot be The cell at the head of an input queue cannot be transferred, thus blocking the following cells transferred, thus blocking the following cells
Cannot betransferred because output buffer full
Cannot be transferred because is blocked by red cell
Output 1
Output 2
Output 3
Input 1
Input 2
Input 3
To Avoid Head-of-line Blocking
Head-of-line blocking with only 1 queue per inputHead-of-line blocking with only 1 queue per input Max throughput <= (2-sqrt(2)) =~ 58%Max throughput <= (2-sqrt(2)) =~ 58%
Solution? Maintain at each input Solution? Maintain at each input N virtual queuesN virtual queues, , i.e., one per outputi.e., one per output Requires N queues; more if QoSRequires N queues; more if QoS The Way It’s Done NowThe Way It’s Done Now
Output 1
Output 2
Output 3
Input 1
Input 2
Input 3
Cell transfer
Schedule: ideally, find the Schedule: ideally, find the maximum numbermaximum number of of input-output pairsinput-output pairs such that: such that:
Resolve input/output contentionsResolve input/output contentions Avoid packet drops at outputsAvoid packet drops at outputs Packets meet their time constraints (e.g., Packets meet their time constraints (e.g.,
deadlines), if anydeadlines), if any Example:Example:
Use stable matchingUse stable matching Try to emulate an OQ switchTry to emulate an OQ switch
Stable Marriage Problem Consider N women and N menConsider N women and N men
Each woman/man ranks each man/woman in Each woman/man ranks each man/woman in the order of their preferencesthe order of their preferences
Stable matching, a matching with no Stable matching, a matching with no blocking pairsblocking pairs
Blocking pair; let p(i) denote the pair of iBlocking pair; let p(i) denote the pair of i
There are matched pairs (k, p(k)) and (j, There are matched pairs (k, p(k)) and (j, p(j)) such that k prefers p(j) to p(k), and p(j)) such that k prefers p(j) to p(k), and p(j) prefers k to jp(j) prefers k to j
Gale Shapely Algorithm (GSA)
As long as there is a free man mAs long as there is a free man m
m proposes to highest ranked women w in his m proposes to highest ranked women w in his list he hasn’t proposed yetlist he hasn’t proposed yet
If w is free, m an w are engagedIf w is free, m an w are engaged If w is engaged to m’ and w prefers m to m’, w If w is engaged to m’ and w prefers m to m’, w
releases m’releases m’ Otherwise m remains free Otherwise m remains free
A stable matching exists for every set of A stable matching exists for every set of preference listspreference lists
Complexity: worst-case Complexity: worst-case O(NO(N22))
OQ Emulation with a Speedup of 2
Each input and output maintains a preference listEach input and output maintains a preference list
Input preference list:Input preference list: list of cells at that input list of cells at that input ordered in the inverse order of their arrivalordered in the inverse order of their arrival
Output preference list:Output preference list: list of all input cells to be list of all input cells to be forwarded to that output ordered by the times they forwarded to that output ordered by the times they would be served in an Output Queueing schedulewould be served in an Output Queueing schedule
Use GSAUse GSA to match inputs to outputs to match inputs to outputs
Outputs initiate the matchingOutputs initiate the matching Can emulate all work-conserving schedulersCan emulate all work-conserving schedulers
Example with a Speedup of 2
c.2 b.1 a.1
a.2
c.3
1
2
3
a
b
c
b.2
c.1
b.3
(a)
c.2 b.1 a.1 1
2
3
a
b
c
b.2
a.2
c.3
c.1
b.3
a.1
c.1
(b)
c.2 b.1 a.1 1
2
3
a
b
c
b.2
a.2
c.3
c.1
b.3
a.1
c.1
b.3
(c)
c.2 b.1 1
2
3
a
b
c
b.2
a.2
c.3 c.1
b.3
a.1
(d) after step 1
A Case Study [Partridge et al ’98]
Goal: show that routers can keep pace with Goal: show that routers can keep pace with improvements of transmission link bandwidthsimprovements of transmission link bandwidths
ArchitectureArchitecture
A CIOQ routerA CIOQ router 15 (input/output) line cards: C = 2.4 Gbps (3.3 15 (input/output) line cards: C = 2.4 Gbps (3.3
Gpps including packet headers) Gpps including packet headers) Each input card can handle up to 16 Each input card can handle up to 16
(input/output) interfaces(input/output) interfaces Separate forward engines (FEs)Separate forward engines (FEs) to perform to perform
routing routing Backplane: Point-to-point (Backplane: Point-to-point (switchedswitched) bus, capacity ) bus, capacity
B = 50 Gbps (32 MPPS)B = 50 Gbps (32 MPPS) B/C = 50/2.4 = 20B/C = 50/2.4 = 20
Router Architecture
packet
header
Router Architecture
11
1515
input interface output interfaces
Backplane
forward engines NetworkprocessorNetwork
processor
Data in Data out
Control data(e.g., routing)
Updaterouting tables Set scheduling
(QoS) state
Router Architecture: Data Plane
Line cardsLine cards
Input processing: can handle input links up to 2.4 Input processing: can handle input links up to 2.4 Gbps Gbps
Output processing: use a 52 MHz Output processing: use a 52 MHz FPGA FPGA (Field (Field Programmable Gate Array); implements QoSProgrammable Gate Array); implements QoS
Forward engine:Forward engine:
415-MHz 415-MHz DECDEC AlphaAlpha 21164 processor, 21164 processor, three level three level cache to store recent routescache to store recent routes Up to 12,000 routes in second level cache (96 Up to 12,000 routes in second level cache (96
kB); ~ 95% hit ratekB); ~ 95% hit rate Entire routing table in tertiary cache (16 MB Entire routing table in tertiary cache (16 MB
divided in two banks)divided in two banks)
Router Architecture: Control Plane
Network processor: 233-MHz 21064 Network processor: 233-MHz 21064 AlphaAlpha running running NetBSD 1.1 NetBSD 1.1
Update routingUpdate routing Manage link statusManage link status Implement reservationImplement reservation
Backplane AllocatorBackplane Allocator: implemented by an : implemented by an FPGAFPGA
The allocator is the The allocator is the heartheart of the high-speed of the high-speed switch switch
ScheduleSchedule transfers between input/output transfers between input/output interfacesinterfaces
Control Plane: Backplane Allocator
Time divided in epochsTime divided in epochs 16 ticks of data clock (8 allocation clocks)16 ticks of data clock (8 allocation clocks)
Transfer unit: 64 B (8 data clock ticks)Transfer unit: 64 B (8 data clock ticks) Up to Up to 15 simultaneous transfers15 simultaneous transfers in an epoch in an epoch
One transfer: 128 B of data + 176 auxiliary bits One transfer: 128 B of data + 176 auxiliary bits Minimum of 4 epochs to schedule and complete a Minimum of 4 epochs to schedule and complete a
transfer but transfer but scheduling is pipelinedscheduling is pipelined..1.1. Source card signals that it has data to send to the Source card signals that it has data to send to the
destination card destination card 2.2. Switch allocator schedules transferSwitch allocator schedules transfer3.3. Source and destination cards are notified and told Source and destination cards are notified and told
to configure themselvesto configure themselves4.4. Transfer takes placeTransfer takes place
Flow control through inhibit pinsFlow control through inhibit pins
Early Crossbar Scheduling Algorithm Wavefront algorithmWavefront algorithm
Problems: Fairness, speed, …
Slow!
(36 “cycles”)
Observation:
2,1 1,2 don’t conflict with each other
(11 “cycles”)
Do in groups, with groups in parallel
(5 “cycles”)
Can find opt group size, etc.
The Switch Allocator Disadvantages of the simple allocatorDisadvantages of the simple allocator
Unfair: a preference for low-numbered sourcesUnfair: a preference for low-numbered sources Requires evaluating 225 positions per epoch, Requires evaluating 225 positions per epoch,
which is too fast for an FPGAwhich is too fast for an FPGA Solution to unfairness problem: random shuffling Solution to unfairness problem: random shuffling
of sources and destinationsof sources and destinations Solution to timing problem: parallel evaluation of Solution to timing problem: parallel evaluation of
multiple locationsmultiple locations wavefront allocation requires 29 steps for a 15 x wavefront allocation requires 29 steps for a 15 x
15 matrix15 matrix MGR allocator uses MGR allocator uses 3 x 5 groups3 x 5 groups
Priority to requests from forwarding engines over Priority to requests from forwarding engines over line cards to avoid header contention on line cardsline cards to avoid header contention on line cards
Alternatives to the Wavefront Scheduler
PIM: Parallel Iterative MatchingPIM: Parallel Iterative Matching
Request: Each input sends requests to all Request: Each input sends requests to all outputs for which it has packetsoutputs for which it has packets
Grant: Output selects an input at Grant: Output selects an input at randomrandom and and grantsgrants
Accept: Input selects from its received grantsAccept: Input selects from its received grants Problem: Matching may Problem: Matching may notnot be be maximalmaximal
Solution: Solution: Run several timesRun several times Problem: Matching may Problem: Matching may notnot be be “fair”“fair”
Solution: Grant/accept in Solution: Grant/accept in round robinround robin instead of instead of randomrandom
iSLIP – Round-robin Parallel Iterative Matching (PIM)
Each input maintains round-robin list of outputsEach input maintains round-robin list of outputs
Each output maints round-robin list of inputsEach output maints round-robin list of inputs
Request phase: Inputs send to all desired outputRequest phase: Inputs send to all desired output
Grant phase: Output picks first input in round-Grant phase: Output picks first input in round-robin sequencerobin sequence
Input picks first output from its RR seqInput picks first output from its RR seq Output updates RR seq only if it’s chosenOutput updates RR seq only if it’s chosen
Good fairness in simulationGood fairness in simulation
100% Throughput?
Why does it matter?Why does it matter? Guaranteed behavior regardless of load!Guaranteed behavior regardless of load! Same reason moved away from cache-based Same reason moved away from cache-based
router architecturesrouter architectures Cool result:Cool result:
Dai & Prabhakar: Dai & Prabhakar: AnyAny maximal matching maximal matching scheme in a crossbar with scheme in a crossbar with 2x speedup2x speedup gets gets 100% throughput100% throughput
Speedup: Run internal crossbar links at 2x the Speedup: Run internal crossbar links at 2x the input & output link speedsinput & output link speeds
Summary: Design Decisions (Innovations)
1.1. Each FE has a complete set of routing tablesEach FE has a complete set of routing tables
2.2. A switched fabric is used instead of the traditional A switched fabric is used instead of the traditional shared busshared bus
3.3. FEs are on boards distinct from the line cardsFEs are on boards distinct from the line cards
4.4. Use of an abstract link layer headerUse of an abstract link layer header
5.5. Include QoS processing in the routerInclude QoS processing in the router
Outline Circuit switchingCircuit switching Packet switchingPacket switching
Switch generationsSwitch generations Switch fabricsSwitch fabrics Buffer placementBuffer placement Schedule the fabric / crossbar / Schedule the fabric / crossbar /
backplanebackplane Routing lookupRouting lookup
Routing Lookup Problem / Port mappers
Identify the output interfaceIdentify the output interface to forward an to forward an incoming packet based on packet’s incoming packet based on packet’s destination addressdestination address
Forwarding tablesForwarding tables (a (a locallocal version of the version of the routing tablerouting table) summarize information by ) summarize information by maintaining a mapping between IP address maintaining a mapping between IP address prefixes and output interfacesprefixes and output interfaces
Route lookup Route lookup find find the longest prefixthe longest prefix in the in the table that matches the packet table that matches the packet destination destination addressaddress
Routing Lookups
Routing tables: 200,000 – Routing tables: 200,000 – 1M entries1M entries
Router must be able to handle routing table load Router must be able to handle routing table load 5 years hence. Maybe 10.5 years hence. Maybe 10.
So, how to do it?So, how to do it?
DRAM (Dynamic RAM, DRAM (Dynamic RAM, ~50ns~50ns latency) latency) Cheap, slowCheap, slow
SRAM (Static RAM, SRAM (Static RAM, <5ns<5ns latency) latency) Fast, $$Fast, $$
TCAM (Ternary Content Addressable Memory – TCAM (Ternary Content Addressable Memory – parallelparallel lookups in lookups in hardwarehardware)) Really fast, quite $$, lots of powerReally fast, quite $$, lots of power
Example Packet with destination address 12.82.100.101 is Packet with destination address 12.82.100.101 is
sent to interface 2, as 12.82.100.xxx is the longest sent to interface 2, as 12.82.100.xxx is the longest prefix matching packet’s destination addressprefix matching packet’s destination address
Need to find Need to find longest prefix matchlongest prefix match A standard solution: A standard solution: triestries
…………
3312.82.xxx.xxx12.82.xxx.xxx
11128.16.120.xxx128.16.120.xxx
1
2128.16.120.111
12.82.100.101
12.82.100.xxx12.82.100.xxx 22
Use Use binary treebinary tree paths to encode prefixes paths to encode prefixes Advantage: simple to implementAdvantage: simple to implement Disadvantage: one lookup may take Disadvantage: one lookup may take O(m)O(m), where m is , where m is
number of bits (32 in IPv4)number of bits (32 in IPv4) Ex. m = 5Ex. m = 5
Two ways to improve performanceTwo ways to improve performance cache recentlycache recently used addresses used addresses move common entries up to a move common entries up to a higher levelhigher level (match (match
longer strings)longer strings)
Patricia Tries
0 1
0
1 0
1
1
0
0
0
0
2
3
5
1001xx 2 0100x 310xxx 101100 5
How can we speed “longest prefix match” up?
Two general approaches:Two general approaches:
Shrink the tableShrink the table so it fits in so it fits in reallyreally fast fast memory (memory (cachecache)) Degermark et al.; optional readingDegermark et al.; optional reading Complete prefix tree (node has 2 or 0 kids) can be Complete prefix tree (node has 2 or 0 kids) can be
compressed well. 3 stages:compressed well. 3 stages:• Match 16 bits; match next 8; match last 8Match 16 bits; match next 8; match last 8
Drastically Drastically reduce the # of memory lookupsreduce the # of memory lookups WUSTL algorithm ca. same time (WUSTL algorithm ca. same time (Binary search Binary search
on prefixeson prefixes))
Lulea’s Routing Lookup Algorithm
Small Forwarding Tables for Fast Routing Lookups (Sigcomm’97)
Minimize number of memory accessesMinimize number of memory accesses
Minimize size of data structure (why?)Minimize size of data structure (why?)
Solution: use a Solution: use a three-level data structurethree-level data structure
First Level: Bit-Vector
Cover all prefixes down to depth 16Cover all prefixes down to depth 16
Use Use one bitone bit to encode to encode each prefixeach prefix
Memory requirements: 2Memory requirements: 21616 = 64 Kb = 8 KB = 64 Kb = 8 KB
genuine heads
root heads
First Level: Pointers
Maintain Maintain 16-bit pointers16-bit pointers
2 bits encode pointer type2 bits encode pointer type 14 bits represent an index into 14 bits represent an index into routing routing
tabletable or into an array containing or into an array containing level twolevel two chunckschuncks
Pointers are stored at consecutive memory Pointers are stored at consecutive memory addressesaddresses
Problem: find the pointerProblem: find the pointer
Example: find the pointer
…
pointerarray
Routingtable
Level two chunks
0006abcd
bit vector …
000acdef
1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1Problem:findpointer
Code Word and Base Indexes Array SplitSplit the the bit-vectorbit-vector in in bit-masks bit-masks (16 bits each)(16 bits each) Find corresponding bit-maskFind corresponding bit-mask How? How? @ Maintain a16-bit @ Maintain a16-bit code wordcode word for each bit-mask for each bit-mask ((10-bit10-bit value; value; 6-bit 6-bit offset) offset) @ Maintain a base @ Maintain a base index arrayindex array (one 16-bit entry (one 16-bit entry for each 4 code words)for each 4 code words)
number of previous ones in the bit-vector
Code word array
Base index array
Bit-vector
First Level: Finding Pointer Group Use first 12 bits to index into code word arrayUse first 12 bits to index into code word array
Use first 10 bits to index into base index array Use first 10 bits to index into base index array
address: 004C
first 12 bits4
1first 10 bits
13 + 0 = 13
Code word array
Base index array
First Level: Encoding Bit-masks Observation: Observation: not all 16-bit values are possiblenot all 16-bit values are possible
Example: bit-mask 1001… is not possible (Example: bit-mask 1001… is not possible (whynot?)whynot?) Let a(n) be number of non-zero bit-masks of length 2Let a(n) be number of non-zero bit-masks of length 2nn
Compute a(n) using recurrence:Compute a(n) using recurrence:
a(0) = 1a(0) = 1 a(n) = 1 + a(n-1)a(n) = 1 + a(n-1)22
For length 16, For length 16, 678 678 possiblepossible values for bit-masks values for bit-masks
This can be encoded in This can be encoded in 10 bits10 bits
Values rValues rii in code words in code words Store all possible bit-masks in a table, called maptable Store all possible bit-masks in a table, called maptable
First Level: Finding Pointer Index Each entry in maptable is an offset of 4 bits:Each entry in maptable is an offset of 4 bits:
Offset of pointer in the group Offset of pointer in the group Number of memory accesses: Number of memory accesses: 33 (7 bytes accessed) (7 bytes accessed)
First Level: Memory Requirements
Code word array: one code word per bit-maskCode word array: one code word per bit-mask
64 Kb64 Kb Based index array: one base index per four bit-maskBased index array: one base index per four bit-mask
16 Kb16 Kb Maptable: 677x16 entries, 4 bits eachMaptable: 677x16 entries, 4 bits each
~ 43.3 Kb~ 43.3 Kb TotalTotal: 123.3 Kb = : 123.3 Kb = 15.4 KB15.4 KB
First Level: Optimizations
Reduce number of entries in Maptable by two:Reduce number of entries in Maptable by two:
Don’t store bit-masks 0 and 1; instead encode Don’t store bit-masks 0 and 1; instead encode pointers directly into code word pointers directly into code word
If r value in code word larger than 676 If r value in code word larger than 676 direct direct encodingencoding
For direct encoding use r value + 6-bit offsetFor direct encoding use r value + 6-bit offset
Levels 2 and 3 Levels 2 and 3 consists of chunksLevels 2 and 3 consists of chunks A chunck covers A chunck covers a sub-tree of height 8a sub-tree of height 8 at most at most
256 heads256 heads Three types of chunksThree types of chunks
Sparse: 1-8 headsSparse: 1-8 heads 8-bit indices, eight pointers (24 B)8-bit indices, eight pointers (24 B)
Dense: 9-64 headsDense: 9-64 heads Like level 1, but only one base index (< 162 Like level 1, but only one base index (< 162
B) B) Very dense: 65-256 headsVery dense: 65-256 heads
Like level 1 (< 552 B)Like level 1 (< 552 B) Only Only 7 bytes7 bytes are accessed to search each of levels are accessed to search each of levels
2 and 32 and 3
Notes This data structure trades the This data structure trades the table constructiontable construction
time for lookup time (build time time for lookup time (build time < 100 ms< 100 ms))
Good trade-off because routes are not supposed Good trade-off because routes are not supposed to change oftento change often
Lookup performance:Lookup performance:
Worst-case: Worst-case: 101 cycles101 cycles A 200 MHz (A 200 MHz (2GHz2GHz) Pentium Pro can do at ) Pentium Pro can do at
least 2 (least 2 (2020) millions lookups per second ) millions lookups per second On average: On average: ~ 50 cycles~ 50 cycles
Open question: how effective is this data structure Open question: how effective is this data structure in the case of IPv6 ?in the case of IPv6 ?
Other challenges in Routing
Routers do more than just “Routers do more than just “longest prefix match”longest prefix match” & & crossbarcrossbar
Packet classification (L3 and up!)Packet classification (L3 and up!) Counters & stats for measurement & debuggingCounters & stats for measurement & debugging IPSec and VPNsIPSec and VPNs QoS, packet shaping, policingQoS, packet shaping, policing IPv6IPv6 Access control and filteringAccess control and filtering IP multicastIP multicast AQM: RED, etc. (maybe)AQM: RED, etc. (maybe) Serious QoS: DiffServ (sometimes), IntServ (not)Serious QoS: DiffServ (sometimes), IntServ (not)
Going Forward
Today’s highest-end: Multi-rack routersToday’s highest-end: Multi-rack routers
Measured in Measured in Tb/secTb/sec One scenario: Big optical switch One scenario: Big optical switch
connecting multiple electrical switchesconnecting multiple electrical switchesCool design: McKeown sigcomm 2003 Cool design: McKeown sigcomm 2003
paperpaper BBN MGR: Normal CPU for forwardingBBN MGR: Normal CPU for forwarding
Modern routers: Several Modern routers: Several ASICASICss