jon turner (and a cast of thousands) washington university [email protected] design of a high...
TRANSCRIPT
Jon Turner(and a cast of thousands)
Washington [email protected]
Design of a High Performance Active Router
Active Nets PI Meeting - 12/01
2 - Jonathan Turner - December 5, 2001
Switch Fabric
IPP
OP
P
SPC
TI
IPP
OP
P
SPC
TI
IPP
OP
P
SPC
TIIP
P
OP
P
SPC
TI
IPP
OP
P
SPC
TI
IPP
OP
PSPC
TI
ControlProcessor
Washington University Active Router
Smart Port Card
Sys.FPGA
64MB
Pentium
Cache
NorthBridge APIC
ATM Switch Core
Transmisson Interfaces
Embedded Processors
Control Processor
• global coordination & control
• routing protocols
• build routing tables and other information needed by SPCs
• active plugin code server
3 - Jonathan Turner - December 5, 2001
SPC Software Architecture
Gen.Filters
Flow &Route
Lookup
...
virtual output queues
...
PluginControl
plugins
Input Side Processing DistributedQueueing
Gen.Filters
Flow Lookup
output queues
...
PluginControl
plugins
RateControl
...
...
reassembly queues
Output Side Processing
4 - Jonathan Turner - December 5, 2001
SPC Throughput - Packets Per Second
80,000
90,000
100,000
110,000
120,000
130,000
140,000
150,000
0 100,000 200,000 300,000 400,000 500,000
Input Rate (PPS)
Thro
ughput
(PPS)
Dist. Queueing
IP lookup
Complete Processing
Word Swap
40 byte packets
5 - Jonathan Turner - December 5, 2001
Comparison with SPC 2
80,000
90,000
100,000
110,000
120,000
130,000
140,000
150,000
0 100,000 200,000 300,000 400,000 500,000
Input Rate (PPS)
Thro
ughput
(PPS)
SPC 1 - Complete Processing
SPC 1 - minus Dist. Queuing, IP Lookup
SPC 2 - Complete Processing
40 byte packets
6 - Jonathan Turner - December 5, 2001
SPC Throughput - Mb/s
0
50
100
150
200
250
300
350
400
450
150 200 250 300 350 400 450 500 550 600
Input Rate (Mb/ s)
Th
rou
gh
pu
t (M
b/s
)
Word Swap
Dist. Queueing IP Lookup
Complete Processing
1500 Byte Packets
7 - Jonathan Turner - December 5, 2001
SPC Throughput vs. Packet Length
0
50
100
150
200
250
0 200 400 600 800 1000 1200 1400
Input IP Packet Size (Bytes)
Thro
ughput
(Mb/s
)
SPC 1 - complete processingInput rate: 40 Kpps
Throughput constrained by word-swapping
overhead
8 - Jonathan Turner - December 5, 2001
Distributed Queueing
Switch Fabric
TI TI TITI TI
I O I O I OI O I OI O
TI
ControlProcessor
Routing
Sched.
Routing
Sched.
Routing
Sched.
Routing
Sched.
Routing
Sched.
Routing
Sched.queueper output
periodic queuelength reports
Scheduler paces eachqueue according to
backlog share
9 - Jonathan Turner - December 5, 2001
Distributed Queueing Algorithm Goal: avoid switch congestion and output queue
underflow. Let hi(i,j) be input i’s share of input-side backlog to
output j.» can avoid switch congestion by sending from input i to output j at
rate LShi(i,j)» where L is external link rate and S is switch speedup
Let lo(i,j) be input i’s share of total backlog for output j.» can avoid underflow of queue at output j by sending from input i
to output j at rate Llo(i,j) » this works if L(lo(i,1)+···+lo(i,n)) LS for all i
Let wt(i,j) be the ratio of lo(i,j) to lo(i,1)+···+lo(i,n). Let rate(i,j)=LSlo(wt(i,j),hi(i,j)). Note: algorithm avoids congestion and for avoids
underflow for large enough S.» what is the smallest value of S for which underflow cannot occur?
11 - Jonathan Turner - December 5, 2001
Stress Test Simulation - Min Rates
0
20
40
60
80
100
1200
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0
90
0
10
00
11
00
DQ Period
Min
Rate
fro
m In
0 (
Mb
/s)
to 01
23
4
External Link Rate: 70 Mb/ s
12 - Jonathan Turner - December 5, 2001
0
20
40
60
80
100
120
140
160
0 200 400 600 800 1000 1200
DQ Period
Act
ual R
ate
s
External link rate: 70 Mb/ s
to 0
1
2
3
4
Stress Test Simulation - Actual Rates
13 - Jonathan Turner - December 5, 2001
Stress Test Simulation - Backlog
0
500
1000
1500
2000
2500
3000
3500
0
100
200
300
400
500
600
700
800
900
1000
1100
DQ Period
Back
log
(K
B)
output 0
1
2
3
0=>20=>10=>0
15 - Jonathan Turner - December 5, 2001
Switch Fabric
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
PFPX
SPC
TI
ControlProcessor
Reconfigurable Hardware Extension
Field Programmable Port Extenders
Field Programmable Port Ext.
NetworkInterfaceDevice
ReprogrammableApplication
Device
SDRAM128 MB
SRAM4 MB
16 - Jonathan Turner - December 5, 2001
Switch Fabric
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
P
FPX
SPC
TI
IPP
OP
PFPX
SPC
TI
ControlProcessor
Active Packet Processing
333 666
Smart Port CardSys.FPGA
32-64MB
Pentium
Cache
NorthBridge APIC
6 56 5 6 5
17 - Jonathan Turner - December 5, 2001
Logical Port Architecture
Gen.Filters
FlowLookup
activeflow queues
returnqueues
......
...
outputqueues
...
...
PCUplugins
SPC
FPX
Output Side Processing
Gen.Filters
Flow &Route
Lookup
activeflow queues
returnqueues
......
...
virtual output queues
...
...
PCUplugins
SPC
FPX
Input Side Processing
18 - Jonathan Turner - December 5, 2001
Fast IP Lookup (Eatherton & Dittia)
Multibit trie with clever dataencoding.» small memory requirements (4-6 bytes per prefix typical)» small memory bandwidth, simple lookup yields fast lookup rates» updates have negligible impact on lookup performance
Avoid impact of external memory latency on throughput by interleaving several concurrent lookups.» 8 lookup engine config. uses about 10% of Virtex 1000E logic cells
address: 101 100 101 00001,10
000 001010100 101 110
011110 110 100101100
* 010,00 1,11 000
11 -- 1 *--1,10
0 00 010000000000
0 10 100000000000
0 10 000000000000
0 01 000100000000
0 00 011011101110
0 00 000000001000
0 00 000100010010
0 00 000000000010
0 01 000000001100
1 00 000000000000
0 01 001000000000
1 00 000000000000
0 00 100000000000
internalbit vector external
bit vector
19 - Jonathan Turner - December 5, 2001
Lookup Throughput & Latency
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8# of FIPL engines
Mill
ions
of
look
ups
per
seco
nd
0
100
200
300
400
500
600
700
800
900
1000
1100
Ave
rage
Loo
kup
Lat
ency
(ns
)Worst-Case Avg. Lookup Latency
Mae West Avg.
Lookup Latency
Mae West
ThrougputWorst-Case Throughput
linearthroughpu
t gain
negligible latency increase
20 - Jonathan Turner - December 5, 2001
Update Performance
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8# of FIPL engines
Mill
ion
s of
look
up
s p
er s
econ
d
No updates
10K updates/ sec
100K updates/ sec
reasonable update rates
have little impact
1 update every 10
s
21 - Jonathan Turner - December 5, 2001
Performance of Combined Traffic
0
25
50
75
100
125
150
175
200
0 0.05 0.1 0.15 0.2Fraction of input traffic that is active
Act
ive T
hru
put
(Kpps)
0
250
500
750
1000
1250
1500
1750
2000
Non-a
ctiv
e T
hru
put
(Kpps)
active packet throughput
non-active packet throughput
40 byte packets2 M input p/ s850 M/ bs
22 - Jonathan Turner - December 5, 2001
Summmary and Status Latest version of SPC software nearly complete.
»additional testing of distributed queueing» testing of new output queueing subsystem - QSDRR»porting active applications to new plugin environment
SPC2 almost ready for production.»finalizing details of PC board schematic and layout»overload performance testing on development system
Completion of FPX design & integration with SPC.» low level debugging of FPX interface circuit»distributed queueing implementation in FPX»FIPL extension for flow classification»enhance active flow, output queueing subsystems