jon turner (and a cast of thousands) washington university [email protected] design of a high...

22
Jon Turner (and a cast of thousands) Washington University [email protected] Design of a High Performance Active Router Active Nets PI Meeting - 12/01

Upload: jayson-ford

Post on 27-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Jon Turner(and a cast of thousands)

Washington [email protected]

Design of a High Performance Active Router

Active Nets PI Meeting - 12/01

2 - Jonathan Turner - December 5, 2001

Switch Fabric

IPP

OP

P

SPC

TI

IPP

OP

P

SPC

TI

IPP

OP

P

SPC

TIIP

P

OP

P

SPC

TI

IPP

OP

P

SPC

TI

IPP

OP

PSPC

TI

ControlProcessor

Washington University Active Router

Smart Port Card

Sys.FPGA

64MB

Pentium

Cache

NorthBridge APIC

ATM Switch Core

Transmisson Interfaces

Embedded Processors

Control Processor

• global coordination & control

• routing protocols

• build routing tables and other information needed by SPCs

• active plugin code server

3 - Jonathan Turner - December 5, 2001

SPC Software Architecture

Gen.Filters

Flow &Route

Lookup

...

virtual output queues

...

PluginControl

plugins

Input Side Processing DistributedQueueing

Gen.Filters

Flow Lookup

output queues

...

PluginControl

plugins

RateControl

...

...

reassembly queues

Output Side Processing

4 - Jonathan Turner - December 5, 2001

SPC Throughput - Packets Per Second

80,000

90,000

100,000

110,000

120,000

130,000

140,000

150,000

0 100,000 200,000 300,000 400,000 500,000

Input Rate (PPS)

Thro

ughput

(PPS)

Dist. Queueing

IP lookup

Complete Processing

Word Swap

40 byte packets

5 - Jonathan Turner - December 5, 2001

Comparison with SPC 2

80,000

90,000

100,000

110,000

120,000

130,000

140,000

150,000

0 100,000 200,000 300,000 400,000 500,000

Input Rate (PPS)

Thro

ughput

(PPS)

SPC 1 - Complete Processing

SPC 1 - minus Dist. Queuing, IP Lookup

SPC 2 - Complete Processing

40 byte packets

6 - Jonathan Turner - December 5, 2001

SPC Throughput - Mb/s

0

50

100

150

200

250

300

350

400

450

150 200 250 300 350 400 450 500 550 600

Input Rate (Mb/ s)

Th

rou

gh

pu

t (M

b/s

)

Word Swap

Dist. Queueing IP Lookup

Complete Processing

1500 Byte Packets

7 - Jonathan Turner - December 5, 2001

SPC Throughput vs. Packet Length

0

50

100

150

200

250

0 200 400 600 800 1000 1200 1400

Input IP Packet Size (Bytes)

Thro

ughput

(Mb/s

)

SPC 1 - complete processingInput rate: 40 Kpps

Throughput constrained by word-swapping

overhead

8 - Jonathan Turner - December 5, 2001

Distributed Queueing

Switch Fabric

TI TI TITI TI

I O I O I OI O I OI O

TI

ControlProcessor

Routing

Sched.

Routing

Sched.

Routing

Sched.

Routing

Sched.

Routing

Sched.

Routing

Sched.queueper output

periodic queuelength reports

Scheduler paces eachqueue according to

backlog share

9 - Jonathan Turner - December 5, 2001

Distributed Queueing Algorithm Goal: avoid switch congestion and output queue

underflow. Let hi(i,j) be input i’s share of input-side backlog to

output j.» can avoid switch congestion by sending from input i to output j at

rate LShi(i,j)» where L is external link rate and S is switch speedup

Let lo(i,j) be input i’s share of total backlog for output j.» can avoid underflow of queue at output j by sending from input i

to output j at rate Llo(i,j) » this works if L(lo(i,1)+···+lo(i,n)) LS for all i

Let wt(i,j) be the ratio of lo(i,j) to lo(i,1)+···+lo(i,n). Let rate(i,j)=LSlo(wt(i,j),hi(i,j)). Note: algorithm avoids congestion and for avoids

underflow for large enough S.» what is the smallest value of S for which underflow cannot occur?

10 - Jonathan Turner - December 5, 2001

Stress Test

11 - Jonathan Turner - December 5, 2001

Stress Test Simulation - Min Rates

0

20

40

60

80

100

1200

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

DQ Period

Min

Rate

fro

m In

0 (

Mb

/s)

to 01

23

4

External Link Rate: 70 Mb/ s

12 - Jonathan Turner - December 5, 2001

0

20

40

60

80

100

120

140

160

0 200 400 600 800 1000 1200

DQ Period

Act

ual R

ate

s

External link rate: 70 Mb/ s

to 0

1

2

3

4

Stress Test Simulation - Actual Rates

13 - Jonathan Turner - December 5, 2001

Stress Test Simulation - Backlog

0

500

1000

1500

2000

2500

3000

3500

0

100

200

300

400

500

600

700

800

900

1000

1100

DQ Period

Back

log

(K

B)

output 0

1

2

3

0=>20=>10=>0

14 - Jonathan Turner - December 5, 2001

Stress Test Measurement Results

15 - Jonathan Turner - December 5, 2001

Switch Fabric

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

PFPX

SPC

TI

ControlProcessor

Reconfigurable Hardware Extension

Field Programmable Port Extenders

Field Programmable Port Ext.

NetworkInterfaceDevice

ReprogrammableApplication

Device

SDRAM128 MB

SRAM4 MB

16 - Jonathan Turner - December 5, 2001

Switch Fabric

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

P

FPX

SPC

TI

IPP

OP

PFPX

SPC

TI

ControlProcessor

Active Packet Processing

333 666

Smart Port CardSys.FPGA

32-64MB

Pentium

Cache

NorthBridge APIC

6 56 5 6 5

17 - Jonathan Turner - December 5, 2001

Logical Port Architecture

Gen.Filters

FlowLookup

activeflow queues

returnqueues

......

...

outputqueues

...

...

PCUplugins

SPC

FPX

Output Side Processing

Gen.Filters

Flow &Route

Lookup

activeflow queues

returnqueues

......

...

virtual output queues

...

...

PCUplugins

SPC

FPX

Input Side Processing

18 - Jonathan Turner - December 5, 2001

Fast IP Lookup (Eatherton & Dittia)

Multibit trie with clever dataencoding.» small memory requirements (4-6 bytes per prefix typical)» small memory bandwidth, simple lookup yields fast lookup rates» updates have negligible impact on lookup performance

Avoid impact of external memory latency on throughput by interleaving several concurrent lookups.» 8 lookup engine config. uses about 10% of Virtex 1000E logic cells

address: 101 100 101 00001,10

000 001010100 101 110

011110 110 100101100

* 010,00 1,11 000

11 -- 1 *--1,10

0 00 010000000000

0 10 100000000000

0 10 000000000000

0 01 000100000000

0 00 011011101110

0 00 000000001000

0 00 000100010010

0 00 000000000010

0 01 000000001100

1 00 000000000000

0 01 001000000000

1 00 000000000000

0 00 100000000000

internalbit vector external

bit vector

19 - Jonathan Turner - December 5, 2001

Lookup Throughput & Latency

0

1

2

3

4

5

6

7

8

9

10

11

1 2 3 4 5 6 7 8# of FIPL engines

Mill

ions

of

look

ups

per

seco

nd

0

100

200

300

400

500

600

700

800

900

1000

1100

Ave

rage

Loo

kup

Lat

ency

(ns

)Worst-Case Avg. Lookup Latency

Mae West Avg.

Lookup Latency

Mae West

ThrougputWorst-Case Throughput

linearthroughpu

t gain

negligible latency increase

20 - Jonathan Turner - December 5, 2001

Update Performance

0

1

2

3

4

5

6

7

8

9

10

11

1 2 3 4 5 6 7 8# of FIPL engines

Mill

ion

s of

look

up

s p

er s

econ

d

No updates

10K updates/ sec

100K updates/ sec

reasonable update rates

have little impact

1 update every 10

s

21 - Jonathan Turner - December 5, 2001

Performance of Combined Traffic

0

25

50

75

100

125

150

175

200

0 0.05 0.1 0.15 0.2Fraction of input traffic that is active

Act

ive T

hru

put

(Kpps)

0

250

500

750

1000

1250

1500

1750

2000

Non-a

ctiv

e T

hru

put

(Kpps)

active packet throughput

non-active packet throughput

40 byte packets2 M input p/ s850 M/ bs

22 - Jonathan Turner - December 5, 2001

Summmary and Status Latest version of SPC software nearly complete.

»additional testing of distributed queueing» testing of new output queueing subsystem - QSDRR»porting active applications to new plugin environment

SPC2 almost ready for production.»finalizing details of PC board schematic and layout»overload performance testing on development system

Completion of FPX design & integration with SPC.» low level debugging of FPX interface circuit»distributed queueing implementation in FPX»FIPL extension for flow classification»enhance active flow, output queueing subsystems