fabric on a chip

8/8/2019 Fabric on a Chip

1/38

Presentation by:

C.AnnaduraiSSN College of Engineering

Fabric on a Chip: A

Memory-managementPerspective


2/38

Objectives

Distributed architecture challenges Fabric Flow

Control CRS Cell based Multi-Stage Benes

Switch Fabric challenges

Metro Architecture Basics

Reassembly Window

Challenges


3/38

Agenda

Ciscoshigh endrouter

CRS-1

Futuredirections

CRS-1s NPMetro (SPP)

CRS-1sFabric

CRS-1sLine Card


4/38

What drove the CRS?A sample taxonomy

OC768

Multi chassis Improved BW/Watt & BW/Space

New OS (IOS-XR) Scalable control plane


5/38

Multiple router flavoursA sample taxonomy

Core OC-12 (622Mbps) and up (to OC-768 ~= 40Gbps)

Big, fat, fast, expensive

E.g. Cisco HFR, Juniper T-640 HFR: 1.2Tbps each, interconnect up to 72 giving 92Tbps, start at $450k

Transit/Peering-facing OC-3 and up, good GigE density

ACLs, full-on BGP, uRPF, accounting Customer-facing

FR/ATM/

Feature set as above, plus fancy queues, etc

Broadband aggregator High scalability: sessions, ports, reconnections Feature set as above

Customer-premises (CPE) 100Mbps

NAT, DHCP, firewall, wireless, VoIP,

Low cost, low-end, perhaps just software on a PC


6/38

Routers are pushed to the

edge

A sample taxonomy

Over time routers are pushed to the edge as: BW requirements grow

# of interfaces scale Different routers have different offering

Interfaces types (core is mostly Eathernet)

Features. Sometimes the same feature is implemented differently

User interface

Redundancy models

Operating system

Costumers look for: investment protection

Stable network topology Feature parity


7/38

What does Scaling means A sample taxonomy

Interfaces (BW, number, variance)

BW

Packet rate

Features (e.g. Support link BW in a flexible manner)

More Routes

Wider ECO system Effective Management (e.g. capability to support more BGPpeers and more events)

Fast Control (e.g. distribute routing information)

Availability

Serviceability

Scaling is both up and down (logical routers)


8/38

RouteTableCPU BufferMemory

LineInterface

MAC

LineInterface

MAC

LineInterface

MAC

Typically


9/38

High BW distributed

LineCard

MAC

LocalBufferMemory

CPUCard

LineCard

MAC

LocalBufferMemory

Crossbar: Switched Backplane

LineInterface

CPU

Mem

ory FwdingTable

RoutingTable

FwdingTable

Typically


10/38

Distributed architecture challenges

(examples) HW wise

Switching fabric

High BW switching QOS Traffic loss Speedup

Data plane (SW) High BW / packet rate Limited resources (cpu, memory)

Control plane (SW)

High event rate Routing information distribution (e.g. forwarding tables)


11/38


12/38


13/38

Switch Fabric challenges

Scale - many ports

Fast Distributed arbitration

Minimum disruption with QOS model

Minimum blocking

Balancing

Redundancy

P e io s sol tion GSR Cell


14/38

Previous solution: GSRCellbased XBAR w centralized

scheduling Each LC hasvariable widthlinks to and

from the XBAR,depending onits bandwidthrequirement

CentralschedulingISLIP based Two request-

grant-acceptrounds

Each arbitrationround lasts onecell time

Per destinationLC virtualoutput queues

Linecard(emphasizing fabric interface)

XBAR SwitchingMatrix(showing

connections for justone linecard)

Fabric Scheduler(showing

connections for justone linecard)

grant

request

XBAR Control

Request/Grant

Control

Virtual Output

Queues

Cellavailabilityinformation

Celltransmitcontrol

Reassembly

Queues

Ingressdata

Egressdata

1 to 16 transmit andreceive lanes

# of lanes varies per linecardtype based on bandwidth

One Output Queue perdestination linecard

One Reassembly Queueper source linecard (and

per unicast/multicast)

To-Fabric Lane(s)

From-Fabric Lane(s)


15/38

CRS Cell based Multi-Stage

Benes

Multiple paths to a destination

For a given input to output port, the no. of paths is equal to the no. of centerstage elements

Distribution between S1 and S2 stages. Routing at S2 and S3Cell routing


16/38

Fabric speedup

Q-fabric tries to approximate an

output buffered switch to minimize sub-port blocking

Buffering at output allows betterscheduling

In single stage fabrics a 2X speedup

very closely approximates an outputbuffered fabric *

For multi-stage the speedup factor toapprox output buffered behavior is


17/38

Fabric Flow Control

Overview

Discard - time constant in the 10s of

mS range Originates from from fab and is directedat to fab.

Is a very fine level of granularity, discardto the level of individual destination rawqueues.

Back Pressure - time constant in the10s ofS range. Originates from the Fabric and is directed

at to fab. Operates per priority at increasingly


18/38

Reassembly Window

Cells transitioning the Fabric take

different paths between Sprayer andSponge.

Cells for the same packet will arrive

out of order. The Reassembly Window for a given

Source is defined as the the worst-case differential delay two cells from apacket encounter as they traverse the

Fabric. The Fabric limits the Reassembly


19/38

Linecard challenges

Power

COGS Multiple interfaces

Intermediate buffering

Speed up

CPU subsystem


20/38

Cisco CRS-1 Line Card

MODULAR SERVICES CARD PLIM

MIDPLAN

E

MIDPLAN

E

CPUSquidGW

OC192

FramerandOptics

OC192

FramerandOptics

OC192Framer

andOptics

OC192Framer

andOptics

OC192Framerand

Optics

OC192Framerand

Optics

OC192FramerandOptics


Egress Packet FlowFrom Fabric

InterfaceModuleASIC

RXMETRO

RXMETRO

IngressQueuing

IngressQueuing

TXMETRO

TXMETRO

FromFabricASIC

FromFabricASIC

EgressQueuingEgress

Queuing

4

1

8

76

5

23


21/38

MODULAR SERVICES CARD PLIM

MIDPLAN

E

MIDPLAN

E

CPUSquidGW

OC192

FramerandOptics

OC192

FramerandOptics

OC192Framer

andOptics

OC192Framer

andOptics

OC192Framerand

Optics

OC192Framerand

Optics



Egress Packet FlowFrom Fabric

InterfaceModuleASIC

RXMETRO

RXMETRO

IngressQueuing

IngressQueuing

TXMETRO

TXMETRO

FromFabricASIC

FromFabricASIC

EgressQueuin

g

EgressQueuin

g

4

1

8

76

5

23

LineCardCPU

EgressMetro

IngressMetro

IngressQueuing

PowerRegulators

FabricSerdes

From

Fabric

EgressQueuing



22/38

EgressMetro

IngressMetro

LineCardCPU

IngressQueuing

PowergulatorsRe

FabricSerdes

From

Fabric

EgressQueuing



23/38


24/38

Metro Subsystem


25/38

Metro Subsystem

What is it ?

Massively Parallel NP Codename Metro

Marketing name SPP

(Silicon PacketProcessor)

What were the Goals ?

Programmability Scalability

Who designed &programmed it ?


26/38

Metro Subsystem

Metro2500 Balls250Mhz

35W

TCAM125MSPS128kx144-

bit entries

2 channels

FCRAM166Mhz DDR9 Channels

Lookups andTableMemory

QDR2 SRAM250Mhz DDR5 Channels

Policing stateClassificationresults Queuelength state


27/38

Metro Top Level

Packet Out

96 Gb/s BW

Packet In

96 Gb/s BW

18mmx18mm - IBM.13um

18M gates

8Mbit SRAM and RAs

ControlProcessorInterface

Proprietary

2Gb/s


28/38

Gee-whiz numbers

188 32-bit

embedded Risccores

~50Bips

175

78 MPPS peakperformance

Why Programmability ?


29/38

Why Programmability ?

Simple forwarding not sosimple Example FEATURES:

MPLS3 Labels Link Bundling (v4)

Load Balancing L3 (v4)

1 Policier Check

Marking

TE/FRR

Sampled Netflow

WRED

ACL

IPv4 Multicast

IPv6 Unicast

Per prefix accounting

GRE/L2TPv3 Tunneling

RPF check (loose/strict) v4

Load Balancing V3 (v6)

Link Bundling (v6)

Congestion Control

IPv4 Unicastlookup algorithm

Hundreds of

Load

balancingEntries per

Millions of

Routes

100k+ of

adjacenciesPointer to

Statistics

Counters

L3

loadbalanceentry L2

info

Increasing pressure to add

1-2 level of increasedindirection for High

Availability and increased

update rates

Lookup

L3info

Load Balancing and Adjacencies : Sram/DRAMSram/Dram

leaf

policy based

routing TCAMtable

TCAM

PBR associative

Sram/DRAM

1:1

data

L2Adjacency

Programmability also means

Ability to juggle feature orderingSupport for heterogeneous mixes of feature chainsRapid introduction of new features (Feature Velocity)


30/38

96G

188 96G

96G

96G

PPEPPE

PPEPPE

On-ChipPacketBuffer

Resource Fabric

ResourceResource

ResourceResource

Metro Architecture BasicsPacket tailsstored on-

chip Packet

Distribution

Run-to-completion (RTC)simple SW modelefficient heterogeneous feature

processingRTC and Non-Flow based Packet

distribution means scalablearchitecture

Costs

High instruction BW supplyNeed RMW and flow orderingsolutions

~100Bytes of

packetcontext sentto PPEs


31/38

96G

188 96G

96G

96G

PPEPPE

PPEPPE

On-ChipPacketBuffer

Resource Fabric

ResourceResource

ResourceResource

Metro Architecture BasicsPacketGather

Gather of Packets involves :Assembly of final packets (at100Gb/s)

Packet ordering after variablelength processing

Gathering without new packet

distribution


32/38

96G

188 96G

96G

96G

PPEPPE

PPEPPE

On-ChipPacketBuffer

Resource Fabric

ResourceResource

ResourceResource

Metro Architecture BasicsPacket Bufferaccessible as

Resource

Resource Fabric is parallel widemulti-drop busses

Resources consist ofMemoriesRead-modify-write operationsPerformance heavy

mechanisms


33/38

Metro ResourcesStatistics

512k

TCAM

InterfaceTables

Policing100k+

Lookup

Engine2MPrefixes

Table DRAM(10sMB)

QueueDepth State

P1

P2P3

P4 P5

P7

P8

P6

P9

Root Node

ChilArra

Child Pointer

Child Pointer Child Pointer

ChildArray

CCR April 2004 (vol. 34 no. 2) pp 97CCR April 2004 (vol. 34 no. 2) pp 97--123.123. TreeTree

Bitmap : Hardware/Software IP Lookups withBitmap : Hardware/Software IP Lookups with

Incremental UpdatesIncremental Updates, Will Eatherton et. Al., Will Eatherton et. Al.

Lookup Engine usesTreeBitmap Algorithm

FCRAM and on-chipmemory

High Update ratesConfigurable performance

Vs density

Packet Processing Element


34/38

16 PPEClusters

EachCluster of12 PPEs

Packet Processing Element

(PPE)

.5sqmmper PPE

Packet Processing


35/38

gElement (PPE)

32-bit RISC

ICACHE

DATA Mem

CiscoDMA

instruction bus

Memory mapped Regs

Distribution Hdr

Pkt Hdr

Scratch Pad

Processor Core

ClusterInstruction

MemoryGlobal

Instruction

Memory

Cluster

DataMux Unit

To12

PPEs

Pkt Distribution

From Resources

Pkt Gather

To Resources

TensilicaXtensa core

with Ciscoenhancements

32-bit, 5-stagepipeline

Code Density: 16/24 bitinstructions

Smallinstruction

To12

PPEs

PPE

Programming Model and


36/38

Programming Model and

EfficiencyMetro Programming Model Run to completion programming model

Queued descriptor interface toresources

Industry leveraged tool flow

Efficiency Data Points

1 ucoder for 6 months: IPv4 withcommon features (ACL, PBR, QoS,

..


37/38


38/38

Summary

Distributed architecture challenges Fabric Flow

Control CRS Cell based Multi-Stage Benes

Switch Fabric challenges Metro Architecture Basics

Reassembly Window

Challenges

fabric on a chip

Documents