department of computer and it engineering university of kurdistan computer networks ii

Department of Computer and IT EngineeringDepartment of Computer and IT EngineeringUniversity of KurdistanUniversity of Kurdistan

Computer Networks IIRouter Architecture

By: Dr. Alireza AbdollahpouriBy: Dr. Alireza Abdollahpouri

What is Routing and forwarding?

A

B

C

R1

R2

R3

R4 D

E

FR5

2

History …

Introduction

3

History …

And future trends!And future trends!

Introduction

4

Cisco GSR 12416 Juniper M160

6ft

19”

2ft

Capacity: 160Gb/sPower: 4.2kW

3ft

2.5ft

19”

Capacity: 80Gb/sPower: 2.6kW

What a Router Looks Like

5

Basic network system functionality Address lookup Packet forwarding and routing Fragmentation and re-assembly Security Queuing Scheduling Packet classification Traffic measurement …

Packet Processing FunctionsPacket Processing Functions

6

1. Accept packet arriving on an ingress line.

2. Lookup packet destination address in the

forwarding table, to identify outgoing interface(s).

3. Manipulate packet header: e.g., decrement TTL,

update header checksum.

4. Send packet to outgoing interface(s).

5. Queue until line is free.

6. Transmit packet onto outgoing line.

Per-packet Processing in a Router

7

Control Plane May be Slow“Typically in Software”

Data plane (per-packet processing) Must be fast“Typically in Hardware”

• Switching•Arbitration•Scheduling

• Routing Lookup• Packet Classifier

Routing - Routing table update (OSPF, RIP, IS-IS) - Admission Control - Congestion Control - Reservation

SwitchingSwitching

Basic Architecture of a Router

How packets get forwarded

How routing protocols establish routes/etc

8

9

Generic Router Architecture

Data Hdr

Data Hdr

Data Hdr

BufferManager

BufferMemory

BufferMemory

BufferManager

BufferMemory

BufferMemory

BufferManager

BufferMemory

BufferMemory

Data Hdr

Data Hdr

Data Hdr

Interconnect scheduling

Route lookup

TTL proces

sing

Buffering

Buffering

QoS sched

uling

Control plane

Ingress linecard Egress linecardInterconnect

Framing

Framing

Data path

Control path

Scheduling path

Functions in a Packet Switch

usually multiple usage of memory (DRAM for packet buffer,

SRAM for queues and tables)10

Line Card Picture

11

Major Components of Routers: Interconnect

Interconnect Input Ports to Output Ports, includes 3 modes Bus

All Input ports transfer data through the shared bus. Problem : Often cause in data flow congestion.

Shared Memory Input port write data into the share memory. After destination lookup is performed, the

output port read data from the memory. Problem : Require fast memory read/write and management technology.

Crossbar N input ports has dedicated data path to N output ports. Result in N*N switching matrix. Problem : Blocking (Input, Output, Head-of-line HOL). Max switch load for random traffic

is about 59%.

BusBusShared MemoryShared Memory

CrossbarCrossbar

MemoryMemory

12

Interconnects: Two basic techniques

Input Queueing Output Queueing

Usually a non-blockingswitch fabric (e.g. crossbar)

13

Output Queued (OQ) Switch

How an OQ Switch Works

14

Input Queueing: Head of Line Blocking

Del

ay

Load58.6% 100%

15

Head of Line Blocking

16

Virtual Output Queues (VoQ)

Virtual Output Queues: At each input port, there are N queues – each

associated with an output port Only one packet can go from an input port at a time Only one packet can be received by an output port

at a time It retains the scalability of FIFO input-queued switches It eliminates the HoL problem with FIFO input Queues

19

Input Queueing: Virtual output queues

20

Del

ay

Load100%

Input Queueing: Virtual output queues

21

The Evolution of Router Architecture

First Generation Routers

Modern Routers

22

RouteTableCPU Buffer

Memory

LineInterface

MAC

LineInterface

MAC

LineInterface

MAC


Shared Backplane

Line Interfac

eCPU

Mem

ory

Bus-based Router Architectures with Single ProcessorBus-based Router Architectures with Single Processor23

Based on software implementations on a single CPU.

Limitations: Serious processing bottleneck in the central

processor Memory intensive operations (e.g. table lookup

& data movements) limits the effectiveness of processor power


24

Second Generation Routers

RouteTableCPU

LineCard

BufferMemory

LineCard

MAC

BufferMemory

LineCard

MAC

BufferMemory

FwdingCache

FwdingCache

FwdingCache

MAC

BufferMemory

Bus-based Router Architectures with Multiple Processors

Bus-based Router Architectures with Multiple Processors

25

Architectures with Route Caching Distribute packet forwarding operations Network interface cards

Processors Route caches

Packets are transmitted once over the shared bus Limitations:

The central routing table is a bottleneck at high-speeds Traffic dependent throughput (cache) Shared bus is still a bottleneck

Second Generation Routers

26

LineCard

MAC

LocalBuffer

Memory

CPUCard

LineCard

MAC

LocalBuffer

Memory

Switched Backplane

Line Interfac

eCPU

Mem

ory FwdingTable

RoutingTable

FwdingTable

Third Generation Routers

Switch-based Architectures with Fully Distributed ProcessorsSwitch-based Architectures with Fully Distributed Processors27

To avoid bottlenecks:

Processing power

Memory bandwidth

Internal bus bandwidth

Each network interface is equipped with appropriate processing power and buffer space.

Data vs. control plane

• Data plane – line cards

• Control plane - processor

Third Generation Routers

28

Switch Core Linecards

Optical links

100sof metres

0.3 - 10Tb/s routers in development

Fourth Generation Routers/Switches

Optics inside a router for the first timeOptics inside a router for the first time

29

Do we still higher processing power in networking devices?

Of course, YESBut why? and how?

Demand for More Powerful Routers

30

Processing Complexity

Hundreds of instructions per

packet

Thousands of instructions per

packetLayer 2

switchingIPv4

routingFlow

ClassificationEncryption

Intrusiondetection

packet inter-arrival time (for 40Gbps):Big packet: 300 nsSmall packet: 12 ns

Beyond the moore’s lawBeyond the moore’s law

Demands for Faster Routers (why?)

31

Future applications will demand TIPS


32

Future applications will demand TIPS Power? Heat?


33

Technology push:- Link bandwidth scaling much faster than CPU and memory

technology

- Transistor scaling and VLSI technology help but not enough

Demands for Faster Routers (summary)

Application pull:

- More complex applications are required

- Processing complexity is defined as the number of instructions

and number of memory access to process one packet

34

“Future applications will demand TIPS”

“Think platform beyond a single processor”

“Exploit concurrency at multiple levels”

“Power will be the limiter due to complexity and leakage”

Distribute workload on multiple cores

Demands for faster routers (How?)

35

Symmetric multi-processors allow multi-threaded applications to achieve higher performance at less die area and power consumption than single-core processors

Asymmetric multi-processors consume power and provide increased computational power only on demand

Multi-Core Processors

36

Performance Bottlenecks

Memory Bandwidth available, but access time too slow Increasing delay for off-chip memory

I/O High-speed interfaces available Cost problem with optical interfaces

Internal Bus Can be solved with an effective switch, allowing simultaneous transfers between network interfaces

Processing power Individual cores are getting more complex Problems with access to shared resources Control processor can become bottleneck

37

Different Solutions

• ASIC• FPGA• NP• GPP

Flexibility

Performance

ASIC

GPP

FPGA

NP

38

By: Niraj Shah

Different Solutions

39

“It is always something(corollary). Good, Fast, Cheap:

Pick any two (you can’t have all three).”

RFC1925“The Twelve Networking Truths”

“It is always something(corollary). Good, Fast, Cheap:

Pick any two (you can’t have all three).”


40

High cost to develop Network processing moderate quantity market

Long time to market Network processing quickly changing services

Difficult to simulate Complex protocol

Expensive and time-consuming to change Little reuse across products Limited reuse across versions No consensus on framework or supporting chips Requires expertise

Why not ASIC?Why not ASIC?

41

• Introduced several years ago (1999+)

• A way to introduce flexibility and programmability

in network processing

• Many players were there (Intel, Motorola, IBM)

• Only a few players still there

Network Processors

42

Intel IXP 2800

Initial release August 200343

CPU-level flexibility – A giant step forward compared to ASICs

How? – Hardware coprocessors – Memory hierarchies – Multiple hardware threads (zero context switching overhead) – Narrow (and multiple) memory buses – Some other ad-hoc solutions for network processing, e.g., Fast switching fabric, memory accesses, etc

What Was Correct With NPs?

44

What Was Wrong With NPs?

Programmability issues

– Completely new programming paradigm

– Developers are not familiar with the unprecedented

parallelism of the NPU, They do not know how to

exploit it at best

– New (proprietary) languages

– Portability among different network processors

families

45

What Happened in NP Market?

Intel went out of the market in 2007

Many other small players disappeared

High risk when selecting a NP maker that may disappear

46

Every old idea will be proposed again with a different name and a

different presentation, regardless of whether it works.


Every old idea will be proposed again with a different name and a

different presentation, regardless of whether it works.


47

Processing in General-purpose CPUs

CPUs optimized for few threads, high performance per thread

– High CPU frequencies – Maximize instruction-level parallelism • Pipeline • Superscalar • Out-of-order execution • Branch prediction • Speculative loads

Software Routers

48

Aim: Low cost, flexibility and extensibility

Linux on PC with a bunch of NICs

Changing a functionality is as simple as a

software upgrade

Software Routers

49

• RouteBricks [SOSP’09] Uses Intel Nehalem architecture

• Packet shader [SIGCOMM’10] GPU-Accelerated Developed in KAIST, Korea

Software Routers (examples)

50

Intel Nehalem Architecture

C0

L3 Common Cache

C1

C2

C3

51

NUMA architecture: The latency to access the local memory is, approximately, 65 nano-seconds. The latency to access the remote memory is, approximately, 105 nano-seconds

Bandwidth through of the QPI link is 12.8 GB/s

Three DDR3 channels to local DRAM support a bandwidth of 31.992GB/s


52

Shared L3 Cache

I/Ocontroller

hub

IMC3 channels

DRAM

DRAM

DRAM

PCI slots

PCI slots

PCI slots

QPI

PCI busnetwork card

disk

file system

communication system

application

file systemcommunication

system

application

disk network cardL2

cache

QPI2

QPI1

Powerand clock

L2cache

L2cache

L2cache

Core

0

Core

1

Core

2

Core

3

Nehalem Quadcore

L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D


53

Other Possible PlatformsOther Possible Platforms

Intel Westmere-EP Intel Jasper Forest

54

Pipeline Parallel

Hybrid

Workload Partitioning (parallelization)Workload Partitioning (parallelization)

55

Questions!Questions!Questions!Questions!

department of computer and it engineering university of kurdistan computer networks ii

Documents

packet header

packet buffer

fifo input queues

n output ports

output portonly

crossbarn input ports

n queues

outgoing line