overview: qub and ecit

26
Sakir Sezer NTU – Taipei October 2007 1 Overview September 2004 Overview: QUB and ECIT High Performance Network Processing Sakir Sezer Research Director – SoC Architectures and Programmable Systems Institute of Electronics, Communications and Information Technology Queen’s University Belfast, N Ireland, UK Sakir Sezer NTU – Taipei October 2007 2

Upload: others

Post on 06-Jun-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 1

OverviewSeptember 2004

Overview: QUB and ECIT

High Performance Network Processing

Sakir SezerResearch Director – SoC Architectures and Programmable Systems

Institute of Electronics, Communications and Information Technology Queen’s University Belfast, N Ireland, UK

Sakir Sezer ���� NTU – Taipei October 2007 2

�������������� ��������

Page 2: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 3

� 161 years old

� 30 schools & 9 research institutes in 3 faculties

� 3,300 staff - 1,100 academics

� 12,700 undergrads + 3,700 postgrads = 16,400 students

� research + support service income = $55 million p.a.

� total income = $260 million p.a.

*Engineering - Humanities - Legal, social & educational sciencesMedicine & health sciences - Science & agriculture

�������������� ��������

Sakir Sezer ���� NTU – Taipei October 2007 4

����� ����������������������

Science and Engineering

Primary Degrees 1,000

Master Degrees 500

Doctorates 150

Page 3: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 5

Institute of Electronics Communications and Information Technology

�������������� �����������

������ ����������

���������������� ��������������

������������ �����

Sakir Sezer ���� NTU – Taipei October 2007 6

��������� ��������������

Page 4: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 7

�� �����������������������

Sakir Sezer ���� NTU – Taipei October 2007 8

ESIT

�������������

Page 5: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 9

�������������

Titanic Quarter $4bn Investment - 20,000 jobs

Page 6: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 11

����� ������!�"�#��

� ����������� ����������������

"��$�%��$�&��

� ������������������������������������������������

'�#���#����#(���

� �������������������������������

'�#��)����*���#�

� � ������������� �!������������

'�#��"��� ���##%���

Sakir Sezer ���� NTU – Taipei October 2007 12

�#���+�����#�����#����#������

�#�������#�����!�#�#+��

Best

effort

Services

Real Time

Interactive

Services

Telecommunication

Broadcast, Multicast TV, Radio

Computer Communication

Fixed Mobile

4G - Mobile-IP

GSM

GPRS/EDGE

3G

HS(D/U)PA (3.5G)

3G-LTE

Dial-up10 Kbps

100 Kbps

1 Mbps

10 Mbps

100 Mbps

ADSL

ADSL2+

VDSL

GPON

WIFI

WIMAX

BWA

Current

Future

Page 7: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 13

���������� �!�����#���!�����,��+

Current Internet

Next Generation Internet

Static ContentGraphics, Text, Online LibraryE-businesses,

Online Banking

Streamed ContentIP-TV, VoD, VoIP

HDTV Online GamingVideo Conferencing

Sakir Sezer ���� NTU – Taipei October 2007 14

����������#���#�

• Internet traffic is continuously doubling every 12 months

• Emerging services require:

– Higher bandwidth (VoD, DVB-IP, VoIP)

– Quality of Service (assured low end-to-end latency)

– Smaller packet size (reduce end-to-end latency for real-time and interactive services)

– Higher degree of security (Internet Banking, internet shopping, e-business) (Estimated 2004 Internet crime at around $350 billion worldwide, Internet fraud cost merchants an average of 1.8% of online revenues or $2.6 billion in 2004)

• Network Access is expected to become eventually wireless

– Deployment of new frequency bands

– Space Division Multiple Access (frequency reuse, MIMO, Beam forming)

• Transmission capacity increase requires:– complex traffic aggregation and management.

– Link bandwidth beyond 10Gbps at the edge and 40-100Gbps at the core.

Page 8: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 15

MainstreamGraphic, Text, Voice

1997

Voice (PSTN)

����(��!�"������� -##�����.�(

FrontierIPTV, IP-HDTViTube, IP-Radio

Web casting, VoD2D/3D Multimedia

video telephonyOther streamed service

Moore’s LawSilicon Integration Capability doubles

every 18 Months

Technology Gap

Internet Traffic is doubles every

12 Months

Data Processing GapData processing demand at Network and access nodes doubles every 6-9 Months

1990 2000 2010 2020

/0

123 �,��

124 �,��

125 �,��

1�,��

Source: Robert – IEEE Computer -Jan2000

Sakir Sezer ���� NTU – Taipei October 2007 16

����+�+�$#��"��+���!�����+��

Networks (Edge & Core)

Data and Network Security, Privacy

Access Technologies & Multimedia Systems Wireless Technology

Page 9: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 17

���� ������!�������#�

Challenge the widening technology and data processing

gaps that exist for emerging and future services and, applications of converging information and communication

technologies.

Research targets - novel SoC architectures, - high-level SoC design methodologies, - and programmability of systems

to satisfy the real-time computational and flexibility

demands of emerging systems.

The division is comprised of over 35 academic, research and research-related support staff.

Sakir Sezer ���� NTU – Taipei October 2007 18

������ $#�� ������!�����(

Applied Research Speculative Research

Network Processing(Frame processing, HW acceleration of Lookup, TCP processing,

Packet Classification, HW-based traffic management packet scheduling

Network Security ProcessingIP-Sec, Deep Packet Inspection, MPSoC security processing, Wireless Ad Hoc Network Security Protocols & Architectures

Cryptograph(Public/Private key algorithm architectures: AES, SHACAL-2, Authentication architectures:

SHA-384/SHA-512,Whirlpool, Cryptography for Constrained Environments

Communication Signal Processing(Processing architectures for MIMO beamforming and smart antenna)

System Level Design Tools for DSP SoC(Rapid Prototyping, HW/SW co-design, System level Design capture )

Video Signal Processing & AnalysisReconfigurable motion estimation for multi-standards, video analysis acceleration

Full-custom embedded associative memory circuits

Dig

ital

Ana

logu

e

Page 10: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 19

Hardware Architectures for Network Processing

• Internet Traffic Management

• Network Security

Sakir Sezer ���� NTU – Taipei October 2007 20

���%+�#���

• Streaming applications – VoiP, IPTV reducing packet size, rigid delay requirements

• Security threats, viruses, worms, trojans etc.

• Deep Packet inspection (DPI) key

• Privacy, authentication require encryption and decryption that is computationally intensive

• Networks moving from unsecured best effort to delivering secure high quality real-time streaming content

• Much of current focus – software based multiple processor cores plus H/W acceleration

• Typically unable to keep up with bandwidth demands

Page 11: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 21

6���(�������!����������#�����(#�%�

'�#�����+�$���������#��'��������#�

• Programmable, hardware-based packet scheduler for high-performance Internet traffic management

• Hardware based Weighted Fair Queuing (WFQ)• Packet retrieval using Associative Memory• Memory Bandwidth – efficient packet storage

• High Performance Pattern Matching for Internet Packet classification and security

• Novel pattern matching methods• Novel reconfigurable CAM/TCAM circuits

• Critical limitations of current Semiconductor technologies explored

Sakir Sezer ���� NTU – Taipei October 2007 22

6+!�'���#�������

�'��������-���+�����

• QoS for real-time interactive services - IPTV, VoIP, On-line Gaming etc

• Network resource utilisation (network bandwidth)

WIFI

ISP

Core Network

TV

Shop

Subscriber

DPI IP Traffic Management

DPI

Link-rate ~ 10-100Gbps

End-to-End QoS

Page 12: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 23

'�#+�����,���7*��'��%���$�!������

Packet Classification( Traff . Flow/Class, QoS)

Finishing Tag

Computation

Tag Lookup Table Write Control

Shared Buffer

Write Control

External Shared Buffer

Packet Server

Tag Lookup Table

Read Control

Scheduler Input

Scheduler output

Developed at ECIT for line-rates above 40Gbps

Shared Packet Data Buffering

Packet Scheduler

Sakir Sezer ���� NTU – Taipei October 2007 24

������!��,8�����

� Hardware-based packet scheduler

� Programmable to support a range of scheduling algorithms including Weighted Fair Queuing (WFQ)

� Scalable 10 Gigabits up to 100 Gigabit per sec

� Resource efficient: link-bandwidth, storage (memory)

� Based on available commercial silicon technology� e.g. standard-cell VLSI, FPGA

Page 13: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 25

$�!����������!�������

� Scheduler determines “finishing tags “ for each packet –> order of service

� Associative memory - returns the smallest available “finishing tag” with a guaranteed time at line speed

� Uses a linked list structure

� New tags sorted using a look-up tree (trie) with a translation table

� Separation of search and data storage allows look-up function to be implemented very efficiently in H/W

� Matching circuit – select and look-ahead used

� Architecture demonstrated using an Altera Stratix II FPGA ->12.5 Gbs

Sakir Sezer ���� NTU – Taipei October 2007 26

Cadence Encounter UMC130nm

Clock frequency: 143 MHz

Number of IOs: 478 Pins

Total area: 14.4 mm2

Number of Packets: External DDRUp to 30 Million packets supported

Throughput: 35.8M packets/sec

Throughput: ~ 40 Gbps line rate(assuming mean IP packet of 130 bytes)

7*��'��%���$�!�������� �-��152��

Address Translation Table

Search TrieMemory

90% distributed embedded Memory

Patent Pending

Page 14: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 27

'���#�������

• Scalable to meet Next Generation 100 Gbps line-rate.

• Scalable to support beyond 1 million flows (virtual queues), each flow with unique weight properties.

• Can perform traditional Software/NPU solutions by > 5X

• Power dissipation a factor 100 less

• Circuit, up to 99.8% accuracy of theoretical WFQ algorithm

• Many traffic management applications (core, edge, access)

• Enables customized service-differentiation

Sakir Sezer ���� NTU – Taipei October 2007 28

Shared Packet Data Buffering

Page 15: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 29

-��#� ��#�������%

• Beyond 10 Gbps SRAM based memory attractive in terms of speed and latency

• High cost and low capacity make unsuitable

• DDR II/III high density/lower cost, but has random access latency

• RLDRAM (Reduced Latency DRAM) attractive – lower random access – better but not ideal

• Challenge - Optimization of a Shared buffer architecture for 20 Gbps based on this

Sakir Sezer ���� NTU – Taipei October 2007 30

������!�����#��

• Optimise Memory capability to make most efficient use of memory bandwidth and storage in terms of packet size

• RLDRAM (Reduced Latency DRAM) multi-bank-technology plus FPGA solution

• Memory space utilisation >90%

• Scalable to a wide range of network processing applications (traffic management, classification, security etc.)

• Shown 20 Gbps packet buffering possible using RLDRAM equipped with a Separate I/O

Page 16: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 31

$!�����'��%�������������!�������

FPGA Shared

Buffer

Sakir Sezer ���� NTU – Taipei October 2007 32

�,�����#��������!�����+��

• Appears to be a lack of suitable memory technology to meet these emerging requirements

• Increasingly smaller packets in order to reduce the overall end-to-end latency.

• Memory utilization typically traded-off against memory access latency to achieve performance.

• However, ultimately limited in terms of meeting future storage and access latency requirements

Page 17: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 33

Pattern Matching for

Deep Packet Inspection

Sakir Sezer ���� NTU – Taipei October 2007 34

������!�����(

• Real-time pattern matching for Virus, Worm, Trojan, Spam and instruction detection at >10Gbps

• Hybrid pattern matching methods used

• Combining embedded memory, reconfigurable logic and SoC technology.

• Explored tradeoffs and limitation of established parallel matching methods including

– Hash Tables

– Content Addressable Memory

Page 18: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 35

"����'��%����������#��9"'�:

Internet EU

Internet UK

Vulnerable

Computer

DPI Engine

• Checks suspect content on packet- header and payload

• Flexible string matching on payload inspection - most computationally expensive aspect of DPI

• Efficient string matching scheme achieves constant lookup time O(1) on each input data-chunk

Sakir Sezer ���� NTU – Taipei October 2007 36

6 ,���'�������-���!�+������

CAM Match

Match (2Bits)

ID#

ID#

M

UX

Delays

Input

ID #

Hash

Function

=

Duplicat

e

RAM

=

Dual

RAM ID#

R

R

• Hash and CAM circuits execute look-up operation simultaneously

• Dual-RAM establishes dual-entry hash table for each hash key

• Expected collisions in hashing module are stored in the Content Addressable Memory (CAM)

• Pipelined matching lookup performance of O(1) can be achieved at low-cost.

Page 19: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 37

��!�������

• Demonstration of prototype pattern matching circuit

• Standard FPGA technology used integrated these on a single device Hash/CAM hybrid architecture, Constant look-up time of O(1)

• Comparable to purely CAM based circuits, at a fraction of the CAM circuit cost

• Throughput rate of 13.7 Gbps (128-bit data-path at 107 MHz) for approximately 1000 patterns

• Larger FPGA device >10K pattern matches at 10Gbps line rate

Sakir Sezer ���� NTU – Taipei October 2007 38

"����'��%����������#��#��

6+!�$��������(#�%�

Altera Stratix II - Device

– 64-bit Data-path

– 500 MHz Internal memory access

– 120 MHz – Data sampling rate

=> 64x80MHz= 5.2 Gbps

- 128-bit Data-path

- 400 MHz Internal memory access

- 100 MHz – Data sampling rate

=> 128x107MHz= 13.7 Gbps

Constant look-up time O(1)

Performance equivalent to searching 300 Yellow Pages Books for 5000 different business names per second!

Page 20: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 39

���������#�����#��

• Function comparable to purely CAM based circuits, but at a fraction of the CAM circuit cost

• However, even with only 64 entries embedded CAM requires 75% of register/logic resources when implemented using FPGA LUTS or Standard Cells

• Can minimise CAM hardware by trading off against hash memory

• However, increasingly expensive as number of matches increases

• Therefore full-custom CAM for area and high performance.

Sakir Sezer ���� NTU – Taipei October 2007 40

Configurable Content Addressable Memory

Architectures

Page 21: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 41

�#��+���,�� ��-�9���-:�����

0

CAM1

CAM2

1

SRAM

SRAM

ComparisonLogic

vdd!

gnd!

vdd!

BL

ML

MML

MWL

WL

Sel

BLN

� UMC 130nm CMOS technology� 2 Metal layers (M1/M2)� Cell Area = 4.47um x 9.07um

� Two bits per CCAM cell; 20 transistors� Operating modes: Sel = 1 => BiCAM ,

Sel = 0 => TCAM

Sakir Sezer ���� NTU – Taipei October 2007 42

�#��+���,�����-

• Custom designed embedded associative memory architecture

– Lookup, search/sort, indexing, classification, pattern matching

• CAM/TCAM cell for design of a configurable memory array

• Support basic SRAM, CAM and TCAM memory types

• Cell Circuit cost that of a TCAM

• If not used as TCAM, “Don’t Care” mask circuit can be used as SRAM or CAM cell thus doubling SRAM or CAM capacity

• (Embedded) Memory density comparable to Stand-alone memory chips

Page 22: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 43

��!�������

• Configurable to create custom-purpose concatenated memory arrays – i.e. trade-off memory-width v memory-depth

• Can be optimized for low-power, area or performance

• CCAM cell area of 4.3�m×8.3�m based (UMC 130nm)

• Simulated access time:

• WR access: 2.28 ns, RD access: 2.5 ns, Search access: 2.5 ns

• Worst-case match-line delay = 0.298ns

• Clock cycle = 2.5 ns => Clock frequency 400MHz

• 64×128 CCAM block designed for evaluation

• Cell array can be configured as SRAM, CAM, local/global masked TCAM.

Sakir Sezer ���� NTU – Taipei October 2007 44

*��������#�����-�"��+���*'��

5;�$�����<=22�;��������

Data write

enable

Mask write

enable

clk reset_N

Input Data

match

Fill Custom DesignTCAM Cell

21 TransistorsArea: 4.3×8.3 µm2

CLK: 400MHz64 x 128 Block TCAM

265,862 TransistorsArea: 0.47 mm2CLK: 320MHz

FPGA designTCAM cell 3xALUTs; 2xReg

128x64 Block CAMArea: 8891 ALUT

8265 Reg1 M4KRAM

CLK: 107 MHz (pipelined)

CAM1

CAM2

1

1

2

3

64

.

.

.

CAMData

_In

CAM_In

WR_

Addr

Comb.

Logic

128

6

128

128××××TCAM Cells

128××××64 = TCAM Cells

Decoder

RAM

6-bit

Match ID

Page 23: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 45

���+#�+� ������!

• Optimising the arrangement and configuration of clusters of CCAM banks into custom-purpose associative memory structures

• Reducing power dissipation at CCAM bank level by targeting the priority decoder and pre-charge circuitry

• On interconnect and interface technology for on-chip distributed multipurpose memory blocks

Sakir Sezer ���� NTU – Taipei October 2007 46

$����� ������#�����#��

• Novel architectures and design studies for Network Processing

• Hardware parallelism allows scaling of functions beyond 40Gbps

• Costly if distributed embedded memory is required (as is the case)

• Conventional SRAM based fast memory technology expensive

Page 24: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 47

$����� ������#�����#��

• Does not really support the future embedded or external fast memory

• PC driven DDR II/III technology provides a low-cost, fast and dense alternative, but hindered by unacceptable random access latency

• RLDRAM technology partially meets latency and density requirements in emerging applications (10-20Gbps shared buffer design study)

Sakir Sezer ���� NTU – Taipei October 2007 48

• FPGA technology versatile in constructing such circuits, but limited by embedded memory size and configurable logic resources

• Introduced a novel hybrid Hash/CAM pattern matching circuit

• Can perform CAM, SRAM and TCAM functions

• Offers low lookup/search latency and memory cost

$����� ������#�����#��

Page 25: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 49

$#���$���� "��+���!�����+��

Network processing - intensive data-dependent memory operations

• Switching, routing, content inspection, classification and protocol pressing (FSM)

External solid-state memory technology• Increase of memory density • Significant increase of memory bandwidth (packaging,

interface) • Reduced random access latency

Embedded on-chip memory technology• Variety of custom purpose embedded SRAM technologies

optimised for power, performance, density and latency.• Advanced and easy deployable embedded DRAM technology

(optimised for power and density) • Wide variety of custom purpose configurable associative

memory technology (full-custom)

Sakir Sezer ���� NTU – Taipei October 2007 50

$#���$���� "��+���!�����+��

On-chip/off-chip interconnect technology

• Advancement of on-chip bus-interconnect

(higher-bus-bandwidth, wider buses)

• On-chip high-bandwidth memory-interconnect

• > 1 Terabit/sec

• Novel low-power high-performance off-Chip interconnect

Page 26: Overview: QUB and ECIT

Sakir Sezer ���� NTU – Taipei October 2007 51

Questions ???