overview: qub and ecit

Sakir Sezer �� NTU – Taipei October 2007 1

OverviewSeptember 2004

Overview: QUB and ECIT

High Performance Network Processing

Sakir SezerResearch Director – SoC Architectures and Programmable Systems

Institute of Electronics, Communications and Information Technology Queen’s University Belfast, N Ireland, UK


��


� 161 years old

� 30 schools & 9 research institutes in 3 faculties

� 3,300 staff - 1,100 academics

� 12,700 undergrads + 3,700 postgrads = 16,400 students

� research + support service income = $55 million p.a.

� total income = $260 million p.a.

*Engineering - Humanities - Legal, social & educational sciencesMedicine & health sciences - Science & agriculture

��


��

Science and Engineering

Primary Degrees 1,000

Master Degrees 500

Doctorates 150


Institute of Electronics Communications and Information Technology

��

��

��

��


��


��


ESIT

��


��

Titanic Quarter $4bn Investment - 20,000 jobs


�� !�"�#��

� ��

"��$�%��$�&��

� ��

'�#��#��#(��

� ��

'�#��)��*��#�

� � �� !��

'�#��"�� ##%��


�#��+��#��#��#��

�#��#��!�#�#+��

Best

effort

Services

Real Time

Interactive

Services

Telecommunication

Broadcast, Multicast TV, Radio

Computer Communication

Fixed Mobile

4G - Mobile-IP

GSM

GPRS/EDGE

3G

HS(D/U)PA (3.5G)

3G-LTE

Dial-up10 Kbps

100 Kbps

1 Mbps

10 Mbps

100 Mbps

ADSL

ADSL2+

VDSL

GPON

WIFI

WIMAX

BWA

Current

Future


�� !��#��!��,��+

Current Internet

Next Generation Internet

Static ContentGraphics, Text, Online LibraryE-businesses,

Online Banking

Streamed ContentIP-TV, VoD, VoIP

HDTV Online GamingVideo Conferencing


��#��#�

• Internet traffic is continuously doubling every 12 months

• Emerging services require:

– Higher bandwidth (VoD, DVB-IP, VoIP)

– Quality of Service (assured low end-to-end latency)

– Smaller packet size (reduce end-to-end latency for real-time and interactive services)

– Higher degree of security (Internet Banking, internet shopping, e-business) (Estimated 2004 Internet crime at around $350 billion worldwide, Internet fraud cost merchants an average of 1.8% of online revenues or $2.6 billion in 2004)

• Network Access is expected to become eventually wireless

– Deployment of new frequency bands

– Space Division Multiple Access (frequency reuse, MIMO, Beam forming)

• Transmission capacity increase requires:– complex traffic aggregation and management.

– Link bandwidth beyond 10Gbps at the edge and 40-100Gbps at the core.


MainstreamGraphic, Text, Voice

1997

Voice (PSTN)

��(��!�"�� -##��.�(

FrontierIPTV, IP-HDTViTube, IP-Radio

Web casting, VoD2D/3D Multimedia

video telephonyOther streamed service

Moore’s LawSilicon Integration Capability doubles

every 18 Months

Technology Gap

Internet Traffic is doubles every

12 Months

Data Processing GapData processing demand at Network and access nodes doubles every 6-9 Months

1990 2000 2010 2020

/0

123 �,��

124 �,��

125 �,��

1�,��

Source: Robert – IEEE Computer -Jan2000


��+�+�$#��"��+��!��+��

Networks (Edge & Core)

Data and Network Security, Privacy

Access Technologies & Multimedia Systems Wireless Technology


�� !��#�

Challenge the widening technology and data processing

gaps that exist for emerging and future services and, applications of converging information and communication

technologies.

Research targets - novel SoC architectures, - high-level SoC design methodologies, - and programmability of systems

to satisfy the real-time computational and flexibility

demands of emerging systems.

The division is comprised of over 35 academic, research and research-related support staff.


�� $#�� !��(

Applied Research Speculative Research

Network Processing(Frame processing, HW acceleration of Lookup, TCP processing,

Packet Classification, HW-based traffic management packet scheduling

Network Security ProcessingIP-Sec, Deep Packet Inspection, MPSoC security processing, Wireless Ad Hoc Network Security Protocols & Architectures

Cryptograph(Public/Private key algorithm architectures: AES, SHACAL-2, Authentication architectures:

SHA-384/SHA-512,Whirlpool, Cryptography for Constrained Environments

Communication Signal Processing(Processing architectures for MIMO beamforming and smart antenna)

System Level Design Tools for DSP SoC(Rapid Prototyping, HW/SW co-design, System level Design capture )

Video Signal Processing & AnalysisReconfigurable motion estimation for multi-standards, video analysis acceleration

Full-custom embedded associative memory circuits

Dig

ital

Ana

logu

e


Hardware Architectures for Network Processing

• Internet Traffic Management

• Network Security


��%+�#��

• Streaming applications – VoiP, IPTV reducing packet size, rigid delay requirements

• Security threats, viruses, worms, trojans etc.

• Deep Packet inspection (DPI) key

• Privacy, authentication require encryption and decryption that is computationally intensive

• Networks moving from unsecured best effort to delivering secure high quality real-time streaming content

• Much of current focus – software based multiple processor cores plus H/W acceleration

• Typically unable to keep up with bandwidth demands


6��(��!��#��(#�%�

'�#��+�$��#��'��#�

• Programmable, hardware-based packet scheduler for high-performance Internet traffic management

• Hardware based Weighted Fair Queuing (WFQ)• Packet retrieval using Associative Memory• Memory Bandwidth – efficient packet storage

• High Performance Pattern Matching for Internet Packet classification and security

• Novel pattern matching methods• Novel reconfigurable CAM/TCAM circuits

• Critical limitations of current Semiconductor technologies explored


6+!�'��#��

�'��-��+��

• QoS for real-time interactive services - IPTV, VoIP, On-line Gaming etc

• Network resource utilisation (network bandwidth)

WIFI

ISP

Core Network

TV

Shop

Subscriber

DPI IP Traffic Management

DPI

Link-rate ~ 10-100Gbps

End-to-End QoS


'�#+��,��7*��'��%��$�!��

Packet Classification( Traff . Flow/Class, QoS)

Finishing Tag

Computation

Tag Lookup Table Write Control

Shared Buffer

Write Control

External Shared Buffer

Packet Server

Tag Lookup Table

Read Control

Scheduler Input

Scheduler output

Developed at ECIT for line-rates above 40Gbps

Shared Packet Data Buffering

Packet Scheduler


��!��,8��

� Hardware-based packet scheduler

� Programmable to support a range of scheduling algorithms including Weighted Fair Queuing (WFQ)

� Scalable 10 Gigabits up to 100 Gigabit per sec

� Resource efficient: link-bandwidth, storage (memory)

� Based on available commercial silicon technology� e.g. standard-cell VLSI, FPGA


$�!��!��

� Scheduler determines “finishing tags “ for each packet –> order of service

� Associative memory - returns the smallest available “finishing tag” with a guaranteed time at line speed

� Uses a linked list structure

� New tags sorted using a look-up tree (trie) with a translation table

� Separation of search and data storage allows look-up function to be implemented very efficiently in H/W

� Matching circuit – select and look-ahead used

� Architecture demonstrated using an Altera Stratix II FPGA ->12.5 Gbs


Cadence Encounter UMC130nm

Clock frequency: 143 MHz

Number of IOs: 478 Pins

Total area: 14.4 mm2

Number of Packets: External DDRUp to 30 Million packets supported

Throughput: 35.8M packets/sec

Throughput: ~ 40 Gbps line rate(assuming mean IP packet of 130 bytes)

7*��'��%��$�!�� -��152��

Address Translation Table

Search TrieMemory

90% distributed embedded Memory

Patent Pending


'��#��

• Scalable to meet Next Generation 100 Gbps line-rate.

• Scalable to support beyond 1 million flows (virtual queues), each flow with unique weight properties.

• Can perform traditional Software/NPU solutions by > 5X

• Power dissipation a factor 100 less

• Circuit, up to 99.8% accuracy of theoretical WFQ algorithm

• Many traffic management applications (core, edge, access)

• Enables customized service-differentiation


Shared Packet Data Buffering


-��#� ��#��%

• Beyond 10 Gbps SRAM based memory attractive in terms of speed and latency

• High cost and low capacity make unsuitable

• DDR II/III high density/lower cost, but has random access latency

• RLDRAM (Reduced Latency DRAM) attractive – lower random access – better but not ideal

• Challenge - Optimization of a Shared buffer architecture for 20 Gbps based on this


��!��#��

• Optimise Memory capability to make most efficient use of memory bandwidth and storage in terms of packet size

• RLDRAM (Reduced Latency DRAM) multi-bank-technology plus FPGA solution

• Memory space utilisation >90%

• Scalable to a wide range of network processing applications (traffic management, classification, security etc.)

• Shown 20 Gbps packet buffering possible using RLDRAM equipped with a Separate I/O


$!��'��%��!��

FPGA Shared

Buffer


�,��#��!��+��

• Appears to be a lack of suitable memory technology to meet these emerging requirements

• Increasingly smaller packets in order to reduce the overall end-to-end latency.

• Memory utilization typically traded-off against memory access latency to achieve performance.

• However, ultimately limited in terms of meeting future storage and access latency requirements


Pattern Matching for

Deep Packet Inspection


��!��(

• Real-time pattern matching for Virus, Worm, Trojan, Spam and instruction detection at >10Gbps

• Hybrid pattern matching methods used

• Combining embedded memory, reconfigurable logic and SoC technology.

• Explored tradeoffs and limitation of established parallel matching methods including

– Hash Tables

– Content Addressable Memory


"��'��%��#��9"'�:

Internet EU

Internet UK

Vulnerable

Computer

DPI Engine

• Checks suspect content on packet- header and payload

• Flexible string matching on payload inspection - most computationally expensive aspect of DPI

• Efficient string matching scheme achieves constant lookup time O(1) on each input data-chunk


6 ,��'��-��!�+��

CAM Match

Match (2Bits)

ID#

ID#

M

UX

Delays

Input

ID #

Hash

Function

=

Duplicat

e

RAM

=

Dual

RAM ID#

R

R

• Hash and CAM circuits execute look-up operation simultaneously

• Dual-RAM establishes dual-entry hash table for each hash key

• Expected collisions in hashing module are stored in the Content Addressable Memory (CAM)

• Pipelined matching lookup performance of O(1) can be achieved at low-cost.


��!��

• Demonstration of prototype pattern matching circuit

• Standard FPGA technology used integrated these on a single device Hash/CAM hybrid architecture, Constant look-up time of O(1)

• Comparable to purely CAM based circuits, at a fraction of the CAM circuit cost

• Throughput rate of 13.7 Gbps (128-bit data-path at 107 MHz) for approximately 1000 patterns

• Larger FPGA device >10K pattern matches at 10Gbps line rate


"��'��%��#��#��

6+!�$��(#�%�

Altera Stratix II - Device

– 64-bit Data-path

– 500 MHz Internal memory access

– 120 MHz – Data sampling rate

=> 64x80MHz= 5.2 Gbps

- 128-bit Data-path

- 400 MHz Internal memory access

- 100 MHz – Data sampling rate

=> 128x107MHz= 13.7 Gbps

Constant look-up time O(1)

Performance equivalent to searching 300 Yellow Pages Books for 5000 different business names per second!


��#��#��

• Function comparable to purely CAM based circuits, but at a fraction of the CAM circuit cost

• However, even with only 64 entries embedded CAM requires 75% of register/logic resources when implemented using FPGA LUTS or Standard Cells

• Can minimise CAM hardware by trading off against hash memory

• However, increasingly expensive as number of matches increases

• Therefore full-custom CAM for area and high performance.


Configurable Content Addressable Memory

Architectures


�#��+��,�� -�9��-:��

0

CAM1

CAM2

1

SRAM

SRAM

ComparisonLogic

vdd!

gnd!

vdd!

BL

ML

MML

MWL

WL

Sel

BLN

� UMC 130nm CMOS technology� 2 Metal layers (M1/M2)� Cell Area = 4.47um x 9.07um

� Two bits per CCAM cell; 20 transistors� Operating modes: Sel = 1 => BiCAM ,

Sel = 0 => TCAM


�#��+��,��-

• Custom designed embedded associative memory architecture

– Lookup, search/sort, indexing, classification, pattern matching

• CAM/TCAM cell for design of a configurable memory array

• Support basic SRAM, CAM and TCAM memory types

• Cell Circuit cost that of a TCAM

• If not used as TCAM, “Don’t Care” mask circuit can be used as SRAM or CAM cell thus doubling SRAM or CAM capacity

• (Embedded) Memory density comparable to Stand-alone memory chips


��!��

• Configurable to create custom-purpose concatenated memory arrays – i.e. trade-off memory-width v memory-depth

• Can be optimized for low-power, area or performance

• CCAM cell area of 4.3�m×8.3�m based (UMC 130nm)

• Simulated access time:

• WR access: 2.28 ns, RD access: 2.5 ns, Search access: 2.5 ns

• Worst-case match-line delay = 0.298ns

• Clock cycle = 2.5 ns => Clock frequency 400MHz

• 64×128 CCAM block designed for evaluation

• Cell array can be configured as SRAM, CAM, local/global masked TCAM.


*��#��-�"��+��*'��

5;�$��<=22�;��

Data write

enable

Mask write

enable

clk reset_N

Input Data

match

Fill Custom DesignTCAM Cell

21 TransistorsArea: 4.3×8.3 µm2

CLK: 400MHz64 x 128 Block TCAM

265,862 TransistorsArea: 0.47 mm2CLK: 320MHz

FPGA designTCAM cell 3xALUTs; 2xReg

128x64 Block CAMArea: 8891 ALUT

8265 Reg1 M4KRAM

CLK: 107 MHz (pipelined)

CAM1

CAM2

1

1

2

3

64

.

.

.

CAMData

_In

CAM_In

WR_

Addr

Comb.

Logic

128

6

128

128××××TCAM Cells

128××××64 = TCAM Cells

Decoder

RAM

6-bit

Match ID


��+#�+� ��!

• Optimising the arrangement and configuration of clusters of CCAM banks into custom-purpose associative memory structures

• Reducing power dissipation at CCAM bank level by targeting the priority decoder and pre-charge circuitry

• On interconnect and interface technology for on-chip distributed multipurpose memory blocks


$�� #��#��

• Novel architectures and design studies for Network Processing

• Hardware parallelism allows scaling of functions beyond 40Gbps

• Costly if distributed embedded memory is required (as is the case)

• Conventional SRAM based fast memory technology expensive


$�� #��#��

• Does not really support the future embedded or external fast memory

• PC driven DDR II/III technology provides a low-cost, fast and dense alternative, but hindered by unacceptable random access latency

• RLDRAM technology partially meets latency and density requirements in emerging applications (10-20Gbps shared buffer design study)


• FPGA technology versatile in constructing such circuits, but limited by embedded memory size and configurable logic resources

• Introduced a novel hybrid Hash/CAM pattern matching circuit

• Can perform CAM, SRAM and TCAM functions

• Offers low lookup/search latency and memory cost

$�� #��#��


$#��$�� "��+��!��+��

Network processing - intensive data-dependent memory operations

• Switching, routing, content inspection, classification and protocol pressing (FSM)

External solid-state memory technology• Increase of memory density • Significant increase of memory bandwidth (packaging,

interface) • Reduced random access latency

Embedded on-chip memory technology• Variety of custom purpose embedded SRAM technologies

optimised for power, performance, density and latency.• Advanced and easy deployable embedded DRAM technology

(optimised for power and density) • Wide variety of custom purpose configurable associative

memory technology (full-custom)


$#��$�� "��+��!��+��

On-chip/off-chip interconnect technology

• Advancement of on-chip bus-interconnect

(higher-bus-bandwidth, wider buses)

• On-chip high-bandwidth memory-interconnect

• > 1 Terabit/sec

• Novel low-power high-performance off-Chip interconnect


Questions ???

overview: qub and ecit

Documents