overview: qub and ecit
TRANSCRIPT
Sakir Sezer ���� NTU – Taipei October 2007 1
OverviewSeptember 2004
Overview: QUB and ECIT
High Performance Network Processing
Sakir SezerResearch Director – SoC Architectures and Programmable Systems
Institute of Electronics, Communications and Information Technology Queen’s University Belfast, N Ireland, UK
Sakir Sezer ���� NTU – Taipei October 2007 2
�������������� ��������
Sakir Sezer ���� NTU – Taipei October 2007 3
� 161 years old
� 30 schools & 9 research institutes in 3 faculties
� 3,300 staff - 1,100 academics
� 12,700 undergrads + 3,700 postgrads = 16,400 students
� research + support service income = $55 million p.a.
� total income = $260 million p.a.
*Engineering - Humanities - Legal, social & educational sciencesMedicine & health sciences - Science & agriculture
�������������� ��������
Sakir Sezer ���� NTU – Taipei October 2007 4
����� ����������������������
Science and Engineering
Primary Degrees 1,000
Master Degrees 500
Doctorates 150
Sakir Sezer ���� NTU – Taipei October 2007 5
Institute of Electronics Communications and Information Technology
�������������� �����������
������ ����������
���������������� ��������������
������������ �����
Sakir Sezer ���� NTU – Taipei October 2007 6
��������� ��������������
Sakir Sezer ���� NTU – Taipei October 2007 7
�� �����������������������
Sakir Sezer ���� NTU – Taipei October 2007 8
ESIT
�������������
Sakir Sezer ���� NTU – Taipei October 2007 9
�������������
Titanic Quarter $4bn Investment - 20,000 jobs
Sakir Sezer ���� NTU – Taipei October 2007 11
����� ������!�"�#��
� ����������� ����������������
"��$�%��$�&��
� ������������������������������������������������
'�#���#����#(���
� �������������������������������
'�#��)����*���#�
� � ������������� �!������������
'�#��"��� ���##%���
Sakir Sezer ���� NTU – Taipei October 2007 12
�#���+�����#�����#����#������
�#�������#�����!�#�#+��
Best
effort
Services
Real Time
Interactive
Services
Telecommunication
Broadcast, Multicast TV, Radio
Computer Communication
Fixed Mobile
4G - Mobile-IP
GSM
GPRS/EDGE
3G
HS(D/U)PA (3.5G)
3G-LTE
Dial-up10 Kbps
100 Kbps
1 Mbps
10 Mbps
100 Mbps
ADSL
ADSL2+
VDSL
GPON
WIFI
WIMAX
BWA
Current
Future
Sakir Sezer ���� NTU – Taipei October 2007 13
���������� �!�����#���!�����,��+
Current Internet
Next Generation Internet
Static ContentGraphics, Text, Online LibraryE-businesses,
Online Banking
Streamed ContentIP-TV, VoD, VoIP
HDTV Online GamingVideo Conferencing
Sakir Sezer ���� NTU – Taipei October 2007 14
����������#���#�
• Internet traffic is continuously doubling every 12 months
• Emerging services require:
– Higher bandwidth (VoD, DVB-IP, VoIP)
– Quality of Service (assured low end-to-end latency)
– Smaller packet size (reduce end-to-end latency for real-time and interactive services)
– Higher degree of security (Internet Banking, internet shopping, e-business) (Estimated 2004 Internet crime at around $350 billion worldwide, Internet fraud cost merchants an average of 1.8% of online revenues or $2.6 billion in 2004)
• Network Access is expected to become eventually wireless
– Deployment of new frequency bands
– Space Division Multiple Access (frequency reuse, MIMO, Beam forming)
• Transmission capacity increase requires:– complex traffic aggregation and management.
– Link bandwidth beyond 10Gbps at the edge and 40-100Gbps at the core.
Sakir Sezer ���� NTU – Taipei October 2007 15
MainstreamGraphic, Text, Voice
1997
Voice (PSTN)
����(��!�"������� -##�����.�(
FrontierIPTV, IP-HDTViTube, IP-Radio
Web casting, VoD2D/3D Multimedia
video telephonyOther streamed service
Moore’s LawSilicon Integration Capability doubles
every 18 Months
Technology Gap
Internet Traffic is doubles every
12 Months
Data Processing GapData processing demand at Network and access nodes doubles every 6-9 Months
1990 2000 2010 2020
/0
123 �,��
124 �,��
125 �,��
1�,��
Source: Robert – IEEE Computer -Jan2000
Sakir Sezer ���� NTU – Taipei October 2007 16
����+�+�$#��"��+���!�����+��
Networks (Edge & Core)
Data and Network Security, Privacy
Access Technologies & Multimedia Systems Wireless Technology
Sakir Sezer ���� NTU – Taipei October 2007 17
���� ������!�������#�
Challenge the widening technology and data processing
gaps that exist for emerging and future services and, applications of converging information and communication
technologies.
Research targets - novel SoC architectures, - high-level SoC design methodologies, - and programmability of systems
to satisfy the real-time computational and flexibility
demands of emerging systems.
The division is comprised of over 35 academic, research and research-related support staff.
Sakir Sezer ���� NTU – Taipei October 2007 18
������ $#�� ������!�����(
Applied Research Speculative Research
Network Processing(Frame processing, HW acceleration of Lookup, TCP processing,
Packet Classification, HW-based traffic management packet scheduling
Network Security ProcessingIP-Sec, Deep Packet Inspection, MPSoC security processing, Wireless Ad Hoc Network Security Protocols & Architectures
Cryptograph(Public/Private key algorithm architectures: AES, SHACAL-2, Authentication architectures:
SHA-384/SHA-512,Whirlpool, Cryptography for Constrained Environments
Communication Signal Processing(Processing architectures for MIMO beamforming and smart antenna)
System Level Design Tools for DSP SoC(Rapid Prototyping, HW/SW co-design, System level Design capture )
Video Signal Processing & AnalysisReconfigurable motion estimation for multi-standards, video analysis acceleration
Full-custom embedded associative memory circuits
Dig
ital
Ana
logu
e
Sakir Sezer ���� NTU – Taipei October 2007 19
Hardware Architectures for Network Processing
• Internet Traffic Management
• Network Security
Sakir Sezer ���� NTU – Taipei October 2007 20
���%+�#���
• Streaming applications – VoiP, IPTV reducing packet size, rigid delay requirements
• Security threats, viruses, worms, trojans etc.
• Deep Packet inspection (DPI) key
• Privacy, authentication require encryption and decryption that is computationally intensive
• Networks moving from unsecured best effort to delivering secure high quality real-time streaming content
• Much of current focus – software based multiple processor cores plus H/W acceleration
• Typically unable to keep up with bandwidth demands
Sakir Sezer ���� NTU – Taipei October 2007 21
6���(�������!����������#�����(#�%�
'�#�����+�$���������#��'��������#�
• Programmable, hardware-based packet scheduler for high-performance Internet traffic management
• Hardware based Weighted Fair Queuing (WFQ)• Packet retrieval using Associative Memory• Memory Bandwidth – efficient packet storage
• High Performance Pattern Matching for Internet Packet classification and security
• Novel pattern matching methods• Novel reconfigurable CAM/TCAM circuits
• Critical limitations of current Semiconductor technologies explored
Sakir Sezer ���� NTU – Taipei October 2007 22
6+!�'���#�������
�'��������-���+�����
• QoS for real-time interactive services - IPTV, VoIP, On-line Gaming etc
• Network resource utilisation (network bandwidth)
WIFI
ISP
Core Network
TV
Shop
Subscriber
DPI IP Traffic Management
DPI
Link-rate ~ 10-100Gbps
End-to-End QoS
Sakir Sezer ���� NTU – Taipei October 2007 23
'�#+�����,���7*��'��%���$�!������
Packet Classification( Traff . Flow/Class, QoS)
Finishing Tag
Computation
Tag Lookup Table Write Control
Shared Buffer
Write Control
External Shared Buffer
Packet Server
Tag Lookup Table
Read Control
Scheduler Input
Scheduler output
Developed at ECIT for line-rates above 40Gbps
Shared Packet Data Buffering
Packet Scheduler
Sakir Sezer ���� NTU – Taipei October 2007 24
������!��,8�����
� Hardware-based packet scheduler
� Programmable to support a range of scheduling algorithms including Weighted Fair Queuing (WFQ)
� Scalable 10 Gigabits up to 100 Gigabit per sec
� Resource efficient: link-bandwidth, storage (memory)
� Based on available commercial silicon technology� e.g. standard-cell VLSI, FPGA
Sakir Sezer ���� NTU – Taipei October 2007 25
$�!����������!�������
� Scheduler determines “finishing tags “ for each packet –> order of service
� Associative memory - returns the smallest available “finishing tag” with a guaranteed time at line speed
� Uses a linked list structure
� New tags sorted using a look-up tree (trie) with a translation table
� Separation of search and data storage allows look-up function to be implemented very efficiently in H/W
� Matching circuit – select and look-ahead used
� Architecture demonstrated using an Altera Stratix II FPGA ->12.5 Gbs
Sakir Sezer ���� NTU – Taipei October 2007 26
Cadence Encounter UMC130nm
Clock frequency: 143 MHz
Number of IOs: 478 Pins
Total area: 14.4 mm2
Number of Packets: External DDRUp to 30 Million packets supported
Throughput: 35.8M packets/sec
Throughput: ~ 40 Gbps line rate(assuming mean IP packet of 130 bytes)
7*��'��%���$�!�������� �-��152��
Address Translation Table
Search TrieMemory
90% distributed embedded Memory
Patent Pending
Sakir Sezer ���� NTU – Taipei October 2007 27
'���#�������
• Scalable to meet Next Generation 100 Gbps line-rate.
• Scalable to support beyond 1 million flows (virtual queues), each flow with unique weight properties.
• Can perform traditional Software/NPU solutions by > 5X
• Power dissipation a factor 100 less
• Circuit, up to 99.8% accuracy of theoretical WFQ algorithm
• Many traffic management applications (core, edge, access)
• Enables customized service-differentiation
Sakir Sezer ���� NTU – Taipei October 2007 28
Shared Packet Data Buffering
Sakir Sezer ���� NTU – Taipei October 2007 29
-��#� ��#�������%
• Beyond 10 Gbps SRAM based memory attractive in terms of speed and latency
• High cost and low capacity make unsuitable
• DDR II/III high density/lower cost, but has random access latency
• RLDRAM (Reduced Latency DRAM) attractive – lower random access – better but not ideal
• Challenge - Optimization of a Shared buffer architecture for 20 Gbps based on this
Sakir Sezer ���� NTU – Taipei October 2007 30
������!�����#��
• Optimise Memory capability to make most efficient use of memory bandwidth and storage in terms of packet size
• RLDRAM (Reduced Latency DRAM) multi-bank-technology plus FPGA solution
• Memory space utilisation >90%
• Scalable to a wide range of network processing applications (traffic management, classification, security etc.)
• Shown 20 Gbps packet buffering possible using RLDRAM equipped with a Separate I/O
Sakir Sezer ���� NTU – Taipei October 2007 31
$!�����'��%�������������!�������
FPGA Shared
Buffer
Sakir Sezer ���� NTU – Taipei October 2007 32
�,�����#��������!�����+��
• Appears to be a lack of suitable memory technology to meet these emerging requirements
• Increasingly smaller packets in order to reduce the overall end-to-end latency.
• Memory utilization typically traded-off against memory access latency to achieve performance.
• However, ultimately limited in terms of meeting future storage and access latency requirements
Sakir Sezer ���� NTU – Taipei October 2007 33
Pattern Matching for
Deep Packet Inspection
Sakir Sezer ���� NTU – Taipei October 2007 34
������!�����(
• Real-time pattern matching for Virus, Worm, Trojan, Spam and instruction detection at >10Gbps
• Hybrid pattern matching methods used
• Combining embedded memory, reconfigurable logic and SoC technology.
• Explored tradeoffs and limitation of established parallel matching methods including
– Hash Tables
– Content Addressable Memory
Sakir Sezer ���� NTU – Taipei October 2007 35
"����'��%����������#��9"'�:
Internet EU
Internet UK
Vulnerable
Computer
DPI Engine
• Checks suspect content on packet- header and payload
• Flexible string matching on payload inspection - most computationally expensive aspect of DPI
• Efficient string matching scheme achieves constant lookup time O(1) on each input data-chunk
Sakir Sezer ���� NTU – Taipei October 2007 36
6 ,���'�������-���!�+������
CAM Match
Match (2Bits)
ID#
ID#
M
UX
Delays
Input
ID #
Hash
Function
=
Duplicat
e
RAM
=
Dual
RAM ID#
R
R
• Hash and CAM circuits execute look-up operation simultaneously
• Dual-RAM establishes dual-entry hash table for each hash key
• Expected collisions in hashing module are stored in the Content Addressable Memory (CAM)
• Pipelined matching lookup performance of O(1) can be achieved at low-cost.
Sakir Sezer ���� NTU – Taipei October 2007 37
��!�������
• Demonstration of prototype pattern matching circuit
• Standard FPGA technology used integrated these on a single device Hash/CAM hybrid architecture, Constant look-up time of O(1)
• Comparable to purely CAM based circuits, at a fraction of the CAM circuit cost
• Throughput rate of 13.7 Gbps (128-bit data-path at 107 MHz) for approximately 1000 patterns
• Larger FPGA device >10K pattern matches at 10Gbps line rate
Sakir Sezer ���� NTU – Taipei October 2007 38
"����'��%����������#��#��
6+!�$��������(#�%�
Altera Stratix II - Device
– 64-bit Data-path
– 500 MHz Internal memory access
– 120 MHz – Data sampling rate
=> 64x80MHz= 5.2 Gbps
- 128-bit Data-path
- 400 MHz Internal memory access
- 100 MHz – Data sampling rate
=> 128x107MHz= 13.7 Gbps
Constant look-up time O(1)
Performance equivalent to searching 300 Yellow Pages Books for 5000 different business names per second!
Sakir Sezer ���� NTU – Taipei October 2007 39
���������#�����#��
• Function comparable to purely CAM based circuits, but at a fraction of the CAM circuit cost
• However, even with only 64 entries embedded CAM requires 75% of register/logic resources when implemented using FPGA LUTS or Standard Cells
• Can minimise CAM hardware by trading off against hash memory
• However, increasingly expensive as number of matches increases
• Therefore full-custom CAM for area and high performance.
Sakir Sezer ���� NTU – Taipei October 2007 40
Configurable Content Addressable Memory
Architectures
Sakir Sezer ���� NTU – Taipei October 2007 41
�#��+���,�� ��-�9���-:�����
0
CAM1
CAM2
1
SRAM
SRAM
ComparisonLogic
vdd!
gnd!
vdd!
BL
ML
MML
MWL
WL
Sel
BLN
� UMC 130nm CMOS technology� 2 Metal layers (M1/M2)� Cell Area = 4.47um x 9.07um
� Two bits per CCAM cell; 20 transistors� Operating modes: Sel = 1 => BiCAM ,
Sel = 0 => TCAM
Sakir Sezer ���� NTU – Taipei October 2007 42
�#��+���,�����-
• Custom designed embedded associative memory architecture
– Lookup, search/sort, indexing, classification, pattern matching
• CAM/TCAM cell for design of a configurable memory array
• Support basic SRAM, CAM and TCAM memory types
• Cell Circuit cost that of a TCAM
• If not used as TCAM, “Don’t Care” mask circuit can be used as SRAM or CAM cell thus doubling SRAM or CAM capacity
• (Embedded) Memory density comparable to Stand-alone memory chips
Sakir Sezer ���� NTU – Taipei October 2007 43
��!�������
• Configurable to create custom-purpose concatenated memory arrays – i.e. trade-off memory-width v memory-depth
• Can be optimized for low-power, area or performance
• CCAM cell area of 4.3�m×8.3�m based (UMC 130nm)
• Simulated access time:
• WR access: 2.28 ns, RD access: 2.5 ns, Search access: 2.5 ns
• Worst-case match-line delay = 0.298ns
• Clock cycle = 2.5 ns => Clock frequency 400MHz
• 64×128 CCAM block designed for evaluation
• Cell array can be configured as SRAM, CAM, local/global masked TCAM.
Sakir Sezer ���� NTU – Taipei October 2007 44
*��������#�����-�"��+���*'��
5;�$�����<=22�;��������
Data write
enable
Mask write
enable
clk reset_N
Input Data
match
Fill Custom DesignTCAM Cell
21 TransistorsArea: 4.3×8.3 µm2
CLK: 400MHz64 x 128 Block TCAM
265,862 TransistorsArea: 0.47 mm2CLK: 320MHz
FPGA designTCAM cell 3xALUTs; 2xReg
128x64 Block CAMArea: 8891 ALUT
8265 Reg1 M4KRAM
CLK: 107 MHz (pipelined)
CAM1
CAM2
1
1
2
3
64
.
.
.
CAMData
_In
CAM_In
WR_
Addr
Comb.
Logic
128
6
128
128××××TCAM Cells
128××××64 = TCAM Cells
Decoder
RAM
6-bit
Match ID
Sakir Sezer ���� NTU – Taipei October 2007 45
���+#�+� ������!
• Optimising the arrangement and configuration of clusters of CCAM banks into custom-purpose associative memory structures
• Reducing power dissipation at CCAM bank level by targeting the priority decoder and pre-charge circuitry
• On interconnect and interface technology for on-chip distributed multipurpose memory blocks
Sakir Sezer ���� NTU – Taipei October 2007 46
$����� ������#�����#��
• Novel architectures and design studies for Network Processing
• Hardware parallelism allows scaling of functions beyond 40Gbps
• Costly if distributed embedded memory is required (as is the case)
• Conventional SRAM based fast memory technology expensive
Sakir Sezer ���� NTU – Taipei October 2007 47
$����� ������#�����#��
• Does not really support the future embedded or external fast memory
• PC driven DDR II/III technology provides a low-cost, fast and dense alternative, but hindered by unacceptable random access latency
• RLDRAM technology partially meets latency and density requirements in emerging applications (10-20Gbps shared buffer design study)
Sakir Sezer ���� NTU – Taipei October 2007 48
• FPGA technology versatile in constructing such circuits, but limited by embedded memory size and configurable logic resources
• Introduced a novel hybrid Hash/CAM pattern matching circuit
• Can perform CAM, SRAM and TCAM functions
• Offers low lookup/search latency and memory cost
$����� ������#�����#��
Sakir Sezer ���� NTU – Taipei October 2007 49
$#���$���� "��+���!�����+��
Network processing - intensive data-dependent memory operations
• Switching, routing, content inspection, classification and protocol pressing (FSM)
External solid-state memory technology• Increase of memory density • Significant increase of memory bandwidth (packaging,
interface) • Reduced random access latency
Embedded on-chip memory technology• Variety of custom purpose embedded SRAM technologies
optimised for power, performance, density and latency.• Advanced and easy deployable embedded DRAM technology
(optimised for power and density) • Wide variety of custom purpose configurable associative
memory technology (full-custom)
Sakir Sezer ���� NTU – Taipei October 2007 50
$#���$���� "��+���!�����+��
On-chip/off-chip interconnect technology
• Advancement of on-chip bus-interconnect
(higher-bus-bandwidth, wider buses)
• On-chip high-bandwidth memory-interconnect
• > 1 Terabit/sec
• Novel low-power high-performance off-Chip interconnect
Sakir Sezer ���� NTU – Taipei October 2007 51
Questions ???