performance analysis of packet classification algorithms on network processors
DESCRIPTION
Performance Analysis of Packet Classification Algorithms on Network Processors. Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University November 18, 2004 IEEE Local Computer Networks. Network Processors. Emerging platform for high-speed packet processing - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/1.jpg)
Performance Analysis of Packet Classification Algorithms on
Network Processors
Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University
November 18, 2004IEEE Local Computer Networks
![Page 2: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/2.jpg)
Network Processors
• Emerging platform for high-speed packet processing– Splice in a statistic here?– Provide device programmability while keeping
performance
• Architectures differ, but common features include…– Multiple processing units executing in parallel– Instruction set customized for network applications– Binary image pre-determined at compile time
![Page 3: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/3.jpg)
Example: Intel’s IXP
![Page 4: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/4.jpg)
IXP Architecture
• Multi-processor– StrongARM core for slow-path processing– 6 microengines for fast-path processing
• Hardware support for multi threading• Each microengine has 4 thread contexts• Zero or minimal overhead context switch
![Page 5: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/5.jpg)
Motivation for study
• NPs offer a programmable, parallel alternative, but current packet processing algorithms are– Written for sequential execution or– Designed using custom, invariant ASICs
• To use them on NPs– Algorithms must be mapped onto NPs in different ways
with each mapping having varying performance
![Page 6: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/6.jpg)
Our study
• Examine several mappings of a packet classification algorithm onto NP hardware
• Identify general problems in performing such mappings
![Page 7: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/7.jpg)
Why packet classification?
• Fundamental function performed by all network devices– Routers, switches, bridges, firewalls, IDS
• Increasing complexity makes packet classification the bottleneck– Increase in size of rulesets– Increase in dimension of rulesets– Algorithms must perform at high-speed on the fast-path
![Page 8: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/8.jpg)
Picking an algorithm
• Many algorithms sequential– Do not leverage inherent parallelism in NPs
• Several parallel algorithms– BitVector [Lakshman98]
• Parallel lookup implemented via FPGA• Maps well onto NP platform
![Page 9: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/9.jpg)
Bit Vector algorithm
• T.V. Lakshman, D. Stiliadis, “High-speed policy-based packet forwarding using efficient multi-dimensional range matching”, SIGCOMM 1998.– Parallel search algorithm– Preprocessing phase– Two-stage classification phase
• Perform lookup for each dimension in parallel• Combine results to determine matching rule
![Page 10: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/10.jpg)
Example ruleset
Rule Field 1 Field 2 Field 3 Action
r1 (10, 11) (2, 4) (8, 11) Allow
r2 (4, 6) (8, 11) (1, 4) Allow
r3 (9, 11) (5, 7) (12, 14) Deny
r4 (6, 8) (1, 3) (5, 9) Allow
Number of rules (N) = 4
Number of dimensions (d) = 3
Width of dimension (W) = 4 (bits)
![Page 11: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/11.jpg)
BitVector example
Packet = {6, 10, 2}
Matching rule = r2
![Page 12: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/12.jpg)
Two design mappings
• Consider multiple mappings of BitVector onto Intel’s IXP1200 microengines– Option 1: All processing for a single packet handled by
one microengine (μEngine) - Parallel– Option 2: Processing for a single packet is split across
μEngines - Pipelined
Recall: IXP has 6 μEngines
![Page 13: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/13.jpg)
Parallel Mapping
![Page 14: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/14.jpg)
Pipelined Mapping
![Page 15: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/15.jpg)
Memory allocation
Purpose Type of memory
Queue for inter-microengine communication SRAM
List of rules actions SRAM
Tries representing ranges SDRAM
Bit Vectors SDRAM
![Page 16: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/16.jpg)
Evaluation platform
• Intel IXP1200 Developer Workbench– Graphical IDE– Cycle-accurate simulator– Performance statistics
• All experiments run within simulator– Configurable– Logging facility
![Page 17: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/17.jpg)
Simulator configuration
• IXP1200 chip– 1K microstore– Core frequency (~ 165 MHz)– 4 ports receive data
• Simulations run until 75000 packets received by IXP– Simulator sends packets as fast as possible
• Rulesets used– Experiments use a small, fixed set of rules– Availability of real-world firewall rulesets limited
![Page 18: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/18.jpg)
Performance metrics
Performance Metric Description
Transmit rate (Mbps) The overall packet transmit rate of the IXP, for all the ports that are configured to send packets.
Microengine execution time (%) The percentage of the total number of microengine cycles that a microengine spent in performing useful tasks.
Microengine aborted time (%) The percentage of the total time of a microengine that was wasted due to instructions in its pipeline being aborted, typically due to branch instructions.
Microengine idle time (%) The percentage of the total time of a microengine that was wasted due to none of the 4 hardware threads being available to run, typically due to memory access wait time.
SDRAM access (%) The total percentage of SDRAM bandwidth utilized by all microengines.
SRAM access (%) The total percentage of SRAM bandwidth utilized by all microengines.
![Page 19: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/19.jpg)
Results and Analysis
![Page 20: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/20.jpg)
Throughput
![Page 21: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/21.jpg)
Packets sent/receive ratio
![Page 22: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/22.jpg)
Analysis
• Overall, Parallel performs better than Pipelined
• Pipelined : A single packet header in SDRAM is read multiple (3) times
![Page 23: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/23.jpg)
Microengine utilization
![Page 24: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/24.jpg)
Microengine aborted time
![Page 25: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/25.jpg)
Analysis
• Aborted time is typically caused by branch instructions
• Algorithms must reduce branch instructions to maximize throughput
![Page 26: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/26.jpg)
Microengine idle time
![Page 27: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/27.jpg)
Distribution of microengine time
0
20
40
60
80
100
120
1 2 3 4 5 6
Microengine
Tim
e (%
) Idle
Aborted
Executing
0
20
40
60
80
100
120
1 2 3 4 5 6
Microengine
Tim
e (%
) Idle
Aborted
Executing
Parallel Pipelined
![Page 28: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/28.jpg)
Analysis
• High microengine idle time in Pipelined due to memory latency
• Lower microengine aborted time in Pipelined due to what?
![Page 29: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/29.jpg)
Discussion
• Pipelined mappings can bottleneck through memory– Repeated memory reads to send work from μEngine to
μEngine– Direct hardware support for pipelining required
• IXP2xxx = next-neighbor registers• Currently re-examining our results on IXP2400
• Algorithms with fewer branch instructions result in better microengine utilization (lower aborted time)
![Page 30: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/30.jpg)
Conclusion
• Packet classification is a fundamental function
• Parallel nature of NPs well-suited for parallel search algorithms
![Page 31: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/31.jpg)
Conclusion
• Network processors offer high packet processing speed and programmability– Performance of an algorithm depends on the design
mapping chosen
• Contributions– Demonstrated that mapping has considerable impact on
performance• Pipelined mappings benefit from hardware support• Algorithms with fewer branch instructions result in better
processor utilization
![Page 32: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/32.jpg)
Future work
• Analyze other mappings– Split work across different hardware threads in a single microengine– Placement of data structures in different memory banks
• IXP2400– Examine how hardware features change trade-offs in algorithm
mapping
• Algorithms designed specifically for network processors
![Page 33: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/33.jpg)
Backup Slides
![Page 34: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/34.jpg)
Definitions
• Process of categorizing packets according to pre-defined rules
• Classifier or ruleset: collection of rules
• Dimension or field: packet header used
• Rule: range of field values and action
![Page 35: Performance Analysis of Packet Classification Algorithms on Network Processors](https://reader030.vdocument.in/reader030/viewer/2022032605/56812c27550346895d9098e9/html5/thumbnails/35.jpg)
Packet classification algorithms
Algorithm Time complexity
Storage complexity
Linear Search Nd Nd
Set-pruning Tries dW NddW
Grid of tries Wd-1 NdW
Cross producting dW Nd
Fat Inverted Segment tree (l + 1)W l x N1+1/l
Recursive Flow Classification d Nd
Hierarchical Intelligent Cuttings d Nd
Tuple Space Search m N
Bit Vector dW + N/memory-width
dN2
N: number of rules d: number of dimensionsW: maximum number of bitsl : number of levels occupied by a FIS-tree