a scalable, cache-based queue management subsystem for network processors sailesh kumar, patrick...
TRANSCRIPT
A Scalable, Cache-Based Queue Management
Subsystem for Network Processors
Sailesh Kumar, Patrick CrowleyDept. of Computer Science and Engineering
2 - Sailesh Kumar - 04/19/23
Packet processing systems
ASIC based» High performance» Low configurability» May be expensive when volumes are low
Network processors (NP) based» Very high degree of configurability» High volumes can result in low cost» Challenges in matching the ASIC performance
In this paper we concentrate on queuing bottlenecks associated with NP based packet processors
3 - Sailesh Kumar - 04/19/23
Basic NP architecture
Chip-Multiprocessor (CMP) architecture» A pool of relatively simple processors» Has dedicated hardware units for common cases» Provides high speed interconnection between processors» Integrates network interface and memory controllers
Group of processors collectively performs the packet processing task (queuing, scheduling, etc)
Best case performance is N times higher when each processor operates in parallel
Example, Intel’s IXP architecture
4 - Sailesh Kumar - 04/19/23
Intel’s IXP2850 Introduction
CMP architecture 16 RISC type processors called Microengines (ME) 8 hardware thread contexts on each ME SPI4.2 and CSIX interface cores 3 Rambus DRAM and 4 QDR SRAM controllers Various hardware units, like Hash, CAM, etc.
Typically MEs are arranged in pipeline and groups of MEs collectively perform the packet processing task.
5 - Sailesh Kumar - 04/19/23
Why does PP need queues?
Routers and switch fabrics are packet processing systems which,» Receives packets at input ports» Classify and identifies the next hop for the packet» Transmit the packet at the appropriate output port
(Ingress rate > output link capacity) Due to,» traffic from many input ports destined to one output port» Bursty nature of internet traffic» Statistical oversubscription
Implications:» Unbounded delay for all flows» Packet loss across every active flow» A single misbehaving flow can affect all other flows
6 - Sailesh Kumar - 04/19/23
Solution
Keep queues for every flow or group of flows» Put arriving packets into appropriate queue» Treat each queue such that resources are allocated fairly to all
flows» Send packets from queues such that each will receive fair
share of the aggregate link bandwidth
In fact, queues are the fundamental data structure in any packet processing system. It ensures» Fair allocation of resources (bandwidth, buffer space, etc)» Isolate misbehaving and high priority flows» Guaranteed traffic treatment, delay, bandwidth, QoS
Conclusion: any packet processing system must handle large number of queues at very high speed
7 - Sailesh Kumar - 04/19/23
A simple queuing model
DRAM space is divided such that it can hold each arriving packets in an address space» Called as buffer
SRAM keeps the queue descriptor (QD) and next pointers» QD is the set of head, tail addresses and length of any queue» Next pointers are the address of next buffer in a queue
We need two categories of queues» A queue to keep all the free buffers available (i.e. free buffer
queue)» Another set of queues to keep the buffers which holds the
packets belonging to various flows (i.e. virtual queues)– These enables isolation of flows
8 - Sailesh Kumar - 04/19/23
A simple queuing model (cont)
2
7
X
X
4
Hea d T a il C o un t
0 7 3
8 4 2
... ... ...
C e ll 1 (Q 0 )
C e ll 2 (Q 0 )
C e ll 2 (Q 1 )
C e ll 3 (Q 0 )
C e ll 1 (Q 1 )
C ells s to r edN ex t p o in ter sQ u eu e D es c r ip to r
Vir tu a lq u eu es
F r ee b u f f erq u eu e
S R AM D R AM
9 - Sailesh Kumar - 04/19/23
Queuing Operation
For each arriving packet» A buffer is dequeued from the free buffer queue» Packet is written into it» Buffer is enqueued into the appropriate queue» Queue descriptors are updated
Thus enqueue operation into any queue involves» Update of free queue descriptor
– Read followed by write» Update of virtual queue descriptor
– Read followed by write
Free queue descriptor is kept on-chip, so updates fast However, virtual queue descriptors are off-chip and hence their updates are slow
10 - Sailesh Kumar - 04/19/23
Queuing Operation in a NP
To achieve high throughput, a group of processors and associated threads are used to collectively perform the queuing» Each thread handles a Packet at a time and enqueues/dequeues it into the appropriate queue
When arriving/departing packets all belong to different queues, such a scheme effectively speeds up the operation linearly with increasing number of threads
However, when packets belong to the same queue, the entire operation gets serialized, and threads start competing for the same queue descriptor» Multiple processors/threads doesn’t result in any benefit
11 - Sailesh Kumar - 04/19/23
Operation
Read QD A Update Write QD AThread 0 Read QD B Update Write QD B
Read QD C Update Write QD CThread 1 Read QD D Update
Read QD E Update Write QD EThread 2 Read QD F
Read QD G Update Write QD GThread x Read QD H
Read QD A Update Write QD AThread 0 Wait
Wait for thread 0 Update Write QD AThread 1
WaitThread 2
WaitThread x
Read QD A
What if all threads access the same queue
If all threads access different queues
12 - Sailesh Kumar - 04/19/23
Solution
Accelerate the serialized operations» Use mechanisms which will enable serialized operations run relatively faster
This can be done by putting a small on chip cache to hold the queue descriptors currently being accessed
Thus all threads but the first thread will be able to update the queue descriptor relatively much faster» In situations where threads access different queue descriptors, the operation will go as it is» When threads access the same queue descriptor, even if the operation gets serialized, each operation will be very fast
13 - Sailesh Kumar - 04/19/23
Queuing cache
Thus, queuing cache will sit between the memory hierarchy and MEs.» Whenever queue descriptors are accessed, they will be put into the cache
Questions» Size of cache?» Eviction policy?
Intuitively the size of cache should be same as the maximum number of threads that are collectively performing the queuing operation» Because only so many QDs can be accessed at a time
The eviction policy can be Least Recently Used (LRU)
14 - Sailesh Kumar - 04/19/23
Operation with queuing cache
Read QD A Update Write QD LRUThread 0 Read QD B Update Write QD LRU
Read QD C Update Write QD LRUThread 1 Read QD D Update
Read QD E Update Write QD LRUThread 2 Read QD F
Read QD G Update Write QD LRUThread x Read QD H
Read QD A UpdateThread 0 Wait
Wait for QD AThread 1
WaitThread 2
WaitThread x
If all threads access the same queue
Update
Update
Update
Update
Wait
Wait
Wait
Update
If all threads access different queues
15 - Sailesh Kumar - 04/19/23
Performance comparison
For a 200 MHz DDR SRAM with SRAM access latency of 80 ns and queuing cache access latency of 20 ns
And assuming that processor takes 10 ns to execute all the queuing related instructions associated with a single packet
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of threads
Milli
on e
nque
ues
per
seco
nd Best Throughput (With Queuing cache)Best Throughput (Without Queuing cache)
Worst Throughput (With Queuing cache)Worst Throughput (Without Queuing cache)
16 - Sailesh Kumar - 04/19/23
Our design approach
Since queuing is so common in NPs, it may be a very good idea to add the hardware level support for enqueue and dequeue operations
Queuing cache is the best place to put these functionalities, because then the queuing will be very fast in situation where it get serialized
Thus each NP will support standard instructions, like enqueue, dequeue, etc» These instructions will be sent to the queuing cache» Queuing cache will internally manage the pointers and also
handle any contention when threads access the same queue Also threads themselves are relived from the burden
of synchronization and pointer management and can operate independently
17 - Sailesh Kumar - 04/19/23
Implementation
n P r o c es s o r s
(H/T /C )
(x/y /z)
Q u eu e D es c r ip to rEntry Tag
Q0 Cac he Index
Q(x -1)
Addr D a taQ 0 (0 /1 /3 )
Q (q -1 ) (x/y /z)
Q u eu e D es c r ip to r( Head /T ail/C o u n t)
( E n ) D eq u eu e
Addr D a ta0 21 X2 1
L in k s
E x ter n a l m em o r y
Q u eu in g c ac h e
m T h r ead s
. ..
m T h r ead s
. . .
18 - Sailesh Kumar - 04/19/23
Intel’s Approach
Intel’s second-generation IXP network processors have support for queuing via,» SRAM controller, which holds queue descriptors and
implements queuing operations, and» MEs, which support enqueue and dequeue instructions
Caching of queue descriptors is implemented using» A Q-array in memory controller» Any queuing operation precedes a transfer of queue
descriptors from SRAM to Q-array» A CAM in kept in each ME
– To keep track of which QD are cached and their position in the Q-array
CAM supports LRU which is used to evict entries from the Q-array
19 - Sailesh Kumar - 04/19/23
Comparison
Reduced instruction count on each processor» If we move all the logic associated with the enqueues and
dequeues to the queuing cache, software may become simple
Simple and modular software code for queuing tasks» No need for synchronization, etc
Queuing cache built near the memory controller results in significantly reduced on chip communication» Since queuing cache handles the pointer processing as well,
the processors needn’t fetch the queue descriptors at all» Only communication between processors and queuing cache
is instruction exchange
More scalable» Any number of MEs can participate in queuing» No local CAM per ME needed unlike Intel’s IXP approach
20 - Sailesh Kumar - 04/19/23
Conclusion
Contributions» Brief qualitative and quantitative analysis of queuing cache» A proposal for efficient and scalable design
Future work» Comparison to other caching technique» Implementation to measure the real complexity
We believe that such a cache based centralized queuing hardware unit will make the future network processors more» Scalable and» Easy to program
Questions?