david m. zar applied research laboratory computer science and engineering department onl stats block
TRANSCRIPT
David M. ZarApplied Research Laboratory
Computer Science and Engineering Department
ONL Stats Block
2 - David M. Zar - 04/19/23
Stats Engine The Stats Engine is a single ME devoted to accepting
messages in a scratch ring and performing increment and add operations to counters.»All MEs that need to update counters will use the Stats Engine»Operations supported will be
Atomic increment (+1) Atomic add (+data)
»Format of the commands will be
Opcode(4b) Data (12b) Index (16b)
3 - David M. Zar - 04/19/23
SRAM
ONL NP Router
Rx(2 ME)
HdrFmt(1 ME)
Parse, Lookup,
Copy(3 MEs)
TCAM SRAM
Mux(1 ME)
Tx(1 ME)
QM(1 ME)
xScale xScale
xScale
Assoc. DataZBT-SRAM
Plu
gin
0
Plu
gin
1
Plu
gin
2
Plu
gin
3
Plu
gin
4NN NN NN NN
FreeList Mgr(1 ME)
Tx, QMParsePluginXScale
Stats(1 ME)
QMCopyPlugins SRAM
NN
SRAMRing
ScratchRing
NNRingNN
SRAM
64KW
64KW64KWEach
4 - David M. Zar - 04/19/23
MEs -> Stats Block
StatsOpcode
(4b)Index (16b)Data (12b)
5 - David M. Zar - 04/19/23
Opcodes
Opcode –»0011 +1, +data pre-q counter specified in Index»0111 +1, +data post-q counter specified in Index»0010 +1 pre-q counter specified in Index»0110 +1 post-q counter specified in Index»0001 +data pre-q counter specified in Index»0101 +data post-q counter specified in Index»1011 +1, +data global register specified in Index»1010 +1 global register specified in Index»1001 +data global register specified in Index
(not implemented – 4/23/07)
Opcode(4b) Data (12b) Index (16b)
6 - David M. Zar - 04/19/23
Stats Counters Each Index specifies a group of four counters
»Pre-Q packet count»Pre-Q byte count»Post-Q packet count»Post-Q byte count
The packet counters get updated when the +1 instructions are specified (opcodes 0-1-)
The byte counter get updated when the +data instructions are specified (opcodes 0--1)
For plug-ins, the use for each counter can be redefined but the opcodes do not change (i.e. each stats index corresponds to two incrementers and two adders).
7 - David M. Zar - 04/19/23
Global Registers For system-wide counters, we define a separate set of global registers to handle them.»RX (packet and byte, 5 ports 10 words)»TX (packet and byte, 5 ports 10 words)»Drop counts (10 words)»Plug-in use (four per plug-in 20 words)»Per ME error counters (8 words)»10+10+10+20+8 = 58 so reserve 64 words for these
The register gets incremented when the +1 instructions are specified (opcodes 101-)
The register gets added to updated when the +data instructions are specified (opcodes 10-1)
The RX and TX counters will be assigned on even-word boundaries (lsb = 0) so we associate the packet and byte counters, together, and can do the +1, +data instruction on them in one command (1011 opcode)
For plug-ins, the use of each register is under the control of the plug-in»Four independent counters»Two sets of two counters»One set of two and two independent
8 - David M. Zar - 04/19/23
ONL Router Counter Registers (in dl_system.h) // RX Per Port registers: (Updated by MUX) ONL_ROUTER_RX_PORT0_PKT_CNTR ONL_ROUTER_RX_PORT0_BYTE_CNTR ONL_ROUTER_RX_PORT1_PKT_CNTR ONL_ROUTER_RX_PORT1_BYTE_CNTR ONL_ROUTER_RX_PORT2_PKT_CNTR ONL_ROUTER_RX_PORT2_BYTE_CNTR ONL_ROUTER_RX_PORT3_PKT_CNTR ONL_ROUTER_RX_PORT3_BYTE_CNTR ONL_ROUTER_RX_PORT4_PKT_CNTR ONL_ROUTER_RX_PORT4_BYTE_CNTR
// TX Per Port registers: (Updated by HF) ONL_ROUTER_TX_PORT0_PKT_CNTR ONL_ROUTER_TX_PORT0_BYTE_CNTR ONL_ROUTER_TX_PORT1_PKT_CNTR ONL_ROUTER_TX_PORT1_BYTE_CNTR ONL_ROUTER_TX_PORT2_PKT_CNTR ONL_ROUTER_TX_PORT2_BYTE_CNTR ONL_ROUTER_TX_PORT3_PKT_CNTR ONL_ROUTER_TX_PORT3_BYTE_CNTR ONL_ROUTER_TX_PORT4_PKT_CNTR ONL_ROUTER_TX_PORT4_BYTE_CNTR
// IP Drop registers (Updated by PLC) ONL_ROUTER_IP_HEC_DROP_CNTR ONL_ROUTER_IP_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_HDR_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_VERSION_ERR_DROP_CNTR
9 - David M. Zar - 04/19/23
ONL Router Counter Registers (cont.) // PLC Drop registers (Updated by Parse, Lookup or Copy) ONL_ROUTER_PLC_TO_PLUGIN_DROP_CNTR ONL_ROUTER_PLC_TO_XSCALE_DROP_CNTR
// QM Drop registers (Updated by QM) ONL_ROUTER_QUEUE_OVERFLOW_DROP_CNTR
// XScale Drop registers (Updated by XScale) ONL_ROUTER_XSCALE_DROP_CNTR
// Rx Drop registers (Updated by Rx) ONL_ROUTER_RX__DROP_CNTR
// Tx Drop registers (Updated by Tx) ONL_ROUTER_TX_DROP_CNTR
// Per Block Generic Error Counters ONL_ROUTER_RX_GENERIC_ERROR_CNTR ONL_ROUTER_MUX_GENERIC_ERROR_CNTR ONL_ROUTER_PLC_GENERIC_ERROR_CNTR ONL_ROUTER_QM_GENERIC_ERROR_CNTR ONL_ROUTER_HF_GENERIC_ERROR_CNTR ONL_ROUTER_TX_GENERIC_ERROR_CNTR ONL_ROUTER_STATS_GENERIC_ERROR_CNTR ONL_ROUTER_FREELISTMGR_GENERIC_ERROR_CNTR
10 - David M. Zar - 04/19/23
ONL Router Counter Registers (cont.) // Plugin 0 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_0_CNTR_0 ONL_ROUTER_PLUGIN_0_CNTR_1 ONL_ROUTER_PLUGIN_0_CNTR_2 ONL_ROUTER_PLUGIN_0_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_1_CNTR_0 ONL_ROUTER_PLUGIN_1_CNTR_1 ONL_ROUTER_PLUGIN_1_CNTR_2 ONL_ROUTER_PLUGIN_1_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_2_CNTR_0 ONL_ROUTER_PLUGIN_2_CNTR_1 ONL_ROUTER_PLUGIN_2_CNTR_2 ONL_ROUTER_PLUGIN_2_CNTR_3 // Plugin 3 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_3_CNTR_0 ONL_ROUTER_PLUGIN_3_CNTR_1 ONL_ROUTER_PLUGIN_3_CNTR_2 ONL_ROUTER_PLUGIN_3_CNTR_3 // Plugin 4 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_4_CNTR_0 ONL_ROUTER_PLUGIN_4_CNTR_1 ONL_ROUTER_PLUGIN_4_CNTR_2 ONL_ROUTER_PLUGIN_4_CNTR_3
11 - David M. Zar - 04/19/23
Stats Counter Priority There are two levels of priority for Stats Counters
» High-priority (high-speed) are kept in local memory. There are 64 sets of counters for the router and 64 for the plug-ins
» Low-priority (low-speed) are in SRAM. There are 216-128 = 65408 of these. Stats Counters 0-127 point to the high-priority counters while 128-65535
are low-priority counters. Using low-priority Stats Counters to count events that happen at high
speed may degrade system performance (being a pre-Q counter on a high-priority queue, for example)
Plug-ins need to be aware of the segmentation of priority so they can use the proper priority counters based on needs
Global Registers are always high-priority Eight threads used
» Seven threads process messages from the input scratch ring» One thread writes 8W chunks of the local memory counters/registers to SRAM so
that each counter/register is updated in SRAM several times a second.
12 - David M. Zar - 04/19/23
Stats ME Local Memory MapGlobal Registers 0
63
Reserved 64127
Stats Counters (router)64*4W = 256W
128
383
Stats Counters (plug-ins)64*4W = 256W
384
639
13 - David M. Zar - 04/19/23
Stats PseudocodeWhile (true and ctx={0:6}) { dl_source_scr_1word()
decode_opcode()case (opcode) { Global Register: lm_addr = index << 2; do opcode; Stats Index: if (index > 127) { do slow_opcode; } else { lm_addr = (128*4) + (index << 4);
do fast_opcode;}}
}
While (true and ctx=7) {offset = 0;for (l_mem=0; l_mem<(64*4); l_mem=l_mem+8) { sram_write(GLOBAL_REGS_BASE, offset, l_mem, 8); offset = offset + 32;}
offset = 0;for (l_mem=(128*4); l_mem<(128*16); l_mem=l_mem+8) { sram_write(ONL_STATS_BASE, offset, l_mem, 8); offset = offset + 32;}
}
14 - David M. Zar - 04/19/23
Stats Function Calls Defined in counter_util.uc:
» _WU_preq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data» _WU_preq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1» _WU_preq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data» _WU_postq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data» _WU_postq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1» _WU_postq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data» _WU_global_register_add(reg_num, tx_reg, update_sig, error_addr) // +1» _WU_global_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data» _WU_global_register_update(reg_num, tx_reg, data, update_sig, error_addr)// +1 & +data
15 - David M. Zar - 04/19/23
Performance Targets How many packets processed per second?
» To hit 5 Gb rate: 76B per min IPv4 packet (64 min Enet Frame + 12B IFS) 1.4Ghz clock rate 5 Gb/sec * 1B/8b * packet/76B = 8.23 Mp/sec 1.4Gcycle/sec * 1 sec/ 8.23 Mp = 170 cycles per packet compute budget: 170 cycles latency budget: (threads*170)
7 threads: 1190 cycles
How many count requests per packet (typical packet)?» RX per-port count» TX per-port count» Preq-Q stats index» Post-Q stats index
Total counts = 8.23 Mp/sec * 4 counts/sec = 32.92 Mcounts/sec
16 - David M. Zar - 04/19/23
Stats Block DiagramRead Scratch Ring
LM_ADDR = 512 + (index << 4)
GlobalRegister?
(4 CLK)
Index > 127? (3 CLK)
SlowCounter
LM_ADDR = (index << 2)Y N
N
Y
DecodeOpcode
(3C)
+data?(3C)
+1?(3C)
LM_ADDR++ = *LM_ADDR + data
LM_ADDR = *LM_ADDR + 1
N
N
Y
Y
SCR READ: 60L + 2C
Worst case (fast) is for
Stats Counters: 20 Clocks +
60 Cycles Latency
17 - David M. Zar - 04/19/23
Performance Results Total fast counts:
» Count time is, effectively, 20 cycles (all 60 cycles of latency are hidden)
» 1400 Mcycles/sec 20 cycles/count = 70 Mcounts/sec* » Target is 39.92 Mcounts/sec.
Slow counts:» Count time is about 150 – 60 = 90 cycles (the SRAM latency is not
completely hidden)» 1400/150 = 15.6 Mcounts.sec
SRAM Write-back» After each count thread has had the chance to run, the write-back
thread writes one 8-word block of local memory to SRAM.» Measured performance is 20 ms for a full write-back (50 updates per
second)» This will slow down the counting, but only by 19 cycles every 7th count
(when the counter is fully-loaded) or less than 3 instructions per count thread.
*In simulation, only 17 cycles were measured for >82 Mcounts/sec
18 - David M. Zar - 04/19/23
Lookup File locations Code
» src/applications/ONL_Router/src/freelistMgr/freelistMgr.uc» Src/library/dataplane/counter_util.uc
Include Paths» src/applications/ONL_Router/src/dispatch_loop/ONL/
dl_source.h and dl_source.uc dl_source() and dl_sink() functions
» Other, standard, include paths (Intel SDK provided)