david m. zar applied research laboratory computer science and engineering department onl stats block

18
David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

Upload: rosamund-hancock

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

David M. ZarApplied Research Laboratory

Computer Science and Engineering Department

ONL Stats Block

Page 2: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

2 - David M. Zar - 04/19/23

Stats Engine The Stats Engine is a single ME devoted to accepting

messages in a scratch ring and performing increment and add operations to counters.»All MEs that need to update counters will use the Stats Engine»Operations supported will be

Atomic increment (+1) Atomic add (+data)

»Format of the commands will be

Opcode(4b) Data (12b) Index (16b)

Page 3: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

3 - David M. Zar - 04/19/23

SRAM

ONL NP Router

Rx(2 ME)

HdrFmt(1 ME)

Parse, Lookup,

Copy(3 MEs)

TCAM SRAM

Mux(1 ME)

Tx(1 ME)

QM(1 ME)

xScale xScale

xScale

Assoc. DataZBT-SRAM

Plu

gin

0

Plu

gin

1

Plu

gin

2

Plu

gin

3

Plu

gin

4NN NN NN NN

FreeList Mgr(1 ME)

Tx, QMParsePluginXScale

Stats(1 ME)

QMCopyPlugins SRAM

NN

SRAMRing

ScratchRing

NNRingNN

SRAM

64KW

64KW64KWEach

Page 4: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

4 - David M. Zar - 04/19/23

MEs -> Stats Block

StatsOpcode

(4b)Index (16b)Data (12b)

Page 5: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

5 - David M. Zar - 04/19/23

Opcodes

Opcode –»0011 +1, +data pre-q counter specified in Index»0111 +1, +data post-q counter specified in Index»0010 +1 pre-q counter specified in Index»0110 +1 post-q counter specified in Index»0001 +data pre-q counter specified in Index»0101 +data post-q counter specified in Index»1011 +1, +data global register specified in Index»1010 +1 global register specified in Index»1001 +data global register specified in Index

(not implemented – 4/23/07)

Opcode(4b) Data (12b) Index (16b)

Page 6: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

6 - David M. Zar - 04/19/23

Stats Counters Each Index specifies a group of four counters

»Pre-Q packet count»Pre-Q byte count»Post-Q packet count»Post-Q byte count

The packet counters get updated when the +1 instructions are specified (opcodes 0-1-)

The byte counter get updated when the +data instructions are specified (opcodes 0--1)

For plug-ins, the use for each counter can be redefined but the opcodes do not change (i.e. each stats index corresponds to two incrementers and two adders).

Page 7: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

7 - David M. Zar - 04/19/23

Global Registers For system-wide counters, we define a separate set of global registers to handle them.»RX (packet and byte, 5 ports 10 words)»TX (packet and byte, 5 ports 10 words)»Drop counts (10 words)»Plug-in use (four per plug-in 20 words)»Per ME error counters (8 words)»10+10+10+20+8 = 58 so reserve 64 words for these

The register gets incremented when the +1 instructions are specified (opcodes 101-)

The register gets added to updated when the +data instructions are specified (opcodes 10-1)

The RX and TX counters will be assigned on even-word boundaries (lsb = 0) so we associate the packet and byte counters, together, and can do the +1, +data instruction on them in one command (1011 opcode)

For plug-ins, the use of each register is under the control of the plug-in»Four independent counters»Two sets of two counters»One set of two and two independent

Page 8: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

8 - David M. Zar - 04/19/23

ONL Router Counter Registers (in dl_system.h) // RX Per Port registers: (Updated by MUX) ONL_ROUTER_RX_PORT0_PKT_CNTR ONL_ROUTER_RX_PORT0_BYTE_CNTR ONL_ROUTER_RX_PORT1_PKT_CNTR ONL_ROUTER_RX_PORT1_BYTE_CNTR ONL_ROUTER_RX_PORT2_PKT_CNTR ONL_ROUTER_RX_PORT2_BYTE_CNTR ONL_ROUTER_RX_PORT3_PKT_CNTR ONL_ROUTER_RX_PORT3_BYTE_CNTR ONL_ROUTER_RX_PORT4_PKT_CNTR ONL_ROUTER_RX_PORT4_BYTE_CNTR

// TX Per Port registers: (Updated by HF) ONL_ROUTER_TX_PORT0_PKT_CNTR ONL_ROUTER_TX_PORT0_BYTE_CNTR ONL_ROUTER_TX_PORT1_PKT_CNTR ONL_ROUTER_TX_PORT1_BYTE_CNTR ONL_ROUTER_TX_PORT2_PKT_CNTR ONL_ROUTER_TX_PORT2_BYTE_CNTR ONL_ROUTER_TX_PORT3_PKT_CNTR ONL_ROUTER_TX_PORT3_BYTE_CNTR ONL_ROUTER_TX_PORT4_PKT_CNTR ONL_ROUTER_TX_PORT4_BYTE_CNTR

// IP Drop registers (Updated by PLC) ONL_ROUTER_IP_HEC_DROP_CNTR ONL_ROUTER_IP_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_HDR_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_VERSION_ERR_DROP_CNTR

Page 9: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

9 - David M. Zar - 04/19/23

ONL Router Counter Registers (cont.) // PLC Drop registers (Updated by Parse, Lookup or Copy) ONL_ROUTER_PLC_TO_PLUGIN_DROP_CNTR ONL_ROUTER_PLC_TO_XSCALE_DROP_CNTR

// QM Drop registers (Updated by QM) ONL_ROUTER_QUEUE_OVERFLOW_DROP_CNTR

// XScale Drop registers (Updated by XScale) ONL_ROUTER_XSCALE_DROP_CNTR

// Rx Drop registers (Updated by Rx) ONL_ROUTER_RX__DROP_CNTR

// Tx Drop registers (Updated by Tx) ONL_ROUTER_TX_DROP_CNTR

// Per Block Generic Error Counters ONL_ROUTER_RX_GENERIC_ERROR_CNTR ONL_ROUTER_MUX_GENERIC_ERROR_CNTR ONL_ROUTER_PLC_GENERIC_ERROR_CNTR ONL_ROUTER_QM_GENERIC_ERROR_CNTR ONL_ROUTER_HF_GENERIC_ERROR_CNTR ONL_ROUTER_TX_GENERIC_ERROR_CNTR ONL_ROUTER_STATS_GENERIC_ERROR_CNTR ONL_ROUTER_FREELISTMGR_GENERIC_ERROR_CNTR

Page 10: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

10 - David M. Zar - 04/19/23

ONL Router Counter Registers (cont.) // Plugin 0 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_0_CNTR_0 ONL_ROUTER_PLUGIN_0_CNTR_1 ONL_ROUTER_PLUGIN_0_CNTR_2 ONL_ROUTER_PLUGIN_0_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_1_CNTR_0 ONL_ROUTER_PLUGIN_1_CNTR_1 ONL_ROUTER_PLUGIN_1_CNTR_2 ONL_ROUTER_PLUGIN_1_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_2_CNTR_0 ONL_ROUTER_PLUGIN_2_CNTR_1 ONL_ROUTER_PLUGIN_2_CNTR_2 ONL_ROUTER_PLUGIN_2_CNTR_3 // Plugin 3 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_3_CNTR_0 ONL_ROUTER_PLUGIN_3_CNTR_1 ONL_ROUTER_PLUGIN_3_CNTR_2 ONL_ROUTER_PLUGIN_3_CNTR_3 // Plugin 4 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_4_CNTR_0 ONL_ROUTER_PLUGIN_4_CNTR_1 ONL_ROUTER_PLUGIN_4_CNTR_2 ONL_ROUTER_PLUGIN_4_CNTR_3

Page 11: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

11 - David M. Zar - 04/19/23

Stats Counter Priority There are two levels of priority for Stats Counters

» High-priority (high-speed) are kept in local memory. There are 64 sets of counters for the router and 64 for the plug-ins

» Low-priority (low-speed) are in SRAM. There are 216-128 = 65408 of these. Stats Counters 0-127 point to the high-priority counters while 128-65535

are low-priority counters. Using low-priority Stats Counters to count events that happen at high

speed may degrade system performance (being a pre-Q counter on a high-priority queue, for example)

Plug-ins need to be aware of the segmentation of priority so they can use the proper priority counters based on needs

Global Registers are always high-priority Eight threads used

» Seven threads process messages from the input scratch ring» One thread writes 8W chunks of the local memory counters/registers to SRAM so

that each counter/register is updated in SRAM several times a second.

Page 12: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

12 - David M. Zar - 04/19/23

Stats ME Local Memory MapGlobal Registers 0

63

Reserved 64127

Stats Counters (router)64*4W = 256W

128

383

Stats Counters (plug-ins)64*4W = 256W

384

639

Page 13: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

13 - David M. Zar - 04/19/23

Stats PseudocodeWhile (true and ctx={0:6}) { dl_source_scr_1word()

decode_opcode()case (opcode) { Global Register: lm_addr = index << 2; do opcode; Stats Index: if (index > 127) { do slow_opcode; } else { lm_addr = (128*4) + (index << 4);

do fast_opcode;}}

}

While (true and ctx=7) {offset = 0;for (l_mem=0; l_mem<(64*4); l_mem=l_mem+8) { sram_write(GLOBAL_REGS_BASE, offset, l_mem, 8); offset = offset + 32;}

offset = 0;for (l_mem=(128*4); l_mem<(128*16); l_mem=l_mem+8) { sram_write(ONL_STATS_BASE, offset, l_mem, 8); offset = offset + 32;}

}

Page 14: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

14 - David M. Zar - 04/19/23

Stats Function Calls Defined in counter_util.uc:

» _WU_preq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data» _WU_preq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1» _WU_preq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data» _WU_postq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data» _WU_postq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1» _WU_postq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data» _WU_global_register_add(reg_num, tx_reg, update_sig, error_addr) // +1» _WU_global_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data» _WU_global_register_update(reg_num, tx_reg, data, update_sig, error_addr)// +1 & +data

Page 15: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

15 - David M. Zar - 04/19/23

Performance Targets How many packets processed per second?

» To hit 5 Gb rate: 76B per min IPv4 packet (64 min Enet Frame + 12B IFS) 1.4Ghz clock rate 5 Gb/sec * 1B/8b * packet/76B = 8.23 Mp/sec 1.4Gcycle/sec * 1 sec/ 8.23 Mp = 170 cycles per packet compute budget: 170 cycles latency budget: (threads*170)

7 threads: 1190 cycles

How many count requests per packet (typical packet)?» RX per-port count» TX per-port count» Preq-Q stats index» Post-Q stats index

Total counts = 8.23 Mp/sec * 4 counts/sec = 32.92 Mcounts/sec

Page 16: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

16 - David M. Zar - 04/19/23

Stats Block DiagramRead Scratch Ring

LM_ADDR = 512 + (index << 4)

GlobalRegister?

(4 CLK)

Index > 127? (3 CLK)

SlowCounter

LM_ADDR = (index << 2)Y N

N

Y

DecodeOpcode

(3C)

+data?(3C)

+1?(3C)

LM_ADDR++ = *LM_ADDR + data

LM_ADDR = *LM_ADDR + 1

N

N

Y

Y

SCR READ: 60L + 2C

Worst case (fast) is for

Stats Counters: 20 Clocks +

60 Cycles Latency

Page 17: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

17 - David M. Zar - 04/19/23

Performance Results Total fast counts:

» Count time is, effectively, 20 cycles (all 60 cycles of latency are hidden)

» 1400 Mcycles/sec 20 cycles/count = 70 Mcounts/sec* » Target is 39.92 Mcounts/sec.

Slow counts:» Count time is about 150 – 60 = 90 cycles (the SRAM latency is not

completely hidden)» 1400/150 = 15.6 Mcounts.sec

SRAM Write-back» After each count thread has had the chance to run, the write-back

thread writes one 8-word block of local memory to SRAM.» Measured performance is 20 ms for a full write-back (50 updates per

second)» This will slow down the counting, but only by 19 cycles every 7th count

(when the counter is fully-loaded) or less than 3 instructions per count thread.

*In simulation, only 17 cycles were measured for >82 Mcounts/sec

Page 18: David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

18 - David M. Zar - 04/19/23

Lookup File locations Code

» src/applications/ONL_Router/src/freelistMgr/freelistMgr.uc» Src/library/dataplane/counter_util.uc

Include Paths» src/applications/ONL_Router/src/dispatch_loop/ONL/

dl_source.h and dl_source.uc dl_source() and dl_sink() functions

» Other, standard, include paths (Intel SDK provided)