1 pci fragment buffers input links tagnet link protocol for generating event-coherent dma bursts in...
DESCRIPTION
3 LHCb VELO trigger problem Eventbuilding I/O requirement 4 1 MHz ( 4 Gbyte/s): 1. very small payloads produced per link < 40 byte 2. very high frequency (up 1.1 MHz ) 3. farm of COTS computers ( PCI bus I/O ) Baseline Approach : 1.aggregate 4 -> 1 links for network payloads >= 128 byte and use a minimal overhead format 1 2.use “hardware eventbuilding” in “memory” ( shared memory farm ) 3.use PCI memory access mechanism for memory-memory copy ( physical PCI address for DMA mapping to remote memory ) T A G N E T: schedule memory-directed DMA’s in an “event-coherent” way 32 CPU-cluster prototype at KIP, COTs PCs 6 Gbit/s network, shared memory Measured at KIP cluster: Hardware DMA via PCI to: (1) local, (2) remote memory (3) Software DMA local memory > 128 byte payloads 248 Mbyte/s 1. STF format 5 % overhead )TRANSCRIPT
1PCI
fragment buffers
Inpu
t lin
ksTAGnet
link protocol for generating event-coherent DMA bursts in trigger farms
Hans Muller, Filipe Vinci dos Santos, Angel Guirao, Francois Bal, Sebastien GonzalveCERN ED Electronics
TAGnet is a protocol for the creation of event-coherent DMA transfers between hardware DMA engines of readout buffers and CPUs of a trigger farm. TAGnet interconnects slave DMA modules via twisted pair to a TAGnet scheduler which collects all data requests from CPUs.
DMA
DMA
DMA
Scheduler
TAGnet
Definition: event-coherent DMA: interconnected hardware DMA engines are initiated to send specified event-fragments to one requesting CPU
CPU farm Readout Network
requests
Twisted pair
TAGnet was developed as part of the LHCb (CERN) HS trigger project, started in Febr. 2002, in collaboration with KIP Heidelberg, for use in the level-1 VELO trigger farm.
2
Features of Event-coherent DMA
• cheap implementation ( twisted pair, FPGA logic, PCI card )
• no worst-case destination buffer like for round-robin ( 2 buffers sufficient )
• no problem with crashing farm CPUs ( just 1 less )
• no problem with very large variations in CPUtime per event
• CPU requests new events from scheduler whilst processing previous one
• no spurious arrivals of event fragments: all arrive concatenated in time
• highest possible use of the network raw bandwidth ( hardware timing )
• added functionality via “message-TAGs” ( bufferchecks, common Xon/Xoff )
3
LHCb VELO trigger problem
Eventbuilding I/O requirement 4 kbyte @ 1 MHz ( 4 Gbyte/s):
1. very small payloads produced per link < 40 byte
2. very high frequency (up 1.1 MHz )
3. farm of COTS computers ( PCI bus I/O )
Baseline Approach :
1. aggregate 4 -> 1 links for network payloads >= 128 byte and use a minimal overhead format 1
2. use “hardware eventbuilding” in “memory” ( shared memory farm )
3. use PCI memory access mechanism for memory-memory copy ( physical PCI address for DMA mapping to remote memory )
T A G N E T: schedule memory-directed DMA’s in an “event-coherent” way
32 CPU-cluster prototype at KIP, COTs PCs 6 Gbit/s network, shared memory
Measured at KIP cluster: Hardware DMA via PCI to: (1) local, (2) remote memory (3) Software DMA local memory
12
3
> 128 byte payloads
248 Mbyte/s
1. STF format 5 % overhead )
4
What is a TAG
TAG: a 64 bit of transfer information
class type
63
instructions 12 bit ID / data 11 bit Buffer Adr. 7 b Dcount 7 b SourceID 7 b Info 7 b Hamming
1 First 1 Force 1 Done 1 Reserved 1 Reserved 1 Reserved
local DMA buffer addr.CPU Identifier Command
Simplified:
More:
• 4 TAG classes
• 4 TAG types
• 7 bit Done counter
• 7 bit Source Module ID
• 7 bit Coded information
• 7 bit Error correcting code
5
• 64 bit TAGs are transmitted in four 16 bit words followed by 1 idle
• 17th bit (Flag) used to delimit “TAG heartbeats” ( 1111011110..)
• Error-correcting Hamming code in the last word
TAGs on a 16 bit bus
6
TAGs over narrow links
Logical 64 bit Bus 16 bit serial link (CAT5 twisted pair)
3 * 175 Mbit/s + 1 * 25 Mbit/s
LHCb Readout Unit:TAG in
TAG out
FPGA
7
TAGnet slave
• “Paket Reception” stores all incoming TAGs in 64 bit bypass register
• “Packet FiFo” only stores TAGs which are directed to the slave
• “Decoding& Execution” takes desired action
• “DMA-engine” gets loaded with source/destination + starts ation
• “Packet-Transmission” copies used TAGs back into the TAGnet ring
simplified block diagram:
IN OUT
VHDL design and synthesis for FPGA
8
heartbeat transport TAG heartbeats: 16-bit words in 11110 clock beats
• synchronous “heartbeat” on link 1111011110.. (bit 17)
• Heartbeat is always on, carrying valid or invalid TAGs
• One “heartbeat” consists of 4 words + 1 Idle (= 5 clocks)
• 1st word contains important class/type/command bits
4 3 2 1
5 clocks/ Hbeat
25 MHz clock
Idle
Idle
Idle
Heartbeat on TAGnet links
Hbeats contain valid or invalid TAGs
Max. 5 MHz TAGs
Scheduler
Slaves
Heartbeat check: physical link layer
TAGnet ring
Heartbeats
9
TAGnet components:
Scheduler
N Slaves
TAGnet ring
1 Master tags
PCI card
FPGASDRAM
PCI bus
SLINK
serializer
Twisted pair
PC
FPGA(DMA)
Network Interface
PCI bus
Readout cards
Subevent buffer
memory bus
deserializer serializer
Readout network
10
Tag classes
Valid Consume TAG class
0 0 invalid, not consumeable. These filler TAGs have no other purpose than allowing the scheduler to control the level of usable TAGs. In order to clear the TAGnet ring the scheduler transmits only these TAGs.
0 1
1 0
1 1
invalid, consumeable. These TAGs are freely consumeable TAGs which can be used by any TAGnet slave to create valid TAGs at it’s outputvalid, not consumeable. These TAGs fall into the type of directed scheduler messages, created normally by a TAGnet slave. They contain message information ( like errors ) for the scheduler and hence must not be consumed by other TAGnet slaves.valid, consumeable. These TAGs fall into the types: undirected command, undirected message, directed slave message and hence contain important scheduler information (command / address / message ) to be consumed by TAGnet slaves.
class type
63
instructions 12 bit ID / data 11 bit Buffer Adr. 7 b Dcount 7 b SourceID 7 b Info 7 b Hamming
1 First 1 Force 1 Done 1 Reserved 1 Reserved 1 Reserved
11
Tag types
directed / undirected
encoded messages
TAG type ( only defined for valid TAGs)
0 0 undirected slave command (C-TAG) of class VALID CONSUMABLE convey command and data to all slaves. These realtime TAGs are the large majority of all TAGs
0 1
1 0
1 1
class type
63
instructions 12 bit ID / data 11 bit Buffer Adr. 7 b Dcount 7 b SourceID 7 b Info 7 b Hamming
1 First 1 Force 1 Done 1 Reserved 1 Reserved 1 Reserved
undirected slave message (M-TAG) of class VALID CONSUMABLE. These TAGs send an encoded message to all slaves
directed slave message (M-TAG) of class VALID CONSUMABLE. These TAGs send command and data to one slave
directed scheduler message (M-TAG) of class VALID NON CONSUMEABLE. These TAGs send an encoded message (error or other) from a slave to the scheduler
12
Event-coherent DMA transferReadout Network
C-TAGs -> event-coherent DMA
C-TAGs are the vast majority of TAGs. Each C-TAG creates 1 event-coherent DMA burst to a requesting CPU: all DMA-slaves are triggered to load identical Source/Destination in their DMA engines and to transmit their data. Result: a fast succession of subevents to the requester CPU.
Tagnet Master FPGA
Host PCI bus
C-TAG
buffer
C-TAG hardware
CPUxSchedulerRequestx
Destin
ation
CPU
RU
Sour
ce b
uffer
s
C-TAG
event
TAGnet link
Tagnet Slave FPGA
64 bit bypassDMA Command
executionEvent
fragment buffer
Slave output bus
13
Message Tags ( M-TAG)Message TAGs (M-TAGS) coexist with C-TAGs for messages between slaves and scheduler. Generated by the scheduler software, M-TAGs are not time critical.
class type
63
instructions 12 bit ID / data 11 bit Buffer Adr. 7 b Dcount 7 b SourceID 7 b Info 7 b Hamming
1 First 1 Force 1 Done 1 Reserved 1 Reserved 1 Reserved
FLUSH ALL: flush all buffers, reset pointers001
SELECT: select a TAGnet slave operation mode contained in the 6 bit info field
000
Message examples of “Directed 1 Slave” M-TAG
FLUSH ALL: flush buffers, reset their pointers001
DIAG: request for Slave Info via M-TAG as specified in Coded INFO field
000
Message examples of “Undirected Slave” M-TAG (to all slaves)
THROTTLE request001
ERROR ( error type decoded in INFO field )100
Message examples of “Undirected Slave” M-TAG (slave to scheduler)
14
Tagnet in shared-memory farms
CPU-Farm
PCI
Scheduler
S/N bridge
DMA
DMA
PCI
DMANIC
NIC
TAGnet
network
Aggregation buffers
Inpu
t lin
ks
CPU
Shared memory eventbuilding features:
• DMAs “write-through” to shared CPU memory (red)
• One event-coherent burst to 1 CPU per C-TAG
• events auto-closed by fixed Nr of event-frames
sh. memory
TAGs may be used for event-coherent Event-building in any system. Shared-memory: for high rate (triggers)
1.) perform high-rate eventbuilding using memory-memory copy ( may require blocksize aggregation )
2.) create TAGs at high rate on CPU demand
mem
Shared memory TAGnet features:
• CPUx,y,z send request to memory block (blue)
• CPU’s share 1 single scheduler
CPUx,y,x
15
DMA measurements PCI to PCI over 6 Gbit/s network
PCI 64 @ 66 MHz
4 * Slink 6 Gbit Network NIC
PCI 32 @ 33 MHz ( ! )
CPUMemory
DMA1+DMA22 *64 byte
NIC uses PCI write combiningPCI64@66
Network 6 Gbit/s
Outgoing128 byte payload in 200ns
2*DMA->NIC->Network
DMA2
DMA1
NIC
Buffer 1
C-TAG
Buffer 2
Readout Unit:
NIC receives 128 byte payload
PCI burst to memory
PCI64@66
PCI32@33
NIC->PCI->Memory
Network IN
2 MHz E.C. DMA bursts
Network 6 Gbit/s
Network BW used at 40% with 2 MHz of 128 byte bursts
Extrapolated ¼ to PCI64@66MHz
16
Hardware: FPGA logic in PCI card• serialize TAGs to twisted pair link ( mezzanine card )• monitor TAGnet ring alive status (heart/errorbeats/clock) • auto-generation of next event-buffer ( default +1 )
• monitor status of outstanding and returning C-TAGs
• timeout for C-TAG return ( programmable via a control register)
• decode errors received via M-TAGs from slaves
• error reporting via interrupts
• accumulation of log-files from returning M-TAGs (SDRAM buffer)
TAGnet Scheduler
Software: C-Tag PCI driver, M-Tag control, Error handling• PCI driver ( Linux & W2000 )
• initialize/configure all TAGnet slaves & disable (throttle) triggers during setup• Creation of C-TAGs from request table at rates >= trigger rate • creation of special C-Tags ( Reset, Align , Flush )• use M-TAG functions for all setup / monitoring/ diagnostic tasks• read / check log-files from returning M-TAGs ( including error TAGs from slaves)• routines for interrupt error handling• regular source buffer verifications / flushing via M-TAGs
17
Scheduler hardware
Host PCI bus
FPGA
PCI master/ slave +Config.
Registers
CPU request array
( Priority encoder ) ADD
Throttle FF
Throttle
Link HB check
TAG type decoder
C-TAG
ckeck
M-TAG check
TAGnet slaves
Serializer
De-serializer
TAGnet ringclear
Done Count=0 ? Error
handling
n
IRQ
SDRAM
logfiles
Make C-TAG
Donebit=
Max.Nr. Slaves
Scoreboard of
busy network channels
4*16 bit TAG beats
clear
25 MHz
Heartbeat
next
M-TAG
buffer
Address decode
M
C
RAMPC
18
DONE
Under Progress
Under Progress
Planned
Project status
Scheduler software
C-TAG creation:
• CPU request = 12 bit Identifier
• at 1 MHz trigger rate ( LHCb ) minimum C-TAG request bandwidth is 2 Mbyte/s
• Burst-mode PCI driver: transmit CPU request from memory to scheduler’s buffer @ 1 MHz
M-TAG creation:
• assemble any class/type of an M-TAG on user request
• send M-TAG
M-TAG result collection:
• readout of M-TAG logfile from SDRAM
• identify returned M-TAG (Type, ID , Command ) & read result
Error handling:
• PCI interrupt handler
• Interrupt code register PCI
19
LINUX
with shared
memory
Scheduler host: One PC of cluster
C-TAG software loop for 2D shared memory cluster
C-TAGs HARDWARE (PCI Card )
FPGA logic
serial
De-serial
PCI glue
SDRAMCtrl.
Slink32
SDRAM
Slink.
Memory scan
Make 32 bit bitmaps, suppress empty rows
Scheduler C-TAG software
00100Row Nr 0
10010Row Nr 3
00100Row Nr 1
Mapped memory
0 0 1 0 0
0 0 1 0 0
0 0 0 0 0
1 0 0 1 0
0 0 0 0 0
Requests: Copy to mapped memory
PCI bus
PCI burst to scheduler card
DMA
cpu
2 D CPU cluster
Row Nr 0
Row Nr 1
Row Nr n
Rea
dout
Uni
ts
20
C-TAG loop timing result
Local segment (id=0x80400, size=1024) is created.
Local segment (id=0x80400, size=1024) is created.
Local segment (id=0x80400) is mapped to user space.
The physical address for local segment is :2f6000
Local segment (id=0x80400) is available for remote connections.
Waiting for the DMA transfer to be ready ....
Node 8 received interrupt (0x0)
DMA transfer done!
Client data: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1024 1024 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
…
Detecting Orca on 2:f....[ OK ]
Physical address = 2f6000
duration for the writing of 16*1000 32bits WORDS : 22843 usec
Measured Xfer to PCI:
1,4 us for up to 24 CPU request bits
Safe to say that >> 1 MHz applies for faster PC with 64 bit 66 MHz PCI
PCI bus activity mixed bursts and single words
Emulation of request loop 16 * 16 farm on “o ld PC” ( PCI bus 32 bit 33 MHz):
31Request bits of CPU in row31 0Row-Nr
23
PCI bursts, 16 *32 bit
Make request blocks and send to PCI:
21
Summary
• TAGnet is a 64 bit protocol which sends TAGs at up to 5 MHz rate in a ring of DMA slaves
• Interconnection within a TAGnet ring is based on twisted pair (CAT5 )
• C-TAGs organize event-coherent DMA transfers on CPU demand
• M-TAGs serve for initialization, error reporting and control
• 4 TAGnet classes and 4 TAGnet types ( 16 flavors )
• TAGnet scheduler is a PCI card which receives CPU requests
• First experimental TAGnet slave implementation in LHCb Readout Unit ( FPGA )
• First experimental TAGnet master implementation via programmable PCI-FLIC card
• software loop “CPU-requests to scheduler” demonstrated to work at more than 1 MHz
• successive “event-coherent DMA” measured at rates up to 2 MHz for 128 byte payloads
22
PCI card with Tagnet mezzanine
64 bit PCI @ 66 MHz
FPGA SDRAMFLIC card ( EP-ED)
• lowcost FPGA card
• very fast host bus IF
• 64 Mbyte SDRAM
• drivers for Linux/Windows
• programmable Slink IF
TAGnet IN
TAGnet OUT
Slink I/O card (EP-ED)
• reprogrammed for TAGnet
• 32 bit Slink connector
• RJ45 standard network link
Interfaced via Slink connector
23
TAGnet on LHCb Readout Unit
TAGnet
Input Links
4*Slink
Readout Network
NetworkedEmbedded
CPU
Dual DMA engines
Subevent buffer