low-latency data acquisition to gpus using fpga-based 3rd ......gpu ram ~ fpga are well suited for...
TRANSCRIPT
RTC_4_AO workshopPARIS 2016
Low-latency data acquisition to GPUs using FPGA-based
3rd party devices Denis Perret, LESIA / Observatoire de Paris
1
RTC for ELT AO
2
HardRealTime
SoftRealTime
tele-metry40
Ge Net
work
high
ban
dwidth
low
late
ncy
Sensors 40Ge
Net
work
high
t ba
ndwidt
h
DeformableMirror
Using FPGA to reduce acquisition latency
~ Interface : 10 GbE, low latency acquisition interface
~ W/O Direct Memory Access to the GPU ram : multiple data copy = introduced latency
3
GPURAM
GPU
HostRAM
HostCPU
PCIe
10GeNIC
Using FPGA to reduce acquisition latency
~ Interface : 10 GbE, low latency acquisition interface
~ With Direct Memory Access to the GPU ram : single data transfer
= low latency, maximum bandwidth
4
GPURAM
GPU
HostRAM
HostCPU
PCIe
10GeNIC
RTC_4_AO workshopPARIS 2016
No PCIe P2P at all.
NIC
PCIe
Host Ram
GPU
Ram
Host
CPU AppHandles the interruptsand the datadecapsulation
Encapsulated Data
~ The NIC DMA engine sends data to host memory and sends an interrupt to the CPU when the buffer is (half)full.
~ The CPU suspends its tasks and saves its current state (context switch).
~ The CPU decapsulates the data (Ethernet,TCP/UDP, GigeVision…).
RTC_4_AO workshopPARIS 2016
No PCIe P2P at all.
~ The CPU builds a CUDA stream containing GPU operations as kernel launch and memory transfer orders.
~ The stream is sent to the GPU
~ The pixels are copied on the GPU RAM (GPU DMA, reading process over PCIe)
NIC
PCIe
Host Ram
GPU
Ram
Host
CPU App Initiates the dataand operations streamtransfer
Dec
apsu
late
d D
ata
RTC_4_AO workshopPARIS 2016
No PCIe P2P at all.
~ The CPU waits for the computation to end (or does something else)
~ The results are sent to the CPU RAM.
~ The synchronization mechanisms between the GPU and the CPU are hidden (interrupts…).
NIC
PCIe
Host Ram
GPU
Ram
Host
CPU App
Com
puta
tion
Res
ults Does something
else. Gets interrupted periodically.
RTC_4_AO workshopPARIS 2016
No PCIe P2P at all.
~ The CPU encapsulates the results (Ethernet, TCP/UDP, …)
~ The CPU initiates the transfer by configuring and launching the NIC DMA engine (reading process).
NIC
PCIe
Host Ram
GPU
Ram
Host
CPU AppEncapsulates the dataInitiates theData transfer
Encapsulated Results
RTC_4_AO workshopPARIS 2016
With PCIe P2P and Custom NIC
~ The CPU gets the address and size of a dedicated GPU buffer and use it to configure the NIC DMA engine.
~ The data are written directly to the GPU memory.
~ The GPU infinitely detects new data by polling its memory (busy loop), performs the computation and fills a local buffer with the results.
Custom NIC
PCIe
Host Ram
GPU
Ram
Host
CPU AppEncapsulated Data Data
Decapsulation
TOE
PollingKernel
Minding hisown businessD
ecaps.Data
RTC_4_AO workshopPARIS 2016
Getting back the results: 1rst way
~ The CPU gets notified in a hidden way that the computation.
~ The CPU initiates the transfer from GPU to FPGA (FPGA DMA engine -> reading process over PCIe). Custom NIC
PCIe
Host Ram
GPU
Ram
Host
CPU AppData
Decapsulation
TOE
PollingKernel
Still minding hisown business
Encaps. Results
Res
ults
RTC_4_AO workshopPARIS 2016
Getting back the results: 2nd way
~ The GPU sends directly the data by writing to the NIC addressable space (GPU DMA -> write process over PCIe).
Custom NIC
PCIe
Host Ram
GPU
Ram
Host
CPU AppData
Decapsulation
TOE
PollingKernel
Still minding hisown business
Encaps. Results
Res
ults
RTC_4_AO workshopPARIS 2016
Getting back the results: 3rd way
~ The GPU notifies the FPGA by writing a flag on the FPGA addressable space.
~ The FPGA gets the data back (FPGA DMA engine -> reading process).
Custom NIC
PCIe
Host Ram
GPU
Ram
Host
CPU AppData
Decapsulation
TOE
PollingKernel
Still minding hisown business
Encaps. Results
Res
ults
Using FPGA to reduce acquisition latency: First tests ~ Stratix V PCIe development board from PLDA (+ QuickPCIe,
QuickUDP IP cores) 42 Gb/s demonstrated from board to GPU; 8.8 Gb/s per 10GbE link in loopback mode
~ 10 GbE camera from Emergent Vision Technologies (8.9 Gb/s to GPU mem), GigEVision protocol, 1.5 kFPS in 240x240 pixels coded on 10bits (360 FPS in 2k x 1k on 8bits or 1k x 1k on 10bits)
13
Using FPGA to reduce acquisition latency
14
PHY UDP
DMA
DMA
DMA
DMA
DMA
DEMUXDECAPS
PHY UDP
Vision_protocol_handling
DATAGENERATOR
Latencymeasurement
DMC
pixels
commands
answers
PCIe
3.0
CPU app
cameracontrol
gpu_ram
Hostram
loop
back
computationkernels
CUSTOM_NIC
GPU
pixelsringbuffer
pollingkernel
DMcombuffer
measurements
deformable mirror commands
P2P from FPGA to GPU (way back still launched by the CPU)
15
600�
700�
800�
900�
1000�
1100�
1200
1� 63�
125�
187�
249�
311�
373�
435�
497�
559�
621�
683�
745�
807�
869�
931�
993�
1055�
1117�
1179�
1241�
1303�
1365�
1427�
1489�
1551�
1613�
1675�
1737�
1799�
1861�
1923�
1985�
2047�
2109�
2171�
2233�
2295�
2357�
2419�
2481�
2543�
2605�
2667�
2729�
2791�
2853�
2915�
2977�
3039�
3101�
3163�
3225�
3287�
3349�
3411�
3473�
3535�
3597�
3659�
3721�
3783�
3845�
3907�
3969�
4031�
4093
no p2p & computation
p2p & computation
no p2p & no computation
p2p & no computation
Interrupts vs polling (on the cpu)
16
carte PLDA "Quickplay"
carteGPU
pollingkernel
PCIE
10GbEPC APC B
carte PLDA"non QP"
computingkernel(s)
RAMGPU
QPCIEDMA
UDP0
UDP1
UDP0
UDP1
DDR
DDR
DDR
hdl-kernel
ImageGen
hdl-kernel
LatencyMeas
c-kernel
COG(ou pas)
BAR2
Interrupts vs polling (on the cpu)
17
-100�
0�
100�
200�
300�
400�
500�
600�
700�
9300� 9350� 9400� 9450� 9500� 9550� 9600� 9650�
Optimized Interrupt+ "stress -c 8"
Memory polling
Non Optim Inter Non Optim Inter+ "stress -c 8"
Interrupts vs polling (on the cpu)
18
-100�
0�
100�
200�
300�
400�
500�
600�
700�
9330� 9335� 9340� 9345� 9350� 9355� 9360�
Polling+ isolated CPU
Polling+ non isolated CPU
Optimized Interrupt
Using FPGA to reduce the load on GPU/PCIe/network…
19
DPRAM CoG FIFODPRAM CoG FIFODPRAM CoG FIFODble portRAM
CoG FIFO
DMA over PCIe
read direction
N*N subimages
{N DPRAMGPU
RAM
~ FPGA are well suited for on-the-fly decapsulation, data rearrangement and highly parallelized computation.
~ Less load on the PCIe, smaller buffers on the GPU ram, less latency
~ Could be done in the WFS, so we lower the load on the network
…or even do everything with FPGAs
20
~ Exploring the possibilities with high level development environments:
~ Vendor specific HLS. Xilinx Vivado,…
~ Quickplay: based on KPN. Has its own HLS. Is gonna be able to use other HLS. They are planning to integrate PCIe P2P.
~ OpenCL: is available for FPGAs (Altera), CPUs, GPUs. Can now handle network protocols: Altera introduced I/O channels allowing kernels read and write network streams that are defined by the board designer. P2P over PCIe should be possible (synchronization…?). Integration of custom HDL blocks?
~ Matlab to HDL. Did someone try it? Maybe useful to help developing IPs.
-> Interactions with HPC tools as MPI, DDS, Corba are quite challenging