accelerated data processing on soc with fpga · marek vasut i software engineer at denx s.e. since...

31
Accelerated Data Processing on SoC with FPGA Marek Vaˇ sut <[email protected]> June 3, 2015 Marek Vaˇ sut <[email protected]> Accelerated Data Processing on SoC with FPGA

Upload: others

Post on 30-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Accelerated Data Processing on SoC with FPGA

Marek Vasut <[email protected]>

June 3, 2015

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 2: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Marek Vasut

I Software engineer at DENX S.E. since 2011I Embedded and Real-Time Systems Services, Linux kernel and

driver development, U-Boot development, consulting, training.

I Versatile Linux kernel hacker

I Custodian at U-Boot bootloader

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 3: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Structure of the talk

I Motivation

I Introduction to FPGAs

I Your first FPGA data cruncher

I Interfacing with Linux

I Speeding things up

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 4: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Why listen to this talk

I Get fresh ideas

I Learn something new

I Reduce energy envelope of your device

I Process data quickly and efficiently

You won’t learn marketing stuff or random benchmark numbers

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 5: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

FPGA

I Abbr. for Field Programmable Gate Array

I Programmable logicI Usually used for:

I Digital Signal Processing (DSP)I Data crunchingI Custom hardware interfacesI ASIC prototypingI . . .

I Common vendors – Xilinx, Altera, Lattice, Microsemi. . .

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 6: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Internal structure

W.T.Freemanhttp://www.vision.caltech.edu/CNS248/Fpga/fpga1a.gif

CC BY 2.5: http://creativecommons.org/licenses/by/2.5/

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 7: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

FPGA and the outside

I FPGA has plenty of I/O options:I Regular I/O with configurable voltage levelsI Differential I/OI High-speed SerDesI . . .

I Usual interface with host:I Stand-alone FPGA, usually PCIe, USB, . . .I FPGA on a CPU bus (PowerPCs, ie. ML507)I Built into CPU (SoCFPGA/Zynq), usually AMBA/AXI

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 8: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Programming the FPGA

I Each vendor has his own tools – Altera Quartus, Xilinx Vivado

I FPGA tools often closed source :-(

I FPGA bitstream format is closed :-(

I Basic vendor tools available free of charge

I Sufficient amount of functionality to implement data cruncher

I Vendor tools needed for place-and-route and assembler

I Third-party tools for synthesis are available

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 9: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Comparison to a GPU – I.

CPU GPU FPGA

Toolchain Open Closed ClosedHW design Proprietary Proprietary Your ownHW units Fixed Fixed As neededI/O Limited None As needed

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 10: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

HDL – Hardware Description Language

I FPGA content is written in HDLs

I HDL – Hardware Description Language

I HDLs are used to model behavior of logic block

I Two major HDLs – VHDL and Verilog

I Tools often allow seamless mixing of HDLs

I Many readily-available cores under acceptable license:OpenCores http://opencores.org/

OpenCores projects http://opencores.org/projects

CERN Open HW Repo http://www.ohwr.org/

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 11: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Modeling behavior

HW Behavior modeling vs. Writing CPU code:

I Vastly different and confusing to software people :-)

I CPU: Programmer implements an algorithm

I FPGA: Programmer implements hardware to run the algorithm

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 12: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Implicit parallelism

I Everything in a block is executed in parallel

I All conditions in a conditional statement are tested in parallelif, case – differs from C

1 if (foo == 1) bar <= 1’b0;

2 else bar <= 1’b1;

I Blocks are executed in parallel

1 begin

2 x <= 1’b0;

3 y <= 1’b1;

4 end

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 13: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Combinatorial vs. Sequential logic

I Combo – imm. value of var is the product of the imm. inputsof the function:

assign Z = X ^ Y;

I Seq logic is sync to clock (involves a latch)

always @(posedge clk)

Z <= DAT;

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 14: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Verilog example

I Looks like C, based on C, but behaves differently

I Used a lot in Europe

I Example: CRC5, polynomial x5 + x2 + x0

I Example modified from:http://www.asic-world.com/examples/verilog/

serial_crc.html

1 module crc5 (

2 /* SYSTEM I/O */

3 input reset,

4 input clk,

5 /* CRC5 I/O */

6 input data,

7 output reg [4:0] crc

8 );

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 15: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Verilog example II

1 always @(posedge clk) begin

2 if (reset) begin

3 crc <= 5’b00000;

4 end else begin

5 crc[0] <= data ^ crc[4];

6 crc[1] <= crc[0];

7 crc[2] <= crc[1] ^ data ^ crc[4];

8 crc[3] <= crc[2];

9 crc[4] <= crc[3];

10 end

11 end

12 endmodule

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 16: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

VHDL example

I Distinctive syntax based on Ada

I More explicit typing system than Verilog

I Used a lot in the USA

I Example: CRC5, polynomial x5 + x2 + x0

I Example from http://outputlogic.com/?page_id=321

1 library ieee;

2 use ieee.std_logic_1164.all;

3

4 entity crc is

5 port ( data_in : in std_logic_vector (0 downto 0);

6 rst, clk : in std_logic;

7 crc_out : out std_logic_vector (4 downto 0));

8 end crc;

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 17: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

VHDL example II

1 architecture imp_crc of crc is

2 signal lfsr_q: std_logic_vector (4 downto 0);

3 signal lfsr_c: std_logic_vector (4 downto 0);

4 begin

5 crc_out <= lfsr_q;

6 lfsr_c(0) <= lfsr_q(4) xor data_in(0);

7 lfsr_c(1) <= lfsr_q(0);

8 lfsr_c(2) <= lfsr_q(1) xor lfsr_q(4) xor data_in(0);

9 lfsr_c(3) <= lfsr_q(2);

10 lfsr_c(4) <= lfsr_q(3);

11

12 process (clk,rst) begin

13 if (rst = ’1’) then

14 lfsr_q <= b"11111";

15 elsif (clk’EVENT and clk = ’1’) then

16 lfsr_q <= lfsr_c;

17 end if;

18 end process;

19 end architecture imp_crc;

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 18: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Comparison to a GPU – II.

CPU GPU FPGA

Languages All OpenCL, CUDA OpenCL, HDLsDesign paradigm Sequential Seq/Par ParallelDesign granularity Instruction Instruction GateOpt. possibility Low Low HighOpt. difficulty Low Low High

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 19: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Development and debugging

I Simulation (on developer’s system)

I Probing (on-target)

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 20: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Simulation

I Simulation tools:Icarus Verilog http://iverilog.icarus.com/

ghdl http://home.gna.org/ghdl/

ModelSim http://en.wikipedia.org/wiki/ModelSim/

I Write testcase for a module in an augmented HDL

I Execute testcase

I Observe results

I View waveformsI Decode and inspect bussesI Trigger on complex conditionsI . . .

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 21: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Probing

I Used to observe design on target

I Think of this as a bus analyzer in the FPGA

I Probing tools (ie. SignalTap)

I Design is augmented with a probing IP, FPGA isreprogrammed

I Probing is controlled through a debug probe attached to theFPGA (JTAG or similar)

I Probe internal signals, observe waveforms, trigger on complexconditions. . .

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 22: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Structuring the design

I HDL files – lowest in the hierarchy

I IP block – collection of HDL files with an interface

I FPGA design – collection of IP blocks

I Vendor tools contain tools to assemble IP blocks into FPGAdesign – ie. Altera QSys.

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 23: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Comparison to a GPU – III.

CPU GPU FPGA

Simulation QEMU ? Icarus, ModelSimDebugger GDB CUDA-GDB, CodeXL SignalTap

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 24: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Linux interface

I No standard in-kernel FPGA interface due to variance ofdesigns

I Attempts do exist:I Device Tree Overlay(s) stored in FPGAI SDB –

http://www.ohwr.org/projects/fpga-config-space

I Usually there are control registers in the FPGA design

I Usually the DMA is involved (either on FPGA or CPU side)I Two options for controlling the FPGA:

I Custom Linux kernel driverI Userspace utility

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 25: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Custom kernel driver

I Driver written to match the particular FPGA bitstream

I Driver can crash the host machine if written badly :-(

I Driver usually exports custom userland I/O

I splice(2)

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 26: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Userland approach

I Userland accesses the FPGA registers via uio

I The uio is like a restricted devmem

I In case DMA is involved, kernel module to prepare the datafor the DMA (ie. assure cache coherency) is needed.

I CMA might be used to export large slab of custom kernelmemory to user

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 27: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Performance tuning

I FPGA is clocked at 50...200MHz, not much

I The fabric is usually rated at much more!

I Synthesize PLL, which generates faster clock

I Clock your design from the PLL

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 28: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Design tricks

I Use combo logic where possible

I Create pipelined designs and make sure the pipeline issaturated

I Synthesize multiple units and compute in parallel

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 29: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Altera OpenCL

I OpenCL EP 1.0 implementation for Altera FPGAs

I Easier for SW developers

I Only thin shim must be ported to the FPGA

I Closed source compiler :-(

I Even needs a license :-C

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 30: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

Conclusion

I FPGAs are strong in parallel, pipelined workloads

I FPGAs give the user almost gate-level performance

I FPGAs are manufactured using bleeding-edge process

I FPGAs deploy excellent Performance-per-Watt

I There is no simple unified Linux interface

I Developing an FPGA content can be difficult

I The FPGA ecosystem is still rather closed

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA

Page 31: Accelerated Data Processing on SoC with FPGA · Marek Vasut I Software engineer at DENX S.E. since 2011 I Embedded and Real-Time Systems Services, Linux kernel and driver development,

The End

Thank you for your attention!

Contact: Marek Vasut <[email protected]>

Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA