opencl for fpgas - · pdf filenew in latest fpga family (arria 10): 32-bit floating point...
TRANSCRIPT
![Page 1: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/1.jpg)
OpenCL for FPGAs
Dmitry Denisenko
Engineer, High-Level Design Team
Intel Programmable Solutions Group
1
![Page 2: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/2.jpg)
Outline
2
(40 min)
Introduction to FPGAs
Brief introduction to OpenCL programming language
(10 min break)
(40 min)
How OpenCL concepts map to FPGA hardware
Examples of architectures for high-performance applications
![Page 3: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/3.jpg)
Basics of Programmable Logic
FPGA Architecture
![Page 4: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/4.jpg)
Flash
SDRAM
Simple CPU
DSP
I/O
I/O
I/O FPGA
I/O I/O I/O
CPU DSP
Solution: Replace External Devices
with Programmable Logic
FPGA
4
Why Programmable Logic?
![Page 5: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/5.jpg)
5
Programmable Logic is Found Everywhere!
Cellular
Basestations
Wireless LAN
Switches
Routers
Optical
Metro
Access
Broadband
Audio/video
Video display
Studio
Satellite
Broadcasting
Medical
Test equipment
Manufacturing
Card readers
Control systems
ATM
Navigation
Entertainment
Secure comm.
Radar
Guidance and control
Wireless
Networking
Wireline
Entertainment
Broadcast
Automotive
Instrumentation Military
Security &
Energy Management
Servers
Mainframe
RAID
SAN
Copiers
Printers
MFP
Computers
Storage
Office
Automation
Consumer
Automotive
Test, Measurement,
& Medical Communications
Broadcast
Military &
Industrial
Computer &
Storage
![Page 6: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/6.jpg)
Agenda
FPGA Architecture
Design Methodology and Software
![Page 7: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/7.jpg)
FPGA Logic blocks
7
FPGA logic is made up of Logic Elements (LEs) or Adaptive Logic Modules
(ALMs)
![Page 8: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/8.jpg)
Lookup Tables (LUTs)
8
Combinational functions created with programmed “tables” (cascaded
multiplexers)
LUT inputs are mux select lines
C
D
B
A
X = AB + ABCD + ABCD
Programmed levels (EEPROM
or SRAM)
0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 = x9889
![Page 9: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/9.jpg)
Programmable register
9
Clock typically driven by
global clock
Asynchronous control through
other logic or I/O
Feedback into LUT
Bypass register or LUT
![Page 10: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/10.jpg)
Carry and Register Chains
10
Chain carry bits between
LEs
Register outputs can chain
to other LE registers in LAB
to form LUT-independent
shift registers
![Page 11: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/11.jpg)
Register Packing
11
Separate outputs from LUT
and register create two
outputs from one LE
Saves device resources
![Page 12: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/12.jpg)
12
LABs and LEs: A Closer Look
LUT & carry
logic
Register
![Page 13: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/13.jpg)
Adaptive Logic Modules (ALM)
13
Based on LE, but includes dedicated resources & adaptive LUT (ALUT)
Improves performance and resource utilization
Adder
Adder
1
2
3
4
5
6
7
8
Adaptive
LUT
AL
M I
np
uts
ALM
Reg
Reg
![Page 14: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/14.jpg)
Field Programmable Gate Array (FPGA)
14
LABs arranged in an array
Row and column programmable interconnect
Interconnect may span all or part of the array
LABs
Row
interconnect
Column
interconnect Segmented
interconnects
![Page 15: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/15.jpg)
FPGA Routing
15
All device resources can feed into or be fed by any routing in device
Differing fixed lengths to adjust for timing
Scales linearly as density increases
Local interconnect Connects between LEs or ALMs within a LAB
Can include direct connections between adjacent LABs
Row and column interconnect Fixed length routing segments
Span a number of LABs or entire device
![Page 16: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/16.jpg)
FPGA Embedded Memory
16
Memory blocks Create on-chip memory structures to
support design
Single/dual-port RAM
ROM
Shift registers or FIFO buffers
Initialize RAM or ROM contents on power-on
Memory LABs (MLABs)
Typical sizes: Single memory block is 20 Kilobits.
One MLAB is 640 bits.
ADDR_A
DATAIN_A
WE_A
CLK_A
ADDR_B
DATAIN_B
WE_B
CLK_B
DATAOUT_A DATAOUT_B
Embedded Memory Block
![Page 17: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/17.jpg)
DSP Block
17
Useful for DSP functions
High-performance multiply/add/accumulate operations
![Page 18: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/18.jpg)
New in latest FPGA family (Arria 10): 32-bit Floating Point Support
18
Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add.
Just like LABs, DSPs can be chained to implement large dot products.
FPADD/SUB
ax[31:0]
ay[31:0]
az[31:0]
result[31:0]
chainout[31:0]
chainin[31:0]
FP
MULT
R
R
R
R
R
{chainout_invalid,chainout_inexact,
chainout_overflow,chainout_underflow}
{invalid,
inexact,
overflow,
underflow}
accumulate
{chainin_invalid,chainin_inexact,
chainin_overflow,chainin_underflow}
![Page 19: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/19.jpg)
FPGA I/O Elements
19
Advanced programmable logic blocks connect directly to row or column
interconnect
Control available I/O features Input/output/bidirectional
Multiple I/O standards
Differential signaling
Current drive strength
Slew rate
On-chip termination/pull-up resistors
Open drain/tri-state
etc.
![Page 20: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/20.jpg)
20
Typical I/O Element Logic
device pin
input path
output path
output enable
control
![Page 21: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/21.jpg)
High Speed Transceivers
21
High-speed transceivers Used for numerous high speed protocols: Ethernet, PCI Express, etc
![Page 22: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/22.jpg)
FPGA Clocking Structures
22
Dedicated input clock pins
Phase Locked Loops (PLLs) (see next slide)
Delay Locked Loops (DLLs) Dynamically phase-shift strobes for external memory interfaces
Clock control block(s) Select clocks to feed clock routing network
Enable/disable clocks for power-up/down and for power savings
Clock routing network Special routing channels reserved for clocks driven by PLLs or clock control blocks
Global clock network feeds entire device
Regional or hierarchical networks feed certain device areas, such as device quadrants
![Page 23: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/23.jpg)
FPGA PLLs
23
Based on input clock, programmable blocks that generate clocks (clock
domains) for use throughout device with minimal skew
100 MHz input clock 100 MHz clock domain
200 MHz clock domain
90° phase-shifted 200 MHz clock
domain
![Page 24: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/24.jpg)
FPGA Programming
24
Most FPGAs use SRAM cell technology to program interconnect and LUT
function levels
Volatile! Must be programmed at power-on!
SRAM
SRAM
SRAM SRAM SRAM
SRAM
Row/column routing
interconnect junction
to IC switch
program
enable
programming bit
SRAM programming cell (latch)
![Page 25: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/25.jpg)
FPGA Programming (cont.)
25
FPGA programming information must be stored somewhere to program device
at power on
Use external EEPROM, CPLD, or CPU to program
Two programming methods Active: FPGA controls programming sequence automatically at power on
Passive: Intelligent host (typically CPU) controls programming
Also programmable through JTAG connection
![Page 26: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/26.jpg)
FPGA Full Chip Architecture
26
Configuration
![Page 27: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/27.jpg)
Example of Resource Counts for largest part (Arria 10 GX 1150)
27
1,150,000 Logic Elements 1.7 M registers and ~1M 4-input LUTs
54,000 Memory Blocks (20Kbits each)
1,518 DSP blocks
36 17.4 Gbps Transceivers
2 Hard PCIe blocks
492 general I/O pins
12 Hard Memory Controllers
![Page 28: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/28.jpg)
FPGA Advantages
28
High density to create many complex logic functions
Integration of many functions
Many available I/O standards and features
Data can go directly from I/O to computation engine and back out, all in one
chip.
Altera FPGAs Max®, Cyclone®, Arria®, and Stratix® series devices
![Page 29: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/29.jpg)
Agenda
30
FPGA Architecture
Design Methodology and Software
![Page 30: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/30.jpg)
31
Typical Programmable Logic Design Flow (1/2)
Synthesis (Mapping) - Translate design into device specific primitives
- Optimization to meet required area & performance constraints
- Quartus® synthesis or those available from 3rd party vendors
- Result: Post-synthesis netlist
Design specification
Design entry/RTL coding • Behavioral or structural description of design
• Possibly with the help of high level tools
RTL functional simulation
- Mentor Graphics ModelSim® or other 3rd party
simulators
- Verify logic model & data flow
(no timing delays)
LE DSP
M9K I/O
![Page 31: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/31.jpg)
32
Typical Programmable Logic Design Flow (2/2)
Timing analysis - Verify performance specifications were met
- Static timing analysis
PC board simulation & test - Simulate board design
- Program & test device on board
- Use on-chip tools for debugging
tclk
Place & route (Fitting) - Map primitives to specific locations inside
target technology with reference to area &
performance constraints
- Specify routing resources to be used
- Result: Post-fit netlist
![Page 32: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/32.jpg)
33
Quartus Prime Design Software
33
HDL Code
FPGA
I/O I/O A/D
CPU DSP Flash
Programming
Image
Quartus Prime fully-integrated development tool
Multiple design entry methods
Logic synthesis
Place & route
Timing & power analysis
Device programming
System Description
![Page 33: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/33.jpg)
Brief Introduction to OpenCL
34
![Page 34: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/34.jpg)
Why OpenCL on FPGAs
35
ASIC
FPGA Programmers
Parallel
Programmers
Standard CPU Programmers
OpenCL expands
the number of
application developers
![Page 35: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/35.jpg)
Utilizing Software Engineering Resources
36
Altera OpenCL flow abstracts away FPGA hardware flow bringing the FPGA to
low-level software programmers Software developers write, optimize and debug in familiar environment
Quartus® II software runs behind the scenes
Emulator and profiler are software development tools
![Page 36: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/36.jpg)
What is OpenCL?
37
A software programming model for software engineers
and a software methodology for system architects First industry standard for heterogeneous computing
Provides increased performance with hardware
acceleration Low Level Programming language
Based on ANSI C99
Open, royalty-free, standard Managed by Khronos Group
Altera active member
Conformance requirements
V1.0 is current reference
V2.0 is current release
http://www.khronos.org
Host Accelerator
C/C++ API OpenCL C
![Page 37: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/37.jpg)
The BIG Idea behind OpenCL
OpenCL execution model … Define N-dimensional computation domain Execute a kernel at each point in computation domain
void
trad_mul(int n,
const float *a,
const float *b,
float *c)
{
int i;
for (i=0; i<n; i++)
c[i] = a[i] * b[i];
}
Traditional loops kernel void
dp_mul(global const float *a,
global const float *b,
global float *c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
} // execute over “n” work-items
Data Parallel OpenCL
Parallelism is Explicit
Altera OpenCL SDK for FPGAs supports both styles.
Parallelism Must be Inferred
![Page 38: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/38.jpg)
Accelerator
OpenCL Programming Model
39
Host
Local M
em
Glo
bal M
em
Local M
em
Local M
em
Local M
em
Accelerator Accelerator Accelerator Kernel
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
Host and accelerator code are separate
![Page 39: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/39.jpg)
OpenCL Host Program
Pure software written in standard C / C++
Communicates with the Accelerator Device via a set of library routines which
abstract the communication between the host processor and the kernels
40
main() { read_data_from_file( … ); maninpulate_data( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…, my_kernel, …); clEnqueueReadBuffer( … ); display_result_to_user( … ); }
Copy data from Host
to FPGA
Ask the FPGA to run a
particular kernel
Copy data from FPGA
to Host
![Page 40: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/40.jpg)
OpenCL Kernels
Data-parallel function Defines many parallel threads of execution
Each thread has an identifier specified by “get_global_id”
Contains keyword extensions to specify parallelism and memory hierarchy
Executed by compute object CPU
GPU
FPGA
41
__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
__kernel void sum( … );
float *a =
float *b =
float *answer =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
![Page 41: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/41.jpg)
Thread ID space for NDRange kernels
42
Each thread knows its “id”. It is used to determine which slice of data the thread
should work on.
Threads are partitioned into work-groups. Only threads within one work-group can share local memory.
get_group_id(0)
get_global_id(0)
ND Range
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
get_local_id(0)
get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0)
Thread & Group identifiers
![Page 42: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/42.jpg)
43
Memory Model
Private Memory Unique to thread. Usually registers.
Local Memory Shared within workgroup. On-chip.
Global & Constant Memory Visible to all workgroups and the host.
Usually off-chip.
Host Memory On the host CPU. Usually DDR on CPU board.
Global Memory Constant Memory
Kernel
Workgroup
Local Memory
Work-item
Private
Memory
Work-item
Private
Memory
Workgroup
Local Memory
Work-item
Private
Memory
Work-item
Private
Memory
Workgroup
Local Memory
Work-item
Private
Memory
Work-item
Private
Memory
Workgroup
Local Memory
Work-item
Private
Memory
Work-item
Private
Memory
Host CPU
Host Memory
PCIE / QPI / AXI
![Page 43: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/43.jpg)
Compiling OpenCL to FPGAs
44
x86
PCIe
ACL Compiler Standard
C Compiler
SOF X86 binary
OpenCL
Host Program + Kernels
main() { read_data_from_file( … ); maninpulate_data( … ); clEnqueueWriteBuffer( … ); clEnqueueKernel(…, sum, …); clEnqueueReadBuffer( … ); display_result_to_user( … ); }
__kernel void Sum (__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
Kernel Programs
Host Program
__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
PCIe
DDRx
![Page 44: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/44.jpg)
OpenCL CAD Flow
my_kernel.cl my_host.c
C++
compiler
ACL
runtime
Library
program.exe
OpenCL Compiler
HDL Quartus
45
FPGA Programming
bitstream
Board Description
![Page 45: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/45.jpg)
OpenCL Compiler Builds Complete FPGA
46
The Altera Offline Compiler (aoc)
produces the complete FPGA
design:
-accelerators
-data paths
-all memory structures
Processor
FPGA
PCIe
DDR
External
Memory Controller
& PHY
External
Memory Controller
& PHY
Kernel
Datapath
Kernel
Datapath
On-Chip
Memory
Global Memory Interconnect
On-Chip
Memory
Local Memory Interconnect Local Memory Interconnect
DDR
Everything in this part is
generated based on user’s
kernel.
This part is fixed by Board Support
Package (BSP) vendor.
OpenCL SDK comes with host-side Run-
Time Environment: OS driver, lower-level
HAL, OpenCL API implementation library.
![Page 46: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/46.jpg)
10 minute break
47
![Page 47: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/47.jpg)
Outline
48
(40 min)
Introduction to FPGAs
Brief introduction to OpenCL programming language
(10 min break)
(40 min)
How OpenCL concepts map to FPGA hardware
Examples of architectures for high-performance applications
![Page 48: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/48.jpg)
Computation in Space
How computation is mapped to FPGAs
49
![Page 49: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/49.jpg)
50
Mapping a simple program to an FPGA
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
High-level code
Mem[100] += 42 * Mem[101]
CPU instructions
![Page 50: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/50.jpg)
B
A
A ALU
51
Execution on a Simple CPU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load Store LdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Fixed and general
architecture:
- General “cover-all-cases” data-paths
- Fixed data-widths
- Fixed operations
![Page 51: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/51.jpg)
B
A
A ALU
Load constant value into register
52
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load Store LdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
![Page 52: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/52.jpg)
CPU activity, step by step
53
A R0 Load Mem[100]
A
R1 Load Mem[101]
A
R2 Load #42
A
R2 Mul R1, R2
A R0 Add R2, R0
Store R0 Mem[100] A
Time
![Page 53: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/53.jpg)
On the FPGA we unroll the CPU hardware…
54
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
Space
![Page 54: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/54.jpg)
… and specialize by position
55
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove
instruction “Fetch”
![Page 55: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/55.jpg)
… and specialize
56
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove instruction
“Fetch”
2. Remove unused ALU ops
![Page 56: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/56.jpg)
… and specialize
57
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove instruction
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
![Page 57: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/57.jpg)
… and specialize
58
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove instruction
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly! And
propagate state.
![Page 58: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/58.jpg)
… and specialize
59
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove instruction
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly! And
propagate state.
5. Remove dead data.
![Page 59: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/59.jpg)
… and specialize
60
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove instruction
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly! And
propagate state.
5. Remove dead data.
6. Reschedule!
![Page 60: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/60.jpg)
Custom data-path on the FPGA matches your algorithm!
61
Build exactly what you need:
Operations
Data widths
Memory size & configuration
Efficiency:
Throughput / Latency / Power
load load
store
42
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path
![Page 61: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/61.jpg)
Data parallel kernel
62
__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
float *a =
float *b =
float *answer =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
__kernel void sum( … );
![Page 62: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/62.jpg)
Example Datapath for Vector Add
On each cycle the portions of the datapath are
processing different threads
While thread 2 is being loaded, thread 1 is being
added, and thread 0 is being stored
63
Load Load
Store
0 1 2 3 4 5 6 7
8 work items for vector add example
+
Work item IDs
![Page 63: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/63.jpg)
Example Datapath for Vector Add
On each cycle the portions of the datapath are
processing different threads
While thread 2 is being loaded, thread 1 is being
added, and thread 0 is being stored
64
Load Load
Store
0 1 2 3 4 5 6 7
8 work items for vector add example
+
Work item IDs
![Page 64: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/64.jpg)
Example Datapath for Vector Add
On each cycle the portions of the datapath are
processing different threads
While thread 2 is being loaded, thread 1 is being
added, and thread 0 is being stored
65
Load Load
Store
0
1 2 3 4 5 6 7
8 work items for vector add example
+
Work item IDs
![Page 65: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/65.jpg)
Example Datapath for Vector Add
On each cycle the portions of the datapath are
processing different threads
While thread 2 is being loaded, thread 1 is being
added, and thread 0 is being stored
66
Load Load
Store
1
2
3 4 5 6 7
8 work items for vector add example
+
0
Work item IDs
![Page 66: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/66.jpg)
Example Datapath for Vector Add
On each cycle the portions of the datapath are
processing different threads
While thread 2 is being loaded, thread 1 is being
added, and thread 0 is being stored
67
Load Load
Store
2
3
4 5 6 7
8 work items for vector add example
+
0
1
Silicon used efficiently at steady-state
Work item IDs
![Page 67: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/67.jpg)
Example Compiler Optimization: Branch Conversion
68
Control flow branching is expensive.
Instead, execute both sides of a branch, pick the result for the “true” path, and
predicate commands that have side-effects.
If a function has no loops, the whole function loses all branches.
Loops lose all internal branches.
1. X = W;
2. if (cond) {
3. X += 2;
4. array[z] = Y;
5. }
X_temp = X + 2;
X = cond ? X_temp : W;
array[z] = Y only if cond
?: operator is a mux in
hardware.
Single operation to store
only if condition is true.
“cond” is enable on the
store unit.
![Page 68: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/68.jpg)
Local memory
![Page 69: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/69.jpg)
FPGA
Kernel Pipeline Kernel Pipeline Kernel Pipeline
PCIe x86 /
External Processor
External
Memory Controller
& PHY
Global Memory Interconnect
Local Memory Interconnect
External
Memory Controller
& PHY
Memory systems in hardware
70
DD
R3
M10K
M10K
M10K
M10K
M10K
M10K
External (off-chip) memory
On-chip memory
DD
R3
M20K
M20K
M20K
M20K
M20K
M20K
![Page 70: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/70.jpg)
Memory hierarchy
1. Register data:
Registers in FPGA fabric
3. Local memory: On-
chip RAMs
4. Global memory: Off-chip
external memory
2. Private data: Registers
in FPGA fabric or on-chip
RAMs
![Page 71: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/71.jpg)
On-chip memory systems
“Local” and some “private” memories use on-chip block RAM resources Very high bandwidth, random accesses, limited capacity
The memory system is customized to your application Huge value proposition over fixed-architecture accelerators
Memory geometry (width, depth, number of banks, etc.), and interconnect all
customized for your kernel Automatically optimized to eliminate or minimize access contention
Caveats: Compiler has to understand access patterns to minimize contention efficiently
![Page 72: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/72.jpg)
On-chip memory architecture
73
Basic memory architectures map to M20Ks Each M20K is a dual-ported RAM
Concurrently, #loads + #stores ≤ 2
Kernels require complex accesses
Compiler optimizes the kernel pipeline, the interconnect, and the memory
system to achieve this Through splitting, coalescing, banking, replication, double-pumping, port sharing
Local Memory Interconnect
M20K
M20K
M20K
M20K
M20K
M20K
Kernel Pipeline
port 0
port 1
![Page 73: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/73.jpg)
Interconnect
74
Interconnect includes access arbitration to memory ports
With no optimization, sharing ports destroys performance Pipeline stalls due to arbitration for concurrent accesses
Key to high kernel efficiency is never-stall memory accesses All possibly concurrent memory access sites in the datapath are guaranteed to access
memory without contention
Memory Port0
store load
load load
Port1
load load
load
Arbitration
nodes
![Page 74: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/74.jpg)
Double pumping
75
Block
RAM
Clk
Port 1
Port 2
Block
RAM
2xClk
Port 1
Port 2
Port 3
Port 4
Block
RAM
2xClk
Port 1
Port 2
Port 3
Port 4
2x clock domain
T
Memory
2x clock
Port0
Port3
store load
load load
Port1
Port2
load load
load
Pros: no M20K increase
Cons: Fmax penalty, register usage increase
1x 2x
![Page 75: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/75.jpg)
Replication
76
Block
RAM
Up to Four
ports
1-3 write
Y-read
Block
RAM
Block
RAM
Memory
2x clock
Port0
Port3
store load
load load
Port1
Port2 load load
load Memory
2x clock
Port0
Port3
Port1
Port2
![Page 76: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/76.jpg)
Static coalescing (cont’d)
Memory
Original kernel:
With coalescing:
kernel void example() { local int A[32][2], B[32][2]; … int lid = get_local_id(0); A[lid][0] = B[lid][0]; A[lid][1] = B[lid + x][1]; … }
kernel void example() { local int A[32][2], B[32][2]; … int lid = get_local_id(0); int2 v = (int2)(B[lid][0], B[lid + x][1]; *(int2*)(&A[lid][0]) = v; … }
![Page 77: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/77.jpg)
Banking
Partition memory into logical banks An N-bank configuration can handle N-requests per clock cycle as long as each request addresses a
different bank
Uses lower bits of memory access for bank selection
78
M20K M20K M20K M20K M20K M20K M20K M20K
Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7
Arbitration Network
Load/Store
Load/Store
Load/Store
Load/Store
![Page 78: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/78.jpg)
Forcing memory geometry
Compiler attributes can enforce a desired local memory configuration.
int __attribute__((numbanks(1)))
__attribute__((bankwidth(128))
__attribute__((doublepump))
__attribute__((numwriteports(1))
__attribute__((numreadports(4))) MyLocalMem[128];
79
![Page 79: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/79.jpg)
Global Memory
![Page 80: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/80.jpg)
Global Memory in OpenCL
81
‘global’ address space Used to share data between host and device
Shared between workgroups
Generally allocated on host as cl_mem object Created with clCreateBuffer
Data transferred with clRead/clWrite Buffer
cl_mem object assigned to global pointer argument in kernel
__kernel void Add(__global float* a,
__global float* b,
__global float* c)
{
int i = get_global_id(0);
c[i] = a[i] + b[i];
}
![Page 81: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/81.jpg)
OpenCL BSP Global Memory
82
FPGA
Kernel Pipeline Kernel Pipeline
PCIe
DD
R
Processor
External
Memory Controller
& PHY
On-Chip
Memory
Global Memory Interconnect
External
Memory Controller
& PHY
On-Chip
Memory
Local Memory Interconnect Local Memory Interconnect
![Page 82: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/82.jpg)
Compiler’s View of Global Memory
83
Agnostic to the memory technology itself DDR, QDR, HMC, QPI
Only a few pertinent parameters (provided by BSP) How many interfaces
Width of the bus
Burst size (affinity for linear access)
Latency
Bandwidth
![Page 83: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/83.jpg)
Compiler Generated Hardware
84
BSP
foo.cl global int* x;
…
int y=x[k];
AOC Compiler
Global
Memory
Load Unit
Arbitration
Pipeline
![Page 84: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/84.jpg)
Global Load/Store Unit (LSU)
85
Width Adaptation User data (32-bit int) to memory word (512-bit DRAM
word)
Possible static coalescing by compiler to avoid
wasted bandwidth
Burst coalescing Dynamically coalesces consecutive memory transactions
into large burst transaction
Global
Memory
Load Unit
Arbitration
Pipeline
![Page 85: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/85.jpg)
Global Load/Store Unit (LSU) Flexibility
86
Compiler picks the most appropriate LSU based on static access analysis of
user’s kernel.
Simple Passes transactions directly to interconnect from pipeline
Used for loads/stores used very infrequently
Burst-Coalesced Most common global memory LSU
Specialized LSU to groups loads/stores into bursts
Load LSU can a private data cache if compiler determines that it will be beneficial
Streaming Simplified LSU used if compiler can determine access pattern is completely linear
Other types for special cases: extra-wide, unaligned, etc.
![Page 86: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/86.jpg)
Arbitration
87
Arbitrate to physical interfaces
Distribute (load balance) across physical
interfaces
Global
Memory
Load Unit
Arbitration
Pipeline
![Page 87: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/87.jpg)
Constant Cache
88
Constant buffer resides in global memory but accessed via on-chip cache
shared by all work-groups Constant cache optimized for high cache hit performance
Use for read-only data that all work-groups access E.g. high-bandwidth table lookups
No __constant data, no constant cache.
![Page 88: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/88.jpg)
Complete Picture
89
Load Unit Load Unit
Coalesce
Load Unit
Cache stream
decoupled
Constant Load
Unit
Constant Cache
Arbitration
Constant Load
Unit
low BW
high BW
DDR4
Kernel datapath
Global Memory
Controller
Global Memory
Controller
![Page 89: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/89.jpg)
Channels
![Page 90: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/90.jpg)
Direct communication between Kernels and I/Os
91
Kernel
Kernel
System
BSP Kernel System
I/O
Channels
10G Input
PCIe Output
BSP can provide access to arbitrary streaming I/O interfaces as OpenCL channels.
A channel is just a FIFO in hardware.
Kernel-to-kernel
channel
No need to go through global
memory!
![Page 91: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/91.jpg)
3. Loop Pipelining
92
![Page 92: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/92.jpg)
Data-Parallel Execution
93
On the FPGA, we use pipeline parallelism to achieve acceleration
Threads execute in an embarrassingly parallel manner.
Ideally, all parts of the pipeline are active at the same time.
kernel void sum(global const float *a, global const float *b, global float *c) { int xid = get_global_id(0); c[xid] = a[xid] + b[xid]; }
Load Load
Store
+ 0
1
2
![Page 93: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/93.jpg)
Data-Parallel Execution - drawbacks
94
Difficult to express programs which have partial dependencies during execution
Would require complicated hardware and new language semantics to describe
the desired behavior
Load Load
Store
+ 0
1
2 kernel void sum(global const float *a, global const float *b, global float *c) { int xid = get_global_id(0); c[xid] = c[xid-1] + b[xid]; }
![Page 94: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/94.jpg)
Solution: Tasks and Loop-pipelining
95
Allow users to express programs as a single-thread
Pipeline parallelism still leveraged to efficiently execute loops in Altera’s
OpenCL Parallel execution inferred
by compiler
Loop Pipelining
Load
Store
+
for (int i=1; i < n; i++) { c[i] = c[i-1] + b[i]; }
i=0
i=1
i=2
![Page 95: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/95.jpg)
Loop Pipelining Example
96
No Loop Pipelining
i0
i1
i2
Clo
ck C
ycle
s
No Overlap of Iterations! Finishes Faster because Iterations
Are Overlapped
i0
i1
i2
i3
i4
i5
Clo
ck C
ycle
s
Looks almost like
multi-threaded
execution!
With Loop Pipelining
Loop Pipelining enables Pipeline Parallelism AND the communication of state
information between iterations.
![Page 96: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/96.jpg)
Parallel Threads vs. Loop Pipelining
97
So what’s the difference NDRange and loop pipelining?
Parallel threads launch 1
thread per clock cycle in
pipelined fashion
Sometimes
loop iterations
cannot be
started every
cycle.
Parallel Threads Loop Pipelining
t0
t1
t2
t3
t4
t5
i0
i1
i2
i3
i4
i5
![Page 97: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/97.jpg)
Loop-Carried Dependencies
98
Loop-carried dependencies are dependencies where one iteration of the loop
depends upon the results of another iteration of the loop
The variable state in iteration 1 depends on the value from iteration 0. Similarly, iteration 2 depends on the
value from iteration 1, etc.
kernel void state_machine(ulong n) { t_state_vector state = initial_state(); for (ulong i=0; i<n; i++) { state = next_state( state ); unit y = process( state ); write_output(y); } }
![Page 98: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/98.jpg)
Loop-Carried Dependencies
99
To achieve acceleration, we pipeline each iteration of a loop with loop-carried
dependencies Analyze any dependencies between iterations
Schedule these operations
Launch the next iteration as soon as possible
At this point, we can launch
the next iteration
kernel void state_machine(ulong n) { t_state_vector state = initial_state(); for (ulong i=0; i<n; i++) { state = next_state( state ); unit y = process( state ); write_output(y); } }
![Page 99: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/99.jpg)
Design Examples
100
![Page 100: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/100.jpg)
FFT – Example of Feed-Forward Pipeline Architecture
101
Instead of loops, data is fed into feed-forward pipeline
Two complex points per clock cycle is the input and output data rate.
Delay elements in the pipeline, together with rotators, ensure that all
Architecture and Image above are taken from Mario Garrido, Jesús Grajal, M. A. Sanchez, Oscar Gustafsson
“Pipeline Radix-2k Feedforward FFT Architectures.”, IEEE Trans. VLSI Syst. 21(1): 23-32 (2013)
Butterfly
Data
rotator
Delay by
8 cycles
![Page 101: OpenCL for FPGAs - · PDF fileNew in latest FPGA family (Arria 10): 32-bit Floating Point Support 18 Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add](https://reader034.vdocument.in/reader034/viewer/2022051600/5a9e3af67f8b9a0d7f8beed0/html5/thumbnails/101.jpg)
Matrix Multiply in OpenCL – Small 4x4 variant
102
systolic array architecture.
Great match for FPGAs.
Scalable to smaller and larger FPGA
devices by simply enlarging the grid.
Arria 10
GX1150 PE Load B
PE
drain
PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
feeder
feeder
feeder
feeder
Load A feeder feeder feeder feeder
Drain C
DDR4
drain drain drain