18-643-F20-L11-S1, James C. Hoe, CMU/ECE/CALCM, ©2020
18-643 Lecture 11:Intel OpenCL for FPGA
Shashank Obla Department of ECE
Carnegie Mellon University
18-643-F20-L11-S2, James C. Hoe, CMU/ECE/CALCM, ©2020
Housekeeping
• Your goal today: understand Intel’s interpretation of OpenCL for FPGAs
• Notices– Handout #6: lab 2, due Monday noon, 10/12– Handout #7: lab 3 (look for on Friday)– Project status report due each Friday
3 weeks to project proposal!!
• Readings– skim Ch 10, Reconfigurable Computing– for concrete reference: Intel SDK for OpenCL:
Programming Guide and Best Practices Guide
18-643-F20-L11-S3, James C. Hoe, CMU/ECE/CALCM, ©2020
Khronos’ OpenCL
18-643-F20-L11-S4, James C. Hoe, CMU/ECE/CALCM, ©2020
Two Parts to OpenCL1. Platform model
– host (processor & memory) – 1 or more accelerator devices
+ device-side mem hierarchy: global/local/private– APIs for host-thread to interact with devices
• launch compute kernels to devices • prepare (load/unload) device memory
2. Kernel programming language– perfect triply-nested loops– no loop-carried dependence
OpenCL terms introduced in bold
18-643-F20-L11-S5, James C. Hoe, CMU/ECE/CALCM, ©2020
OpenCL Platform Model
hostCPU
hostmem
globalmem
computedevice
globalmem
computedevicecompute
devicecomputedevice
What are these “compute devices”???
18-643-F20-L11-S6, James C. Hoe, CMU/ECE/CALCM, ©2020
Basic Host Program Examplemain ( ) {
. . . get device handle and queue . . .
. . . allocate memory buf objects . . .
. . . get kernel object. . .while ( ) {
. . . initialize memory buf data . . .
. . . bind buf objects to kernel arguments . . .
. . . add kernel and buf objects to device queue . . .
. . . wait for kernel to finish . . .
. . . retrieve memory buf object for result . . .}
What are these “kernels”???
18-643-F20-L11-S7, James C. Hoe, CMU/ECE/CALCM, ©2020
What are these kernels?
Specifically talking about OpenCL C
18-643-F20-L11-S8, James C. Hoe, CMU/ECE/CALCM, ©2020
Conceptually . . . for (int i=0; i < R0; i++)
for (int j=0; j < R1; j++)
for (int k=0; k < R2; k++) {
<< local variable declarations >>
<< arbitrary C-code with access to
global memory >>
}
• Loop body must be data-parallel– local variables limited to scope– disallow loop-carried dependencies through global
memory==> statements from different iterations can interleave in
any order (using disambiguated local variables)
18-643-F20-L11-S9, James C. Hoe, CMU/ECE/CALCM, ©2020
Concretely . . . • Only specify loop body as a kernel function
__kernel foo(<< pointers to global mem buf>>) {
int i=get_global_id(2), j=get…(1), k=get…(0);
<< local variable declarations >>
<< arbitrary C-code with access to
global memory >>
}
• Triply-nested loops hardwired as NDRange– specified as 3 integer constants, i.e., the loop
bounds (R0, R1, R2)– 1 execution of kernel function is a work-item
work-item has private memory for local var’s– 1 kernel execution is R0×R1×R2 work-items
18-643-F20-L11-S10, James C. Hoe, CMU/ECE/CALCM, ©2020
Example: N-by-N MMM__kernel mmm(__global float *A, … *B, … *C) {
int i=get_global_id(1);
int j=get_global_id(0);
for(int k=0; k<N; k++)
C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j];
}
• NDRange=(N, N, 1)– kernel function executed by NxNx1 work-items – each work-item sees a different combination of
dimension-0 and dimension-1 global id’s– no assumption about work-items’ relative progress
18-643-F20-L11-S11, James C. Hoe, CMU/ECE/CALCM, ©2020
(For Your Reference: N-by-N MMM in C)float A[N][N], B[N][N], C[N][N];
for(int i=0; i<N; i++)
for(int j=0; j<N; j++) {
for(int k=0; k<N; k++)
C[i][j]=C[i][j]+A[i][k]*B[k][j]
}
• Note:– Loop body of the inner-most loop is not data-
parallel---dependency through C[i][j]– Loop body of the second inner-most loop is
18-643-F20-L11-S12, James C. Hoe, CMU/ECE/CALCM, ©2020
• Partition NDRange of R0×R1×R2 work-items into 3D work-groups of G0×G1×G2 work-items– G0/1/2 must divide R0/1/2 evenly– get_local_id(dim): id within group– get_group_id(dim): id of group
• Work-group signifies “locality” b/w work-items – execute together by a processing element– can share per-group local memory– can synchronize by barrier()
Why do we need this?
Work-Group
NDRange=(9, 3, 6)Group Size=(3, 3, 3)
Group Range=(3,1,2)
18-643-F20-L11-S13, James C. Hoe, CMU/ECE/CALCM, ©2020
OpenCL Kernels on GPGPUs• Work-items is a CUDA thread• Work-group executes as a thread block---broken
down into 32-work-item SIMD Warps• Work-groups from same and different kernels are
interleaved on a Streaming Processor128-SIMD lanes, 1 INT+1 FMA per lane, 1.73GHz
• 1 kernel can execute on upto 20 StrmProc’s (as 1 compute device), peak 8,873 GFLOPS
• Global=GDDR; local=shared memory 96KB SRAM; private=register file 256KB SRAMNvidia terms in italic-bold; numbers for GTX 1080
18-643-F20-L11-S14, James C. Hoe, CMU/ECE/CALCM, ©2020
“warp”InstT1
InstT3
A B C F G HA B C D E E G H
D EA B E F G HC D
A B E F G HC D
InstT2
InstT4A B E F G HC D
A B E F G HC DA B E F G HC D
A B E F G HC DA B E F G HC D
InstT5
InstT7
InstT6
InstT8InstT1
Aside: Barrel Processor [HEP, Smith]• Each cycle, select a “ready” thread from scheduling pool
– only one instruction per thread in flight at once– on a long latency stall, remove the thread from scheduling
• Simpler and faster pipeline implementation since– no data dependence, hence, no stall or forwarding– no penalty in making pipeline deeper
18-643-F20-L11-S15, James C. Hoe, CMU/ECE/CALCM, ©2020
To get 8,873 GFLOPS . . .
• # work-items ≥ 128 x StrmProc pipeline depth x 20• Computation entirely of Fused-Multiply-Add
Interleaved warps so no worries about RAW stalls• No if-then-else (branch divergence)BTW:• ld’s and st’s take up inst. issue BW of-the-top• 320 GB/sec DRAM BW AI > 108 SP FP / float
– only certain access pattern can sustain peak BW– SIMD ld’s and st’s in a warp must go to the same
memory line (memory coalescing)
18-643-F20-L11-S16, James C. Hoe, CMU/ECE/CALCM, ©2020
Intel OpenCL for FPGA
18-643-F20-L11-S17, James C. Hoe, CMU/ECE/CALCM, ©2020
OpenCL FPGA Platform Model
hostCPU
hostmem
globalmem
(DRAM)
Compute devices synthesized from kernel functions
globalMem
(DRAM)
globalMem
(DRAM)
FPGA
customKerneldevice
customkerneldevice
customkerneldevice
18-643-F20-L11-S18, James C. Hoe, CMU/ECE/CALCM, ©2020
Example: N-by-N MM“Add”__kernel mma(__global float *A, … *B, … *C) {
int i=get_global_id(1);
int j=get_global_id(0);
C[i*N+j]=A[i*N+j]+B[i*N+j];
}
• NDRange=(N, N, 1)• Note in this example:
– data-parallel kernel function– no loop in kernel function
18-643-F20-L11-S19, James C. Hoe, CMU/ECE/CALCM, ©2020
Fully-pipelined Kernel Datapath
load
NDRange = (N,N,1)(gid0,gid1) stream =(0,1),…(0,N-1),(1,0)…(1,N-1)……(N-1,0)…(N-1,N-1)
addrcalc
gid0
load
addrcalc
gid1
add
&B[i*N+j]&A[i*N+j]
A B
addrcalc
C
store
&C[i*N+j]
A[i*N+j] B[i*N+j]
fully unrolled loop in kernel fxn also okay
18-643-F20-L11-S20, James C. Hoe, CMU/ECE/CALCM, ©2020
What about MMM?__kernel mmm(__global float *A, … *B, … *C) {
int i=get_global_id(1);
int j=get_global_id(0);
for(int k=0; k<N; k++)
C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j];
}
• NDRange=(N, N, 1)• Can’t easily pipeline work-items like before• PE pipelines the k iterations
– dependency on C[i*N+j]– kernel function scope limits the tricks we can play
18-643-F20-L11-S21, James C. Hoe, CMU/ECE/CALCM, ©2020
Use of Single Work-Item Kernel__kernel mmm(__global float *A, … *B, … *C) {
for(int i=0; i<N; i++)
for(int j=0; j<N; j++)
for(in k=0; k<N; k++)
C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]
• NDRange=(1, 1, 1) never do this on GPU!!• Arbitrary control flow (loops, if’s) and dependencies• Becomes just “regular” C-to-HW synthesis
– pipeline and parallelize loops – schedule for initiation-interval, resource, etc.
Only want OpenCL’s platform model and API; “work-group” & “work-item” not too meaningful
18-643-F20-L11-S22, James C. Hoe, CMU/ECE/CALCM, ©2020
globalmem
(DRAM)krnlCkernelB
Use of Kernel-Kernel Channels• GPU multi-kernel OpenCL program
– kernel computes from global-mem to global-mem – next kernel continues from last kernel’s output buf– producer and consumer kernels fully serialized
• For streaming processing on FPGA, connect kernels with streaming channels to bypass DRAM– concurrent producer and consumer kernels– reduce DRAM and PCIe bandwidth requirements
globalmem
(DRAM)kernelA hostchnl storeload chnl PCIehost PCIe
Different programming mindset from GPU
18-643-F20-L11-S23, James C. Hoe, CMU/ECE/CALCM, ©2020
Program-to-HW, not Function-to-IP
hostCPU
hostmem
Device DRAM
FPGA
PCIe Controller
Memory Controller
Glo
bal I
nter
conn
ect
Local Memory
Local Memory
Loca
l In
terc
onne
ct
PCIe
Shell User Space
HLS kernel
HLS kernel
Object of design/optimization is the entire FPGA system
18-643-F20-L11-S24, James C. Hoe, CMU/ECE/CALCM, ©2020
Inside an example: Tiled MMMfor (int i = 0; i < HA; i += BHA) {
for (int j = 0; j < WB; j += BWB) {for (int k = 0; k < WA; k += BWA) {
for (int ii = 0; ii < BHA; ++ii) {for (int jj = 0; jj < BWB; ++jj) {
#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {
C[(i + ii) * WC + j + jj] +=A[(i + ii) * WA + k + kk] *B[(k + kk) * WB + j + jj];
18-643-F20-L11-S25, James C. Hoe, CMU/ECE/CALCM, ©2020
Performance-Geared Synthesis• Automatic pipelining and unrolling
– no pipeline directive, but disable_loop_pipelining so you still have control
– unroll factor controls extent of unrolling
• Local memory Optimizations– if you don’t say anything, the tool does it’s best– banks, replicates, private copies and interconnect
heig
ht
byte/cycle = width ports
local memory bank 0
replicate 0 replicate 1bank 1replicate 0 replicate 1
18-643-F20-L11-S26, James C. Hoe, CMU/ECE/CALCM, ©2020
Global Memory Interconnect
• Load-Store Units– automatically chosen based on access pattern
• Constant cache memory• DDR banking automatically handled
Sequential Random Access
Area Efficient Prefetching Pipelined
Area Expensive Burst-Coalesced Burst-Coalesced Cached (loads only)
18-643-F20-L11-S27, James C. Hoe, CMU/ECE/CALCM, ©2020
Revisiting the example: Tiled MMMfor (int i = 0; i < HA; i += BHA) {
for (int j = 0; j < WB; j += BWB) {for (int k = 0; k < WA; k += BWA) {
for (int ii = 0; ii < BHA; ++ii) {for (int jj = 0; jj < BWB; ++jj) {
#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {
C[(i + ii) * WC + j + jj] +=A[(i + ii) * WA + k + kk] *B[(k + kk) * WB + j + jj];
18-643-F20-L11-S28, James C. Hoe, CMU/ECE/CALCM, ©2020
Using Local Memoryfor (int ii = 0; ii < BHA; ++ii) {
#pragma unroll 1for (int jj = 0; jj < BWB; ++jj) {
#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {
C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk];
for (int ii = 0; ii < BHA; ++ii) {#pragma unrollfor (int jj = 0; jj < BWB; ++jj) {
#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {
C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk];
18-643-F20-L11-S29, James C. Hoe, CMU/ECE/CALCM, ©2020
Also, Systolic Array Style Coding
• Express structure – describe models such as systolic arrays– instantiate explicit registers – timing – vectorize operations
• Not so readable as “program”
FedA
FedB
float PE(struct vec_float_t_bool valA,vec_float_t valB, float *accum) {
float oldAcc = __fpga_reg(accum[0]);float sum = (valA.c ? 0.0f : oldAcc);#pragma unrollfor (int i = 0; i < VEC; i++) {
sum += valA.data[i] * valB[i];}
#pragma unrollfor (int i = 0; i < ACCUM_SIZE - 1; i++) {
accum[i] = accum[i + 1];}accum[ACCUM_SIZE - 1] = sum;return oldAcc;
while(1) {...#pragma unrollfor (int row = 0; row < ROWS; row++) {
#pragma unrollfor (int col = 0; col < COLS; col++) {
// compute and store outputs in shift registerfloat result = PE(fedA[row], fedB[col], accum[row][col]);if (fedA[row].c) {
drain[col][row * ACCUM_SIZE] = result;}fedA[row].data = __fpga_reg(__fpga_reg(fedA[row].data));fedA[row].c = __fpga_reg(__fpga_reg(fedA[row].c));fedB[col] = __fpga_reg(__fpga_reg(fedB[col]));
An excerpt from a 500-line device code
18-643-F20-L11-S30, James C. Hoe, CMU/ECE/CALCM, ©2020
Parting Thoughts
• OpenCL = platform model/API + SIMD language– kernel language forces regular and explicit parallelism– SIMD parallelism different on GPUs vs FPGAs
• FPGA OpenCL = platform model/API + memory system + “regular” HLS– single-work-item kernel unlocks parallelism style– kernel-kernel channels alleviate DRAM and PCIe
bottleneck for streaming use cases– develop/debug/analysis tools integral to appeal
GPUs great at SIMD; FPGAs good for more than SIMD