![Page 1: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/1.jpg)
High-Level Synthesis of Multiple
Dependent CUDA Kernels for FPGA
Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3,
Kyle Rupnow12, Deming Chen4
1Nanyang Technological University
2Advanced Digital Sciences Center, Illinois at Singapore 3Peking University
4Univ. of Illinois Urbana-Champaign
![Page 2: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/2.jpg)
High-Level Synthesis
Automatic generation of hardware
from algorithm descriptions
• RTL design time high for complex
designs
Different input languages
• Extensions to C/C++
(SystemC, ImpulseC)
• Functional (Haskel), GPGPU (CUDA),
Graphical (LabView)
High-level Synthesis Tools
C/C++ SystemC CUDA …
![Page 3: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/3.jpg)
High-Level Synthesis Tools
Facilitate design space exploration
• Compiler directives or language features
• Automate (partially) selection of design parameters
Challenge – extracting parallelism
• Require restructuring or reimplementation of code in
HLS specific manner
Data-parallel input languages provide inherent
advantage
![Page 4: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/4.jpg)
Parallel Computing & GPU Languages
Shift towards parallel computing & heterogeneous
CUDA programming model (NVIDIA)
• Minimal extensions to C/C++
• CUDA (GPU), MCUDA (Multi-core), FCUDA (FPGA)
CUDA advantages for HLS
• Easier analysis of
application parallelism
• Exploration of parallelism
granularity options
![Page 5: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/5.jpg)
Synthesis of CUDA Kernel
FCUDA - CUDA to FPGA [SASP’09], [FCCM ‘11]
Automates design space exploration of single CUDA kernel • Match GPU performance with significantly less power
Currently supports only single kernel synthesis
FPGA
Bitfile
HLS
&
Logic
Synthesis
AutoPilot
C Code
FCUDA
Translation
Annotated
CUDA FCUDA
Annotation
CUDA
Code
![Page 6: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/6.jpg)
Synthesis of Multiple CUDA Kernels
Possible to create single enclosing wrapper kernel
Single Enclosing Wrapper Kernel is not Ideal
• Must fully-buffer all sub-kernel communications on-chip
• Must use the same thread organization for sub-kernels
• Forces all sub-kernels to be CUDA device-only functions
K1
K2 K3
K4
K1
![Page 7: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/7.jpg)
Objective
Map multiple communicating CUDA kernels onto
FPGA
• Allow fine-grained communication
• Enable data streaming
• Handle different thread organizations
Key Contributions to synthesize communicating CUDA
kernels to RTL
• Manual step-by-step procedure
• Identify key challenges in automation
• Case study of stereo-matching algorithm
![Page 8: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/8.jpg)
Multi-Kernel Synthesis - Steps
Individual Kernel Synthesis
Communication Buffer Generation
Analytical Design Space Exploration
Implementation and Verification
![Page 9: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/9.jpg)
Individual Kernel Synthesis
Kernel extraction and FCUDA flow
Initial solution of cores to be minimal in area
• Perform joint design space exploration kernels later!
Measure resource usage and latency
CUDA Application Kernel
1
FPGA Cores
FPGA Cores Core
![Page 10: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/10.jpg)
Communication Buffers
Generate control flow graph (CFG) for kernels
• ASAP scheduling to determine execution critical path
Buffers between each pair of communicating
kernels
K1
K2 K3
K4
K1
K2 K3
K4
B1
B2
B3
B4
![Page 11: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/11.jpg)
Communication Buffers
Size of buffers? • Full-size buffers infeasible
Data access pattern analysis • Initial buffer size = minimal data processing quanta
• Bigger sizes explored in analytical model
Growth rate of communication buffer
Include overlap data size for correctness • Boundary data for algorithms with windowed
computations
![Page 12: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/12.jpg)
Buffering Schemes
Kernel A
Kernel B
Buffer
Kernel C
Buffer
A B C
Time
Kernel A
Kernel B
Buffer 0
Kernel C
Buffer 0
A B C
Time
Buffer 1
Buffer 1
A B
A
a. Single-Buffer Flow b. Dual-Buffer Flow
A
![Page 13: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/13.jpg)
Analytical Design Exploration Model
![Page 14: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/14.jpg)
Analytical Design Exploration
K1
K2
K3
2
8
6
nQuanta = 8 Max cores = (4,2,4)
Tim
e
![Page 15: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/15.jpg)
Analytical Design Exploration
K1
K2
K3
2
8
6
nQuanta = 8 Max cores = (4,2,4)
Tim
e
K1
K2
K3
2
4
6
K2
![Page 16: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/16.jpg)
Analytical Design Exploration
K1
K2
K3
2
8
6
nQuanta = 8 Max cores = (4,2,4)
Tim
e
K1
K2
K3
2
4
6
K2
K1
K2
K3
2
4
3
K2
K3
![Page 17: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/17.jpg)
Analytical Design Exploration
K1
K2
K3
2
8
6
nQuanta = 8 Max cores = (4,2,4)
Tim
e
K1
K2
K3
2
4
6
K2
K1
K2
K3
2
4
3
K2
K3
Only 2 cores of
K2 is possible!
![Page 18: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/18.jpg)
Analytical Design Exploration
K1
K2
K3
2
8
6
nQuanta = 16 Max cores = (8,4,8)
Tim
e
K1
K2
K3
2
4
6
K2
K1
K2
K3
2
4
3
K2
K3
Increase
nQuanta
![Page 19: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/19.jpg)
Analytical Design Exploration
K1
K2
K3
2
8
6
nQuanta = 16 Max cores = (8,4,8)
Tim
e
K1
K2
K3
2
4
6
K2
K1
K2
K3
2
4
3
K2
K3
K1
K2
K3
2
2
3
K2
K3
K2
K2
![Page 20: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/20.jpg)
Implementation and Verification
Core allocations from analytical model
• AutoPilot-C pragmas for suggested parallelism
Communication buffers and kernel-level
parallelism
SystemC simulation
Vivado Synthesis
![Page 21: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/21.jpg)
Stereo Matching
Two spatially separated color cameras
Distance in pixels between the same object in
the images infers depth
Complex algorithms to match pixels
![Page 22: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/22.jpg)
Case Study – Stereo Matcher
Census
Transform
Kernel
RGB to Lab
Conversion
Kernel
Left
GridBuilding
Kernel
Matching
Kernel
Cross
Correction
Kernel
Pre Filtering
Kernel
Median
Filtering Kernel
Census
Transform
Kernel
RGB to Lab
Conversion
Kernel
Right
GridBuilding
Kernel
Matching
Kernel
Pre Filtering
Kernel
Median
Filtering Kernel
Left Image Right Image
Left Depth Map Right Depth Map
![Page 23: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/23.jpg)
Design Space for Dual-Buffer Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
Mcycle
s)
Sum of Normalized Resource Use
6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)
12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)
72x384(DB) 96x384(DB) 144x384(DB)
![Page 24: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/24.jpg)
Design Space for Dual-Buffer Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
Mcycle
s)
Sum of Normalized Resource Use
6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)
12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)
72x384(DB) 96x384(DB) 144x384(DB)
![Page 25: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/25.jpg)
Design Space for Dual-Buffer Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
Mcycle
s)
Sum of Normalized Resource Use
6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)
12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)
72x384(DB) 96x384(DB) 144x384(DB)
![Page 26: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/26.jpg)
Design Space for Dual-Buffer Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
Mcycle
s)
Sum of Normalized Resource Use
6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)
12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)
72x384(DB) 96x384(DB) 144x384(DB)
![Page 27: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/27.jpg)
Design Space for Dual-Buffer Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
Mcycle
s)
Sum of Normalized Resource Use
6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)
12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)
72x384(DB) 96x384(DB) 144x384(DB)
Selected solution
![Page 28: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/28.jpg)
Design Space for Dual and Single Buffer
Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
MC
ycle
s)
Sum of Normalized Resource Use
SB
DB
![Page 29: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/29.jpg)
Design Space for Dual and Single Buffer
Flow
1
10
100
1000
0 0.5 1 1.5 2 2.5 3
Lo
g L
ate
ncy (
MC
ycle
s)
Sum of Normalized Resource Use
SB
DB
![Page 30: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/30.jpg)
Performance–Power Comparison
HLS of sequential code achieved
speedup of 6.9x over software
[FPT’11]
HLS of CUDA parallel code
achieved speedup of >50x over
sequential software **
Greater exposed parallelism
provides synthesis tool greater
opportunity for optimization
0
0.2
0.4
0.6
0.8
1
1.2
NormalizedLatency
Normalized Energy
GPU
FPGA
![Page 31: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/31.jpg)
Challenges in Automation – Single Kernel
Single kernel synthesis
• Critical: Replicating the initial solution for concurrency
Multi-kernel synthesis
![Page 32: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/32.jpg)
Optimize thread index computations
• Solution: Improved analytical techniques in FCUDA to
optimize index computations
Floating-point to fixed-point computations
• Solution: Automatic transformation with functional
verification of transform
Inefficient implementations of difficult operations
• Solution: Automatic instantiation of library elements
for common but challenging operations
Challenges in Automation – Single Kernel
![Page 33: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/33.jpg)
Challenges in Automation – Multiple Kernel
Selection of single-core implementation • Solution: Complex value function, knowledge of resource
criticality, and iteration of entire design flow
Automatic buffer-generation and insertion • Solution: Complex memory access pattern analysis and
transformations (See upcoming FPGA 13 paper)
Performance estimation within synthesis process • Solution: Improved analytical model for loop bounds, trip
counts, resource estimates
Sub-kernel optimizations to match pipeline stage latencies • Solution: Improved ability to combine
or split pipeline stages
S1
S2
S3
S1
S21
S22
S3
![Page 34: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/34.jpg)
Conclusion
Multi-kernel CUDA synthesis is important
Manual process for mapping multiple dependent CUDA kernels to FPGA
Performance parity with GPU consuming 16x less energy than GPU • Benefit of data-parallel input language for HLS
Fully automating multi-kernel synthesis is challenging
![Page 35: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,](https://reader031.vdocument.in/reader031/viewer/2022011914/5fb884a118125670d5366653/html5/thumbnails/35.jpg)
Acknowledgement
A*STAR HSS Funding
Peking University
University of Illinois at Urbana-Champaign
ADSCs Lab Colleagues
• Hongbin Zheng
• Muhammad Teguh Satria