![Page 1: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/1.jpg)
A framework for optimizingOpenVX Applications on
Embedded Many‐Core AcceleratorsGiuseppe Tagliavini, DEI ‐ University of Bologna
Germain Haugou, IIS ‐ ETHZAndrea Marongiu, DEI ‐ University of Bologna & IIS ‐ ETHZ
Luca Benini, DEI ‐ University of Bologna & IIS ‐ ETHZ
![Page 2: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/2.jpg)
Outline
Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run‐time Conclusion
![Page 3: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/3.jpg)
Many‐core acceleratorsfor signal/image processing
1 > 1003 6
CPU GPGPU HW IP
GOPS/W
Accelerator Gap
SW HWMixed
ThroughputComputing
General‐purposeComputing
![Page 4: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/4.jpg)
Some examples: STM STHORM, Kalray MPPA, PULP
Clustered many‐core accelerators (CMA)
Cluster Cluster… L2
Host
L3
Cluster‐baseddesign
Cluster memory(optional)
Multi‐coreprocessor
DDR3 memory SoC design(optional)
CC
PE PEPE PE PE PEPE …
L1 DMA HWS
Cluster controller(optional)
MPMDProcessingElements
Low latencyshared TCDM memory
DMA engine
(L1 ↔L3)
HW synchronizer
Many‐core accelerator
![Page 5: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/5.jpg)
PULPParallel Ultra‐Low‐Power platform
L2MEMORY
PERIPHERALS
BRIDGE
BRIDGE
SoCVOLTAGE DOMAIN(0.8V)
INSTRUCTION BUS
I$ I$ I$PE#0
PE#1
PE#N‐1
BRIDGES
CLUSTER VOLTAGEDOMAIN(0.5V‐0.8V)
LOW LATENCY INTERCONNECT
DMA...
...
CLUSTER
BUS
PERIPH
ERAL
INTERC
ONNEC
T
PERIPH
ERAL
S
to RMUs
...
RMU RMURMU
SRAM#0
SRAM#1
SRAM #M‐1
SCM #0
SCM #M‐1
SCM #1
SRAM VOLTAGE DOMAIN (0.5V – 0.8V)
Hybridmemorysystem
PEAK EFFICIENCY: 400GOPS/W
![Page 6: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/6.jpg)
OpenVX overview
Foundational API for vision acceleration– Focus on mobile and embedded systems
Stand‐alone or complementary to other libraries
Enable efficientimplementations on different devices– CPUs, GPUs, DSPs, many‐core accelerators
![Page 7: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/7.jpg)
OpenVX programming model The OpenVX model is based on a directed acyclic graph of
nodes (kernels), with data (images) as linkage
vx_image imgs[] = {vxCreateImage(ctx, width, height, VX_DF_IMAGE_RGB),vxCreateVirtualImage(graph, 0, VX_DF_IMAGE_U8),…vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8),
};vx_node nodes[] = {
vxColorConvertNode(graph, imgs[0], imgs[1]),vxSobel3x3Node(graph, imgs[1], imgs[2], imgs[3]),vxMagnitudeNode(graph, imgs[2], imgs[3], imgs[4]),vxThresholdNode(graph, imgs[4], thresh, imgs[5]),
};vxVerifyGraph(graph);vxProcessGraph(graph);
![Page 8: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/8.jpg)
OpenVX DAG
I CC TS MV1
V2
V3
V4 O
Virtual images are not required to actually reside in main memoryThey define a data dependency between kernels, but they cannot be read/writtenThey are the main target of our optimization efforts
An OpenVX program must be verified to guarantee some mandatory properties:Inputs and outputs compliant to the node interfaceNo cycles in the graphOnly a single writer node to any data object is allowedWrites have higher priorities than reads.Virtual image must be resolved into concrete types
![Page 9: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/9.jpg)
ADRENALINE
Platformconfiguration
Applicationmapping
Run‐timesupport
i G Sgx
y
TestApplicationsPE0 PEn
Mem
…Virtual Platform
Run‐time policies
ADRENALINE
http://www‐micrel.deis.unibo.it/adrenaline/
![Page 10: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/10.jpg)
Outline
Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run‐time Conclusion
![Page 11: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/11.jpg)
Virtual platform (1)
The virtual platform is written in Python and C++– Python is used for the architecture configuration– C++ is used to provide an efficient implementation of internal model
A library of basic components is available, but custom blocks can also be implemented and assembled
![Page 12: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/12.jpg)
Virtual platform (2) Standard configuration:
– OpenRISC core. An Instruction Set Simulator (ISS) for the OpenRISC ISA, extended with timing models to emulate pipeline stalls
– Memories. Multi‐bank, constant‐latency timing mode– L1 interconnect. One transaction per memory bank serviced at each cycle
– DMA. Single synchronous request to the interconnect for each line to be transferred
– Shared instruction cache. Dedicated interconnect and memory banks.
![Page 13: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/13.jpg)
Virtual platform (3)
Host
PE PEPE PE PE PEPE …
L1 DMA
OpenRISC ISS
L2
L3
Parametric PEs
Size parameter
Size parameter
Configurablebandwidth/latency
Size parameter
OpenRISC ISS
![Page 14: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/14.jpg)
Outline
Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run‐time Conclusion
![Page 15: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/15.jpg)
A first solution: using OpenCL to accelerate OpenVX kernels
OpenCL is a widely used programming model for many‐core accelerators
First solution: OpenVX kernel == OpenCL kernel– When a node is selected for execution, the related OpenCL kernel is enqueued on the device
Limiting factor:– too much code!– memory bandwidth
![Page 16: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/16.jpg)
OpenCL bandwidth Experiments performed with OpenCL runtime on a STHORM
evaluation board same results using the virtual platform
290
922
7138
71
307
31
15
199
1391779
1
10
100
1000
10000
MB/s
OpenCL Available BW
![Page 17: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/17.jpg)
OpenVX for CMA
We realized an OpenVX framework for many‐core accelerators coupling a tiling approach with algorithms for graph partition and scheduling
Main goals:– Reducing the memory bandwidth– Maximize the accelerator efficiency
Several steps are required:– Tile size propagation– Graph partitioning– Node scheduling– Buffer allocation– Buffer sizing
![Page 18: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/18.jpg)
L1
L3
Localized execution
Reads/writes on L1 do no stall the PEs In real platforms the L1 is often too small to contain a full image In addition, multiple kernels requires more L1 buffers During DMA transfers cores are waiting
PE PEPE …
DMA
RGB to Grayscale
Localized execution when a kernel is executed by a many‐core accelerator, read/write operations are always performed on local buffers in the L1 scratchpad memory
![Page 19: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/19.jpg)
L1
L3
Localized execution with tiling
Single tiles always fit L1 memory Transfer latency is hidden by computation Tiling is not so trivial for all algorithms data access patterns
PE PEPE …
DMA
RGB to Grayscale
Images are partitioned into smaller blocks (tiles) Double buffering overlap between data transfers and
computation
![Page 20: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/20.jpg)
Common access patternsfor image processing kernels
(A) POINT OPERATORSCompute the value of each output point from the corresponding input point
Support: Basic tiling
(B) LOCAL NEIGHBOR OPERATORSCompute the value of a point in theoutput image that corresponds to the input windowSupport: Tile overlapping
(C) RECURSIVE NEIGHBOR OPERATORSLike the previous ones, but alsoconsider the previously computed values in the output windowSupport: Persistent buffer
(D) GLOBAL OPERATORSCompute the value of a point in the output image using the whole input imageSupport: Host exec / Graph partitioning
(E) GEOMETRIC OPERATORSCompute the value of a point in the output image using a non‐rectangular input windowSupport: Host exec / Graph partitioning
(F) STATISTICAL OPERATORSCompute any statistical functions of the image points
Support: Graph partitioning
![Page 21: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/21.jpg)
Tile size propagation
I K1 K5K2
K3
V1
V2
V3
V4
O
K4 V5
![Page 22: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/22.jpg)
Example (1)
NESTED GRAPH
L3 L1 L3
MEMORY DOMAINS
ADAPTIVE TILINGACCELERATOR SUB‐GRAPH HOST NODE
![Page 23: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/23.jpg)
Example (2)
a
i1
b
S N P NM
c d e
i2
S N P NM …
o1 o2
i3 i4
PEs
Host/CC
DMAin
DMAout
time
B0B1B2B3
B5B4
L3 access
![Page 24: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/24.jpg)
CMA kernel
__kernel void threshold(__global unsigned char *src, int srcStride,__global unsigned char *dst, int dstStride,short width, short height,short bandWidth, char nbCores,__global unsigned char *params) {
int i, j;
int id = get_id();unsigned char threshold = params[4];
int srcIndex = 0, dstIndex = 0;for (j=0; j<height; ++j) {for (i=id*bandWidth; i<(id+1)*bandWidth && i<width; ++i) {unsigned char value = src[srcIndex+i];dst[dstIndex+i] = (value >= threshold? value: 0);
}srcIndex+=srcStride;dstIndex+=dstStride;
}}
The global space is remapped on local buffers!
![Page 25: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/25.jpg)
Bandwidth reduction
34
307
36
8
2444
815 18
359
22
290
922
7138
71
307
31
15
199
1391779
1
10
100
1000
10000
MB/s
OVX OpenCL Available BW
![Page 26: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/26.jpg)
Speed‐up w.r.t. OpenCL
6.73
3.86 3.46 3.502.81
5.64
2.92
1.00
3.12
5.04
9.61
Random graph
Edge detector
Object detection
Super resolution
FAST9 Disparity Pyramid Optical Canny Retina preproc.
Disparity S4
0.00
2.00
4.00
6.00
8.00
10.00
12.00
![Page 27: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/27.jpg)
Outline
Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run‐time Conclusion
![Page 28: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/28.jpg)
Next steps
Virtual platform– More accurate models– Multi‐cluster configuration
OpenVX runtime– Models evolution
FPGA emulator
![Page 29: A framework for optimizing OpenVX Applications on Many ... · A framework for optimizing OpenVX Applications on Embedded Many‐Core Accelerators Giuseppe Tagliavini, DEI ‐University](https://reader030.vdocument.in/reader030/viewer/2022021809/5c65b6f009d3f2876e8d0ff9/html5/thumbnails/29.jpg)
THANKS!!!
Work supported by EU‐funded research projects