parallelization and optimization of feature detection algorithms on embedded...

Parallelization and Optimization of FeatureDetection Algorithms on Embedded GPU

Seung Heon Kang, Seung-Jae Lee, and In Kyu Park

Department of Information and Communication Engineering, Inha UniversityIncheon 402-751, Korea

Email: {[email protected], [email protected], [email protected]}

Abstract—In this paper, we parallelize and optimize thepopular feature detection algorithms, i.e. SIFT and SURF, onthe latest embedded GPU. Using conventional OpenGL shad-ing language and recently developed OpenCL as the GPGPUsoftware platforms, we compare the implementation efficiencyand speed performance between each other as well as betweenGPU and CPU. Experimental result shows that implementationon OpenCL is more efficient but has comparable performancewith OpenGL. Compared with the performance on the embeddedCPU in the same application processor, the embedded GPU runs4⇠5 times faster. Furthermore, we measure and compare thepower consumption on each implementation, which shows thatOpenCL consumes less energy than OpenGL.

Index Terms—Parallelization, GPGPU, OpenGL, OpenCL,SIFT, SURF

I. INTRODUCTION

Recent embedded GPU (graphics processing unit) is a pow-erful working horse in computer graphics as well as GPGPU(general purpose computation on GPU) applications on asmartphone. Unlike desktop GPUs which have thousands ofprocessing element, embedded GPUs do not support massiveparallelism yet. However, recent embedded GPU has advancedto have multiple cores and utilizes multithreading for a smallnumber of parallelism but which is still useful in GPGPUon the smartphone. Especially, the nature of various imageprocessing and computer vision algorithm is often well fit ontoGPGPU architecture on both desktop GPU [5] and embeddedGPU [6]. It is expected that more camera and image relatedapplications on a smartphone will utilize GPU’s computingpower.

In this paper, we select SIFT(scale-invariant feature trans-form) [4] and SURF (speeded up robust features) [1] ascase studies and implement them on an embedded GPU toshow that the parallel implementation outperforms usual serialimplementation on an embedded CPU. Conventional OpenGLshading language (GLSL) [3] as well as recent OpenCL [2]are used in our implementation on the embedded GPU.

To our best knowledge, this is the first attempt to implementsuch a complex computer vision algorithm on an embeddedGPU using OpenCL. Note that it is very recent change thatGPU in smartphones’ application processor (AP) begins tosupport OpenCL specification fully, although the OpenCLdriver implementation is not optimized yet.

(a)

(b)

Fig. 1. Image processing frameworks on an embedded GPU. (a) Based onOpenGL shading language [3]. (b) Based on OpenCL [2].

This paper is organized as follows. In Section II, webriefly summarize the embedded GPGPU framework for imageprocessing. In Section III, the parallelization and implemen-tation of SIFT and SURF algorithms on the embedded GPUare described in detail. Experimental results are provided inSection IV. Finally, we give a conclusive remark in Section V.

II. GPGPU IMAGE PROCESSING FRAMEWORKS

A. Image Processing Using OpenGL Shading Language

OpenGL shading language is a conventional tool forGPGPU on an embedded GPU during last decade. Thegraphics pipeline is customized by vertex and pixel shader

2014 International Workshop on Advanced Image Technology

- 164 -

TABLE IOPENCL IMPLEMENTATION DETAILS OF SIFT FEATURE EXTRACTION ALGORITHM.

Steps # of Kernels Parallelization Amount and Strategy Work-Item Dimension (in a Work-Group)

RGB2Gray 1 Per-pixel 16⇥16

Pyramid 1 Per-pixel using local memory 32⇥8, 64⇥4, 32⇥2

DoG 1 Per-pixel 8⇥8

Keypoint Localization 1 Per-pixel 16⇥16

Orientation Computation 1 Per-feature using local memory 16⇥1

Descriptor Generation 1 Per-pixel 16⇥16

programming, so that non-graphics applications can run onGPU in a parallel way. In case of image processing, per-pixeloperation is performed on the fragmented shader using off-screen rendering and frame buffer object (FBO). As shownin Fig. 1 (a), the input image is first loaded as a textureformat. Using the texture as the input of the shader program,image processing is performed and the output is stored onanother FBO in a single rendering phase. The pixel data inFBO can then be switched to be the input for a subsequentshader program.

The main drawbacks of GPGPU on OpenGL are (1) thescattering operation is limited, i.e. parallel thread can writeonly a few bytes of pixel value at the fixed location in FBOand (2) there is no synchronization during rendering. Con-sequently, they become constraints which disturb implementgeneral algorithms on GPU.

B. Image Processing Using OpenCL

OpenCL is the new standard of GPGPU tool for embeddedGPUs. Unlike OpenGL which implements GPGPU applica-tions as a corner case of computer graphics, OpenCL is apure parallel computing library on heterogeneous platformsincluding CPU, GPU, and even DSP. As shown in Fig. 1 (b),image processing is performed using thread-level parallelismon image2d_t data structure. Kernel function usually workson a single or multiple pixels to produce output image which iscopied to CPU memory at the end. OpenCL has significantlyimproved flexibility over OpenGL, such as parallel thread syn-chronization and unrestricted memory write. Therefore, mostof future GPGPU applications on smartphones are expected tobe accelerated on the OpenCL platform.

III. IMPLEMENTING FEATURE EXTRACTION ALGORITHMSON EMBEDDED GPU

In this section, the implementation issues are addressed forSIFT and SURF algorithms. We do not describe the algorithmsin detail. The reader should refer to the original papers [1],[4] .

A. SIFT

SIFT is a widely used feature extraction algorithm whichis robust to rotation, illumination, and viewpoint change. Itconsists of 6 steps as follows.

1) RGB to grayscale conversion

2) Image pyramid construction3) Difference of Gaussian (DoG) computation4) Keypoint localization5) Orientation computation6) Descriptor generationThe OpenCL implementation details are listed in Table I.

RGB to grayscale conversion is a trivial operation which canbe easily be done on GPU on per-pixel based. In imagepyramid construction, we use 4 levels of octave. Difference ofGaussian operation is done to detect the points of interest. Notethat it is the most serious bottleneck in computational com-plexity when it is implemented serially on CPU. We adjust thelocal work size in each work-item adaptively. In the keypointlocalization procedure, non-maximum suppression (NMS) aswell as suppressing edge responses are performed on per-pixelbased.

In the next step, direction and magnitude of keypoints arecomputed and accumulated at each octave. The representativeorientation and scale are determined using the keypoints’histogram. They are all done on per-pixel based too. Notethat, in this procedure, local memory is used to reduce thetime for the memory access.

B. SURF

SURF is another widely used feature extraction algorithmwhich has similar performance but significantly less compu-tation than SIFT. It consists of 5 steps as follows.

1) RGB to grayscale conversion2) Integral image computation3) Hessian determinant computation using box filters4) Non-maximum suppression (NMS) to find local maxima

where feature points are located5) Orientation computation using Harr response6) Descriptor generationThe OpenCL implementation details are listed in Table II.

First three steps are performed on per-pixel based, where thetotal number of parallel work-items is same as the numberof pixels. After NMS, the location of each feature locationis determined. We copy the location information to CPUmemory to construct feature table on CPU. The feature tableis transferred back to the GPU memory. Therefore, in the lasttwo steps, the total number of the parallel work-items is sameas the number of feature points.


- 165 -

TABLE IIOPENCL IMPLEMENTATION DETAILS OF SURF FEATURE EXTRACTION ALGORITHM.

Steps # of Kernels Parallelization Amount and Strategy Work-Item Dimension (in a Work-Group)

RGB2Gray 1 Per-pixel 16⇥16

Integral Image 2 Reduction using local memory 64⇥1

Hessian Determinant 10 Per-pixel 8⇥8, 16⇥4, 16⇥2, 16⇥1

NMS 8 Per-pixel 16⇥8, 32⇥4, 32⇥2, 16⇥1

Orientation Computation 1 Per-feature using local memory 16⇥1

Descriptor Generation 1 Per-feature using local memory 2⇥2

In integral image computation, we employ the common par-allel reduction algorithm. The intermediate procedure can besynchronized inside the OpenCL kernel function. Therefore,we can reduce the number of kernels significantly (from 21 to2) compared with OpenGL shader implementation. Since thelocal memory is much faster than the global memory, we tryto use it maximally including the integral image computationstep. Note that the use of the local memory indeed improvesthe performance .

Compared with corresponding OpenGL shader implemen-tation, we reduce the number of kernels more than half(from 51 to 23). This is mainly because of the improvedflexibility of OpenCL over OpenGL shader, which includesthe synchronization capability inside kernel and no limitationin scattering (writing to memory) operation.

C. Performance Optimization

Unfortunately, OpenCL code optimization is not well ad-dressed in the currently available documents. In this paper, weemploy general techniques used in [6] as well as maximallyuse the local memory whenever possible. The work-itemresolution in a single work-group is an important factor ofthe performance optimization. In our implementation, it isdetermined empirically. In addition, we try to reduce thenumber of kernels as described in the previous subsection.

IV. EXPERIMENTAL RESULTS

We use the latest smartphone, i.e. Samsung’s Galaxy S4LTE-A, as the platform of implementation. It equips Qual-comm’s Snapdragon 800 application processor with Kraitquad-core 2.3 GHz CPU and Adreno 330 quad-core 450 MHzGPU. The operating system is Android 4.2. The test imagesare normalized to 1280 ⇥ 720.

First, we build the framework shown in Fig. 1 to capturethe camera input and display the feature extraction result con-tinuously. Then, the implemented SIFT and SURF algorithmsare run on the device. Note that all background job of thesmartphone is terminated before the experiment to measurethe execution time as accurately as possible.

A. SIFT

Several visual results of SIFT feature extraction are shownin Fig. 3 (a). Running SIFT algorithm on the embeddedGPU shows similar average performance on OpenGL shading

Fig. 2. Power measurement setup using Monsoon Solutions’ Power Monitor.

language (229.6 ms/frame) and OpenCL (212.2 ms/frame),since parallelization of algorithm using both languages arealmost similar. Note that OpenCV on Android runs at 1147ms/frame. Although the detailed implementation of SIFT isdifferent between OpenCV and our implementation, GPUversion shows more than 5 times faster performance.

B. SURF

In Fig. 3 (b), we show the feature extraction result usingSUFR on embedded GPU. Implemented SURF algorithmruns at 179.1 ms/frame and 255.3 ms/frame using OpenGLshading language and OpenCL, respectively. OpenCV SURFruns at 965.5 ms/frame which is significantly slower than GPUimplementation. Unlike SIFT algorithm, SURF algorithm hassimple integer-based operations in the integral image and Hes-sian determinant computation. However, the embedded GPU’sinteger operation capability is not good enough, speedup overthe embedded CPU is a little bit less than that of SIFTalgorithm.

In addition, we perform an interesting experiment to mea-sure the power consumption on the smartphone. A commercialpower measurement device (Monsoon Solutions’ Power Mon-itor) is utilized for this purpose as shown in Fig. 2. Per-frameaverage energy consumptions for OpenGL shading languageand OpenCL are 3333.3mW and 2645.5mW, respectively. This


- 166 -

(a)

(b)

Fig. 3. Feature extraction results using OpenGL on the embedded GPU (Adreno 330). They are captured on the smartphone’s screen. (a) SIFT. (b) SURF.

shows that OpenCL consumes less or at least comparablepower than OpenGL shading language.

V. CONCLUSION

In this paper, we parallelized and optimized the popularfeature detection algorithms on the latest embedded GPU. Weshowed that the embedded GPU significantly outperformedthe comparable embedded CPU. We also showed that imple-mentation on OpenCL is more efficient but has comparableperformance with OpenGL shading language. Furthermore, wemeasured the power consumption on each implementation toobserve that OpenCL consumed less energy than OpenGL.

It is very recent change that GPU in smartphones’ applica-tion processor begins to support OpenCL specification fully.OpenCL has significantly improved flexibility over OpenGL,such as parallel barrier and scattering operation. We believe alot of GPGPU applications on smartphones are expected to beaccelerated on the OpenCL platform from now on.

ACKNOWLEDGMENT

This work was supported by the Industrial Strategic Tech-nology Development Program (10041664, The Developmentof Fusion Processor based on Multi-Shader GPU) funded bythe Ministry of Trade, Industry and Energy (MOTIE, Korea).

REFERENCES

[1] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Speeded-up robust features.Computer Vision and Image Understanding, 110(3):346–359, June 2008.

[2] K. Group. Open Computing Language (OpenCL). http://www.khronos.org/opencl/.

[3] K. Group. OpenGL ES. http://www.khronos.org/opengl/.[4] D. Lowe. Distinctive image features from scale-invariant keypoints.

International Journal of Computer Vision, 60(2):91–110, November 2004.[5] I. K. Park, N. Singhal, M. H. Lee, S. Cho, and C. Kim. Design and

performance evaluation of image processing algorithms on GPUs. IEEETrans. on Parallel and Distributed Systems, 22(1):91–104, January 2011.

[6] N. Singhal, J. W. Yoo, H. Y. Choi, and I. K. Park. Implementation andoptimization of image processing algorithms on embedded GPU. IEICETrans. on Information and Systems, E95-D(5):1475–1484, May 2012.


- 167 -

parallelization and optimization of feature detection algorithms on embedded...

Documents