fpga based parallelized architecture of efficient graph...

FPGA based Parallelized Architecture of Efficient Graph based

Image Segmentation Algorithm

by

Roopal Nahar, Akanksha Baranwal, Madhava Krishna

in

2017 IEEE Int. Conf. on Robotics and Biomimetics

Report No: IIIT/TR/2017/-1

Centre for RoboticsInternational Institute of Information Technology

Hyderabad - 500 032, INDIADecember 2017

FPGA based Parallelized Architecture of EfficientGraph based Image Segmentation Algorithm

Roopal Nahar1, Akanksha Baranwal1, K.Madhava Krishna1

Abstract—Efficient and real time segmentation of color imageshas a variety of importance in many fields of computer visionsuch as image compression, medical imaging, mapping andautonomous navigation. Being one of the most computationallyexpensive operation, it is usually done through software imple-mentation using high-performance processors. In robotic systems,however, with the constrained platform dimensions and the needfor portability, low power consumption and simultaneously theneed for real time image segmentation, we envision hardwareparallelism as the way forward to achieve higher acceleration.Field-programmable gate arrays (FPGAs) are among the bestsuited for this task as they provide high computing power in asmall physical area. They exceed the computing speed of softwarebased implementations by breaking the paradigm of sequentialexecution and accomplishing more per clock cycle operations byenabling hardware level parallelization at an architectural level.

In this paper, we propose three novel architectures of a wellknown Efficient Graph based Image Segmentation algorithm.These proposed implementations optimizes time and powerconsumption when compared to software implementations. Thehybrid design proposed, has notable furtherance of accelerationcapabilities delivering atleast 2X speed gain over other implemen-tations, which henceforth allows real time image segmentationthat can be deployed on Mobile Robotic systems.

I. INTRODUCTION

Image segmentation is a key component of robotic vi-sion systems and is used widely in applications that entailsuperpixelling and unsupervised segmentation. As a conse-quence, many different approaches have been proposed inthis area like clustering [1], region-based growing [2], graphcuts [3][4],super-pixeling [5]. However, nowadays, variousmethods of deep learning like CNN [6], SegNet [7], semanticimage segmentation [6] are commonly used.

FPGA (Field-programmable gate arrays) implementationsfind relevance in Robotics due to its small physical area, lightweight and its high capability of delivering tightly packed,energy efficient designs. FPGA provides real time paral-lelization, gate level control of system architecture allowingcontrol over minute details of the arithmetic design. Theyalso provide an opportunity to pipeline sequential processes.FPGA implementations of Image processing descriptors [8],collision avoidance [9], further certify, that use of FPGA is anideal solution for robotic systems with constrained dimensions.Hence, we can exploit the hardware flexibility, parallelization,logical, electrical and physical advantages provided by FPGA.And consequently, will help to achieve real time and energy

1 Robotics Research Center, IIIT-Hyderabad, [email protected]@[email protected]

Fig. 1: Input Image and Segmented Output Image usingEfficient Graph Based Segmentation approach on FPGA

efficient image segmentation as compared to the software andother hardware implementations.The research work that hasbeen done in the field of FPGA based Image Segmentationfinds its applications in robotics, computer vision and medicalimaging [10][11][12].

This paper focuses on three novel architectures - Sequential,Pipelined and Hybrid of the well known Efficient Graph basedSegmentation Algorithm[4] on FPGA. The sequential designexploits the gate level control of system architecture allowingthe control over minute details of arithmetic design whichis difficult in CPU implementation. Due to sequential natureof the sub-modules, this architecture is further modified to apipelined design, by allowing interleaving of the processingsteps of the sub-modules used in the algorithm. Parallelismand pipelining have been incorporated into the hybrid designby making multiple copies of elementary modules and us-ing them in parallel, along with scheduling them efficiently.These hardware implementations reduce power dissipation andachieve real time segmentation.

Comparative study of these three novel architectures hasbeen done in terms of clock cycles and power dissipationto bring out vividly the advantages of FPGA over softwareimplementation which generally follows sequential approach.The sequential approach shows a tangible improvement incomputation time when compared to CPU implementation,but the hybrid architecture delivers an acceleration gain of atleast 2X to the CPU implementation with much lesser powerdissipation. The other results for the same will be delineatedin more detail in the results section. The implementation isdone using Verilog HDL language and the same is simulatedand synthesized using Xilinx Vivado Design Suite.

II. EFFICIENT GRAPH BASED IMAGE SEGMENTATIONALGORITHM

Efficient Graph based Segmentation algorithm by Felzen-szwalb and Huttenlocher [4] has turned out to be populardue to its simplicity and high fidelity outputs. An importantcharacteristic of this algorithm is its ability to preserve detail

Fig. 2: Block diagram of Efficient Graph Based Image Segmentation.

in low variability image regions while ignoring detail in highvariability regions. The overall flow of the algorithm [4] canbe well-explained by the block diagram illustrated in Fig. 2.

Fig. 3: Instead of considering the 8 conventional neighborsaround a vertex Va, we consider only these 4 neighbors(Vb1, Vb2, Vb3, Vb4) for edge weight calculations to removeredundancy

In this algorithm, a color image is given as input in RGB for-mat which undergoes smoothening. Image is then representedas a weighted undirected graph G=(V, E). Here, V representsvertices (the set of pixels in the image) and E represents theset of edges defined between two adjacent vertices. For everyvertex say Va, there will be four vertices (Vb1, Vb2, Vb3 andVb4) that will be considered for graph formation as shown infigure3. Each edge (va,vb) ∈ E has a corresponding weightw(va,vb), which is a non-negative measure of the dissimilaritybetween neighboring elements va and vb. The weight of theedge E(va,vb) is the Euclidean distance between them in RGBcolor space. Preprocessing and graph initialization is done inthe sub-module named Preprocessing and Graph initialization.

In graph based approach, Segmentation S is partitioningof V into components such that each component C∈ Scorresponds to a connected component in a graph G’= (V,E’),where E’ ∈ E. In Threshold based Graph agglomerationsub-module, we define the internal difference of a component(Int(C)) C ⊆V to be the largest weight in the minimumspanning tree of the component, MST(C,E).

Int(C) = maxe∈MST (C,E)

w(e) (1)

We iterate over each edge to evaluate if there is anyevidence of boundary between a pair of components (Ci, Cj)joined by the edge. Pairwise comparison predicate (D(Ci,Cj))is used to verify that if the two components are disjointand the weight of the edge joining them is less than theminimum internal difference (MInt) of both the componentsthen they are merged on basis of threshold function τ using

Eq. 2. Pairwise comparison predicate D(Ci, Cj) is defined as

D(C i, C j) =

{true if Dif(Ci, Cj)>MInt(Ci, Cj)false otherwise

(2)

where the minimum internal difference, MInt, is defined as,

MInt(C i, C j) = min(Int(C i) + τ(C i), Int(C j) + τ(C j)). (3)

and the difference between the two components Dif (Ci, Cj)is defined as:

Dif(C i, C j) = minvi∈Ci,vj∈Ci,(vi,vj)∈E

w(vi, vj) (4)

The threshold function τ used in Eq. 3 is used to controlthe degree, to which Dif(Ci, Cj) must be larger than MInt(Ci,Cj). Threshold function τ , is defined as,

τ(C) = k/|C| (5)

where |C| is the size of the component and k is some constant.Size based Graph Agglomeration sub-module does merging

of the components based on min size factor. Componentsget merged with their neighboring components if theirsizes are less than min size which is defined by the user.Post-processing and rendering sub-module reconstructs themodified graph into an image. It also recolors the imagebased on the new labels assigned to the pixels. This newimage formed is the segmented output.

III. PROPOSED ARCHITECTURES FOR FPGA

Image segmentation using graph based approach involvessolving equations [1-5] for each component of the graph andfinally merging them based on threshold and min size. Thissection delineates the three proposed architectures - sequential,pipelined and hybrid in detail below:

A. Sequential Architecture

Figure 4 shows the architectural flow of the sequentialimplementation. After the initial preprocessing and graphinitialization, as explained in the previous section, the sortedvalues of Va, Vb and W are stored in three different DualPort BRAMs. Due to dimensional constraints, while storingin BRAM, each address of BRAM corresponds to four datasegments. Every vertex has four attributes namely rank, label,threshold and size. All labels are assigned a different valueand size 1 because initially each vertex is considered as adifferent component. Threshold for all vertices is assigned

Fig. 4: Block diagram of proposed Sequential Architecture

the same value as that of the global threshold. After variabledeclaration and initialization, these values are stored in fourdifferent Dual Port BRAMs. These seven BRAMs are sent asinput to Finite State Machine which updates the four attributesbased on threshold criterion as explained in Eq. 1-5. Ananalogous Finite State Machine is used for merging segmentedcomponents based on min size. The modified graph obtainedis then reconstructed into an image. Random color assignmentis done to the pixels, based on final updated labels.

Sub-modules used to implement this algorithm are as fol-lows:• Finite State Machine for Segmentation: The Finite

State Machine module(Fig. 5) is used for implementingthe blocks - threshold based graph agglomeration and sizebased graph agglomeration. As shown in Fig. 4, sevenBRAMs are sent as input to the block Threshold BasedGraph Agglomeration. Depending on the value of ad-dress, we read values from Va, Vb and W BRAMs. EachVa, Vb and W corresponds to the undirected weightededge between them. As soon as Va, Vb are read, findmodules are used to obtain their corresponding parentlabels.

Fig. 5: Finite State Machine Architecture

These two parent labels representing components aregiven as an input to the join module which decideswhether to merge the components or not based onthreshold criterion (Eq. 2). Find and Join modulesare explained in the sub-section below. When the joinmodule asserts the signal done, counter and address are

updated as per the requirement. Once all the edges havebeen traversed the state machine is terminated. This statemachine updates the four attributes - rank, label, sizeand threshold. These updated BRAMs are again sent asinput to a similar Finite State Machine which mergescomponents based on min size and re-updates thosefour attributes. Once all the edges have been traversedthe state machine is terminated.

• Find module: The Find module (Fig. 6) is used to searchthe parent label of the current component. A componentwhen merged with any other component, gets the newlabel using set union find algorithm[18]. This updatedlabel is stored in the Label BRAM and further given asan output for later modules.

Fig. 6: FIND Module Architecture

• Join module: The Join module (Fig. 7) is used for merg-ing the 2 components. This module is used in Thresholdbased Graph Agglomeration state machine (Fig. 4) and isused to merge components based on threshold criterion(Eq 2). In Size based graph agglomeration, this moduleis used for merging components based on min size cri-terion as explained in Section 2. Rank based set unionalgorithm[18] is used to implement this algorithm andreduce time complexity.

This sequential architecture uses these above modules asshown in Fig. 4. It exploits the gate level control providedby an FPGA and henceforth, allowing to control over minutedetails of the arithmetic design which is not possible in CPUimplementation.

Fig. 7: JOIN Module Architecture

B. Pipelined Architecture

Figure 7 shows the process flow for the pipelined imple-mentation of this algorithm. In the fig. , all steps in onehorizontal row are executed parallelly and as we move to nextrow, all previous tasks would be completed. Due to sequentialnature of the sub-modules, internal Pipeline processing ispossible. While preprocessing of image and graph formationis being done, simultaneously initialization of the four BRAMcan be done as these are mutually exclusive. Also, the edgeweights can be computed row wise in parallel thus, graphformation time can be reduced to O(n) from O(n*n). Evenfor random recoloring, since colors depend only on the labelassigned, similar reduction in time complexity is seen. Dualport BRAMs have two independent access ports. Exploitingthis as soon as the first set of sorted Va, Vb and W are storedin BRAM, the FSM for segmentation based on threshold canstart executing the first iteration. Using this also find and joinoperations can be pipelined with the write operation. Thissubstantially reduces the time. Further finding the parent labelsof Va and Vb are independent processes and can be executedtogether. Since there are different dual port BRAMs for label,rank and threshold, find and updating threshold operations arepipelined. Similarly, read and join operations are pipelined.Independent dual port BRAMs also allow reading the labelsbefore threshold based merging is finished. Thus, the lastiteration of threshold based merging is pipelined with the firstiteration of min size based merging. Thus by making elegantmodifications in the sequential architecture, it is possible tosave a tangible amount of clock cycles.

C. Hybrid Architecture

When there is no limit to the number of resources that canbe used, Hybrid architecture can be designed as shown in Fig9. This architecture is a full fledged parallel and pipelinedimplementation of the algorithm. By taking advantage ofparallelism provided by an FPGA, we use multiple copies ofsub-modules and use them in parallel.

An input image can be divided into n parts, and each partwill then undergo independent pre-processing, Graph initial-ization and Threshold based Graph agglomeration. Now, theseparts are merged using Horizontal and Vertical stitching. Thisnew stitched image undergoes size based graph agglomerationand random recolouring based on new labels. The value of n

Fig. 8: Process flow of the pipelined architecture

needs to be chosen judiciously. If n is very large then dueto the limited number of resources on FPGA, this hybridarchitecture will not work. Also, larger n implies smaller sub-images, which may result in loss of information while mergingbased on threshold and stitching.

This flow is illustrated with an example below.

Fig. 9: Block diagram of proposed Hybrid Architecture

As shown in a fig. 9, the input image is divided in 4parts. Each sub image undergoes pre-processing parallelly.This smoothened sub-images undergoes segmentation based

on threshold(Eq. 2) independently. Now sub-images 1-2 and3-4 are stitched horizontally.

For horizontal stitching, the rightmost column of the leftimage is compared with the leftmost column of the imageto the right. As the individual images have already beensegmented according to the threshold, while merging alongthe joining edges, instead of considering four neighbors foredges, only one neighbor is considered. We consider Va -Vb2and ignore Va - Vb1, Va - Vb3. The dotted line in (Fig 3) showsthe neighboring edges for horizontal stitching. This reduces atangible amount of computation. Similarly, vertical stitchingis implemented using the bottom most row of the top imageand the topmost row of the bottom image.

Now this image which we get after vertical stitching, is thesegmented output which we get after thresholding. Segmenta-tion based on min size is then done over the complete image.Random coloring of the image based on new labels results inthe final segmented image.

This hybrid architecture saves a lot of clock cycles and helpsin achieving real time segmented implementation.

IV. FPGA IMPLEMENTATION

Optimized for high-performance logic and DSP with lowpower serial connectivity, the design test platform, VirtexUltraScale XCVU190-2FLGC2104E FPGA delivers 4 CPF4Optical Interfaces, 28 Gbps Backplane Interface, Dual 512MBQuad-SPI flash memory, 2GB HMC Memory to meet higherbandwidth, performance and memory demands with lesspower. The table below provides system parameters.

The three architectures explained in the previous sectionsare implemented on this FPGA.The 7 BRAMs used in thedesign are simple dual port memories because they have twoindependent access ports to a common storage array. Themodules Find and Join are implemented using finite statemachines and various IP cores like Divider Generator aredeployed for computation of division. To reduce number of

TABLE I: Comparative study of resource utilization in thethree architectures

Architectures CLBs LUTs Power dissipation (mW)Sequential 1352 332 396Pipelined 2426 525 478Hybrid 4638 941 740

resources, serial operation is used at certain places. The designof system comprises both serial and parallel operations tomaintain an optimum level of utilization of resources as well asclock cycles required. The exact values involved are presentedin results section.

V. RESULTS

The three architectures discussed above have been im-plemented on FPGA. We realized that if FPGA has lesserresources, then pipelined architecture should be preferred overthe Hybrid one. On modern FPGA hardware, using Hybridarchitecture, an image can be split into more sub parts (alarge value of n say 16 or 32), exponential improvement in

computation time can be observed. The entire segmentationmodule - threshold based graph agglomeration and size basedgraph agglomeration runs on FPGA board, while the processoris used for pre-processing and rendering. We discuss theperformance of the three architectures designed based onacceleration in computation and power dissipation in detailbelow.

A. Acceleration in computation

Table 3 shows the computation time a CPU takes andvarious other proposed architectures. We can see that a lot ofclock cycles can be saved when we go for a hybrid architectureinstead sequential or pipelined design. We can see that asignificant amount of speed up is obtained, even though ithas high resource utilization but this issue can be addressed bythe ability of FPGA to dynamically reconfigure. At a particularinstance, we can have only those resources which are requiredfor processing and FPGA can be reconfigured before the nextstep. In this way, highly parallel architectures can be designedon FPGA.

As we can observe from Table 3, the sequential architec-ture shows a marginal improvement compared to the CPUimplementation. But by making the design more tightly packedand introducing pipelining at sub-modular level, we can seefurther more improvement in computation time. By breakingthe paradigm of sequential execution and accomplishing moreper clock cycle operations by enabling hardware level paral-lelization at an architectural level , we can achieve enormousspeed up.

We can infer from table IV, that as we keep on increasingthe value of n, we achieve more and more acceleration. But,the value of n needs to be chosen judiciously. It depends ontwo factors- size of image and the FPGA board that is beingused. For example, an image of size 128*72, if we split thisimage into 16 parts, the result obtained is very distorted. Also,if n is very large then due to the limited number of resourceson FPGA, this hybrid architecture will not work.

B. Power consumption by equivalent hardware

Simulation of power dissipation has also been done usingVivado Power analysis tool. The results are compiled in tableV. As evident from the results shown, sequential implementa-tion has lowest power dissipation and hybrid implementationhas highest. This is because, in the hybrid architecture, morecells and interconnects are active at any instance of time com-pared to sequential. Even in pipelined architecture, the numberof active interconnects are more than that in sequential becausethe design is tightly packed. Hence power dissipation of thesetwo architectures is more when compared to sequential.

The power dissipation in FPGA is in the order of milliwattwhich is nominal when compared to power dissipation in atypical CISC or RISC processor for which power dissipation isin the order of Watt. Hence, both in terms of power dissipationand computation time, hybrid architecture is way better thanthe CPU implementation.

(a) Input Road image1 (b) Segmented image after thresholding (c) Segmented image after min-size

Fig. 10: Input images (1920 X 1080 color image) and showing intermediate results and final segmentation using EfficientGraph Based Segmentation approach on FPGA

TABLE II: Computation time obtained for different architec-tures for different sizes of test images

Computation time (in ms)Image Size 128 X 72 256 X 144 512 X 288

CPU 49.2 96.31 261.14Sequential 36.23 102.663 257.649Pipelined 31.47 94.511 252.323Hybrid 17.83 53.888 175.314

TABLE III: Computation time hybrid architecture takes fordifferent sizes of test images and different values of n

Computation time (in ms)Image Size 128 X 72 256 X 144 512 X 288

n=1 31.47 94.511 252.323n=2 25.803 85.9 245.064n=4 21.109 66.183 212.57n=8 17.83 53.888 175.314

TABLE IV: Power consumption for different Architectures fordifferent sizes of test images

Power Dissipation (in mW)Image Size 128 X 72 256 X 144 512 X 28Sequential 396 530 767Pipelined 478 654 893Hybrid 740 968 1,332

VI. CONCLUSION AND FUTURE SCOPE

Implementing a software algorithm on an FPGA enablesus to exploit the hardware flexibility, parallelization, logical,electrical and physical advantages of it. Henceforth, it offersreal time performance and economical power consumptionon various mobile robotics applications. This paper presentedthree novel architectures of image segmentation, implementedon Virtex UltraScale XCVU190-2FLGC2104EES9847 FPGA.The sequential design exploits the gate level control of systemarchitecture allowing control over minute details of arithmeticdesign, whereas in pipelined design we exploit the hardwareflexibility of pipelining provided by an FPGA. The hybridarchitecture provides a full fledged parallel and pipelined im-plementation of the algorithm. The hybrid architecture overallprovides at least 2X gain in acceleration, when compared toother implementation. Even though hybrid design has morepower dissipation as compared to sequential but it is negligiblecompared to CPU.

Though currently we have been able to achieve only 2Xspeedup, this speedup is limited by the storage constraintsof the device we had access to. With a better device higheracceleration can be achieved as this hybrid architecture has the

potential to offer much better performance as it can execute inparallel more computations by exploiting the hardware paral-lelism thus allowing more number of subimages. However dueto unavailability of a bigger and more expensive FPGA board,we are not able to verify the full potential of this algorithm.

With the popularity of multicore programming and GPUsneural networks based image segmentation algorithms aregaining popularity. As a part of the future work, we would liketo explore deep learning based segmentation algorithms likesegnet and fully convoluted networks on FPGA. . Focus wouldbe more on maintaining the acceleration of the algorithmand quality of image segmentation without causing loss ofinformation and keeping minimal power consumption.

REFERENCES

[1] Swendsen, R.H., Wang, S.: Nonuniversal critical dynamics in MonteCarlo simulations. Physical Review Letters 76(18) (1987) 8688

[2] Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall (2001)[3] Boykov, Y., Funka-Lea, G.: Graph cuts and efficient N-D image segmen-

tation. International Journal of Computer Vision 70(2) (2006) 109131[4] Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image seg-

mentation. International Journal of Computer Vision 59(2) (2004) 167181[5] Jyotsana Mehra, Nirvair Neeru: A Brief Review: Super Pixel Based Image

Segmentation Methods. Imperial Journal of Interdisciplinary Research(2016)

[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Mur-phy, Alan L. Yuille: DeepLab: Semantic Image Segmentation with DeepConvolutional Nets, Atrous Convolution, and Fully Connected CRFs(2016). Computer Vision and Pattern Recognition.

[7] Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla: SegNet: A DeepConvolutional Encoder-Decoder Architecture for Image Segmentation(2016). Computer Vision and Pattern Recognition

[8] Jan Fischer, Alexander Ruppel, Florian Weisshardt, and Alexander Verl. Arotation invariant feature descriptor o-daisy and its fpga implementation.In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ InternationalConference on, pages 23652370. IEEE ( 2011)

[9] Roopak Dubey, Neeraj Pradhan, K. Madhava Krishna and Shubhajit RoyChowdhury: Field Programmable Gate Array (FPGA) Based CollisionAvoidance Using Acceleration Velocity Obstacles. International Confer-ence on Robotics and Biomimetics. IEEE (2012)

[10] Shumit Saha, Kazi Hasan Uddin, Md. Shajidul Islam, Md. Jahiruzzaman,A. B. M. Awolad Hossain: Implementation of Simplified NormalizedCut Graph Partitioning Algorithm on FPGA for Image Segmentation.The 8th International Conference on Software, Knowledge, InformationManagement and Applications (2014)

[11] K. Yamaoka,T. Morimoto, H. Adachi,T. Koide,H.J. Mattausch:Imagesegmentation and pattern matching based FPGA/ASIC implementationarchitecture of real-time object tracking. Asia and South Pacific Confer-ence on Design Automation (2006)

[12] P. Dillinger, J. F. Vogelbruch, J. Leinen, S. Suslov, R. Patzak, H. Winker,and K. Schwan: FPGA-Based Real-Time Image Segmentation for MedicalSystems and Data Processing. IEEE Transactions on Nuclear Science.

[13] http://www.geeksforgeeks.org/union-find-algorithm-set-2-union-by-rank/

fpga based parallelized architecture of efficient graph...

Documents