object recognition through convolutional …1119717/fulltext01.pdfthere is a su cient amount of...

32
alardalen University School of Innovation Design and Engineering aster˚ as, Sweden Thesis for the Degree of Master of Science in Engineering - Robotics 30.0 credits OBJECT RECOGNITION THROUGH CONVOLUTIONAL LEARNING FOR FPGA Daniel Jonasson [email protected] Examiner: Masoud Daneshtalab alardalen University, V¨ aster˚ as, Sweden Supervisor: Ning Xiong alardalen University, V¨ aster˚ as, Sweden Company supervisor: Lars Asplund, Unibap,V¨aster˚ as, Sweden July 4, 2017

Upload: others

Post on 25-Feb-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen UniversitySchool of Innovation Design and Engineering

Vasteras, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics30.0 credits

OBJECT RECOGNITION THROUGHCONVOLUTIONAL LEARNING FOR

FPGA

Daniel [email protected]

Examiner: Masoud DaneshtalabMalardalen University, Vasteras, Sweden

Supervisor: Ning XiongMalardalen University, Vasteras, Sweden

Company supervisor: Lars Asplund,Unibap, Vasteras, Sweden

July 4, 2017

Page 2: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Abstract

In later years the interest for deep networks and convolutional networks in regards to object recog-nition has spiked. There are however not many that focuses on the hardware in these subjects.This thesis was done in collaboration with Unibap to explore the feasibility of implementing theseon a FPGA to speed up object recognition. A custom database is created to investigate if a smallerdatabase could be utilized with good results. This database alongside the MNIST database are testedin multiple configurations to find a suitable solution with good enough accuracy. This thesis fo-cuses on getting an accuracy which could be applicable in industries of today and is therefore notas driven by accuracy as many other works. Furthermore a FPGA implementation that is ver-satile and flexible enough to utilize regardless of network configuration is built and simulated. Toachieve this research was done on existing AI and the focus landed on convolutional neural net-works. The different configurations are all presented in regards to time, resource utilization andaccuracy. The FPGA implementation in this work is only simulated and this leaves the desire andneed to syntethize it on an actual FPGA.

1

Page 3: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Acronyms

AI Artificial Intelligence.

ANN Artificial Neural Network.

CNN Convolutional Neural Network.

CPU Central Processing Unit.

DCNN Deep Convolutional Neural Network.

DNN Deep Neural Network.

FPGA Field Programmable Gate Array.

GD Gradient Descent.

GPGPU General-Purpose computing on Graphics Processing Units.

GPU Graphical Processing Unit.

IVS-70 Intelligen Vision System-70.

LUT Look-Up Table.

RCNN Reccurent Convolutional Neural Network.

RDCNN Recurrent Deep Convolutional Neural Network.

RLU Rectified Linear Units.

SOC System On Chip.

2

Page 4: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Table of Contents

1 Introduction 51.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 General research problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background. 72.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.3 Recurrent Convolutional Neural Network . . . . . . . . . . . . . . . . . . . 92.4.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.5 Related work for optimization and FPGA implementations . . . . . . . . . 102.4.6 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Method 133.1 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Deep Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Reccurent Deep Convolutional Neural Network . . . . . . . . . . . . . . . . 14

3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Backpropagation - Fully connected layers . . . . . . . . . . . . . . . . . . . 153.2.2 Backpropagation - Max pooling layers . . . . . . . . . . . . . . . . . . . . . 163.2.3 Backpropagation - Convolutional layers . . . . . . . . . . . . . . . . . . . . 163.2.4 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Hardware 184.1 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 System on Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Implementation 195.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.1.2 Custom database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.1.3 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Field Programmable Gate Array (FPGA) implementation . . . . . . . . . . . . . . 205.2.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 results 246.1 General research problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.1.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Discussions 267.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8 Future Work 27

3

Page 5: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

9 conclusion 28

10 Acknowledgments 29

References 31

4

Page 6: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

1 Introduction

In later years the interest in robotics have been growing. Although industry was the first todiscover the advantages of using robots, more and more areas are becoming aware of the benefitsof implementing robotics. Everything from production to health care. Even our homes are startingto make use of robotics in the form of vacuum cleaners, lawn mowers and other applications. Thishas driven the research community to look into the major obstacles in the field. One of the biggesthindrances to increasing the versatility of robotics of today is the lack of awareness in robots.If a robot were given the ability to see its surroundings and take decisions accordingly then theversatility would increase exponentially. One example could be having the manipulators in anassembly line be able to detect and pickup individual screws from out of a box. One of the corefeatures to achieve this is object recognition. Although multiple methods for object detection existstoday there is room for a lot of improvements. One of these improvements are to speed up thedetection without losing efficiency.

This thesis is focused on finding a suitable Artificial Intelligence (AI), more specifically anArtificial Neural Network (ANN) that may utilize deep learning [1], that is fast and reliable forimplementation on a FPGA. The goal is to implement an AI that fulfills the requirements fora FPGA, fully or partly, on a simulated FPGA so that the parts intended for it can be easilysynthetized and transferred to a FPGA. This implicates that one of the focuses of this thesis willbe to ensure that the part of the AI intended for the FPGA is executable on it.

Included in this thesis are research of existing algorithms and methods for object detectionintended for deployment on a FPGA and methods for reducing the load on computational com-plexity and memory usage. The implementation will be tested by a custom image training set aswell as a well-known database and a proper analysis is included. This thesis was conducted incollaboration with Unibap [2] and is intended to, in the future, be integrated with their existingplatform Intelligen Vision System-70 (IVS-70). In contrary to existing solutions this thesis willfocus on finding a solution which can be easily re-trained on a small custom database so that itis versatile enough to be deployed at many different companies with different applications withoutthe need for massive retraining or alteration. The training will be performed outside the FPGAprogram as only the forward pass will be implemented on it. This will save resources and thetraining is optimally performed on a large server park or a super-computer.

1.1 Problem Formulation

Before work began a hypothesis was stipulated and some questions were formulated to give thescope of the thesis.

1.1.1 General research problem

What methods are feasible when implementing object recognition on a FPGA andwhich of these methods are the most suitable?

Not all methods will be considered but rather a few that are chosen from the research performedand requests from Unibap.

1.1.2 Problem formulation

The following questions are formulated to solve the general research problem.

Q1 What limitations does the FPGA entail?Since the goal of this thesis is to find a method that fully or partly is implementable on aFPGA it is imperative that the methods chosen are analyzed and compared to the specifica-tions of the hardware it is intended for.

Q2 Considering the limitations mentioned in Q1. What type of ANN is suitable?Whit the limitations in mind. What type of AI would fit within the limitations both inregard to size, complexity, accuracy and method.

5

Page 7: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Q3 Is it possible to fit the entire network on the FPGA?With the intended hardware in mind. Is it plausible to fit the entire network on the FPGA oronly part of it? If only part of it fits then what parts of the network should be implementedin the FPGA

6

Page 8: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

2 Background.

The background is divided into five sections. The first section describes the company Unibapthat proposed this thesis and why it was proposed. The second section describes the trainingdata briefly. The next section describes FPGA, why it is a suitable hardware, and presents thelimitations that needs to be considered when designing a ANN intended for the FPGA. The fourthsection presents the research done on existing solutions both intended for FPGAs as well as generalmethods for object recognition by utilizing ANNs. Most of these are focused on optimization. Thefinal section handles the reasoning and choice of method to implement in this work.

2.1 Motivation

Unibap are taking the step towards intelligent visual perception solutions and are already a worldclass supplier of safety critical solutions in vision processing. They started out in the aerospacemarked and are now moving into the industrial machine vision market. One of the ideas they areworking with is implementation of solutions that are inspired by artificial visual cortexes. Unibapbelieves that the future of automation lies with a robust and reliable vision system. As quoted ontheir website:

”Artificial intelligence and machine vision are key enablers of the future automationindustry. The world market is rapidly growing according to almost all reports —intelligent machine vision for automation is the holy grail.”

This thesis were conceptualized because of this drive and the applications of such a solution ismassive.

2.2 Training data

To train an ANN a training data set, consisting of training and validation examples, is required.The training data examples are used to train the network while the validation examples are usedto validate the training of the network. It is important to note that the sets needs to contain bothpositives (The object sought after) as well as negatives (other objects) to prevent false positives.If a network is supposed to recognize nails and it does not have the negatives in its training dataset the network might classify similar objects, such as screws, as nails. It is also important thatthere is a sufficient amount of training data. A deep network especially is prone to suffer fromoverfitting if the training data is too small.

2.3 FPGA

There exists a lot of solutions to object recognition utilizing ANNs. The problem however is thatthese networks demands a lot of computing power. This has spawned the need for a reductionin hardware complexity and optimization so that its requirements reduce. There has been manytypes of hardware accelerators suggested in recent years. Two of these are the FPGA and General-Purpose computing on Graphics Processing Units (GPGPU). GPGPU has good performance butunfortunately it also consumes a lot of power. In later years the FPGA has become a viable optionto GPGPU because of its low energy consumption as well as an increase in its computationalpower [3]. The FPGA is hardware implementations of algorithms and hardware is always fasterthan software. It is also more deterministic which means that its latency is an order of magnitudeless than that of Graphical Processing Unit (GPU). The difference is that while the GPU hassingle-digit miroseconds, the FPGA has hundreds of nanoseconds.

A FPGA consists of an array of logic blocks. The connections of these logic blocks are re-configurable so that the user can program the FPGA to fit their needs. The logic blocks can beprogrammed as simple logic gates like AND,NOR and so on. They can also be programmed toperform complex combinational functions. These logic blocks also contains memory. FPGAs alsocontains very fast inputs and outputs as well as bidirectional data-buses. Furthermore they maycontain analog-digital and digital-analog converters. The latest FPGAs also contains embeddedmicroprocessors and related peripherals.

7

Page 9: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

In order for the ANN to be easily transferable to a FPGA there are a few limitations toconsider. One of the major limitations is that the memory in the logic blocks are quite small.This would imply that the method chosen needs to be limited in the amount of memory it utilizesduring execution. It would also need to have a suitable configuration in terms of depth, layersand amount of weights needed since all of these govern the amount of memory and resources, e.g.logic blocks, the approach needs. Another thing to consider is that using floating-point precisionwill increase the load on the FPGA and therefore it is a good idea to use fixed-point precisionwhen implementing on a FPGA. A basic neuron circuit, which is the backbone of ANNs, can beconstructed by a multiplier, an accumulator and an activation function calculator. However, themultiplication and addition will take up a lot of computation time. This can be solved by pipeliningin the FPGA [4].

An ANN also implements an activation function to manage the nonlinear transformation thatneeds to be performed between the input and output space. Commonly a sigmoid function isused. In a FPGA implementation it needs to be solved through a different method. Four of theseactivation function solutions are [4]:

a) look-up tables method b) piecewise linear approximation c) CORDIC and d) table-drivenlinear interpolation.

Each of these solutions has advantages and disadvantages. Research needs to be done with alllimitations in mind to select an appropriate method.

2.4 Existing methods

2.4.1 Artificial Neural Network

The ANN simulates neuronal functions by utilizing nodes and activation functions. It originatedin the 1940s and uses weights between every node to simulate memory. A supervised learningmodel was later proposed that was based on the concept of a perceptron as described by Tao-ranCheng et al. in [4]. This perceptron is used to build a multi-layer network which through theback-propagation algorithm adjusts the weights in the network.

Figure 1: A basic ANN structure showing the different layers, inputs and weights

As seen in figure 1 a basic ANN is constructed by an input layer that receives an input x(n).These are then forwarded through weights w(ij), corresponding to each line which is multipliedwith the input. The next layer is called the hidden layer were the weighted inputs are summatedand introduced to a biased weight before being forwarded to the output layer. The output is

8

Page 10: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

then compared to the expected output and the error is backpropagated layer by layer through thenetwork. The weights are then adjusted accordingly. This is repeated until a satisfying result, ora pre-defined number of iterations, are achieved. One such iteration is called an epoch. After thisis done the ANN is trained and can be used to solve the problem at hand.

2.4.2 Convolutional Neural Network

A Convolutional Neural Network (CNN) is a feed-forward ANN whose structure is inspired by thevisual cortex in animals. Every neuron (or node) operates in a restricted space which is called areceptive field. Each of these receptive fields overlaps to a certain extent. Each nodes output canbe approximated through convolution. The CNN was designed to require as little pre-processingas possible.

In a CNN each neural is designed to process a small part of the input image which overlapswith the image parts next to it which is fed into another neuron. This operation allows fortranslation of the input image in the CNN. The network may contain layers of pooling, eitherlocal or global, which combines the outputs of the neurons and scales it down. It also includesfully connected layers as well as convolutional layers. The advantage of CNNs is that they shareweights across the convolutional layers. This makes it possible to detect the same feature acrossthe entire image using the same pool of weights, also called a kernel. This minimizes the amountof memory that is required for the feature extraction. The pooling layers divides the input imageinto non-overlapping squares or rectangles. In every such geometry only the strongest responsefrom a neuron is outputted if the max pooling is implemented. This reduces the spatial size of therepresentation. It is built on the assumption that the exact location of a feature is not as importantas its relative location to other features. The pooling layer helps reduce the amount of parametersrequired in the network as well as the number of computations. Finally a fully connected layer isimplemented to perform the main learning. The most commonly used activation function is theRectified Linear Units (RLU) since it is quite resistant to gradient vanishing effect that occurs inthe backpropagation algorithm. During recent years CNN has shown substantial improvementsover other state-of-the-art approaches in object recognition [5, 6].

2.4.3 Recurrent Convolutional Neural Network

A Reccurent Convolutional Neural Network (RCNN) is implemented much like the CNN. Thedifference is that the convolution is performed a pre-defined number of times by the same filter.The result from each iteration will be convoluted again by another filter. After this all of theresults are summated to the last feature map provided by the first function on the last iteration.This method allows for an increase in depth with just a few added weights.

2.4.4 Deep learning

Deep Neural Network:A Deep Neural Network (DNN) is very similar to the CNN in it also being a feed-forward network.It also contains convolutional and pooling in the hidden layers. However, the DNN often containsmore layers which allows it to achieve more complex functions because of its abstraction of inputdata [1]. A big difference is that DNNs does not share weights between kernels but rather uniqueweights for each. This may increase the risk of overfitting as well as memory usage and computa-tional cost. The DNN may also contain normalization layers. Another big difference is that in aDNN there is no output layer but rather a classification layer.

Deep Convolutional Neural Network:A Deep Convolutional Neural Network (DCNN) is a combination between DNN and CNN. It uti-lizes the strength of both as it contains both feature extraction and classification in itself [6]. Itshares weights between kernels as the CNN but with more layers and a classification in the outputlayer as in DNN. It has been shown that it can perform even better than the CNN, although withincreasing calculation complexity as the depth and size of layers are increased [7]. An importantthing to consider is that a deep network increases the risk of overfitting. This can be solved with

9

Page 11: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

strategies such as dropout.

Reccurent Deep Convolutional Neural Network:The Recurrent Deep Convolutional Neural Network (RDCNN) is essentially a DCNN with thesame convolution steps as described in section 2.4.3. This will allow for a deeper net with just afew added weights. The reccurence may increase the accuracy of the net but also increases the riskof overfitting due to the high level of abstraction of the input features [5].

2.4.5 Related work for optimization and FPGA implementations

There are multiple approaches to implement AI for object recognition and it is important to lookinto what exists today. When it comes to implementations on a FPGA it is important to think oflimitations as mentioned in section 2.3.

Arunachalam Venkadesan et al. presents in their article [8] a solution to three major problemsrelated to the limitations of FPGAs. Firstly the computational complexity that is mainly drivenby the non-linear activation function of a ANN. The most popular non-linear function is the tan-sigmoid function which is defined as:

f(n) =en − e−n

en + e−n

This equation induces a problem in hardware implementation. This is because it is an infiniteseries and to decrease the computational load it has to be truncated. This induces large truncationerrors. If this is solved by Look-Up Tables (LUTs), these becomes large and any interpolationbetween values also becomes complex due to the fact that these combinations also is a power of e.Arunachalam Venkadesan et al. proposes a solution to this problem by implementing a differentactivation function called the Elliot function. The Elliot function is defined as

f(n) =n

1 + |n|and contrary to the tan-sigmoid function it consists of only one adder and one divider. It is

shown that the Elliot function performs as good as the tan-sigmoid function with less complexityand therefore faster execution in a benchmark test.

The second problem presented is the bit precision of the system. While a lower bit precisiondecreases cost and memory usage. This also decreases the accuracy of the system. Therefore, it iscrucial to find the optimal precision to lower costs while maintaining an acceptable accuracy. Theprecision can be set by choosing, for example, an 8-bit signed or unsigned variable where the 4most significant bits are the integer part and the 4 least significant bits represents the float part.This can be viewed in [9] where they receive an accuracy of 1

16 = 0.0625. Arunachalam et al.proposes a formula 1 which helps them to find the precision Y2 for variable A.

Y = A ∗ 2N

Y1 = wholepart(Y )

Y2 = Y1 ∗ 2−N(1)

The final problem addressed is how to conserve FPGA utilization. Instead of implementing theentire architecture a multiplexing method is chosen and implemented. In the multiplexing methodonly the largest layer of the ANN is implemented on the FPGA. This is then used in all the layersof the network. This is achieved by a controller unit that inputs the weights, inputs and biasescorresponding to the layer that is currently being calculated. This method reduces resources usedin the FPGA.

To minimize memory usage Ning Li et al. proposes in their work [3] an implementation thatutilizes multiple computing engines to be able to optimize each computing engine to each differentlayer. All of the engines, or layers, were pipelined and this eliminated the need for a buffer to storethe results, which serves as inputs to the next layer. This significally decreased memory usageand increased the parallelism. They found that a RCNN, along with a strategy called Global

10

Page 12: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

summation, with the aforementioned computing engines lead to an implementation that wouldall fit on a single chip. This includes all of the computations from the start to the end. Theimplementation also performed two times faster than, at the time, the latest research. The globalsummation method used saved memory and DSP usage by not having fully connected layers. Topipeline all of the computing engines a FIFO-queue is implemented. This stores the results andforwards them to the next layer. The FIFO-queue makes it possible to calculate any large imageas long as it has room for a sufficient amount of weights. Ning Li et al. compared their hardwarewith a CPU and found that their method were 380 times faster than the CPU and approximatelytwo times faster than the latest research at the time. They achieved 409.62 Giga-operations persecond and an accuracy of 94.5% and 98.4% on two separate datasets.

In [6] Byungik Ahn presents a CNN implemented on a FPGA that recognizes objects in areal-time video stream. By down-scaling the input stream and extracting image blocks it is able toclassify more than 170,000 times per second and perform scale-invariant object recognition on a 60frames-per-second video stream that has a resolution of 720x480. The system consists of the CNNcore, a pair of video encoder and decoder as well as pre-processing and post-processing modules.The down-scaling is done in seven steps in the pre-processing module where the size of the imageblocks are halved in both x and y dimension every other scaling step. This will allow the CNN toclassify objects at any scale. After classification the results are sent to the post-processing modulewhich marks the classified objects with a colored frame. Byungik Ahn’s work uses only elementarycomponents such as combinational logic, memories and arithmetic operators. This entails thatneither main memory nor processor is used. A multi-category recognition is also implemented thatenables the system to classify, or recognize, two different objects at the same time. This is doneby using different weight sets in alternating frames. The more objects that is classified the morethe recognition frame rate must be lowered. The connection weights are stored as 25-bit signedintegers plus two exponent bits. These are used to bit-shift the product of large values in theconnection weights effectively eliminating the need for floating-point operators.

Sajid Anwar et al. presents in [7] a fixed-point DCNN which is optimized to reduce computationcomplexity and to speed up classification. This is achieved by quantizing pre-trained parametersfrom a high precision network through L2 error minimization layer by layer. The network isthen retrained with the quantized weights. The results indicates that sparsity is achieved withinthe network. This reduces the amount of parameters and the memory usage by one tenth whileobtaining better results than the high precision networks. By lowering floating-point precision tothree and four-bit precision an 80% saving is achieved in hardware resources. By using the sameinput for each layer during the quantizing step a sensitivity analysis is performed on each layerexcept for the pooling layers, which are kept as high precision. This is done by keeping the otherlayers as high precision and computing the optimal weights one layer at a time. These weightsare then added to the high-precision weights and compared to a validation set to find the optimalweights. Using this method a network that performs better, similar or comparable, while onlyusing 10% of the memory that a high-precision network does, was obtained.

Ming Liang and Xiaolin Hu proposes in [5] a RCNN which performs static object recognition.The results obtained showed that their network performed better than state-of-the-art CNNs uti-lizing fewer parameters. Two GPUs with data parallelism were used to run the experiments thatwere performed on four benchmark object classification datasets. The training was performed withthe backpropagation through time algorithm [10] in combination with stochastic gradient descent.A regularizer is implemented through weight decay and dropout is used to minimize overfitting.

A 3D CNN is proposed by Daniel Maturana and Sebastian Schere in [11] called VoxNet. Voxnetdiffers from standard CNN by utilizing a 3D-point cloud, obtained through LiDAR, RGB-D andCAD data. Several authors have implemented RGB-D cameras in their work but instead of utilizingthe entirety of the spatial data they just treat the depth as an additional input to their networks.This is called 2.5D networks. Daniel Maturana and Sebastian Schere utilizes a point cloud to findthe spatial occupancy through a volumetric occupancy grid and predicts class labels through a 3DCNN. The volumetric occupancy grid maintains a probalistic estimate of the environment throughthe use of random variables that corresponds to a voxel. This helps, from range measurements, inestimating free, occupant and unknown space. They implement their solutions on a GPU and theirresult shows improvements on state-of-the art results in various benchmarks while performing theclassification in real-time.

11

Page 13: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

2.4.6 Reasoning

There exists a lot of implementations of ANNs all with their own strengths and weaknesses. Sincethis thesis scope is to investigate the feasibility to implement real-time object recognition on aFPGA through a video feed, it is natural to consider the limitations stated in 2.3. The FPGA wasalso chosen over the GPU because of the latency, power-consumption and by request of Unibap.Keeping this in mind the method that is tested in this thesis will be a CNN. This is due to the factof the weight sharing between convolutional layers, the fact that the classification is integrated inthe network, the fact that it utilizes less resources than a DCNN as well as promising results. Itis also the request of Unibap that this network is implemented. The focus will be in finding goodconfiguration in terms of depth, kernel size, the amount of layers and which methods that is tobe implemented. The implementation will also utilize the multiplexing method to conserve logicblocks and DSP’s.

12

Page 14: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

3 Method

This thesis was proposed by Unibap and their request included a deep learning algorithm intendedfor use on a FPGA. An existing platform called IVS-70 is provided by Unibap. From this platformwe can draw some requirements. With these in mind, research was performed on different deeplearning algorithms and a choice was made as mentioned in section 2.4.6. The next steps are todesign the layout of the algorithm, decide how to train the network as well as to decide on theimplementation.

3.1 Layout

The layout of the networks will need to be decided upon to find an optimal solution. This will beperformed through the program Tensorflow [12]. Different configurations with different size of thefeature maps as well as number of nodes in the fully connected layer will be implemented, testedand evaluated. The resource and timing demands on each of these configurations with differentsizes of the variables will be investigated. This will give an indication towards what configurationsperforms well and this will be used as a basis for the decision of what configuration that will beimplemented on a 2.3 in the future.

3.1.1 Deep Convolutional Neural Network

Figure 2: Structure of a DCNN from [13].

The process of a CNN consists of two parts. In the first part the feature extraction is performedand consists of alternating convolution and pooling layers. The second part performs the classi-fication and recognition and this is done through dense layers. The structure of a DCNN can beviewed in figure 2. In the figure, the convolution layers are called C-layers and the pooling layersare called MP-layers while the dense layers can be viewed at the end of the network.

The DCNN and CNN works on the principal of receptive fields. This means that a neuron inthe network is only connected to a small region of the previous layer. In the first layer the rawimage, that is used as input, is divided into small regions. In figure 2 this region is set to be 5x5.The resulting square is then shifted across the image to produce the input for subsequent neurons.The variable that governs how far the region shifts is called stride and is set to 1 in the figure.Since the region used is 5x5 and the total size of the image is 32x32 a stride of 1 will produce 28x28

13

Page 15: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

different regions as inputs to the first convolution layer. The depth of the first convolution layer, inother words the amount of feature maps used, may all detect different features such as blobs, colorsor edges and also governs how big the output becomes. Each feature map shares weights acrossthe entire input field but different feature maps utilizes different sets of weights. This is built onthe assumption that if a neuron is able to detect features in one part of the picture it should alsobe useful in finding the same feature in another part of the image. The activation function mostcommonly used is the RLU and this is due to the fact that it increases the non-linear propertiesof the network while leaving the receptive fields in the convolutional layer unaltered. As describedby Ning Li et al. in [3] when three images, RGB, are used as input the output of a convolutional

layer and the RLU is given by the equations 2 and 3 where y(l)i,j,k is the output of layer l, i, j and k

is the 3D-coordinate of the node, w(l−1,f)a,b,c is the weights of filter f which is applied at layer (l− 1)

and a, b and c is the 3D-coordinates of the filter weight. Finally, before being passed on to thepooling layer the RLU function σ is applied, which can be seen in 4, which produces the output ofthe layer.

x(l)i,j,k =

∑a

∑b

∑c

w(l−1,f)a,b,c y

(l−1)i+a,j+b,k+c + biasf (2)

y(l)i,j,k = σ(x

(l)i,j,k) (3)

σ(x(l)i,j,k) = max(0, x

(l)i,j,k) (4)

After the initial convolutional layer which summates the activations from the three channels.The subsequent convolutional layers may utilize two dimensions instead of three. After the convo-lutional layer and RLU the output is used as input to the pooling layer. The pooling layer can bedone in multiple ways. The basic method of the pooling layer is to divide the input into squaresor rectangles of equal size that does not overlap. In a max pooling layer the different nodes in thepre-defined area are compared and the strongest (highest) response is chosen to use. This reducesthe size of the input and this can be used as a new input to the next convolutional layer. If a maxpooling layer with a kernel of 2x2 is used on an input of size 10x10 then the resulting output tothe next convolutional layer will be of size 5x5. Another method is the average pooling methodthat works much in the same way as the max pooling. The difference is that instead of using thestrongest response, the average response is calculated and used.

At the end of the network a fully connected layer is implemented much the same way as in aclassic ANN. Its output can be calculated the same as well, through a matrix multiplication and abias offset. The output of this layer is in the form of a vector. For example, if 6 classes are to beclassified the output will have the form as shown in equation 5 where each number represent thedegree it belongs to each class. The output from 5 belongs to class 1 to a degree of 10%, class 220% and so on.

Output = [0.1, 0.2, 0.05, 0.7, 0.3, 0.0] (5)

Since the matrix multiplication of this layer utilizes a lot of resources Ning Li et al. proposesin their work [3] a solution called global summation. It is based on the same technique as globalaverage pooling. Instead of having a fully connected layer it requires the same amount of featuremaps as classes wished to be detected. This entails that it requires less resources, since onlyaccumulators need to be utilized, as well as a reduction in overfitting.

To reduce the risk of overfitting, since a smaller training set will be used, a technique calleddrop-out may be used. Dropout is performed by actively excluding nodes during training. Everynode has the probability of (1− p) to be excluded from training. If excluded, the node and all itsconnections are removed during training and then included again after training. The training maybe performed, as with ANN, through Gradient Descent (GD).

3.1.2 Reccurent Deep Convolutional Neural Network

A RDCNN is implemented the same way as a DCNN with a minor alteration in the convolutionallayers. Instead of being convoluted one time in each layer the input is convoluted a pre-defined

14

Page 16: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Figure 3: Depiction of the reccurancy of a RDCNN from [3].

number of times by the same filter F . The result from the convolution will then be convolutedagain by another filter f and added to the next convolution by filter F . A depiction of the method,with three steps, can be viewed in figure 3 and the corresponding equations can be seen in equations6, 7 and 8. It is based on the work of Ning Li et al. in [3].

Outputt=0 = F ∗ Input (6)

Outputt=1 = f ∗ (F ∗ Input) + F ∗ Input (7)

Outputt=2 = f ∗ (f ∗ (F ∗ Input) + F ∗ Input)) + F ∗ Input (8)

3.2 Training

To achieve low error rates, it is recommended that a CNN is trained on a massive database ofimages. This is very time consuming and therefore two approaches will be tested and evaluatedin this thesis. Firstly, a network that has been trained on a big database, such as the ImageNetdatabase [14], will be implemented. The end-product that Unibap works towards will be workingin simplified surroundings where the objects will be more easily recognized. This is why the secondimplementation will train on a small data-set and be evaluated to see if it is feasible to minimizethe training. These two methods will then be compared and analyzed.

3.2.1 Backpropagation - Fully connected layers

To train the DCNN there are a few different steps depending on which layer that is being trained.In the fully connected layer the backpropagation method is implemented. First the error, or costfunction denoted E(yL), at the output layer needs to be calculated. This is done by the squared-error loss function. The squared-error loss function can be viewed in 9.

EN =1

2

N∑n=1

c∑k=1

(targetnk − ynk )2 (9)

15

Page 17: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

where N is the number of training examples, c is the number of classes supposed to be identified,targetnk is the n:th training example target of class k, and ynk is the actual output from the lastlayer for training example n’s belonging to class k. Since the squared-error loss function is justa sum of individual errors across the training dataset this can be simplified to a single trainingexample. This can be viewed in equation 10.

E(yL) =1

2

c∑k

(targetk − yk)2 (10)

The partial from the output layer is simple the derivative of the error function and this can beseen in equation 11.

δE

δyLi=

d

dyLiE(yL) (11)

After this is done the partial derivative of error, commonly known as deltas, needs to becalculated for each input to the current neuron. This can be viewed in equation 12

δE

δxlj= σ

′(xlj)

δE

δylj(12)

where δEδxl

j

is the delta for input xlj to the current neuron. This is done for all neurons. When

this is done the errors at the previous layer needs to be calculated, in other words the error isbackpropagated. This is done by equation 13

δE

δyl−1i

=∑

wl−1ij

δE

δxlj(13)

where wl−1ij is the weight connected to the input xlj in the next layer. Equation 12 and 13 is

then repeated through all fully connected layers in the network until the input to the first fullyconnected layer is reached. After this you have the gradients to all of the weights in the fullyconnected part of the network. This gradient is then multiplied with the negative learning rateand this is added to each corresponding weight and thus the higher reasoning, or dense layers, ofthe network has trained on one training example. The equation 14 shows the variable which isadded to the weights

∆wl−1ij = −η δE

δyl−1i

(14)

where η is the learning rate.

3.2.2 Backpropagation - Max pooling layers

Since the max pooling layers does not actually performs any calculation but rather picks the neuronin the layer before with the highest activation it does not perform any learning at all. This meansthat the error is simply forwarded to the place where the highest activation is found.

3.2.3 Backpropagation - Convolutional layers

The backpropagation in the convolutional layers are different from the one performed in the fullyconnected layer. The error in the convolutional is known from the layers succeeding the convo-lutional layer. First, as in the fully connected layers, the gradients for each weight needs to becalculated for the current layer. To do this the chain rule is utilized and it must sum the contri-butions of all expressions where the variable occurs. Since the convolutional layer shares weights,every single xlij expression that includes the weight wab must be included. The equation can beviewed in 15.

δE

δwab=

N−m∑i=0

N−m∑j=0

δE

δxlij

δxlijδwab

(15)

16

Page 18: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

By looking at the forward pass of the algorithm mentioned in section 3.1.1 we already knowthat

δxlijδwab

= yl−1(i+a)(j+b) (16)

and therefore, get equation 17.

δE

δwab=

N−m∑i=0

N−m∑j=0

δE

δxlijyl−1(i+a)(j+b) (17)

In order to calculate the gradient, the value of δEδxl

ij

must be known. This can be calculated by

using the chain rule again as in 18

δE

δxlij=

δE

δylij

δylijδxlij

=δE

δylij

δ

δxlij(σ(xlij)) =

δE

δylijσ

′(xlij) (18)

Since we already know the error at the current layer the deltas can be calculated easily bytaking the derivative of the activation function. The activation function, which is max(0, xlij), can

only give the answer one or zero except for when xlij = 0 when its derivative is undefined. Afterthis the error needs to be propagated back to the previous layer. Once again this is achieved byapplying the chain rule as seen in equation 19.

δE

δyl−1ij

=

m−1∑a=0

m−1∑b=0

δE

δxl(i−a)(j−b)

δxl(i−a)(j−b)

δyl−1ij

=

m−1∑a=0

m−1∑b=0

δE

δxl(i−a)(j−b)wab (19)

Looking at this equation we can see that this is a convolution where wab have been flippedalong both axes. It is also important to note that this will not work for the top- and left-mostvalues. It is therefore necessary to pad the top and the left with zeros.

Another thing to note is that when the convolutional layer closest to the input is trained. Theequations needs to be expanded to three dimensions since three channels, RGB, are summatedfrom the input.

3.2.4 Training data

Unibap wants to implement the work of this thesis in an industrial environment where it is supposedto recognize a small number of objects in a simplified environment. This will be targeted objectsfor different industries and therefore there exists no training data-sets for these objects. A trainingdata-set will be created by taking pictures from different angles. The objects will then be digitallyplaced in different rotations and on different backgrounds. These images will then be utilized asthe training data-set.

3.3 Implementation

The implementation, that will be done on a FPGA, will only contain the forward pass. This is dueto the fact that it is more productive to train the network on an external hardware. This way thetraining can be sped up by utilizing server-parks or super-computers. After training the only partthat will be needed on the FPGA for recognition is the forward pass. This will be implementedand simulated for a FPGA and the results and timing will be analyzed and reported.

3.4 Analysis

The purpose of this work is to work on targeted objects in an industrial environment. One exampleis that the hardware might be fitted in an assembly line where this work would be used to identifya part which is packaged with identical parts. Therefore, it is not necessary to identify the objectamongst different objects but rather make sure that just one of the objects is recognized. Thisentails that there will be no previous results to compare to. Instead the analysis will focus on theusefulness the work in this thesis will have in its intended environments.

17

Page 19: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

4 Hardware

The platform intended is provided by Unibap and is named IVS-70 [15]. The IVS-70 is fitted withtwo cameras and has stereovision. Through disparity mapping it has the ability to see the depth,or distance, to the objects it sees. By always situating the camera at the same distance from theplane the objects are placed on all of the objects meant for detection will always be at the samedistance. Because of this all of the objects will always be the same size and the work in this thesisdoes not have to implement any kind of scale-invariance. The IVS-70 includes both a GPU, CentralProcessing Unit (CPU) and a FPGA.

4.1 Cameras

The IVS-70 is fitted with two Color or monochrome 5.2 megapixel camera lenses. They have aresolution of 2560x2048. The framerate is up to 25 frames per second when utiizing the full 5.2megapixel and up to 50 frames per second when utilizing 1 megapixel. They have a global shutterwith programmable exposure time.

4.2 Field Programmable Gate Array

The FPGA in the IVS-70 is a smartfusion2M2S050T [16] with a 166MHz ARM Cortex-M3 proces-sor. Its logic elements consist of 4 LUTs and one DFF. There are a total of 56,340 of these logicelements. It also contains 72 math blocks which is used in multiplications. The multiplications canbe performed on 17x17 unsigned variables or on 18x18 signed variables. The smartfusion2 containsa total of 1,314 kbit of RAM memory.

4.3 System on Chip

On the IVS-70 there is a System On Chip (SOC) that has both a CPU and a GPU integrated onone chip. This chip is the AMD G-series GX-415 SOC [17].

The CPU in the the SOC has 4 cores and works at a frequency of 1.5GHz while the GPUconsist of 2 CUS and operates at 500MHz.

18

Page 20: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

5 Implementation

5.1 Testing

During the testing phase, to figure out which configurations works best, the program Tensorflowwas used [12] and two tests were performed. The first test was performed on the MNIST database[18] and the second test were performed on a custom created database.

5.1.1 MNIST

The MNIST database is a database of handwritten digits. It has a training set containing 60,000examples and a test set containing 10,000 examples. It consists of size-normalized and centeredimages taken from the NIST database. The pictures were normalized to fit in a 20x20 pixel boxand later centeralized in a 28x28 image. This was done by computing the center of mass of thepixels and translating it to the center of the 28x28 image. This is the only size that was tested onthe MNIST. Some examples of the images in the MNIST database can be viewed in fig 4.

Figure 4: An example of MNIST images from [12].

5.1.2 Custom database

The custom database consists of images depicting different components in the IVS-70 taken ona white background from different angles. An example of images from the custom database canbe viewed in fig 5. The custom database contains 7192 images and the test were performed byleave-one-out method. This means that a random set of the images are used as validation andthis set interchanges each epoch. The tests were also performed on 3 different image sizes, namely28x28, 56x56 and 112x112. The tests were then compared by validating them on 1000 images.

5.1.3 Test setup

To get an idea of what the optimal configuration is four test setups were investigated. All thetest involved two convolutional layers and one fully connected layer. This is due to the goodperformance on the MNIST database and this was used as a comparison to the custom database.The MNIST database was tested on 2D gray-scale images while the custom database was testedon 3D RGB color images. In test one the first convolutional layer has 32 feature-maps in the first

19

Page 21: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

(a) Example one from custom database (b) Example two from custom database

Figure 5: Two images from the custom database

Table 1: Test results

Test Results (%)

Images MNISTCustom28x28

Custom56x56

Custom112x112

Test one 98.96 93.1 74.2 86.3Test two 99.32 95.4 78.1 88.8

Test three 99.2 91.8 88.9 86.7Test four 99 92,6 91.8 84.2

convolutional layer and 64 feature maps in the second. The fully connected layer is built with1024 nodes. In test two the first convolutional layer consists of 16 feature maps and the secondof 32 feature maps. The fully connected layer consists of 512 nodes. In test three and four theconvolutional layers are the same as in test two but the fully connected layer consists of 300 and100 nodes respectively. Every test was performed with 10,000 iterations and the results of the testscan be viewed in table 1.

As can be seen by the results both the MNIST and the custom database performs best ontest two. And in case of the custom database on the size of 28x28. It should be noted that anincrease of training images and more iterations might improve the results on the custom database.It also shows that the MNIST database is slightly better than the custom database. This mightbe because of the substantially smaller size of the custom database. The results shows that thecustom database has acceptable accuracy for its intended use and therefore the expansion of thedatabase is left for future work.

5.2 FPGA implementation

During the course of this project Unibap notified that a proper database is to be provided althoughlate in the project. This lead to the decision to implement the core modules of the networkand verify its functonality so that a network of any size can be constructed. Because of this apresentation of all the test setups will be presented in regards to resources required as well astiming.

The first module is the multiplication and adder module. The multiplications in the networkare pretty straight forward and since there are 72 math blocks on the IVS-70 this limits the amountof multiplications per clock-cycle to 72. There is a small problem with the adder modules. Sincethe addition blocks on the FPGA are only able to take two inputs this leads to a problem in time.For example if there is 5 inputs that needs to be added this will take 3 clock-cycles. This can beviewed in 6. This is not the case in the multiplication and adder module which only adds a biasedto the result of the multiplication. This problem arises after the module where all the results fromthe kernel needs to be added together.

Because of this the additions in the network will be pipelined. This means that the first fivevalues to be added are inputed into the adder tree, visible in fig 6, at clock-cycle one. On clock-cycle two these will be inputted into the next level of the adder-tree. At the same time the next5 values to be added are inputted into the first level. Even though the first 5 values to be added

20

Page 22: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Figure 6: Adder tree.

Figure 7: Timing of module.

will take 3 clock-cycles the subsequent addition will only take one extra clock-cycle. This meansthat dataset n will take 2 + n clock-cycles to compute.

The next two modules are simple. Since the activation function used is the RLU a simplecomparison block is utilized to achieve this. The maxpooling is simply comparing four values,but only two each clock-cycle, and outputting the highest and this can also be done with onecomparison block. Both of these modules are one clock-cycle modules.

One decision made to simplify the first part of the network, the convolutional layers, the RLUand the max-pooling, was to implement a specific module that works on a per kernel basis. Thismeans that the FPGA will not be used to its full potential during this stage since the kernels chosenis 5x5 in size. This means that only two of these modules may be implemented simultaneouslysince three of them will use 75 math blocks and this is not possible on the current FPGA. Thesemodules will input the first four nodes, namely the ones relevant to the first max-pooling, via abit-sequence. This entails that the bit values from the image won’t be sent in left to right, butrather via the same squares dictated by the max-pooling function. An example of this module,running fully through from kernel via the add-tree and RLU to max-pooling and outputting thevalues of two such passes, may be viewed in figure 7 where the explanations of each row may beviewed in table 2. In this example the values are stored in 4-bit signed integers.

As visible in the figure it is clear that the first pass, from inputs from four different kernels tomax-pooling output, takes 9 clock-cycles. However, the second output from the second pass onlytakes 4 clock-cycles. This indicates that the pipelining is working. This figure also confirms thatthe correct results are achieved from the multiplication, the add-tree as well as the max-pooling. Inthe first cycle the bitstreams in rows 2,3 and 4 indicate that the first 4 bits in the three bit-streamsthat represents the pixel-values in the image, the weights and the biases all have the bit-value of0001 or 1 in decimal. This is repeated for all 25 positions in the kernel for simplicity. In row 4 and7 we can se that the result of the multiplication with the added bias is 2 for all the positions. Thiscan be verified by 1 ∗ 1 + 1 = 2. In the next cycle we can se the beginning of the pipelining wherethe next kernel inputs are calculated and forwarded to the add-tree with the result of 6 which canbe verified in the same manner. Since the add-tree needs to add 25 values together it is known

21

Page 23: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Table 2: Row explanations

Row Name Explanation

1 Clock This represent the clock-cycles (rising edges)2 Image input This is the bitstream of the 25 pixel values from image3 Weight input This is the stored weights in the FPGA4 Bias input This is the stored biases in the FPGA5 Result/Add in This is the result from the multiplications and also the input to the add-tree6 Add out/Max-pool in This is the result from the add-tree as well as the input to the max-pooling7 Add in decimal This is the add-tree input in decimal values8 Max-pool out This is the result from the max-pooling9 Current max This is the current max-value in the max-pooling module

that the first layer of the add-tree will consist of 12 additions plus one odd value that needs to bepipelined along with the rest of the tree to achieve optimal timing. This means that 4 clock-cyclesis needed to complete all of the additions. As depicted by the figure, four cycles after receivingthe first results it puts out the value of 50. This can be verified by 2 ∗ 25 = 50 and this showsthat the result obtained is correct. Furthermore it is also visible that the subsequent calculationsonly takes one clock-cycle and this verifies that the pipelining works as intended. In the bottomrow the current maximum in the max-pooling module is displayed. It is clearly showing that thelargest value obtained from the 4 positions in the image,150, is replacing the lower value of 50 andoutputting this value after 9 clock-cycles as indicated by row 8.

5.2.1 Resources

The resources utilized by this implementation is described for test two on the custom database withan image resolution of 28x28 and will be shown for the other test setups. It is wort mentioning thatthe software utilized in the tests, Tensorflow [12], only works on 32-bit float values. The FPGA onthe other hand performs all of the multiplications on 18 bit values regardless of size specified onthe values. This implies that the most optimal size of the values is 17-bit signed integers. Thesevalues are intended as fixed-point integers where the 6 least significant bits represents the floatpart of the value. This will not change the mathematics in the FPGA.

As mentioned in section 5.1.3 test two has a configuration where the first convolutional layer has16 feature maps. This means that this layer needs to have 25∗16 weights. However since the netwokhandles 3D images this needs to be multiplied by three giving us 1200 values required. It also needsas many biases as feature-maps which sums up to 1216. The RLU and the maxpooling does notrequire any memory but after this is done the new image values, which is now 14 ∗ 14 = 196,needs to be stored. This means that 196 ∗ 16 = 3136 values are required. Going by the sameprinciple, except that the input is now in 2D, the next convolutional layer, which have 32 featuremaps, requires 25 ∗ 32 + 32 = 832 values. After the second convolutional layer the new imagerepresentations needs to be stored in 7 ∗ 7 ∗ 32 = 1568 values. For the fully connected layer theweights required can be calculated by equation 20.

FC1 = SizeOfImageV alues ∗NrOfFeaturemaps ∗NrOfNodes+NrOfNodes

FC2 = NrOfNodes ∗NrOfOutputs+NrOfOutputs

NrOfV aluesRequired = FC1 + FC2

(20)

Since the custom database has 512 nodes and 4 outputs this would come to a total of 812132values needed for the network. If all of these values were to be stored internally on the fpga by17-bit signed integers this would require 14,618,376 bits or 14,618.376 kbit. As mentioned, theFPGA only has 1,314 kbit of RAM memory and therefore this is not possible. A summary ofresource requirements for all the tests can be viewed in table 3.

22

Page 24: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

5.2.2 Timing

The timing will be reported in the same way as resources, with test two being described and therest presented. Since only two kernel modules may be implemented at the same time this gives anindication on the time needed. Since the image is 28x28 and there is three of them, RGB images,the modules will need to run 14 ∗ 3 ∗ 16 = 672 times. As described earlier this section the first passthrough will take 9 clock-cycles while the subsequent passes will take 4 clock-cycles. This meansthat to get to the second convolutional layer 9 + (671 ∗ 4) = 2693 clock-cycles will pass. To getthrough the second convolutional layer the same principle can be applied with the difference thatthe pipelining may continue so every calculation will take 4 clock-cycles. This means that to reachthe fully connected layer from convolutional layer two (7 ∗ 16 ∗ 32 ∗ 4 = 14, 336) clock-cycles isrequired. The fully connected layer will take one clock-cycle to calculate 72 multiplications. Theamount of clock-cycles to calculate the initial values to the fully connected layers can be solved byequation 21.

ClkInitial = ceil((FC1−NrOfNodes)/72) (21)

in this case this would amount to 11, 151 clock-cycles. During this the result to each node willneed to be added together. That would mean that to each node 1568 values needs to be added.If this is pipelined then this would mean that the first addition will take 10 clock-cycles and thesub-sequent will take one clock-cycle each. This starts as soon as 1568 values have been calculatedby the multiplication module. However, in this case that takes 22 clock-cycles which is a longertime than the addition takes. Because of this it is not necessary to pipeline the addition in thisconfiguration. Instead the addition will be finished 10 clock-cycles after the last nodes inputsare calculated. The final calculation which is the output from the fully connected layer and theinput to the classification layer consists of 512 ∗ 4 = 2048 multiplications. This requires 29 passesthrough the multiplication modules and luckily this can start before the 10 clock-cycles the lastaddition takes from the previous layer and those 10 cycles may be omitted from the total time.These 29 passes will take 29 clock-cycles. When 512 results are obtained which will be after 8cycles the first add-tree may commence. One such add-tree takes 8 cycles as well. This entails thatthe last addition will finish 8 cycles after the last multiplication. Summating all of these cyclesgives a total of 28,217 clock-cycles. Since the FPGA has 20MHz as maximum clocking resource,which translates to 0.000050 ms per cycle, this would mean that one full forward pass through thenetwork would take 1.41 ms. This translates roughly to 709 frames-per second which is way overwhat the IVS-70 can muster. The rest of the tests can be viewed in table 3.

23

Page 25: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

Table 3: Test timing and resource demands

Test Setup Database Clock-cycles msMemory usage

18 bit precision (kbit)Memory usage

8 bit precision (kbit)Memory usage

4 bit precision (kbit)

1

MNIST/Custom28x28

107,393 5.37 58,249/58,138 25,888/25,839 12,944/12920

Custom56x56

303,916 15.20 232,054 103,135 51,567

Custom112x112

964,567 48.23 927,719 412,320 206,160

2

MNIST/Custom28x28

28,217 1.41 14,674/14,618 6,522/6,497 3,261/3,248

Custom56x56

78,692 3.93 58,224 25,878 12,939

Custom112x112

246,542 12.33 232,649 103,399 51,700

3

MNIST/Custom28x28

23,587 1.18 8,648/8,616 3,844/3,829 1,922/1,915

Custom56x56

60,211 3.01 34,271 15,232 7,616

Custom112x112

172,659 8.63 136,854 60,824 30,412

4

MNIST/Custom28x28

19,220 0.96 2,964/2,953 1,317/1,312 659/656

Custom56x56

42,777 2.14 11,674 5,189 2,594

Custom112x112

102,958 5.14 46,559 20,693 10,346

6 results

This section describes the results of this thesis. Although the FPGA implementation was onlysimulated it verified that the network behaves as expected. The majority of the result can beviewed in the table in the previous section. The results leave a lot to be desired but unfortunatelythere was not enough time to explore all the possible solutions.

6.1 General research problem

What methods are feasible when implementing object recognition on a FPGA and which of thesemethods are the most suitable?

6.1.1 Question 1

The first question formulated was what limitations does the FPGA entail? It is quite clearthat when working with the IVS-70 the major obstacles are the amount of multiplication blocks aswell as the size of the internal memory. Furthermore, the limitation in the size of the variables inthe FPGA will probably lead to a loss in accuracy. Otherwise it would seem as FPGAs is suitablefor this purpose.

6.1.2 Question 2

Regarding the second question which is formulated as Considering the limitations in q1.what type of ANN is suitable? It was quite early on that both the research in itself as well asindications from Unibap indicated that a CNN would be quite suitable. This is due to the fact ofresource sharing as well as the accuracy proven in other works. There are however a few differentmethods explained in section 2 that were not properly investigated.

6.1.3 Question 3

Question three which is formulated Is it possible to fit the entire network on the FPGA?was investigated and the answer is yes it is quite possible to fit an entire network on a FPGA.

24

Page 26: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

There are however some strong limitations on the network, especially on the IVS-70. The networkis quite small and restricted in both image dimensions, layer size, depth and variable size.

25

Page 27: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

7 Discussions

This section discusses the results of this thesis with some extra care taken to the questions formu-lated in the beginning.

Firstly, It should be pointed out that a lot of time in this thesis were used to learn languageand programs pertaining to the specific hardware that was supplied by Unibap. Because of this thetesting phase suffered in the way of less time for training and simulating. The decision to simulatethe hardware implementation instead of synthetizing were taken to make sure that at least anaccurate result, in regard to the simulation performed, could be achieved. In the beginning, itwas decided that the network would focus on accuracy on a major database. But after discussionswith Unibap it was instead decided that a custom database would be tested as well. This furtherdiminished the time for testing since quite some time were put into creating the database. A whileinto the work Unibap announced that they would create a database that this thesis should include.It was later discovered that this database would not be ready until the very end and therefore thecustom database included in this thesis is not as extensive as it should be. The results of thisthesis leave a lot to be desired but it does however lay a solid base for future work.

7.1 Question 1

Even though the FPGA that exists in the IVS-70 brings a lot of restrictions to the network Unibapis working on upgrading their hardware to a larger FPGA which could allow the network to expandto larger dimensions. This is one of the reasons that test on networks that does not fit on thecurrent hardware is included in this thesis. The results may be moot as of today but in the nearfuture it will be relevant for the new hardware that is to be acquired. The new hardware will havemore internal memory and more mathematical blocks and further testing of network configurationswill be necessary to find the most suitable one.

7.2 Question 2

The decision to implement a CNN is something that would stand even after the hardware update.There are however some methods that could be relevant to test which might improve accuracyand lessen the resources necessary. Especially the global summation method that is mentioned insection 3. There also exist a lot of other networks that, for time restrictions, were never explored.This is something that would have been very relevant to this thesis but to explore them all isimpossible in the timeframe of this work. Another method that would be interesting to explore isthe RCNN as well as the average pooling technique that both were omitted, also because of thelack of time.

7.3 Question 3

This is a very interesting part of this work. It is shown in section 5.2.1 that an entire network canfit on a FPGA. This is however a very small network that could be bigger if a better FPGA isutilized. There are probably quite a lot of optimizations that could be done since the author of thiswork is relatively new to the language of verilog and many concepts pertaining to the coding ofFPGAs. However, the result shows that it is possible and with someone more experienced perhapsboth the time and the resource demands may be diminished.

26

Page 28: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

8 Future Work

As mentioned before the main thing to consider in future work is to implement and try differentmethods than the ones included in this work. There are so many unexplored method like the globalsummation method and recurrency in CNNs. Furthermore, testing different learning algorithmswould be an expected part of the future work. All of the tests in this work is built on the backprop-agation method with a learning step of 0.01 through 10,000 iterations. It would be important totry different learning steps, number of iterations and algorithms. It would be especially importantto investigate the backpropagation through time algorithm since this work is intended for a livevideo stream in the end. There are also completely different networks that was never considered forthis work to try. Just to mention a few there are Deep-belief networks, Neural history compressorand Deep Boltzmann machines.

Another important part of future work would be to syntethize the network on a FPGA andactually confirming that everything works as intended. Not to mention doing a deep study ofoptimization in FPGA programming. There is a lot of improvements to be made after the upgradeof hardware that includes testing different configurations of the method investigated in this worksince it only lays the foundation for bigger networks.

The custom database that is included in this work is something else that could be improved inthe future. There is a whole area of filtering, composition, number of images required to explore.Even though Unibap is currently working on producing such a database some background work andtests could be included in future work. Another thing to consider for future work is the comparisonbetween a GPU and a FPGA. This was intended to be in this work but since GPU programmingis new to the author the decision to leave this for future work was taken.

Unibap started two similar thesis collaborations were this was the first one. The second oneis more concentrated on accuracy than this work. Therefore, a merger of the two works would besomething to consider for future work since this is the desire of Unibap.

27

Page 29: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

9 conclusion

The focus of this work was to investigate and find the answer to the following questions. Whatmethods are feasible when implementing object recognition on a FPGA and which ofthese methods are the most suitable?,What limitations does the FPGA entail?,Consideringthe limitations mentioned in question 1. what type of ANN is suitable? and Is it possi-ble to fit the entire network on the FPGA?. This work mainly focused on finding a suitablemethod for FPGA implementation which could be applied in industry of today. This applicationcould increase the productivity and speed of, for example, assembly. This would, according toUnibap’s beliefs, revolutionize industry of today. However, because of time constraints a lot ofmethods were bypassed without further investigation. It was however decided in tandem with Uni-bap on a certain network and configurations for this network were tested with acceptable results.It was discovered that the CNN is suitable for FPGA implementation due to its resource sharingand accuracy. This network was implemented in a simple way on a FPGA which showed that it ispossible to implement the network on the hardware.

It was discovered that the method presented in this work has acceptable accuracy for theintended use and that the network could indeed fit entirely on a FPGA. It was also discovered thatthis network would be severely restricted in its configuration but this could be improved when animpending hardware update is realized.

A custom database was created which showed promising results which could mean that acompany interested in the end-product may easily be able to create smaller databases for theirspecific use with acceptable results.

There was also discovered that a hugh part of the area is still unexplored and that a lot ofresearch could be done to improve the quality of this work amongst other things a more experiencedFPGA developer.

28

Page 30: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

10 Acknowledgments

I would like to thank my supervisors Ning Xiong and Lars Asplund for the help with ideas andtheir continous support. Quick to answer any questions and a genuine interest in this work.

I would also like to thank the co-workers at Unibap who has helped me with areas they specializein.

29

Page 31: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

References

[1] Y. Kaneko and K. Yada, “A deep learning approach for the prediction of retail store sales,” in2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), Dec 2016,pp. 531–537.

[2] Unibap, “Unibap homepage,” 2017, [Online; accessed 24-January-2017]. [Online]. Available:https://www.unibap.com/

[3] N. Li, S. Takaki, Y. Tomiokay, and H. Kitazawa, “A multistage dataflow implementation of adeep convolutional neural network based on fpga for high-speed object recognition,” in 2016IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), March 2016, pp.165–168.

[4] T. Cheng, P. Wen, and Y. Li, “Research status of artificial neural network and its applicationassumption in aviation,” in 2016 12th International Conference on Computational Intelligenceand Security (CIS), Dec 2016, pp. 407–410.

[5] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015,pp. 3367–3375.

[6] B. Ahn, “Real-time video object recognition using convolutional neural network,” in 2015International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–7.

[7] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolutional neuralnetworks for object recognition,” in 2015 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), April 2015, pp. 1131–1135.

[8] A. Venkadesan, S. Himavathi, K. Sedhuraman, and A. Muthuramalingam, “Design and fieldprogrammable gate array implementation of cascade neural network based flux estimator forspeed estimation in induction motor drives,” IET Electric Power Applications, vol. 11, no. 1,pp. 121–131, 2017.

[9] S. Li, K. Choi, and Y. Lee, “Artificial neural network implementation in fpga: A case study,”in 2016 International SoC Design Conference (ISOCC), 2016, pp. 297–298.

[10] J. Mazumdar and R. G. Harley, “Recurrent neural networks trained with backpropagationthrough time algorithm to estimate nonlinear load harmonic currents,” IEEE Transactionson Industrial Electronics, vol. 55, no. 9, pp. 3484–3491, Sept 2008.

[11] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time objectrecognition,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), Sept 2015, pp. 922–928.

[12] “Tensorflow homepage,” 2017, [Online; accessed 24-January-2017]. [Online]. Available:https://www.tensorflow.org/

[13] P. Bezak, Y. R. Nikitin, and P. Bozek, “Robotic grasping system using convolutional neuralnetworks,” American Journal of Mechanical Engineering, vol. 2, no. 7, pp. 216–218, Oct2014. [Online]. Available: http://pubs.sciepub.com/ajme/2/7/9

[14] S. U. Stanford Vision Lab, “Imagenet homepage,” 2016, [Online; accessed 24-January-2017].[Online]. Available: http://image-net.org/

[15] Intelligent Vision System IVS-70, Unibap, January 2016, rev: 0.19.

[16] Microsemi, “Smartfusion2 soc fpga family,” 2017, [Online; accessed 24-January-2017].[Online]. Available: https://www.microsemi.com/products/fpga-soc/soc-fpga/smartfusion2

[17] AMD, “1st and 2nd generation amd embedded g-series system-on-chip (soc),” 2015,[Online; accessed 24-January-2017]. [Online]. Available: https://www.amd.com/Documents/AMDGSeriesSOCProductBrief.pdf

30

Page 32: OBJECT RECOGNITION THROUGH CONVOLUTIONAL …1119717/FULLTEXT01.pdfthere is a su cient amount of training data. A deep network especially is prone to su er from over tting if the training

Malardalen University Master Thesis

[18] C. J. B. Yann LeCun, Corinna Cortes, “The mnist database,” [Online; accessed24-January-2017]. [Online]. Available: http://yann.lecun.com/exdb/mnist/

31