neural network-based face detector implementation … · this thesis focuses on the design of a...
TRANSCRIPT
NEURAL NETWORK-BASED FACE DETECTOR IMPLEMENTATION
ON A VIRTEX2 PRO FPGA PLATFORM
by
Christos Kyrkou
Submitted to the University of Cyprus in partial fulfilment
of the requirements for the degree of Bachelor of science in Computer
Engineering
Department of Electrical and Computer Engineering
May 2008
NEURAL NETWORK-BASED FACE DETECTOR IMPLEMENTATION
ON A VIRTEX2 PRO FPGA PLATFORM
by
Christos Kyrkou
Examination Committee:
Theocharis Theocharides Lecturer, Department of ECE, Advisor Athinodoros Georghiades Visiting Assistant Professor, Department of ECE, Committee Member
iii
Abstract
Face detection is a vital part towards face recognition and is a vital task in security
and intelligent vision-based human computer interaction applications. Current software
face detection implementations lack the computational ability to support detection in real
time video streams. Hence the need for hardware implementations of face detection
systems arises. Hardware implementations are not only desirable because of their speed
that allows for real time video processing, but also because hardware implementations
can be optimized to have the best possible results in terms of area and power
consumption.
This thesis focuses on the design of a hardware system on an FPGA platform for the
purpose of performing upright frontal view face detection on an image frame. The
proposed design consists of a neural network that performs face detection on a 320x240
input frame. The neural network receives a 20x20 search window from the image and
classifies the window as a face or no face. The system output is an image where the
windows that were classified as faces are marked. An important part of this thesis is the
effective allocation of the FPGAs resources in order to design a system capable of
parallel processing of frames.
The weights and thresholds for the neural network were provided in collaboration
with Video mining incorporated. The weights and thresholds are for up frontal viewing
and for a specific data set. The neural network training was done offline and the detection
is done online.
The implemented face detection system can process approximately 30 frames per
second whereas software implementations process between 15-22 frames per second
under favourable conditions. The detection frame rate also indicates that the system can
perform face detection in real time video streams.
iv
Περίληψη
Η ανίχνευση προσώπων σε εικόνες είναι µια πολύ σηµαντικό κοµµάτι για
εφαρµογές όπως αναγνώριση προσώπου, εφαρµογές ασφαλείας και τις ευφυείς
εφαρµογές αλληλεπίδρασης ανθρώπου-υπολογιστή. Οι τρέχουσες εφαρµογές ανίχνευσης
προσώπου λογισµικού στερούνται την υπολογιστική δυνατότητα να υποστηριχθεί η
ανίχνευση σε πραγµατικό - χρόνο. Ως εκ τούτου προκύπτει η ανάγκη για υλοποίηση
συστηµάτων ανίχνευσης προσώπου σε υλικό. Οι εφαρµογές υλικού είναι όχι µόνο
επιθυµητές λόγω της ταχύτητάς τους που επιτρέπει επεξεργασία σε πραγµατικό - χρόνο,
αλλά και επειδή οι εφαρµογές υλικού µπορούν να βελτιστοποιηθούν για να έχουν τα
καλύτερα δυνατά αποτελέσµατα όσον αφορά την περιοχή του κυκλώµατος και την
κατανάλωση ισχύος.
Αυτή η διατριβή εστιάζει στον σχεδιασµό ενός συστήµατος υλικού σε µια
πλατφόρµα FPGA µε σκοπό την ανίχνευση µετωπικής άποψης προσώπων σε ένα πλαίσιο
εικόνας. Το προτεινόµενο σχέδιο αποτελείται από ένα νευρικό δίκτυο που εκτελεί την
ανίχνευση προσώπου σε ένα πλαίσιο εισαγωγής 320x240. Το νευρικό δίκτυο λαµβάνει
ένα παράθυρο αναζήτησης 20x20 από την εικόνα και ταξινοµεί το παράθυρο ως
πρόσωπο ή µη πρόσωπο. Το αποτέλεσµα του συστήµατος είναι µια εικόνα όπου τα
παράθυρα που ταξινοµήθηκαν ως πρόσωπα περικλείονται γύρω από ένα πλαίσιο. Ένα
σηµαντικό µέρος αυτής της διατριβής είναι η αποτελεσµατική κατανοµή των πόρων του
FPGA προκειµένου να σχεδιαστεί ένα σύστηµα ικανό για παράλληλης επεξεργασία των
πλαισίων.
Τα βάρη και οι τιµές των κατώτατων ορίων για το νευρικό δίκτυο µας δόθηκαν
από την εταιρία Video Mining. Τα βάρη και τα κατώτατα όρια είναι για πρόσωπα µε
µετωπική άποψη και για ένα συγκεκριµένο περιβάλλον. Το εφαρµοσµένο σύστηµα
ανίχνευσης προσώπου µπορεί να επεξεργαστεί περίπου 30 πλαίσια ανά δευτερόλεπτο
ενώ οι εφαρµογές λογισµικού επεξεργάζονται µεταξύ 15-22 πλαίσια ανά δευτερόλεπτο
υπό ευνοϊκές συνθήκες. Το ποσοστό πλαισίων ανίχνευσης επίσης δείχνει ότι το σύστηµα
µπορεί να χρησιµοποιηθεί για την ανίχνευση προσώπων σε πραγµατικό χρόνο.
v
Acknowledgements
“I would like to thank my family for their support and understanding throughout the four
years of my studies at the university. For the successful completion of this project I
would like to first of all thank my project supervisor Theocharis Theocharides for his
corporation, guidance and support during the course of completing this project. Also I
would like to thank committee member Athinodoros Georghiades for the helpful advices
that he gave me and useful remarks the he made in order to complete a comprehensive
report on the project.“
vi
Table of Contents
Chapter 1: Motivation
1.1. Face detection ....................................................................................................... 1 1.2. Challenges of face detection ............................................................................. 1
1.3. General approach for face detection ................................................................. 1
1.4. Software Vs Hardware face detection .............................................................. 3
1.5. Challenges of hardware face detection ............................................................. 3
1.6. Contribution ......................................................................................................... 4 Chapter 2: Fundamentals
2.1. Digital Image Processing ..................................................................................... 4 2.2. Image Pyramid Generation................................................................................... 7
2.3. Neural Networks .................................................................................................. 9 Chapter 3: Related Work
3.1. Past work on face detection ................................................................................ 13 3.2. Hardware face detection ..................................................................................... 16
3.2.1. ASIC Implementation of a Neural Network Based Face Detection ........... 16 3.2.2. FPGA Implementation of a Neural Network Based Face Detection .......... 17
Chapter 4: FPGA Hardware Implementation
4.1. Xilinx Virtex2 Pro XC2VP30 FPGA ................................................................. 21
4.2. VGA Controller .................................................................................................. 22 4.3. Image Pyramid Generation................................................................................. 23
4.3.1. IPG Implementation Strategy ..................................................................... 23
4.3.2. IPG Architecture ......................................................................................... 26 4.4. Neural Network .................................................................................................. 28
4.4.1. NN implementation strategy ....................................................................... 31
4.4.2. NN architecture ........................................................................................... 32 4.5. Overall System architecture ............................................................................... 43
vii
Chapter 5: Experimental methodology
5.1. Experimental strategy ......................................................................................... 47 5.2. Experimental Setup ............................................................................................ 48 5.3. Results and discussion ........................................................................................ 49
Chapter 6: Discussion - Future Work and Improvement
6.1. Conclusions and Discussion ............................................................................... 54 6.2. Improvement ...................................................................................................... 56 6.3. Future Work ....................................................................................................... 57
References ........................................................................................................................ 59 Appendix A ...................................................................................................................... 62 Appendix B ...................................................................................................................... 68
1
Chapter 1: Motivation
1.1. Face detection
Face detection is the process of identifying locations in an image that contain a face
regardless of their position, orientation, scale or the environment conditions in the image
[13]. Face detection plays a major role in applications such as security, robotics,
computer vision, human-computer, multimedia applications and intelligent vision-based
human computer interaction applications. Moreover it is important because it is the first
step in other processes such as face recognition, face tracking and monitoring.
1.2. Challenges of face detection
Face detection as a problem has many different challenges. Probably the most
important challenge is that a face can appear with many variations. First there are the
different poses that a face can have according to the relative camera – face angle. Also a
human face has many different facial features such as beards, mustaches and glasses but
also various skin tones and shapes. The facial expression of the face is also a condition
that needs to be taken into consideration. Some other factors that make face detection a
challenging task, are related to the setting of the image. For instance a face may not be
fully visible because part of it is hidden by another object. Finally the environment in
which the image is taken is important as lightning and weather conditions affect the way
a face is appears in the image [8].
1.3. General approach for face detection
In general face detection procedures consists of receiving an image frame and try to
locate the image regions that contain a face. This is done by examining small image
regions, m x n search windows generated from the original source image, and
determining if they contain a face.
2
A face detection process typically consists of three stages [13]. The first is the image
pyramid generation (IPG) stage. Its purpose is upon receiving an image frame, to create
downscaled versions of predefined sizes of that image. This is required in order to cover
the case were a face is larger than the examining region and thus it will not be detectable
unless the image is scaled down.
The second stage is the preprocessing stage. The goal of this stage is to filter out
noise and lighting variations the examined region in order to increase the probability of
accurate detection.
The third stage is the detection stage. In this stage the detection algorithm is applied
and returns the faces found in the image. There are three major categories of detection
algorithms. The first one is the feature based approach, were the algorithm tries to find
features that denote a face even with lightning and pose variations. Such features include
facial features, texture, skin color, eyes or the mouth. The second category is the template
matching approach which is based on predefined face templates. Patterns of faces are
used to describe a face or facial features as a whole and then are used to locate a face.
Finally there is the appearance based method where the examined region is treated as data
and it is classified into containing a face or not. Appearance based methods include
eigenfaces, neural networks and support vector machines models. These models use a
training set to capture the variety of different faces and facial features and then are used
for face detection.
Face detectors make two types of errors. The first one is called false negatives and is
when the detector classifies a face as a non-face. The other one is false positives and
happens when a detector classifies a non face as a face. The latter is a much more serious
error as the error will be propagated to any following processes such as face recognition,
face tracking and others that are mentioned above.
3
1.4. Software Vs Hardware face detection Modern day software implementations have reached a very high level of detection
rate and effectiveness as well as robustness, particularly in software implementations.
However, even with this impressive performance the best throughput achievable by
software detectors under good conditions is about 15 frames per second. Hence software
implementations are not best suited for real-time applications that require a throughput of
30 frames per second. Moreover, the complexity of the algorithms used for preprocessing
and filtering is another problem for such methods. Thus the need for hardware
implementation of face detection arises. Hardware face detectors although more difficult
to implement can be faster compared to software face detector implementations that run
on a general purpose processor and achieve a high detection rate as well, because they are
optimized to perform face detection [13].
1.5. Challenges of hardware face detection
There are several challenges in the implementation process of a hardware face
detection system. First is how the system will get its input. The best choice of input
would be for the hardware to have a video input interface so the input frames would come
from a digital camera. Next the constraints that the algorithm will introduce need to be
assessed. Such constraints include the memory required for the algorithm to work, the
processing power required to execute the algorithm, the parallelism capabilities of the
algorithm and the mathematics that are required by the algorithm and how easy it is to
implement them on hardware.
The task of designing a hardware face detection system becomes much more
challenging when designing on an FPGA platform. The reason is that the resources and
capabilities of the FPGA add constrains to what can be done and how. Maybe the most
important constrain is that the area of the design cannot exceed the area available on the
FPGA. This forces the designer to design the system in such a way so that it does not
exceed the total area available on the FPGA. Also the on chip memory on an FPGA is
predetermined and often not enough for applications that require or produce a lot of data
4
such as the face detection application. So if there is no access to an external memory and
the on chip memory is not enough the designer needs to find ways to share the on chip
memory which often results to the loss of speed because of memory interactions.
Furthermore when designing on FPGA the system speed is determined by the FPGAs
system clock and so the system cannot operate at its full potential.
1.6. Contribution
In this project a neural network face detector was implemented. The system receives
320x240 images in 20x20 windows and sends them to the neural network for processing.
The neural network then classifies these 20x20 images as a face or no face. Also in this
project an image pyramid generation unit was implemented that produces smaller scales
of the original 320x240 image. The two designs were not integrated into one system due
to time issues.
Chapter 2: Fundamentals
2.1. Digital Image Processing
A digital image can be considered as a two dimensional array a[x, y] of N finite rows
and M finite columns, where x and y are spatial coordinates and a[x, y] is called the
intensity of the image at that point. An image is composed of picture elements or pixels
(Figure 2.1). A pixel is comprised of three color producing elements each representing
one of the three primary colors red, green and blue. This representation is called the RGB
model (Figure 2.2). These primary colors combine together and their variable
combinations create the colors that the human eye can see. Each primary color takes
values in the space [0, L-1], where L is the number of intensity values. The number of
intensity values L is in the form 2k with k denoting the number of bits needed to represent
the intensity values. If k is 8 and L is 256 this means that there are 256 possible intensity
levels and 8 bits are required to represent all of them. The number of bits needed to store
5
an image is given by M x N x k for grayscale images and by M x N x k x 3 for color
images.
Figure2.1: Illustration of a pixel in an image
Figure 2.2: The RGB Model [5]
A color image consists of three component images, one for each of the primary
colors. These three images combine on the phosphor screen to produce a composite color
image. The number of bits used to represent each pixel in the RGB model is called the
pixel depth. The standard used is a 24-bit representation for color images, 8-bits used for
each primary color. When all three components are 255, then the resulting color is white.
When all three components are 0, the resulting color is black [5].
6
Grayscale images on the other hand to not require three composite images as the
intensity values of red, green and blue are all equal for grayscale images. Hence grayscale
images require only a third of memory compared to color images because only 8-bits are
used to represent a grayscale image [5]. This makes grayscale images suitable for
hardware implementations.
Digital images are obtained by sampling and quantization of analog images, this
process is called digitization. Sampling takes place in space; equally spaced samples are
taken in both horizontal and vertical coordinates. After a sample has been taken
quantization is required in order to turn the continuous intensity levels into discrete
intensity levels. This can be done by a mapping process that maps continuous spaces into
one discrete value. After quantization the discrete value is stored at a position in the
array. After the sampling and quantization of the whole analog image the process of
generating a digital image is complete. The sampling rate and number of bits used to
store the data determines the quality of the digital image [4].
The process of manipulating a digital image can be divided into three categories
according with the goal that is set. First there is the category of image processing that
involves tasks such as image enhancement and noise removal. In image processing an
image is processed and the result of this processing is another image that is improved
according to what the goal is. Next there is image analysis. Again the input is an image
and the results are measurements that give some statistical analysis of the image. Finally
there is the category of computer vision and image understanding. Tasks in this category
include object matching and recognition. The goal here is, given an image to extract a
high level description of the image.
7
2.2. Image Pyramid Generation
Image pyramid generation is the process of scaling down an image a number of times
to create smaller copies of that image. Face detection is one application that requires
image processing generation. The reason is that not all faces in an M x N image can fit in
the search window of size m < M and n < N. Hence it is required that images of smaller
size are created in order to decrease the size of the faces in the image. This way it is
possible for large faces to fit in the search window and be detected by the system.
The algorithm used for the image pyramid generation is very important for the quality
of the smaller images that will be created. But at the same time some applications do not
require the quality of the image to be preserved and so simpler algorithms can be used.
The simplest algorithm used is the one that uses scale factors to scale down the
image. It does not preserve quality but it is quite simple since it requires only two
multiplications, one for each coordinate. To find the new coordinates that a pixel value
will go to in the new image, we multiply the X coordinate with a scaling factor Sx and
obtain the new horizontal coordinate X’. . The same procedure is followed for the vertical
coordinate Y. An example is illustrated in Figure 2.3. When the new vertical and
horizontal coordinates are calculated, the pixel value that was at X, Y will be stored at
position X’ , Y’. To summarize X’ = X * Sx and Y’ = Y * Sy.
8
Figure 2.3: Image Pyramid Generation Example
It is obvious that some pixel values will be sent to the same new coordinates. As a
result some pixel values will be overwritten by others. Which pixel value is preserved
makes a difference on the quality of the resulting image. There are different approaches
for which pixel value should be replaced. One approach is to preserve the first pixel value
that went to the new coordinates. Another approach is to replace the old pixel value with
the new one until last pixel value remains. Another approach is to use interpolation.
Interpolation achieves better results in terms of quality loss but is more complicated in
relation with the former two approaches.
9
2.3. Neural Networks
Neural networks (NN) are a very useful tool because they introduce a different
approach to problem solving that the traditional algorithmic approach. Their main
advantage is that they can be trained to perform certain tasks. Being able to be trained
and thus “learn again” the NN can reorganize their structure and adapt to different
circumstances. Neural networks are separated into two distinct categories. The first
category includes the biological neural networks (BNN) such as the human nervous
system and the human brain. BNN consist of interconnected biological components
called neurons. The second category includes the artificial neural networks (ANN) which
are comprised in the same manner as BNN but use artificial components (artificial
neurons) that mimic the operations of biological components and are less complex [1].
Biological neurons (Figure 2.4) are the hurt of biological neural networks and are
constructed by four basic components, the dendrites, the soma, the synapse and the axon.
The dendrites serve as projections of a neuron. They act as conductors of electrical
stimulation received from other neurons, to the neurons cell body [2]. The soma contains
the cell’s nucleus. The cell’s nucleus operation is to control the activities of a cell by
regulating gene expression [3]. The axon [23] is a long projection of the neuron that
conducts electrical impulses away from the neurons soma. The axon can be thought as a
transmission line that propagates electrical activity from one neuron to another. The
synapse converts the electrical activity from the axon into electrical effects that inhibit or
excite activity in other connected neurons [1].
Figure 2.4: Biological Neuron [16]
10
The simplest model of artificial neuron (Figure 2.5) is one which has many inputs
and one output and the output is determined by the combination of the artificial neuron
inputs. A more complex but much more dynamic model of artificial neuron is the
McCulloh and Pitts model illustrated in Figure 2.6. In this model the inputs are weighted.
This affects the importance of each input on the overall computation. Additionally a
preset threshold value is used which affects the output of the neuron. The inputs are
multiplied by their respective weights and then are summed all together. The threshold
value is subtracted by their sum and the artificial neuron output the resulting number. The
addition of weights and threshold makes the artificial neuron a very dynamic and flexible
tool that can be adjusted for different tasks only by altering the weights and threshold [1].
Figure 2.5: Simple Artificial Neuron [1] Figure 2.6: McCulloh and Pitts artificial
neuron model [1]
The behavior of an artificial neuron can be also modified with the addition of an
activation function. This function can be either a linear function, a threshold function, or
a sigmoid function. When using linear functions the output of the artificial neuron is
proportional to the weighted output. For threshold functions the output is either of two
values depending if it is greater, less than, or equal to a threshold value. For sigmoid
functions the output varies continuously as the input changes but not in a linear way. All
three function types are approximations of the way that a biological neuron behaves.
11
A general mathematical definition for the artificial neuron operation is shown in
Equation 2.1, y is the neuron output, X is the neuron input, W is the inputs respective
weight, t is the preset threshold and f is the activation function.
0
(( * ) )n
i ii
y f X W t=
= −∑
Equation 2.1: Artificial neuron operation
An ANN may consist of more than one layer of neurons (Figure 2.7). Commonly it
would consist of the input layer that operates on the inputs of the system, the hidden
layers that operate on the output of the input layers and the output layer which operates
on the output of the hidden layers and produces the output of the system.
Figure 2.7: Multilayer Neural Network [15]
There are different approaches for how to train a neural network. The first is
associative mapping. In this approach the NN basically learns to produce a particular
output pattern when a particular input pattern has been applied. Then there is the
regularity detection approach where the NN learns to respond to particular properties of
the input pattern. Such learning approaches are useful when feature discovery is
important. When dealing with regularity detection we have two types of networks. Fixed
networks where their weights cannot be changed and adaptive networks that have the
ability to change their weights to adapt to a different situation. All training methods used
12
in adaptive networks are categorised as either supervised or unsupervised. Supervised
methods require an external entity to inform each unit of what the expected output should
be according to the inputs. On the other hand unsupervised learning requires no external
interference because the network can self-organize the data given to it to detect their
emergent collective properties [1].
The way to teach a three layer neural network to perform a particular task is by first
introducing training examples to the network which contain the input sequence and the
desired output for the particular input. Then the difference between the actual and the
desired output is calculated and the weight values are changed for each neuron so that the
actual output would get closer to the desired output.
It is vital to note that the variety of inputs that a NN receives during training
determines its accuracy and ability to perform as expected when operating.
Because of their parallel architectures, simplicity and fast computational times NN
are a preferred choice when dealing with real time applications and especially when
hardware implementation is an option.
13
Chapter 3: Related Work
3.1. Past work on face detection Face detection has attracted many researches attention because of its numerous
applications and so various solutions have been proposed. In this section some of these
solutions will be described and explained.
Rowley et. al. [6] present a face detection system that is based on artificial neural
networks. The general method involves scanning the image pixel-by-pixel in various
scales. The image is scanned by a 20 x 20 search window and each image frame is passed
to the neural network for classification. Initially the image pyramid is generated using
scaling steps of 1.2. Then the images are scanned with a 20x20 pixel window. Before the
window passes to the neural network it is subject to histogram equalization. Histogram
equalization increases the quality of the pixel window and makes the face features clearer
and so it increases the chances of accurate detection by the neural network. After
histogram equalization the window is given to a router network. The router network is
responsible to bring the face in an upright frontal view if it is rotated. The neural network
[7] receives as input a 20x20 pixel region of the image and generates an output ranging
from 1to-1, signifying the presence or absence of a face. The neural network structure has
three hidden layers. The first part of the layer splits the image to quadrants of the 20x20
image, 10x10 pixels each, the second to quadrants of the quadrants, 5x5 pixels each and
the third to six overlapping regions, 5x20 pixels each [8]. The overlapping horizontal
regions allow the hidden units to detect such features as mouths or pairs of eyes. The
quadrant units might detect features such as individual eyes, the nose, or corners of the
mouth. Rowley et. al. also used arbitration between multiple neural networks in order to
eliminate their false positives and increase their accuracy. This happens because the
different neural networks have gone under different training conditions and so the
networks will have different biases and make different errors. The basic algorithm is
illustrated in Figure 3.1.
14
Figure 3.1: Roeley et. al. algorithm for Neural Network-based face detection [7]
The system proposed by Rowley et. al. was able to detect 79.5% of faces over two large test sets, with a small number of false positives [6].
Viola and Jones [9] presented a method of rapid object detection using a boosted
cascade of simple features. The work done in [9] had three main contributions. The first
contribution is the introduction of a new image representation which is called an integral
image. This new representation allows for very fast feature evaluation. The new values of
the integral image are calculated from summing the pixels above and to the left for every
pixel in the original image. The second contribution in [9] was that it introduced a
method for constructing a classifier by selecting a small number of important features
using AdaBoost. AdaBoost is a machine learning algorithm that can be used with other
algorithms to improve their performance [10]. The effort here was to ensure fast
classification by excluding a large number of features and focus on a small set of critical
features. The third and last contribution of this paper was the introduction of method for
combining successively more complex classifiers in a cascade structure. The cascade
structure increases dramatically the speed of the detector by focusing in promising image
regions. As explained in [9] it is often possible to determine where an object might occur
in an image and hence it is more promising to look in that region of the image.
15
Yang and Huang [11] used a hierarchical knowledge-based method to detect faces.
The proposed system consists of three levels of rules. A scanning window passes through
the input image and each region is applied a set of rules. The higher level rules contain a
general description of what a face looks like. The lower level rules look for facial features
of a human face.
Craw et. al. [11] presented a localization method that is based on predefined face
templates. The templates are shapes of frontal faces (i.e. outlines of faces). First a sobel
operator is used to extract edges from the image. The edges are then grouped together to
search for a face that matches a template. When the head contour is located the same
process is repeated for the purpose of locating eyes, eyebrows lips, etc.
Schneiderman and Kanade [8] suggest a statistical method for face detection. To
apply statistical methods to the problem, they represent visual attributes with wavelet
coefficients. Wavelet is a mathematical function used to divide a given function into
different frequency components [12]. This representation allows them to jointly model
image data which is localized in space, frequency and orientation. From this information
they are able to construct a histogram-based face detector. This method requires initial
histograms to be constructed.
Graf et. al. [11] developed a method of locating facial features and faces in grayscale
images. The process includes the image passing through a band pass filter and then
morphological operations are applied on the image to enhance regions with high intensity
and certain shapes. The histogram of the processed image is then computed. The
histogram should exhibit a prominent peak. Based on the peaks value and width, adaptive
threshold values are selected in order to create two binary images. Then in both these
images connected component regions are identified and are selected as candidates for
facial features. These regions are combined and evaluated with classifiers to determine if
they contain a face. In this method it is not clear how morphological operations are
performed and how the candidate facial feature regions are combines to determine if they
contain a face.
16
3.2. Hardware face detection
3.2.1. ASIC Implementation of a Neural Network Based Face Detection
An ASIC implementation of a neural network face detector is presented in [13]. The
algorithm used is the one proposed in [7] and as is described in section 3.1. The detection
process that includes the image pyramid generation stage, the image enhancement stage
and the neural network stage were all implemented on one chip. The architecture uses a
320 x 240 grayscale image frame.
The image pyramid generation unit acts is the interface of the system with the
outside world. The image pyramid generation unit utilizes a 64-bit bus to communicate
with the image data source and a 32-bit data bus to communicate with the image
enhancement unit. The unit requires 80 KB of memory. It creates the 20x20 windows that
are needed by the algorithm. The windows are generated in raster scan style and handed
to the image enhancement unit.
The image enhancement unit is responsible for improving the quality of the 20x20
windows it receives from the image pyramid generation unit. This unit performs
histogram equalization. The image enhancement unit creates the cumulative distribution
function needed for histogram equalization. It performs histogram equalization on the
whole window rather that in sub-regions of the window. The module requires 503 clock
cycles to fully process a 20x20 window.
The neural network unit in this implementation was designed for parallelism but
another goal was for the unit to remain as small as possible. This unit receives the 20x20
histogram equalized image from the image enhancement unit. The overall neural network
architecture is as follows. The pixel from the 20x20 window is multiplied by a predefined
weight value. For each region in the 20x20 window as described in section 3.1 the
17
multiplication results are accumulated and then pass through an activation function. The
activation function is a hyperbolic tangent implemented as a lookup table in a 16-bit
SRAM.
The input layer of the neural network consists of three regions. For the first region
which divides the window into 10x10 regions and the 5x5 region, one multiply
accumulator unit was used. For the 5x20 region two multiply accumulators were used
because of the overlapping regions. The three regions share the same SRAM unit.
For the hidden and output layers one shared multiply accumulator was used. They
also share a 4 KB SRAM that contains the weights.
The final hardware implementation operates at 125 MHz with a total consumption of
165mW. The face detection was performed at 24 frames per second for a 320x240 image
frame.
3.2.2. FPGA Implementation of a Neural Network Based Face Detection
Another hardware implementation of a face detection system on an FPGA is
described in [13]. The algorithm used was again the one proposed by [7]. The FPGA
platform used was the Xilinx XUP2V-Pro development board.
The implementation assumes a 320x240 input image, 20x20 search windows and 15
pixel overlap between consecutive windows. With these constraints in mind, up to sixteen
sub windows may be active for a given input pixel, including both the horizontal and
vertical directions.
The proposed architecture consists of a 32 node processing array. Sixteen of these 32
are layer-one multipliers. The implementation utilizes a packet-switched network on
programmable chip with sixteen multipliers that perform the layer-one multiplications of
the pixel and the respective weight value. There are 256 accumulators utilized in order to
18
sustain a maximum of 256 in-progress accumulations. The activation function is
implemented using lookup tables.
The neural network dataflow is controlled by registers call context registers. These
context registers are present in each processing node and are responsible for specifying
the destination of the result of the ongoing operation. There are two types of context
registers multiplier context register and accumulator context register. The multiplier
context registers store besides the destination, the value of the weight that will be used in
the multiplication. The accumulator context registers also store the number of
accumulations to be processed and the current number of accumulations. The architecture
of the implementation is showed in Figure 3.2.
Figure3.2: Architecture of the neural network-based face detector on an FPGA board [13]
19
The FPGA implementation achieved almost twice the detection frame rate in
comparison to the ASIC implementation. The additional multiple multiply accumulator
units of the FPGA as well as the concurrent window processing are the main reasons why
the FPGA implementation is better.
20
Chapter 4: FPGA Hardware Implementation
This senior design project is about implementing a neural network-based face
detector proposed in [7], on a Xilinx Virtex2 Pro XC2VP30 FPGA platform using the
verilog hardware design language. The purpose of this project was to exploit and
investigate different implementation methods considering the given boards resources and
constraints. The algorithm is about upright frontal face detection in a 320x240 greyscale
image frame. The hardware implementation involved the implementation of an image
pyramid generation unit responsible for the down scale of the original 320 x 240 image
frame, the implementation of the neural network structure described in [7] and the
implementation of a VGA controller that is required to output the results on a VGA
screen.
The main goal of this project is the implementation of a prototype of the neural
network that will cover as less area as possible with a view of integrating more neural
networks to the design. This will allow the concurrent processing of the down scaled
frames. Thus achieving a frame rate near 30 frames per second, that is the rate required
for real time video processing.
The implemented face detector system includes the VGA controller, a memory
containing a 320x240 image and one neural network. Due to time issues the image
pyramid generator unit was implemented but was not integrated into the face detection
system and only one neural network was integrated into the design. Nevertheless the
work that was done in this project and the decisions taken take into consideration the
integration of the IPG unit and the use of more than one neural network in a later stage.
The weights and thresholds for the neural network were given to us by Video Mining
Incorporated, so in this implementation we are not trying to improve the accuracy of the
neural network only implement the design.
21
4.1. Xilinx Virtex2 Pro XC2VP30 FPGA
The Virtex2 Pro XC2VP30 FPGA board (Figure 4.1) has many features that make it
almost ideal for the task of face detection.
The Virtex 2 Pro FPGA, were the system will be designed on, meets the memory and
processing requirements required by a face detection application. It offers a total of
306KB on chip block RAM that is more than capable of storing all the images needed, so
the use of the on-chip block RAM is the preferred option. Processing-wise the Virtex 2
Pro offers 136 18-bit embedded multipliers and 30,816 logic cells which is exactly what
is needed for the intense multiplication and accumulation operations that the neural
network requires. It is also equipped with two power PC processors. The two processors
may be used to run a software version of the neural network and thus process frames
concurrently.
The board that the FPGA is on is also very capable in terms of interface with the
outside world and the memory that is offers. It is equipped with video decoder board that
makes it easy to get live video from a camcorder that is connected to it. Additionally it
has a compact flash card port that can be used to provide the face detector with new
image frames. Also the board has USB 2.0 ports that provide the option of delivering
image frames to the system on the Virtex 2 Pro via a USB memory stick. Furthermore the
board can be connected with a hard disk drive to retrieve frames from the disk. For
output purposes the board offers connectivity with a VGA monitor through which makes
it is possible to view the results. The user can interact with the system through the push
buttons, switches and leds that are present on the board. The memory capabilities of the
board are also fitting for the task the board is needed. It has a 256 MB DDR SDRAM
DIMM module. The board’s system clock is at 100MHz.
22
The on chip block RAM is more than capable of storing all the images needed, so the
use of the on-chip block RAM is the preferred option because it will be faster to access
the frames and more simple.
Figure 4.1: Virtex2 Pro XC2VP30 FPGA board [18]
4.2. VGA Controller
A VGA controller Implementation was required in order to output the system results
onto a VGA monitor. The VGA format used is the 640x480, 60Hz format. For this format
there are 480 rows and each consists of 640 pixels [19].
A VGA monitor outputs the pixel values that it has received in a raster scan fashion.
Beginning from the top left corner of the screen, which is the position (0, 0) it outputs the
pixel values for each column in the row until position (640, 0) and then moves to the next
23
row starting from the zero column at position (0,1). The monitor returns to position (0, 0),
when it has outputted the pixel at position (640, 480). Figure 4.2 illustrates the order of
pixels in a VGA monitor.
Figure 4.2: Representation of pixels on a VGA monitor [19]
The VGA controller operates at a 25,175 MHz clock. For this reason a clock divider
had to be implemented so that it would slow down the 100MHz system clock four times
in order to generate the appropriate clock frequency needed by the VGA. At each
negative edge of the 25,175MHz clock the VGA gets the pixel values needed and outputs
them to the screen.
4.3. Image Pyramid Generation
4.3.1. IPG Implementation Strategy
The Image Pyramid Generation Unit (IPG) is responsible for taking a 320 x 240
image and produce downscaled versions of the original image that will be processed by
the neural network for face detection. In order to cover a vast majority of the possible
sizes that a human face can have, substantial downscaled images had to be generated. The
IPG process must conclude with an image frame of 20x20 pixels to check for the case
that the face covers the whole image.
The IPG unit could be the bottleneck of the face detection system if not implemented
in a smart way. If for example a downscaled version is generated then immediately
processed and then the next downscaled image is created and processed then the face
24
detection process would have to wait for a downscaled image to be generated and it will
remain idle until it does.
The solution proposed in this project considering the IPG unit is to create the needed
downscaled images all together at once. This can be achieved by multiplying the
coordinate with all the scale factors in parallel, to create the new coordinate. This way the
face detection stage would only have to wait only at the start until all required images
have been created. Then if the hardware constraints allow it, it could be possible to
integrate as many neural networks as the number of images and have all frames processed
at the same time. By doing this we all the images are processed, at the same time that is
required to process only the 320 x 240 image.
To implement this idea the following aspects had to be considered. The total memory
required the processing resources that the downscale operation will occupy and which
downscaled dimensions would be more efficient. Efficient in this case means that the
dimensions of the downscaled images would be equally distributed so that not too many
large images are created nor too many small images. The methodology employed was to
first identify some candidate scales, then calculate the total memory to see if all scales fit
onto the FPGA and finally if the memory constraint was met, determine the scaling
factors that produced these scales.
The process of IPG as mentioned in section 2.1 requires two multiplications to
calculate the new coordinates of a pixel value. This means that if 9 downscaled images
are produced then it would require 18 multipliers for the IPG process. But multipliers are
mostly needed for the face detection process and the neural network implementation.
This is why particular scales were chosen so that their respective scaling factors do not
require multiplication in order to calculate the new coordinates. Instead the new
coordinates can be calculated by the shift operation. As an example consider the scale
factor 0.25 for image scale 7 in Table 4.1. To find the new coordinates x’ and y’ we need
to multiply x and y with 0.25 which is a power of two. But the equivalent operation in
binary arithmetic when multiplying or dividing with a power of 2 is to shift left or right
respectively. This way some multipliers were saved for the neural network.
25
The final scales chosen are illustrated in Table 4.1, along with the memory each
scale will occupy, the total memory for all images, the scale factors that produce the
respective scale and the number of windows that will be generated out of each scale.
After the scales were determined it was observed that there is no need to have as
many neural networks as the number of images. This is because the number of windows
created by image scales from 3 to 10 in Table 4.1 is 2398 windows. This means that the
320x240 image which has the largest number of windows , 2745, will still be processed
after all others have finished and so the other neural networks would remain idle until the
next image arrives and is downscaled. With this in mind it is possible to process all
images at the same time it would take for a 320x240 to be processed image by using only
three neural networks. The first neural network will process the 320 x 240 image, the
second will process the 260 x 295 image and the third all the others.
Image Scales Memory Required
Scale Factors Number of generated windows
1. 320 x 240 76.800 KB None 2745
2. 260 x 195 50.700 KB 0.8125 1764
3. 200 x 150 30.000 KB 0.625 999
4. 160 x 120 19.200 KB 0.5 609
5. 120 x 90 10.800 KB 0.375 399
6. 100 x 75 7.500 KB 0.3125 204
7. 80 x 60 4.800 KB 0.25 117
8. 60 x 45 2.700 KB 0.1875 54
9. 40 x 30 1.200 KB 0.125 15
10. 20 x 20 400 KB 0.0625 (column) 0.083984375 (row)
1
Total: 204.500 KB Table 4.1: Resolutions of scaled down images, the required memory for each on, the scaling factors to
create each scale and the windows that each scale will create.
26
4.3.2. IPG Architecture The IPG unit consists of 10 block RAM units and two address/coordinate generation
units. Each block RAM stores one scale of the ten images shown in Table 4.1. The first
address/coordinate generator is responsible for creating an x and y coordinate for the
pixels of the 320x240 image. At the same time it will use this x and y coordinates to
generate the address from which to read from the 320x240 image. The address is
generated using Equation 4.1. The second address/coordinate generator is responsible for
creating the x’ and y’ coordinates for the 9 downscaled versions of the 320x240 image,
by multiplying the x and y coordinates with the respective scale factor. For simplification
only the integer part of the multiplication is taken and no rounding has been performed.
Again the addresses for each image are generated using Equation 4.1 and each images
respective x’ , y’ coordinates. The IPG unit block diagram is shown in Figure 4.3.
( _ * )Address image column x y= + Equation 4.1: How to create an address given the x and y coordinates, x represents current column
and y current row.
For a pixel value to be read from the 320x240 image RAM and to be stored in a
downscaled image RAM 3 clock cycles are required. In the first clock cycle the
address/coordinates from which to read are produced. In the second cycle the pixel value
is loaded from the 320x240 RAM and at the same time the addresses for where to write
are produced. At the final cycle the pixel value is stored in all the downscale images
block RAMs. In total 230,400 clock cycles are required for the all downscaled images to
be produced and be ready for processing.
27
Figure 4.3: Block Diagram of the Image Pyramid Generation Unit
28
4.4. Neural Network
The final unit of the face detector system is the neural network unit. The unit takes as
input a 20x20 pixel window and classifies it as a face or not a face. The unit is partitioned
into three parallel stages. The first stage segments the window into four regions each of
10x10 pixels, searching for facial features such as nose eyebrows, glasses etc. The second
stage segment the window to sixteen regions 5x5 pixels each. The final stage segments
the window in six overlapping 5x20 regions and searches for pairs of eyes and the mouth.
Figure 4.4 shows in detail the segmentation of the window in these three regions. Each
segment of a region is assigned an artificial neuron which takes as inputs the pixels of the
respective region and the respective weights for each pixel in that region. These three
regions are the input layer of the neural network.
The 10x10 region has 4 neurons, the 5x5 has 16 neurons and the 5x20 region has six
neurons. The neurons operation is a multiplication of the pixel value with the weight
value and an accumulation of all the multiplication results. The output of each neuron is
its accumulation result minus its threshold. The output of a neuron then passes through an
activation function which in this case is a hyperbolic tangent.
Each regions neuron is fully connected to a hidden layer neuron. There are three
hidden layer level neurons one for each region. The hidden layer neurons perform
multiplication on the outputs of the activation functions of all previous neurons with their
respective weights. These neurons also perform accumulation and threshold their output.
An activation function is applied on their output as well.
Finally the output of the three next level neurons is propagated to an output neuron
that performs multiplication with the next level neurons outputs and their respective
weights, accumulates the multiplication results and outputs the result. If the final number
is greater or equal to 0 then the image is classified as a face, otherwise it is classified as
non face. A detail description of the neural network structure is presented in Figure 4.5.
29
(a)
(b)
(c)
Figure 4.4: The segmentation of the window for (a) 10x10 region (b) 5x5 region (c) 5x20 region
30
Figure 4.5: Basic Neural Network structure [7]
31
4.4.1. NN implementation strategy
The goal when designing the NN was to minimize the amount of area and resources
needed by a NN to make it possible for more than one NN to be integrated into the face
detection system. The factors that need to be considered are how many multipliers and
accumulators are going to be used. The mathematical model used for number
representation e.g. twos complement, sign magnitude, the binary representation of real
numbers, the binary representation of signed numbers etc. Also under consideration was
how to share resources between units.
The first thing that had to be decided was the number of pixels the NN would process
at a time. For simplicity and faster implementation it was decided that the NN would
process one pixel of the 20x20 window at a time. Since only one pixel would be
processed at a time it means that only one accumulator will be active in the first two
stages of the NN and a maximum of two in the third stage. Under this observation the
number of multipliers that can be used in each stage can be one for the first two stages
and two for the third one. Hence in the first and second stage neurons will share one
multiplier respectively and the third stage neurons will share two because of the
overlapping regions. The number of accumulators has to be the same as the neurons since
all neurons accumulate independently.
Next thing to be considered was the signed numbers representation in binary. The
two methods that were considered were the twos complement representation and a
representation of the absolute value of the number together with a sign magnitude. After
an early prototype was implemented using sign magnitude representation it was observed
that the neurons were much to complex and the design occupied a lot of area considering
it was just a prototype. The reason was the extra hardware used to manipulate the sign of
each number and for the implementation of both an adder and a subtractor for each
neuron to use whichever was needed according to the numbers sign. For these reasons the
number representation was changed to twos complement so there was no need to use
extra hardware.
32
Another factor to consider was the representation of real numbers in binary form. The
representation used was the following. A number of bits were chosen to be the integer
part and the remaining ones were the fractional part of the number. The conversion from
decimal to binary for each part was done appropriately.
4.4.2. NN architecture
4.4.2.1. NN Weights and thresholds
Every pixel in each segment of the three regions has a respective weight that
determines its importance in the overall computations. Weights that are used by the same
neuron were grouped together and implemented as lookup tables in the on-chip block
RAMs. The weights are represented by 17-bits in twos complement form.
Specifically for the 10x10 region 4 weight memories were created each of 100
positions, for the 5x5 region 16 weight memories were created each of 25 positions and
for the 5x20 region 6 weight memories were created each of 100 positions. The 10x10
and 5x20 region weight memories occupy 213B each, the 5x5 weight memories occupy
54B each.
The separation of memories helped to avoid any conflicts as each memory was
accessed independently according to the segment of the 20x20 window that the pixel
value is located.
The thresholds for all neurons and the weights for the hidden layer neurons and the
output neuron were converted to 17-bit twos complement numbers and were hardwired in
the system to save up memory. The binary representations of the weights and thresholds
for all the neurons are shown in Table 4.2.
33
Thresholds binary representation for
input layer neurons 35-bits total
19-bit 16-bit integer decimal
Thresholds binary representation for
hidden and output layer neurons 35-bits total
5-bit 30-bit integer decimal
Weights binary representation 17-bits total
1-bit 16-bit integer decimal
Table 4.2
4.4.2.2. Input Layer Neurons
Each pixel in the 20x20 window is assigned an x and y coordinates when it is read for
processing. The x and y coordinates represent its position in the 20x20 image. Using the
x and y coordinates it can be identified in which segment the window belongs and so
which accumulator needs to be enabled in each of the three stages of the input layer.
The10x10 region of the input layer consists of one multiplier and 4 accumulators that
simulate the behaviour of the 4 neurons. The input to the multiplier is an 8-bit pixel value
and a 17-bit weight value. The multiplication result is a 35-bit number with the 16 least
significant bits being the fractional part and the other 19 bits being the integer part. The
accumulators take this 35-bit number and add it to their current sum. The binary
representation of the accumulator and multipliers results is illustrated in Table 4.3. Once
the accumulation process is complete the result is sent to the 10x10 input layer controller.
The accumulators have a counter that enables them to stop accumulating once they have
accumulated all necessary accumulations and inform the system that their result is ready
for further processing. The 5x5 region implementation also follows the same philosophy.
34
Multiplier results binary representation for input layer neurons
35-bits total 19-bit 16-bit integer decimal
Accumulator results binary representation for input layer neurons
35-bits total 19-bit 16-bit integer decimal
Table 4.3
The implementation method for the 5x20 region needed to change because of the
presence of overlapping segments. What has changed is that for the pixel values of
overlapping areas, two multiplications with different weights and two accumulations
would take place at the same time. Figure 4.6 shows the architecture the three regions.
35
(a)
(b)
36
+=
Enable
Result
Multiplexer
*5x20
Weight
Memory
Weight Value 1
*5x20
Weight
Memory
Weight Value 2
+=
Enable
Result
Multiplexer
+=
Enable
Result
Multiplexer
+=
Enable
Result
Multiplexer
+=
Enable
Result
Multiplexer
+=
Enable
Result
Multiplexer
Multiplication
Result1
Multiplication
Result2
Pixel Value
(c)
Figure 4.6: (a) 10x10 region neurons (b) 5x5 region neurons (c) 5x20 region neurons
4.4.2.3. Activation Function
The activation function used is a hyperbolic tangent (Figure 4.7). The input to the
hyperbolic tangent is the output of the input layer neurons. The hyperbolic tangent unit
was implemented as a lookup table in an on-chip block RAM. The hyperbolic tangent
function produces values between [-1, 1]. An important property of the hyperbolic
tangent is that a number z produces the same value as the negative of the value produced
by number –z, this property is illustrated in Equation 4.2. So it is convenient to store
37
only the values produced by positive numbers. Hence if the output of a neuron is a
negative number, all that needs to be done is to find its absolute value of that number and
sent the absolute value to the hyperbolic tangent lookup table. Also the sign of the
number must be propagated to the output of the hyperbolic tangent. If the sign is negative
the hyperbolic tangent output must be converted to its negative counterpart using the
twos complement method, if it is not the number will remain as is. By exploiting this
property of the hyperbolic tangent we can drastically decrease the size needed for the
storage of the hyperbolic tangent values to the half.
tanh( ) tanh( )
tanh( ) tanh( )
z z
z z
= − −= −
Equation 4.2: Hyperbolic tangent property
Figure 4.7: Hyperbolic tangent used in the implementation, the inputs are numbers from -8 to 8 with
a step of 0.0625
38
The hyperbolic tangent unit consists of 128 positions representing inputs 0 to 8 with
an interval of 0.0625. In each position the respective output for that number is stored in as
a 15-bit number represented in twos complement form. The memory required for one
hyperbolic tangent unit is 240B. The binary representation of the hyperbolic tangent
values is shown in Table 4.4.
Hyperbolic tangent binary representation
15-bits total 1-bit 14-bit
integer decimal Table 4.4.
There are three hyperbolic tangent units in the neural network. In the first stage each
one is used to produce the hyperbolic tangent values for the three regions of the input
layer respectively. In the next stage the first hyperbolic tangent unit is used to produce the
hyperbolic tangent output for the hidden layer neurons because it is no longer occupied
by the 10x10 region neurons. Furthermore the three hidden layer neurons do not finish
their operation at the same time so it is more convenient to use one hyperbolic tangent
unit for all three.
In all cases mention above the input to the hyperbolic tangent is a 7 bit part of the
neurons output. The 7-bit part consists of the lower three bits that form the integer part
and the most significant bits that form the fractional part of the number. The reason for
this is that for inputs in the space of [0, 8] the hyperbolic tangent as a function returns
different values, but for inputs that are larger than 8 the hyperbolic tangent returns 1. So
before passing the output of a neuron to the hyperbolic tangent the result is checked to
verify if it is less than 8. If it is then the 7-bit part is extracted and sent to the hyperbolic
tangent unit. In other case the hyperbolic tangent unit output the value 1.
39
4.4.2.4. Controller Units
Four controllers were required to control the data flow between the shared
subtractors and hyperbolic tangent units. The sharing of the units was possible because
none of the neurons in the same region finish the accumulations at the same time. Three
controllers are needed to control the flow for the three regions of the input layer and one
for the three neurons of the hidden layer. One controller is assigned to one region of the
input layer. The controllers act as and enable units and multiplexers.
When an input layer neuron has finished accumulating it informs its respective
controller that it is done via a ready signal. The controller then sends the neurons result to
the subtractor along with the respective threshold. The subtractor subtracts the threshold
from the accumulation result. Then the controller sends the appropriate bits from the
subtraction result as an address to the hyperbolic tangent unit and initiates the load
process from the hyperbolic tangent memory. The hyperbolic tangent output is sent to the
respective hidden layer neuron were the multiplication with the weight is done and then
that result is accumulated.
The hidden layer controller does the same operation but for the three hidden layer
neurons. It sends the hidden layer neuron results to a subtractor where the subtraction
result is sent to the hyperbolic tangent unit and the output is given to the output neuron
for its multiplication and accumulation processes. Additionally the hidden layer
controller is responsible for controlling a multiplexer that determines the input to the first
hyperbolic tangent unit. This is required in order to reuse the hyperbolic tangent unit for
the hidden layer neurons, as the input layer neurons have finished processing.
40
4.4.2.5. Hidden Layer Neurons
The hidden layer neurons are constituted by on multiplier and one accumulator. They
take as input the values returned by the hyperbolic tangent units. They multiply these
values by the appropriate weight and accumulate the multiplication results.
The input to the multiplier is a 15 bit hyperbolic tangent value and a 17-bit weight
value. The multiplication result is a 35-bit number with the 30 least significant bits being
the fractional part and the other 5 bits being the integer part. The accumulator takes this
35-bit number and adds it to its current sum. Once the accumulation process is complete
the result is passed to the hidden layer controller. The binary representation for the
accumulator and multiplier results is illustrated in Table 4.5.
The three neurons share the same activation function. The reason is that the neuron
following the 10x10 region will do 4 accumulations, the neuron following the 5x5 region
will do 16 accumulations and the neuron following the 5x20 region will do 6
accumulations. So the activation function can be used by the 10x10 hidden neuron first,
then by the 5x20 hidden neuron, and finally by the 5x5 neuron without any conflict
problems. In Figure 4.8 a model for the hidden layer neurons is shown.
41
Figure 4.8: Implementation of a hidden layer neuron
4.4.2.6. Output Layer Neuron
The final output neuron determined the face detection system output. It consists of
one multiplier and one accumulator. The neuron takes as inputs a 15 bit hyperbolic
tangent value and a 17-bit weight value. The inputs are multiplied producing a 35-bit
result with the 30 least significant bits being the fractional part and the other 5 bits being
the integer part. The binary representation for the accumulator and multiplier results is
illustrated in Table 4.5. The multiplication result is then sent to the accumulator for the
accumulation process. Once the accumulation process is complete the accumulators
subtract their respective threshold value from the total sum. If the result of the output
neuron is greater or equal to 0 the 20x20 window is classified as a face, otherwise it is
classified as a non-face. Figure 4.9 shows the implementation of the output neuron.
42
Figure 4.9: Implementation of the output neuron
Accumulator results binary representation for hidden and output
layer neurons 35-bits total
5-bit 30-bit integer decimal
Multiplier results binary representation for hidden and output
layer neurons 35-bits total
5-bit 30-bit integer decimal
Table 4.5
43
4.5. Overall System architecture
The operation of the face detection system can be in two stages. In the first stage
each 20x20 window pixel is read and processed by the input layer of the neural network.
This includes creating the addresses to read a 20x20 window from the 320x240 memory,
, read the weight value that corresponds to that pixel value from the proper weight
memory, enable the respective accumulator according to the pixel x and y window
coordinates and for each input layer neuron that completes its operation send its result to
the activation function. The second stage the hidden layer neurons and output layer
neurons finish their operations. The two operations are repeated until all the windows
have been read and processed. An FSM diagram illustrating the face detection system
operation is shown in Figure 4.10.
The first stage requires 3 cycles for each window pixel to be read and processed. In
the first cycle the addresses needed for the 320x240 image memory and the weight
memory are generated. In the second the values for pixel and weight are loaded from the
memory units and multiplied. In the third cycle the accumulation takes place. The first
stage takes 1,200 clock cycles to complete.
For the second stage each input layer neuron that completes its operation requires
two cycles to be multiplied and accumulated by the respective hidden layer neuron. In the
first cycle the hyperbolic tangent memory is accessed and in the second the hyperbolic
tangent output is multiplied with the weight value and accumulated. This procedure is
also followed for the output layer neuron. The second stage requires 66 clock cycles to
complete.
44
Figure 4.10: Face detection System Architecture
45
To summarize, 2745 windows will be generated for the 320x240 frame each
requiring 1,266 cycles so the frame should be finished in 3,475,170 cycles corresponding
to 34ms. For 30 frames it should take approximately 1.02 seconds to complete. Hence the
goal set for real time video processing has been achieved. The architecture of the face
detection system is illustrated in Figure 4.11. The memory requirements for the
proposed face detection system are shown in Table 4.6.
Unit Memory
Image Pyramid generation 204.500 KB
Hyperbolic tangent memories 2.160 KB
Weights memories 2.994 KB
Memories required for presentation 1.200 KB
Total: 210.854 KB
Table 4.6: Memory requirements when the Image pyramid generation unit is integrated and three neural networks are used.
46
Reset the system
1. Enable window
generation unit -Start
generating window
2. Enable Neural Network
1. Wait for Neural Network to
finish processing
2. Window generation unit on
hold
Window
processing
finished
Finished
reading
window
System idle
Finished
processing
the last
window
Reset Button
Pressed
Start
Figure 4.11: FSM illustrating the face detection system operation
47
Chapter 5: Experimental methodology
5.1. Experimental strategy
The face detection system was synthesized using Xilinx ISE 9.1i. The synthesis
report is appended in Appendix A. The debugging and error correction was done using
the simulator ModelSim XE III 6.2c in the early stages of the implementation and later by
viewing the results on a VGA monitor to test the actual hardware. Simulation waveforms
are appended in Appendix A.
To test the correctness of the implemented hardware face detection system it was
needed to write a software equivalent of the face detection system and compare the
results. The software was written in C++. The reason software is used is that it is easier to
find any errors in the algorithm implementation and mathematical operations that are
used. After the software was verified, the comparison of the two system results took
place. Additionally the C++ software was enhanced in order to take a 320x240 frame as
input and output the faces found in the frame.
The hardware and software implementations of the face detection system were given
a set of 20x20 frames, 15 faces and 15 non faces from a constructed database of various
faces to compare their results. The multiplication, accumulation and tangent function
results of the two systems were compared to see if the hardware implementation was
correct. The results were similar and the only reason there was a slight difference was due
to the loss of accuracy when representing real numbers in binary format.
In addition, the two implementations were given the same 320x240 frames to operate
on. By doing this it was made possible to check the same 320x240 frame in each window
position and make the sure the two systems were giving the same result This way it was
easier to test the hardware under working circumstances. Again the two systems
produced the same results.
48
5.2. Experimental Setup
The Setup of the System is illustrated in Figure 5.1 and Figure 5.2. The setup
consists of the FPGA board and a VGA monitor to output where the image and the results
are viewed.
Figure 5.1: Experimental setup. The system consists from the FPGA board and a VGA monitor.
49
Figure 5.2: Output of the system.
5.3. Results and discussion
The faces and non faces of Table 5.1 were given as input to the neural network and it
classified each input frame as illustrated in the following table. The majority of faces
were classified as non faces which yield a high number of false negatives. Only one non
face from the non face test set was classified as a face. Table 5.2 shows some of the
320x240 images that were given to the system as input. In this case the system found the
majority of faces in all four images but also found a lot of false positives.
The conclusion from these tests is that the weights and thresholds produce a lot of
false positives and false negatives and so are not suitable for the generalized task of face
detection as they were produced by a data set that was too specific. But as they were only
used to design the neural network prototype it is possible to change them into better ones
50
and produce better results. Also the addition of a pre-processing stage may reduce false
detections. Another conclusion drawn from these tests is that the system is very sensitive
to small differences in the window frame and thus produces a lot of false positives.
51
Sample 20x20 images
Faces Classification Non faces Classification
1 Face Face
2 Face Non Face
3 Non Face Non Face
4 Non Face Non Face
5 Non Face Non Face
6 Non Face Non Face
7 Non Face Non Face
8 Face Non Face
9 Non Face Non Face
10 Non Face Non Face
11 Non Face Non Face
12 Non Face Non Face
13 Non Face Non Face
14 Face Non Face
15 Non Face Non Face
Table 5.1
52
Sample 320x240 Images
Input Image Result Image
Found 4 out of 6 faces
Found 13 out of 17 faces
Found 15 out of 15 faces
53
Found 6 out of 7 faces
Table 5.2
54
Chapter 6: Discussion - Future Work and Improvement
6.1. Conclusions and Discussion
During the course of this project implementation a lot of issues needed to be
addressed. First of all it was necessary to decide the exact structure of the neural network,
its processing throughput and the resource sharing of the network units. Next the binary
representations and precision of the negative and decimal numbers needed to be figured
out. After this the memory requirements for the weights needed to be addressed in respect
to the sharing of these memories between neurons and the parallelism capabilities that
this sharing would allow. Additionally there had to be a decision on how the hyperbolic
tangent would be implemented either as a unit that would actually calculate the
hyperbolic tangent value or as a lookup table that would have these values stored.
Furthermore the implementation was heavily driven by what resources the FPGA had to
offer and how their utilization allow for the further improvement of the system. All this
issues have been addressed and solved in the manner described in Chapter 3, and the
project was implemented successfully.
As noticed in section 5.3 the system produced a high number of false positives and
false negatives. In order to solve the problem and try to balance the high number of false
positives and false negatives, the threshold of the output neuron was altered to
experiment on how the systems behaviour would change. First the threshold was
decreased in order to try and balance the false negatives. This had as a result the increase
of the windows that were classified as faces and thus the number of false positives
increased. Next the threshold value was increased in order to decrease the number of
false positives. Due to this, the number of windows that were classified as faces
decreased dramatically, however the correct detections were decreased as well. Hence
this solution did not prove to be a good one and so other methods need to be exploited to
improve and balance the systems performance.
55
The throughput of the system is approximately 30 frames per second as is mentioned
in section 4.5 and so the implemented design is capable of real time video face detection.
The issue with the amount of false positives and false negatives that the system produces
can be fixed by utilizing new and better weights.
The problems that were raised during the implementation of the project in their
greater part had to do with the complexity of the system and the lack of experience on
implementing such a big design.
Through the process of implementing this project I gain valuable experience in
hardware design that will prove useful in the future. I also got familiar with the process
paradigm of neural networks and what benefits they offer.
56
6.2. Improvement
There is a lot of future work that could be done to improve and advance this project.
The first necessary improvement is the completion of the project. At this point the IPG
unit and NN unit are two separate designs. For the system to be complete the integration
of the image pyramid generation unit and neural network unit into the same system. After
this is done it is required to retrain the neural network on a much larger and variant
database because the given weights and threshold values did not demonstrate the best
results in terms of accuracy and correctness. This will improve the detection performance
of the system. After the system is complete the next step is to change the input of the
design from one static 320x240 frame to a video stream from a digital camera. This will
allow for real time face detection.
The output images shown in Table 5.2 have many indications for the same face. This
could lead to the processing of window images that contain the same face. A way around
this problem is to introduce a filtering mechanism that will track and eliminate any
redundant indications of the same face. Hence a vast number of square indicators would
be eliminated and this will have as an effect the reduction the numbers of false positives
and false negatives classifications.
57
6.3. Future Work
The image pyramid generation process can be changed from nearest neighbour
approach to bilinear interpolation. Using bilinear interpolation will retain the quality of
the downscaled images and thus improving the possibilities for correct detection by the
neural network.
Furthermore a pre-processing stage can be introduced that will receive the 20x20
image window and improve the images quality via histogram equalization, in order to
improve the possibilities for accurate detection by the neural network. In addition it can
also improve the images lightning conditions to eliminate any environmental effects in
the image.
The problem of balancing the number of false positives and false negatives still
needs to be dealt with. A more orthodox approach to the one mentioned in section 6.1
would be to re-examine the network. Specifically to examine the input layer networks
and find out which one has the biggest effect on the systems outcome in the cases of false
positives and false negatives detections and focus on changing that particular input layer
network.
More neural networks can be integrated to the system so that the frames created can
be processed in parallel. Also the neural network implementation could be revised and
altered so that the neural network is not so depended on the system clock cycles and as a
result improve its speed and reduce the total area of the system.
The neural network can be combined with other algorithms to improve the system
accuracy. For example after a window has been determined as a face from the neural
network it could be compared with a series of templates to verify that is really a face.
Another possibility would be to put three neural networks that are trained with different
58
training sets in the same system. The three neural networks would process the same
window and voting would take place on their results to determine the final result.
Other possible improvements could be to change the neural networks structure and
add more neurons to increase the intelligence and accuracy of the network and make it
more reliable. Also the regions that are not processed due to the window 5-pixel overlap
can be classified by interpolating the classification results of their processed neighbours
and thus determine a result for the non-processed regions.
59
References
[1] C. Stergiou and D. Siganos , “Neural Networks” . (Accessed: February 2008)
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
[2] Dendrite – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Dendrite
[3] Cell Nucleus – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Cell_nucleus
[4] Rafael C. Gonzalez, Richard E. Woods, Steven L. Eddins , “Digital Image
Processing”, (Upper Saddle River NJ: Prentice Hall), Chapter 2, P1-31.
[5] Rafael C. Gonzalez, Richard E. Woods , “Digital Image Processing”, second
edition, (Upper Saddle River NJ: Prentice Hall, 2002), Chapter 6, P34-71.
[6] H. Rowley, S Baluja, and T. Kanade. “Rotation invariant neural network –based
face detector”. In Proc. IEEE Conf on Computer Vision and Pattern Recognition, Page(s)
38-44, Santa Barbara, CA, June 23-25, 1998.
[7] H. A. Rowley, S. Baluja, T. Kanade, “Neural Network-Based Face Detection”,
IEEE Trans. On PAMI, Vol. 20, No. 1 Page(s).39-51, 1998.
[8] Andew King, “A Survey of Methods for Face Detection” , March 2003.
[9] P. Viola, M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple
Features”, May 2004.
60
[10] AdaBoost – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/AdaBoost
[11] Yang Ming-Hsuan, DJ Kriegman, N Ahuja, “Detecting faces in images a survey”,
IEEE Trans. On PAMI, volume: 24 Issue 1, Page(s): 34-58, Jan. 2002.
[12] Wavelets – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Wavelets
[13] T. Theocharides, C. Nicopoulos, K. Irick, N. Vijaykrishnan and M.J. Irwin, “An
Exploration of Hardware Architectures for face Detection” in the “The VLSI handbook”
second edition (New York: CRC Press, 2006), Section XII, chapter 83, P1-27.
[14] Neural Network – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Neural_network
[15] Artificial Neural Network – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Artificial_neural_network
[16] Biological Neural Network – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Biological_neural_network
[17] Face detection – Wikipedia. (Accessed: May 2008)
http://en.wikipedia.org/wiki/Face_detection
[18] Digilent Inc. Virtex 2 Pro development system. (Accessed: May 2008)
http://www.digilentinc.com/Products/Detail.cfm?Nav1=Products&Nav2=Programmable
&Prod=XUPV2P
61
[19] Zainalabedin Navabi, “VGA adapter” in the “Design and implementation with
Field programmable arrays” (Norwell, Massachusetts: Kluwer Academic Publishers,
2005), Chapter 12, P147-257.
[20] Magnus Strengert, Martin Kraus and Thomas Ertl, “Pyramid Methods in GPU
based image processing”.
[21] Mohammad Inayatullah, Shair Akbar Khan and Bashir Ahmad, “A face detection
system using Neural Network Approach”.
[22] Henry Schneiderman and Takeo Kanade, “A Statistical Method for 3D Object Detection Applied to Faces and Cars”. [23] Axon – Wikipedia. (Accessed: May 2008) http://en.wikipedia.org/wiki/Axon
62
Appendix A
Part of the synthesis code produced by Xilinx ISE 9.1i
Advanced HDL Synthesis Report Macro Statistics # FSMs : 1 # Multipliers : 15 10x5-bit multiplier : 1 10x9-bit multiplier : 1
5x3-bit multiplier : 1 5x4-bit multiplier : 1 5x5-bit multiplier : 1 6x4-bit multiplier : 3 6x5-bit multiplier : 1 6x6-bit multiplier : 5 8x10-bit multiplier : 1 # Adders/Subtractors : 82 10-bit adder : 2 11-bit subtractor : 1 12-bit subtractor : 1 15-bit adder : 4 17-bit adder : 2 3-bit adder : 4 35-bit adder : 4 35-bit subtractor : 6 5-bit adder : 1 5-bit subtractor : 1 6-bit subtractor : 7 7-bit adder : 33 7-bit subtractor : 1 8-bit adder : 1 8-bit adder carry out : 3 8-bit subtractor : 1 9-bit adder : 4 9-bit adder carry out : 3 9-bit subtractor : 3 # Counters : 66 12-bit up counter : 1 3-bit up counter : 27 30-bit up counter : 1 4-bit up counter : 1 5-bit up counter : 3
63
=================================================== ====================== * Final Report * =================================================== ====================== Final Results RTL Top Level Output File Name : Face_detector. ngr Top Level Output File Name : Face_detector Output Format : NGC Optimization Goal : Speed Keep Hierarchy : NO Design Statistics
7-bit up counter : 30 8-bit up counter : 1 9-bit up counter : 2 # Accumulators : 31 35-bit up accumulator : 30
8-bit up accumulator : 1 # Registers : 330 Flip-Flops : 330 # Latches : 7 1-bit latch : 3 8-bit latch : 3 9-bit latch : 1 # Comparators : 84 10-bit comparator equal : 3 10-bit comparator greatequal : 9
10-bit comparator greater : 4 10-bit comparator less : 10 10-bit comparator lessequal : 10 10-bit comparator not equal : 1 11-bit comparator equal : 1 12-bit comparator equal : 1 5-bit comparator less : 5 5-bit comparator lessequal : 6 7-bit comparator equal : 29
8-bit comparator greatequal : 1 9-bit comparator equal : 1 9-bit comparator greatequal : 1 9-bit comparator greater : 1 9-bit comparator not equal : 1 # Multiplexers : 14 1-bit 4-to-1 multiplexer : 4 35-bit 4-to-1 multiplexer : 10
64
# IOs : 43 Cell Usage : # BELS : 9026 # GND : 42
# INV : 343 # LUT1 : 254 # LUT2 : 1177 # LUT3 : 857 # LUT3_D : 13 # LUT3_L : 1 # LUT4 : 2286 # LUT4_D : 4 # LUT4_L : 5 # MULT_AND : 1
# MUXCY : 1998 # MUXF5 : 269 # MUXF6 : 32 # MUXF7 : 16 # VCC : 34 # XORCY : 1694 # FlipFlops/Latches : 1793 # FD_1 : 20 # FDC : 144
# FDCE : 1383 # FDCPE : 34 # FDE : 96 # FDE_1 : 20 # FDP : 1 # FDR : 10 # FDR_1 : 18 # FDRE : 30 # FDRS : 1 # LD : 2 # LDC : 1 # LDC_1 : 4 # LDCP : 16 # LDCPE_1 : 9 # LDP_1 : 4 # RAMS : 70 # RAMB16_S18 : 29 # RAMB16_S4_S4 : 38 # RAMB16_S9 : 2 # RAMB16_S9_S9 : 1 # Clock Buffers : 8 # BUFG : 7 # BUFGP : 1 # IO Buffers : 42 # IBUF : 9
65
# OBUF : 33 # MULTs : 17 # MULT18X18 : 17 =================================================== ======================
Device utilization summary: --------------------------- Selected Device : 2vp30ff896-7 Number of Slices: 2647 out of 13696 19% Number of Slice Flip Flops: 1793 out of 27392 6% Number of 4 input LUTs: 4940 out of 27392 18% Number of IOs: 43
Number of bonded IOBs: 43 out of 556 7% Number of BRAMs: 70 out of 136 51% Number of MULT18X18s: 17 out of 136 12% Number of GCLKs: 8 out of 16 50%
66
Modelsim Simulation Waveforms
Figure A1: Region 1 neurons accumulation operation. The first two neurons have finished
accumulation. The last two are still accumulating.
Figure A2: Region 2 neurons accumulation operation. The first eleven neurons have finished
accumulation. The last five are still accumulating.
Figure A3: Region 3 neurons accumulation operation. The first four neurons have finished
accumulation. The last two are still accumulating.
67
Figure A4: Order in which the accumulation operation ends at each neuron.
Figure A5: Illustration of the Neural network final output. First the face or no_face signal becomes 1
which represents a non face and then the ready signal is asserted to inform other units to send the next window.
68
Appendix B
Face Detector top level implementation verilog code module Face_detector(show, switches, systemClock, l eds, up, right, left, down, vSync, hSync, Sync, blank,redOUT, greenOUT, b lueOUT, clock25Mhz); //////////////////////// // General Signals //////////////////////// input show; //Negative Logic input[3:0] switches; //Negative Logic input systemClock; wire address_clock; wire mem_clock;
wire mem_clock_vga; wire acc_clock; output[3:0] leds; ///////////////// // Reset signals ////////////////
wire reset_window_gen; wire reset_NN; wire reset_enable_unit; wire reset_clock_generator; wire reset_addr_buffer; ////////////////////////// // VGA Signals //////////////////////////
input up, right, left, down; // Negative Logic wire[9:0] col_counter; wire[9:0] row_counter; wire[7:0] REDvalue, GREENvalue, BLUEvalue; output wire clock25Mhz; output vSync, hSync, Sync, blank; output[7:0] redOUT, greenOUT, blueOUT; //////////////////////////// // 10x10 neurons signals ////////////////////////////
69
wire[6:0] address_weight_mem_10x10; wire[3:0] enable_10x10;
wire[16:0] weight_value_10x10; /////////////////////// // 5x5 neurons signals /////////////////////// wire[4:0] address_weight_mem_5x5; wire[15:0] enable_5x5;
wire[16:0] weight_value_5x5; ///////////////////////// // 20x5 neurons signals ///////////////////////// wire[6:0] address_weight_mem_20x5_1; wire[6:0] address_weight_mem_20x5_2;
wire[5:0] enable_20x5; wire[16:0] weight_value_20x5_1; wire[16:0] weight_value_20x5_2; ////////////////////// // Hidden 2 neuron /////////////////////// wire face_or_noface_320x240; wire ready_NN_320x240; ///////////////////////////////////////////// // Signals for window creation and viewing ///////////////////////////////////////////// wire[8:0] base_column_joy; wire[7:0] base_row_joy; wire[8:0] base_column_rast_scan; wire[7:0] base_row_rast_scan; wire[8:0] base_column; wire[7:0] base_row; wire[8:0] address20x20;
70
wire[16:0] address320x240; wire[8:0] w_address20x20; wire[16:0] read_address320x240;
wire[7:0] pixel_value_20x20; wire[7:0] pixel_value_320x240; wire[7:0] pixel_value_face; wire[7:0] pixel_value_noface; wire[8:0] address_classification_image; wire[7:0] read_pixel_value_320x240; wire[7:0] doutb;
wire we; wire stand_by; // to leds wire next_window; wire[11:0] number_of_faces;
/////////////////////////////////////////////////// //////// // Modules required for system monitoring and contr ol /////////////////////////////////////////////////// //////// // Generates the clocks for the memories, the multi pliers and the accumulators system_clock_generator scg(reset_clock_generator, s ystemClock, address_clock, mem_clock_vga, mem_clock, acc_clock) ; push_button_modifier pbm_right(systemClock, ~right, ~up, ~down, ~left, next_window); timer time_track(systemClock, ready_NN_320x240, tim e_out); //speed at which the face detector works mode_mux mode_m(~switches[2], time_out, next_window , mode); enable_view view_pic(reset_NN, show, view); controller ctrl(~switches[3], systemClock, done_win dow, done_frame, mode, enable_win_gen, stand_by, reset_NN, reset_enable_un it, reset_clock_generator, reset_window_gen, mem_clock, reset_addr_buffer, we, mode); // Controller that enables the appropriate memories and accumulators according with the created pixel window id
71
controller_column_row_gen enable_unit(reset_enable_ unit, address_clock, enable_10x10, enable_5x5, enable_20x5, address_weig ht_mem_10x10, address_weight_mem_5x5, address_weight_mem_20x5_1, address_weight_mem_20x5_2);
/////////////////////////////////////////////////// / //Modules Required To operate the Neural Network /////////////////////////////////////////////////// / // Contains the weights for the 10x10 neurons weights_memory_module_10x10 mem10x10(address_weight _mem_10x10, enable_10x10, mem_clock, weight_value_10x10);
// Contains the weights for the 5x5 neurons weights_memory_module_5x5 mem5x5(address_weight_mem _5x5, enable_5x5, mem_clock, weight_value_5x5); // Contains the weights for the 20x5 neurons weights_memory_module_20x5 mem20x5(address_weight_m em_20x5_1, address_weight_mem_20x5_2, enable_20x5, mem_clock, weight_value_20x5_1, weight_value_20x5_2);
// Neural Network // Reads value from the 320x240 memory neural_network NN(reset_NN, systemClock, enable_10x 10, enable_5x5, enable_20x5, read_pixel_value_320x240, acc_clock, weight_value_10x10, weight_value_5x5, weight_value_ 20x5_1, weight_value_20x5_2, ready_NN_320x240, face_or_noface_320x240); face_counter fac_count(~switches[3], done_frame, re ady_NN_320x240, mode, face_or_noface_320x240, number_of_faces); /////////////////////////////////////////////////// ////////////// //Modules Resaponsible for data extraction and imag e showing /////////////////////////////////////////////////// ////////////// // VGA clock Divider VGAclockDivider clock(systemClock, clock25Mhz); // VGA controller vga_controller vga(REDvalue, GREENvalue, BLUEvalue, clock25Mhz, vSync, hSync, Sync, blank, redOUT, greenOUT, blueOUT, col_counter, row_counter); navigation_module nav_unit(~switches[3], ready_NN_3 20x240, ~up, ~right, ~left, ~down, base_column_joy, base_row_joy);
72
window_generator gen(~switches[3], ready_NN_320x240 , base_column_rast_scan, base_row_rast_scan); br_bc_mux bcbr(~switches[2], base_column_rast_scan, base_row_rast_scan,
base_column_joy, base_row_joy, base_column, base_ro w); new_window_generator new_gen(reset_window_gen, enab le_win_gen, base_column, base_row, address_clock, read_address3 20x240, done_window, done_frame, mode); address_to_write addr_w(reset_addr_buffer, address_ clock, w_address20x20); // Address to output to VGA
Address addr(col_counter, row_counter, address20x20 , address320x240, address_classification_image); Source sour(ready_NN_320x240, mode, view, face_or_n oface_320x240, pixel_value_20x20, pixel_value_320x240, pixel_value _noface, pixel_value_face, col_counter, row_counter, base_co lumn, base_row, REDvalue, GREENvalue, BLUEvalue);
////////////////////////////////////////////// // Images and window buffers ////////////////////////////////////////////// // Classification images classification_face face(address_classification_ima ge, systemClock, pixel_value_face); classification_noface noface(address_classification _image, systemClock, pixel_value_noface); // address320x240 -> address for vga // read_address320x240 -> address by window generat or //read_pixel_value_320x240 -> For NN //image_320x240 im(address320x240, read_address320x 240, systemClock, mem_clock_vga, 8'b00000000, pixel_value_320x240, read_pixel_value_320x240, 1'b0); //test_faces im(address320x240, read_address320x240 , systemClock, mem_clock_vga, 8'b00000000, pixel_value_320x240, read_pixel_value_320x240, 1'b0); Old_focks im(address320x240, read_address320x240, s ystemClock, mem_clock_vga, 8'b00000000, pixel_value_320x240, read_pixel_value_320x240, 1'b0); window_buffer_20x20 buff(address20x20, w_address20x 20, systemClock, mem_clock, read_pixel_value_320x240,pixel_value_20x 20, doutb, we);
73
////////////////////////////////////////////// // Output //////////////////////////////////////////////
LED_unit led_un(~switches [1:0],number_of_faces, stand_by, done_frame, leds);
Neural Network top level implementation verilog code module neural_network(reset, systemClock, enable_10 x10, enable_5x5, enable_20x5, pixel_value, acc_clock, weight_value_10x10, weight_value_5x5, weight_value_ 20x5_1,
weight_value_20x5_2,acc_ready_hidden2, verdict); ////////////////////////// // General Signals ///////////////////////// input reset; input systemClock; input[7:0] pixel_value;
input acc_clock; ////////////////////////////// // 10x10 neurons signals ////////////////////////////// input[3:0] enable_10x10; input[16:0] weight_value_10x10;
wire[34:0] acc1_final_sum; wire[34:0] acc2_final_sum; wire[34:0] acc3_final_sum; wire[34:0] acc4_final_sum; wire acc1_ready; wire acc2_ready; wire acc3_ready;
wire acc4_ready; ////////////////////////////// // 5x5 neurons signals
74
////////////////////////////// input[15:0] enable_5x5; input[16:0] weight_value_5x5;
wire[34:0] acc5_final_sum; wire[34:0] acc6_final_sum; wire[34:0] acc7_final_sum; wire[34:0] acc8_final_sum; wire[34:0] acc9_final_sum; wire[34:0] acc10_final_sum; wire[34:0] acc11_final_sum; wire[34:0] acc12_final_sum; wire[34:0] acc13_final_sum; wire[34:0] acc14_final_sum;
wire[34:0] acc15_final_sum; wire[34:0] acc16_final_sum; wire[34:0] acc17_final_sum; wire[34:0] acc18_final_sum; wire[34:0] acc19_final_sum; wire[34:0] acc20_final_sum; wire acc5_ready; wire acc6_ready;
wire acc7_ready; wire acc8_ready; wire acc9_ready; wire acc10_ready; wire acc11_ready; wire acc12_ready; wire acc13_ready; wire acc14_ready; wire acc15_ready; wire acc16_ready; wire acc17_ready; wire acc18_ready; wire acc19_ready; wire acc20_ready; ///////////////////////////////// // 20x5 neurons signals ///////////////////////////////// input[5:0] enable_20x5; input[16:0] weight_value_20x5_1; input[16:0] weight_value_20x5_2;
75
wire[34:0] acc21_final_sum; wire[34:0] acc22_final_sum; wire[34:0] acc23_final_sum; wire[34:0] acc24_final_sum; wire[34:0] acc25_final_sum;
wire[34:0] acc26_final_sum; wire acc21_ready; wire acc22_ready; wire acc23_ready; wire acc24_ready; wire acc25_ready; wire acc26_ready; ////////////////////////////////////
// Hidden 1 - neuron 1 signals 10x10 /////////////////////////////////// wire clock_mem_10x10_to_hidden1; wire clock_acc_10x10_to_hidden1; wire[16:0] weight_10x10_to_hidden1; wire[34:0] acc_final_sum_10x10_to_hidden1; // Hidd en 1 - neuron 1
result wire acc_ready_10x10_to_hidden1; //////////////////////////////////// // Hidden 1 - neuron 2 signals 5x5 /////////////////////////////////// wire clock_mem_5x5_to_hidden1; wire clock_acc_5x5_to_hidden1; wire[16:0] weight_5x5_to_hidden1; wire[34:0] acc_final_sum_5x5_to_hidden1; // Hidden 1 - neuron 2 result wire acc_ready_5x5_to_hidden1; //////////////////////////////////// // Hidden 1 - neuron 3 signals 20x5 /////////////////////////////////// wire clock_mem_20x5_to_hidden1; wire clock_acc_20x5_to_hidden1; wire[16:0] weight_20x5_to_hidden1; wire[34:0] acc_final_sum_20x5_to_hidden1; // Hidde n 1 - neuron 2
76
result wire acc_ready_20x5_to_hidden1; //////////////////////////////////// // Hidden 2 - neuron signals
/////////////////////////////////// wire clock_mem_hidden2; wire clock_acc_hidden2; wire[16:0] weight_hidden2; wire[34:0] acc_final_sum_hidden2; // Hidden 2 - ne uron result // if acc_final_sum_hidden2[34] == 1 -> Not a face
// if acc_final_sum_hidden2[34] == 0 -> A face output acc_ready_hidden2; output verdict; ////////////////////////// // Tanh memory signals /////////////////////////
wire[34:0] tanh_address_10x10_to_hidden1; // tanh memory address wire[14:0] tanh_value_10x10_to_hidden1_to_hidden2; // Data wire[34:0] tanh_address_5x5_to_hidden1; // tanh me mory address wire[14:0] tanh_value_5x5_to_hidden1; // Data wire[34:0] tanh_address_20x5_to_hidden1; // tanh m emory address wire[14:0] tanh_value_20x5_to_hidden1; // Data wire[34:0] tanh_address_hidden2; ///////////////////////////////////////////////// // Multiplier and accumulators for neuron 10x10 ///////////////////////////////////////////////// neurons_10x10 group1(enable_10x10, acc_clock, reset , pixel_value, weight_value_10x10, acc1_final_sum, acc1_ready, acc2_final_sum, acc2_ready, acc3_final_sum, acc3_ready, acc4_final_sum, acc4_ready); ///////////////////////////////////////////////// // Multiplier and accumulators for neuron 5x5 /////////////////////////////////////////////////
77
neurons_5x5 group2(enable_5x5, acc_clock, reset, pi xel_value, weight_value_5x5, acc5_final_sum, acc5_ready, acc6_final_sum, acc6_ready,
acc7_final_sum, acc7_ready, acc8_final_sum, acc8_ready, acc9_final_sum, acc9_ready, acc10_final_sum, acc10_ready, acc11_final_sum, acc11_ready, acc12_final_sum, acc12_ready, acc13_final_sum, acc13_ready, acc14_final_sum, acc14_ready, acc15_final_sum, acc15_ready, acc16_final_sum, acc16_ready,
acc17_final_sum, acc17_ready, acc18_final_sum, acc18_ready, acc19_final_sum, acc19_ready, acc20_final_sum, acc20_ready); ////////////////////////////////////////////////// // Multipliers and accumulators for neuron 20x5 //////////////////////////////////////////////////
neurons_20x5 group3(enable_20x5, acc_clock, reset, pixel_value, weight_value_20x5_1, weight_value_20x5_2, acc21_final_sum, acc21_ready, acc22_final_sum, acc22_ready, acc23_final_sum, acc23_ready, acc24_final_sum, acc24_ready, acc25_final_sum, acc25_ready, acc26_final_sum, acc26_ready); ///////////////////////////////// // 10x10 to hidden1 controller //////////////////////////////// controller_10x10_to_hidden1 ctrl_1(reset, systemClo ck, clock_mem_10x10_to_hidden1, clock_acc_10x10_to_hidd en1, weight_10x10_to_hidden1, tanh_address_10x10_to_hidd en1, acc1_final_sum, acc1_ready, acc2_final_sum, acc2_ready, acc3_final_sum, acc3_ready, acc4_final_sum, acc4_ready); ///////////////////////////////// // 5x5 to hidden1 controller /////////////////////////////////
78
controller_5x5_to_hidden1 ctrl_2(reset, systemClock , clock_mem_5x5_to_hidden1, clock_acc_5x5_to_hidden1, weight_5x5_to_hidden1, tanh_address_5x5_to_hidden1, acc5_final_sum, acc5_ready,
acc6_final_sum, acc6_ready, acc7_final_sum, acc7_ready, acc8_final_sum, acc8_ready, acc9_final_sum, acc9_ready, acc10_final_sum, acc10_ready, acc11_final_sum, acc11_ready, acc12_final_sum, acc12_ready, acc13_final_sum, acc13_ready, acc14_final_sum, acc14_ready, acc15_final_sum, acc15_ready,
acc16_final_sum, acc16_ready, acc17_final_sum, acc17_ready, acc18_final_sum, acc18_ready, acc19_final_sum, acc19_ready, acc20_final_sum, acc20_ready); ///////////////////////////////// // 20x5 to hidden1 controller /////////////////////////////////
controller_20x5_to_hidden1 ctrl_3(reset, systemCloc k, clock_mem_20x5_to_hidden1, clock_acc_20x5_to_hidden 1, weight_20x5_to_hidden1, tanh_address_20x5_to_hidden 1, acc21_final_sum, acc21_ready, acc22_final_sum, acc22_ready, acc23_final_sum, acc23_ready, acc24_final_sum, acc24_ready, acc25_final_sum, acc25_ready, acc26_final_sum, acc26_ready); ////////////////// // Tanh Memory ////////////////// tanh hyperbolic_tangent(tanh_address_10x10_to_hidde n1, tanh_address_5x5_to_hidden1, tanh_address_20x5_to_h idden1, tanh_address_hidden2, acc_ready_20x5_to_hidden1, acc_ready_10x10_to_hidden1, clock_mem_10x10_to_hidd en1, clock_mem_5x5_to_hidden1, clock_mem_20x5_to_hidden1 , clock_mem_hidden2, tanh_value_10x10_to_hidden1_to_hidden2, tanh_value_ 5x5_to_hidden1, tanh_value_20x5_to_hidden1); ////////////////////// // Hidden 1 neurons /////////////////////
79
hidden1_neuron1 neuron10x10_to_hidden1(reset, clock_acc_10x10_to_hidden1, weight_10x10_to_hidden1 , tanh_value_10x10_to_hidden1_to_hidden2, tanh_address_10x10_to_hidden1[34], acc_final_sum_10 x10_to_hidden1,
acc_ready_10x10_to_hidden1); hidden1_neuron2 neuron5x5_to_hidden1(reset, clock_a cc_5x5_to_hidden1, weight_5x5_to_hidden1, tanh_value_5x5_to_hidden1, tanh_address_5x5_to_hidd en1[34], acc_final_sum_5x5_to_hidden1, acc_ready_5x5_to_hidden1); hidden1_neuron3 neuron20x5_to_hidden1(reset, clock_ acc_20x5_to_hidden1, weight_20x5_to_hidden1,
tanh_value_20x5_to_hidden1, tanh_address_20x5_to_hi dden1[34], acc_final_sum_20x5_to_hidden1, acc_ready_20x5_to_hidden1); ///////////////////////////////////// // hidden 1 to hidden2 controller ///////////////////////////////////// controller_hidden1_to_hidden2 ctrl_4(reset, systemC lock,
clock_mem_hidden2, clock_acc_hidden2, weight_hidden2, tanh_address_hidden2, acc_final_sum_10x10_to_hidden1, acc_ready_10x10_to_ hidden1, acc_final_sum_5x5_to_hidden1, acc_ready_5x5_to_hidd en1, acc_final_sum_20x5_to_hidden1, acc_ready_20x5_to_hi dden1); ////////////////////// // Hidden 2 neurons ///////////////////// hidden2_neuron hidden2_n1(reset, clock_acc_hidden2, weight_hidden2, tanh_value_10x10_to_hidden1_to_hidden2, tanh_addres s_hidden2[34], acc_final_sum_hidden2, acc_ready_hidden2); classification classify(acc_final_sum_hidden2, verd ict); endmodule
80
New coordinates generator implementation verilog code from the Image Pyramid generator module.
module new_coordinates(row_320x240, column_320x240, row_260x195, column_260x195, row_200x150, column_200x150, row_16 0x120, column_160x120, row_120x90, column_120x90, row_100x 75, column_100x75, row_80x60, column_80x60, row_60x45, column_60x45, r ow_40x30, column_40x30, row_20x20, column_20x20); /* factor1 = 0.8125 - 0.1101 - 260x195 factor2 = 0.625 - 0.101 - 200x150 factor3 = 0.5 - 0.1 - 160x120 factor4 = 0.375 - 0.011 - 120x90 factor5 = 0.3125 - 0.0101 - 100x75 factor6 = 0.25 - 0.01 - 80x60 factor7 = 0.1875 - 0.0011 - 60x45 factor8 = 0.125 - 0.001 - 40x30 factor9 = 0.0625 - 0.0001 & 0.083984375 - 0.0001010 11 - 20x20 */ parameter factor1=4'b1101; parameter factor2=3'b101; parameter factor4=3'b011; parameter factor5=4'b0101; parameter factor7=4'b0011; parameter factor9_row=9'b000101011; // 320x240 input[7:0] row_320x240; input[8:0] column_320x240; //260x195 output[11:0] row_260x195; reg[11:0] row_260x195=12'b000000000000; output[12:0] column_260x195; reg[12:0] column_260x195=13'b0000000000000; //200x150
output[10:0] row_200x150; reg[10:0] row_200x150=11'b00000000000; output[11:0] column_200x150; reg[11:0] column_200x150=12'b000000000000; //160x120 output[6:0] row_160x120; reg[6:0] row_160x120=7'b0000000; output[7:0] column_160x120;
81
reg[7:0] column_160x120=8'b00000000; //120x90 output[10:0] row_120x90; reg[10:0] row_120x90=11'b00000000000;
output[11:0] column_120x90; reg[11:0] column_120x90=12'b000000000000; //100x75 output[11:0] row_100x75; reg[11:0] row_100x75=12'b000000000000; output[12:0] column_100x75; reg[12:0] column_100x75=13'b0000000000000; //80x60
output[5:0] row_80x60; reg[5:0] row_80x60=6'b000000; output[6:0] column_80x60; reg[6:0] column_80x60=7'b0000000; //60x45 output[11:0] row_60x45; reg[11:0] row_60x45=12'b000000000000; output[12:0] column_60x45;
reg[12:0] column_60x45=13'b0000000000000; //40x30 output[4:0] row_40x30; reg[4:0] row_40x30=5'b00000; output[5:0] column_40x30; reg[5:0] column_40x30=6'b000000; //20x20 output[16:0] row_20x20; reg[16:0] row_20x20=17'b00000000000000000; output[4:0] column_20x20; reg[4:0] column_20x20=5'b00000; //260x195 always@(row_320x240 or column_320x240)begin column_260x195 = column_320x240 * factor1; row_260x195 = row_320x240 * factor1; end //200x150 always@(row_320x240 or column_320x240)begin
82
column_200x150 = column_320x240 * factor2; row_200x150 = row_320x240 * factor2; end
//160x120 always@(row_320x240 or column_320x240)begin column_160x120 = column_320x240[8:1]; row_160x120 = row_320x240[7:1]; end //120x90 always@(row_320x240 or column_320x240)begin
column_120x90 = column_320x240 * factor4; row_120x90 = row_320x240 * factor4; end //100x75 always@(row_320x240 or column_320x240)begin
column_100x75 = column_320x240 * factor5; row_100x75 = row_320x240 * factor5; end //80x60 always@(row_320x240 or column_320x240)begin column_80x60 = column_320x240[8:2]; row_80x60 = row_320x240[7:2]; end //60x45 always@(row_320x240 or column_320x240)begin column_60x45 = column_320x240 * factor7; row_60x45 = row_320x240 * factor7; end //40x30 always@(row_320x240 or column_320x240)begin column_40x30 = column_320x240[8:3];
83
row_40x30 = row_320x240[7:3]; end //20x20
always@(row_320x240 or column_320x240)begin column_20x20 = column_320x240[8:4]; row_20x20 = row_320x240 * factor9_row; end endmodule