neural network-based face detector implementation … · this thesis focuses on the design of a...

NEURAL NETWORK-BASED FACE DETECTOR IMPLEMENTATION

ON A VIRTEX2 PRO FPGA PLATFORM

by

Christos Kyrkou

Submitted to the University of Cyprus in partial fulfilment

of the requirements for the degree of Bachelor of science in Computer

Engineering

Department of Electrical and Computer Engineering

May 2008

NEURAL NETWORK-BASED FACE DETECTOR IMPLEMENTATION

ON A VIRTEX2 PRO FPGA PLATFORM

by

Christos Kyrkou

Examination Committee:

Theocharis Theocharides Lecturer, Department of ECE, Advisor Athinodoros Georghiades Visiting Assistant Professor, Department of ECE, Committee Member

iii

Abstract

Face detection is a vital part towards face recognition and is a vital task in security

and intelligent vision-based human computer interaction applications. Current software

face detection implementations lack the computational ability to support detection in real

time video streams. Hence the need for hardware implementations of face detection

systems arises. Hardware implementations are not only desirable because of their speed

that allows for real time video processing, but also because hardware implementations

can be optimized to have the best possible results in terms of area and power

consumption.

This thesis focuses on the design of a hardware system on an FPGA platform for the

purpose of performing upright frontal view face detection on an image frame. The

proposed design consists of a neural network that performs face detection on a 320x240

input frame. The neural network receives a 20x20 search window from the image and

classifies the window as a face or no face. The system output is an image where the

windows that were classified as faces are marked. An important part of this thesis is the

effective allocation of the FPGAs resources in order to design a system capable of

parallel processing of frames.

The weights and thresholds for the neural network were provided in collaboration

with Video mining incorporated. The weights and thresholds are for up frontal viewing

and for a specific data set. The neural network training was done offline and the detection

is done online.

The implemented face detection system can process approximately 30 frames per

second whereas software implementations process between 15-22 frames per second

under favourable conditions. The detection frame rate also indicates that the system can

perform face detection in real time video streams.

iv

Περίληψη

Η ανίχνευση προσώπων σε εικόνες είναι µια πολύ σηµαντικό κοµµάτι για

εφαρµογές όπως αναγνώριση προσώπου, εφαρµογές ασφαλείας και τις ευφυείς

εφαρµογές αλληλεπίδρασης ανθρώπου-υπολογιστή. Οι τρέχουσες εφαρµογές ανίχνευσης

προσώπου λογισµικού στερούνται την υπολογιστική δυνατότητα να υποστηριχθεί η

ανίχνευση σε πραγµατικό - χρόνο. Ως εκ τούτου προκύπτει η ανάγκη για υλοποίηση

συστηµάτων ανίχνευσης προσώπου σε υλικό. Οι εφαρµογές υλικού είναι όχι µόνο

επιθυµητές λόγω της ταχύτητάς τους που επιτρέπει επεξεργασία σε πραγµατικό - χρόνο,

αλλά και επειδή οι εφαρµογές υλικού µπορούν να βελτιστοποιηθούν για να έχουν τα

καλύτερα δυνατά αποτελέσµατα όσον αφορά την περιοχή του κυκλώµατος και την

κατανάλωση ισχύος.

Αυτή η διατριβή εστιάζει στον σχεδιασµό ενός συστήµατος υλικού σε µια

πλατφόρµα FPGA µε σκοπό την ανίχνευση µετωπικής άποψης προσώπων σε ένα πλαίσιο

εικόνας. Το προτεινόµενο σχέδιο αποτελείται από ένα νευρικό δίκτυο που εκτελεί την

ανίχνευση προσώπου σε ένα πλαίσιο εισαγωγής 320x240. Το νευρικό δίκτυο λαµβάνει

ένα παράθυρο αναζήτησης 20x20 από την εικόνα και ταξινοµεί το παράθυρο ως

πρόσωπο ή µη πρόσωπο. Το αποτέλεσµα του συστήµατος είναι µια εικόνα όπου τα

παράθυρα που ταξινοµήθηκαν ως πρόσωπα περικλείονται γύρω από ένα πλαίσιο. Ένα

σηµαντικό µέρος αυτής της διατριβής είναι η αποτελεσµατική κατανοµή των πόρων του

FPGA προκειµένου να σχεδιαστεί ένα σύστηµα ικανό για παράλληλης επεξεργασία των

πλαισίων.

Τα βάρη και οι τιµές των κατώτατων ορίων για το νευρικό δίκτυο µας δόθηκαν

από την εταιρία Video Mining. Τα βάρη και τα κατώτατα όρια είναι για πρόσωπα µε

µετωπική άποψη και για ένα συγκεκριµένο περιβάλλον. Το εφαρµοσµένο σύστηµα

ανίχνευσης προσώπου µπορεί να επεξεργαστεί περίπου 30 πλαίσια ανά δευτερόλεπτο

ενώ οι εφαρµογές λογισµικού επεξεργάζονται µεταξύ 15-22 πλαίσια ανά δευτερόλεπτο

υπό ευνοϊκές συνθήκες. Το ποσοστό πλαισίων ανίχνευσης επίσης δείχνει ότι το σύστηµα

µπορεί να χρησιµοποιηθεί για την ανίχνευση προσώπων σε πραγµατικό χρόνο.

v

Acknowledgements

“I would like to thank my family for their support and understanding throughout the four

years of my studies at the university. For the successful completion of this project I

would like to first of all thank my project supervisor Theocharis Theocharides for his

corporation, guidance and support during the course of completing this project. Also I

would like to thank committee member Athinodoros Georghiades for the helpful advices

that he gave me and useful remarks the he made in order to complete a comprehensive

report on the project.“

vi

Table of Contents

Chapter 1: Motivation

1.1. Face detection ....................................................................................................... 1 1.2. Challenges of face detection ............................................................................. 1

1.3. General approach for face detection ................................................................. 1

1.4. Software Vs Hardware face detection .............................................................. 3

1.5. Challenges of hardware face detection ............................................................. 3

1.6. Contribution ......................................................................................................... 4 Chapter 2: Fundamentals

2.1. Digital Image Processing ..................................................................................... 4 2.2. Image Pyramid Generation................................................................................... 7

2.3. Neural Networks .................................................................................................. 9 Chapter 3: Related Work

3.1. Past work on face detection ................................................................................ 13 3.2. Hardware face detection ..................................................................................... 16

3.2.1. ASIC Implementation of a Neural Network Based Face Detection ........... 16 3.2.2. FPGA Implementation of a Neural Network Based Face Detection .......... 17

Chapter 4: FPGA Hardware Implementation

4.1. Xilinx Virtex2 Pro XC2VP30 FPGA ................................................................. 21

4.2. VGA Controller .................................................................................................. 22 4.3. Image Pyramid Generation................................................................................. 23

4.3.1. IPG Implementation Strategy ..................................................................... 23

4.3.2. IPG Architecture ......................................................................................... 26 4.4. Neural Network .................................................................................................. 28

4.4.1. NN implementation strategy ....................................................................... 31

4.4.2. NN architecture ........................................................................................... 32 4.5. Overall System architecture ............................................................................... 43

vii

Chapter 5: Experimental methodology

5.1. Experimental strategy ......................................................................................... 47 5.2. Experimental Setup ............................................................................................ 48 5.3. Results and discussion ........................................................................................ 49

Chapter 6: Discussion - Future Work and Improvement

6.1. Conclusions and Discussion ............................................................................... 54 6.2. Improvement ...................................................................................................... 56 6.3. Future Work ....................................................................................................... 57

References ........................................................................................................................ 59 Appendix A ...................................................................................................................... 62 Appendix B ...................................................................................................................... 68

1

Chapter 1: Motivation

1.1. Face detection

Face detection is the process of identifying locations in an image that contain a face

regardless of their position, orientation, scale or the environment conditions in the image

[13]. Face detection plays a major role in applications such as security, robotics,

computer vision, human-computer, multimedia applications and intelligent vision-based

human computer interaction applications. Moreover it is important because it is the first

step in other processes such as face recognition, face tracking and monitoring.

1.2. Challenges of face detection

Face detection as a problem has many different challenges. Probably the most

important challenge is that a face can appear with many variations. First there are the

different poses that a face can have according to the relative camera – face angle. Also a

human face has many different facial features such as beards, mustaches and glasses but

also various skin tones and shapes. The facial expression of the face is also a condition

that needs to be taken into consideration. Some other factors that make face detection a

challenging task, are related to the setting of the image. For instance a face may not be

fully visible because part of it is hidden by another object. Finally the environment in

which the image is taken is important as lightning and weather conditions affect the way

a face is appears in the image [8].

1.3. General approach for face detection

In general face detection procedures consists of receiving an image frame and try to

locate the image regions that contain a face. This is done by examining small image

regions, m x n search windows generated from the original source image, and

determining if they contain a face.

2

A face detection process typically consists of three stages [13]. The first is the image

pyramid generation (IPG) stage. Its purpose is upon receiving an image frame, to create

downscaled versions of predefined sizes of that image. This is required in order to cover

the case were a face is larger than the examining region and thus it will not be detectable

unless the image is scaled down.

The second stage is the preprocessing stage. The goal of this stage is to filter out

noise and lighting variations the examined region in order to increase the probability of

accurate detection.

The third stage is the detection stage. In this stage the detection algorithm is applied

and returns the faces found in the image. There are three major categories of detection

algorithms. The first one is the feature based approach, were the algorithm tries to find

features that denote a face even with lightning and pose variations. Such features include

facial features, texture, skin color, eyes or the mouth. The second category is the template

matching approach which is based on predefined face templates. Patterns of faces are

used to describe a face or facial features as a whole and then are used to locate a face.

Finally there is the appearance based method where the examined region is treated as data

and it is classified into containing a face or not. Appearance based methods include

eigenfaces, neural networks and support vector machines models. These models use a

training set to capture the variety of different faces and facial features and then are used

for face detection.

Face detectors make two types of errors. The first one is called false negatives and is

when the detector classifies a face as a non-face. The other one is false positives and

happens when a detector classifies a non face as a face. The latter is a much more serious

error as the error will be propagated to any following processes such as face recognition,

face tracking and others that are mentioned above.

3

1.4. Software Vs Hardware face detection Modern day software implementations have reached a very high level of detection

rate and effectiveness as well as robustness, particularly in software implementations.

However, even with this impressive performance the best throughput achievable by

software detectors under good conditions is about 15 frames per second. Hence software

implementations are not best suited for real-time applications that require a throughput of

30 frames per second. Moreover, the complexity of the algorithms used for preprocessing

and filtering is another problem for such methods. Thus the need for hardware

implementation of face detection arises. Hardware face detectors although more difficult

to implement can be faster compared to software face detector implementations that run

on a general purpose processor and achieve a high detection rate as well, because they are

optimized to perform face detection [13].

1.5. Challenges of hardware face detection

There are several challenges in the implementation process of a hardware face

detection system. First is how the system will get its input. The best choice of input

would be for the hardware to have a video input interface so the input frames would come

from a digital camera. Next the constraints that the algorithm will introduce need to be

assessed. Such constraints include the memory required for the algorithm to work, the

processing power required to execute the algorithm, the parallelism capabilities of the

algorithm and the mathematics that are required by the algorithm and how easy it is to

implement them on hardware.

The task of designing a hardware face detection system becomes much more

challenging when designing on an FPGA platform. The reason is that the resources and

capabilities of the FPGA add constrains to what can be done and how. Maybe the most

important constrain is that the area of the design cannot exceed the area available on the

FPGA. This forces the designer to design the system in such a way so that it does not

exceed the total area available on the FPGA. Also the on chip memory on an FPGA is

predetermined and often not enough for applications that require or produce a lot of data

4

such as the face detection application. So if there is no access to an external memory and

the on chip memory is not enough the designer needs to find ways to share the on chip

memory which often results to the loss of speed because of memory interactions.

Furthermore when designing on FPGA the system speed is determined by the FPGAs

system clock and so the system cannot operate at its full potential.

1.6. Contribution

In this project a neural network face detector was implemented. The system receives

320x240 images in 20x20 windows and sends them to the neural network for processing.

The neural network then classifies these 20x20 images as a face or no face. Also in this

project an image pyramid generation unit was implemented that produces smaller scales

of the original 320x240 image. The two designs were not integrated into one system due

to time issues.

Chapter 2: Fundamentals

2.1. Digital Image Processing

A digital image can be considered as a two dimensional array a[x, y] of N finite rows

and M finite columns, where x and y are spatial coordinates and a[x, y] is called the

intensity of the image at that point. An image is composed of picture elements or pixels

(Figure 2.1). A pixel is comprised of three color producing elements each representing

one of the three primary colors red, green and blue. This representation is called the RGB

model (Figure 2.2). These primary colors combine together and their variable

combinations create the colors that the human eye can see. Each primary color takes

values in the space [0, L-1], where L is the number of intensity values. The number of

intensity values L is in the form 2k with k denoting the number of bits needed to represent

the intensity values. If k is 8 and L is 256 this means that there are 256 possible intensity

levels and 8 bits are required to represent all of them. The number of bits needed to store

5

an image is given by M x N x k for grayscale images and by M x N x k x 3 for color

images.

Figure2.1: Illustration of a pixel in an image

Figure 2.2: The RGB Model [5]

A color image consists of three component images, one for each of the primary

colors. These three images combine on the phosphor screen to produce a composite color

image. The number of bits used to represent each pixel in the RGB model is called the

pixel depth. The standard used is a 24-bit representation for color images, 8-bits used for

each primary color. When all three components are 255, then the resulting color is white.

When all three components are 0, the resulting color is black [5].

6

Grayscale images on the other hand to not require three composite images as the

intensity values of red, green and blue are all equal for grayscale images. Hence grayscale

images require only a third of memory compared to color images because only 8-bits are

used to represent a grayscale image [5]. This makes grayscale images suitable for

hardware implementations.

Digital images are obtained by sampling and quantization of analog images, this

process is called digitization. Sampling takes place in space; equally spaced samples are

taken in both horizontal and vertical coordinates. After a sample has been taken

quantization is required in order to turn the continuous intensity levels into discrete

intensity levels. This can be done by a mapping process that maps continuous spaces into

one discrete value. After quantization the discrete value is stored at a position in the

array. After the sampling and quantization of the whole analog image the process of

generating a digital image is complete. The sampling rate and number of bits used to

store the data determines the quality of the digital image [4].

The process of manipulating a digital image can be divided into three categories

according with the goal that is set. First there is the category of image processing that

involves tasks such as image enhancement and noise removal. In image processing an

image is processed and the result of this processing is another image that is improved

according to what the goal is. Next there is image analysis. Again the input is an image

and the results are measurements that give some statistical analysis of the image. Finally

there is the category of computer vision and image understanding. Tasks in this category

include object matching and recognition. The goal here is, given an image to extract a

high level description of the image.

7

2.2. Image Pyramid Generation

Image pyramid generation is the process of scaling down an image a number of times

to create smaller copies of that image. Face detection is one application that requires

image processing generation. The reason is that not all faces in an M x N image can fit in

the search window of size m < M and n < N. Hence it is required that images of smaller

size are created in order to decrease the size of the faces in the image. This way it is

possible for large faces to fit in the search window and be detected by the system.

The algorithm used for the image pyramid generation is very important for the quality

of the smaller images that will be created. But at the same time some applications do not

require the quality of the image to be preserved and so simpler algorithms can be used.

The simplest algorithm used is the one that uses scale factors to scale down the

image. It does not preserve quality but it is quite simple since it requires only two

multiplications, one for each coordinate. To find the new coordinates that a pixel value

will go to in the new image, we multiply the X coordinate with a scaling factor Sx and

obtain the new horizontal coordinate X’. . The same procedure is followed for the vertical

coordinate Y. An example is illustrated in Figure 2.3. When the new vertical and

horizontal coordinates are calculated, the pixel value that was at X, Y will be stored at

position X’ , Y’. To summarize X’ = X * Sx and Y’ = Y * Sy.

8

Figure 2.3: Image Pyramid Generation Example

It is obvious that some pixel values will be sent to the same new coordinates. As a

result some pixel values will be overwritten by others. Which pixel value is preserved

makes a difference on the quality of the resulting image. There are different approaches

for which pixel value should be replaced. One approach is to preserve the first pixel value

that went to the new coordinates. Another approach is to replace the old pixel value with

the new one until last pixel value remains. Another approach is to use interpolation.

Interpolation achieves better results in terms of quality loss but is more complicated in

relation with the former two approaches.

9

2.3. Neural Networks

Neural networks (NN) are a very useful tool because they introduce a different

approach to problem solving that the traditional algorithmic approach. Their main

advantage is that they can be trained to perform certain tasks. Being able to be trained

and thus “learn again” the NN can reorganize their structure and adapt to different

circumstances. Neural networks are separated into two distinct categories. The first

category includes the biological neural networks (BNN) such as the human nervous

system and the human brain. BNN consist of interconnected biological components

called neurons. The second category includes the artificial neural networks (ANN) which

are comprised in the same manner as BNN but use artificial components (artificial

neurons) that mimic the operations of biological components and are less complex [1].

Biological neurons (Figure 2.4) are the hurt of biological neural networks and are

constructed by four basic components, the dendrites, the soma, the synapse and the axon.

The dendrites serve as projections of a neuron. They act as conductors of electrical

stimulation received from other neurons, to the neurons cell body [2]. The soma contains

the cell’s nucleus. The cell’s nucleus operation is to control the activities of a cell by

regulating gene expression [3]. The axon [23] is a long projection of the neuron that

conducts electrical impulses away from the neurons soma. The axon can be thought as a

transmission line that propagates electrical activity from one neuron to another. The

synapse converts the electrical activity from the axon into electrical effects that inhibit or

excite activity in other connected neurons [1].

Figure 2.4: Biological Neuron [16]

10

The simplest model of artificial neuron (Figure 2.5) is one which has many inputs

and one output and the output is determined by the combination of the artificial neuron

inputs. A more complex but much more dynamic model of artificial neuron is the

McCulloh and Pitts model illustrated in Figure 2.6. In this model the inputs are weighted.

This affects the importance of each input on the overall computation. Additionally a

preset threshold value is used which affects the output of the neuron. The inputs are

multiplied by their respective weights and then are summed all together. The threshold

value is subtracted by their sum and the artificial neuron output the resulting number. The

addition of weights and threshold makes the artificial neuron a very dynamic and flexible

tool that can be adjusted for different tasks only by altering the weights and threshold [1].

Figure 2.5: Simple Artificial Neuron [1] Figure 2.6: McCulloh and Pitts artificial

neuron model [1]

The behavior of an artificial neuron can be also modified with the addition of an

activation function. This function can be either a linear function, a threshold function, or

a sigmoid function. When using linear functions the output of the artificial neuron is

proportional to the weighted output. For threshold functions the output is either of two

values depending if it is greater, less than, or equal to a threshold value. For sigmoid

functions the output varies continuously as the input changes but not in a linear way. All

three function types are approximations of the way that a biological neuron behaves.

11

A general mathematical definition for the artificial neuron operation is shown in

Equation 2.1, y is the neuron output, X is the neuron input, W is the inputs respective

weight, t is the preset threshold and f is the activation function.

0

(( * ) )n

i ii

y f X W t=

= −∑

Equation 2.1: Artificial neuron operation

An ANN may consist of more than one layer of neurons (Figure 2.7). Commonly it

would consist of the input layer that operates on the inputs of the system, the hidden

layers that operate on the output of the input layers and the output layer which operates

on the output of the hidden layers and produces the output of the system.

Figure 2.7: Multilayer Neural Network [15]

There are different approaches for how to train a neural network. The first is

associative mapping. In this approach the NN basically learns to produce a particular

output pattern when a particular input pattern has been applied. Then there is the

regularity detection approach where the NN learns to respond to particular properties of

the input pattern. Such learning approaches are useful when feature discovery is

important. When dealing with regularity detection we have two types of networks. Fixed

networks where their weights cannot be changed and adaptive networks that have the

ability to change their weights to adapt to a different situation. All training methods used

12

in adaptive networks are categorised as either supervised or unsupervised. Supervised

methods require an external entity to inform each unit of what the expected output should

be according to the inputs. On the other hand unsupervised learning requires no external

interference because the network can self-organize the data given to it to detect their

emergent collective properties [1].

The way to teach a three layer neural network to perform a particular task is by first

introducing training examples to the network which contain the input sequence and the

desired output for the particular input. Then the difference between the actual and the

desired output is calculated and the weight values are changed for each neuron so that the

actual output would get closer to the desired output.

It is vital to note that the variety of inputs that a NN receives during training

determines its accuracy and ability to perform as expected when operating.

Because of their parallel architectures, simplicity and fast computational times NN

are a preferred choice when dealing with real time applications and especially when

hardware implementation is an option.

13

Chapter 3: Related Work

3.1. Past work on face detection Face detection has attracted many researches attention because of its numerous

applications and so various solutions have been proposed. In this section some of these

solutions will be described and explained.

Rowley et. al. [6] present a face detection system that is based on artificial neural

networks. The general method involves scanning the image pixel-by-pixel in various

scales. The image is scanned by a 20 x 20 search window and each image frame is passed

to the neural network for classification. Initially the image pyramid is generated using

scaling steps of 1.2. Then the images are scanned with a 20x20 pixel window. Before the

window passes to the neural network it is subject to histogram equalization. Histogram

equalization increases the quality of the pixel window and makes the face features clearer

and so it increases the chances of accurate detection by the neural network. After

histogram equalization the window is given to a router network. The router network is

responsible to bring the face in an upright frontal view if it is rotated. The neural network

[7] receives as input a 20x20 pixel region of the image and generates an output ranging

from 1to-1, signifying the presence or absence of a face. The neural network structure has

three hidden layers. The first part of the layer splits the image to quadrants of the 20x20

image, 10x10 pixels each, the second to quadrants of the quadrants, 5x5 pixels each and

the third to six overlapping regions, 5x20 pixels each [8]. The overlapping horizontal

regions allow the hidden units to detect such features as mouths or pairs of eyes. The

quadrant units might detect features such as individual eyes, the nose, or corners of the

mouth. Rowley et. al. also used arbitration between multiple neural networks in order to

eliminate their false positives and increase their accuracy. This happens because the

different neural networks have gone under different training conditions and so the

networks will have different biases and make different errors. The basic algorithm is

illustrated in Figure 3.1.

14

Figure 3.1: Roeley et. al. algorithm for Neural Network-based face detection [7]

The system proposed by Rowley et. al. was able to detect 79.5% of faces over two large test sets, with a small number of false positives [6].

Viola and Jones [9] presented a method of rapid object detection using a boosted

cascade of simple features. The work done in [9] had three main contributions. The first

contribution is the introduction of a new image representation which is called an integral

image. This new representation allows for very fast feature evaluation. The new values of

the integral image are calculated from summing the pixels above and to the left for every

pixel in the original image. The second contribution in [9] was that it introduced a

method for constructing a classifier by selecting a small number of important features

using AdaBoost. AdaBoost is a machine learning algorithm that can be used with other

algorithms to improve their performance [10]. The effort here was to ensure fast

classification by excluding a large number of features and focus on a small set of critical

features. The third and last contribution of this paper was the introduction of method for

combining successively more complex classifiers in a cascade structure. The cascade

structure increases dramatically the speed of the detector by focusing in promising image

regions. As explained in [9] it is often possible to determine where an object might occur

in an image and hence it is more promising to look in that region of the image.

15

Yang and Huang [11] used a hierarchical knowledge-based method to detect faces.

The proposed system consists of three levels of rules. A scanning window passes through

the input image and each region is applied a set of rules. The higher level rules contain a

general description of what a face looks like. The lower level rules look for facial features

of a human face.

Craw et. al. [11] presented a localization method that is based on predefined face

templates. The templates are shapes of frontal faces (i.e. outlines of faces). First a sobel

operator is used to extract edges from the image. The edges are then grouped together to

search for a face that matches a template. When the head contour is located the same

process is repeated for the purpose of locating eyes, eyebrows lips, etc.

Schneiderman and Kanade [8] suggest a statistical method for face detection. To

apply statistical methods to the problem, they represent visual attributes with wavelet

coefficients. Wavelet is a mathematical function used to divide a given function into

different frequency components [12]. This representation allows them to jointly model

image data which is localized in space, frequency and orientation. From this information

they are able to construct a histogram-based face detector. This method requires initial

histograms to be constructed.

Graf et. al. [11] developed a method of locating facial features and faces in grayscale

images. The process includes the image passing through a band pass filter and then

morphological operations are applied on the image to enhance regions with high intensity

and certain shapes. The histogram of the processed image is then computed. The

histogram should exhibit a prominent peak. Based on the peaks value and width, adaptive

threshold values are selected in order to create two binary images. Then in both these

images connected component regions are identified and are selected as candidates for

facial features. These regions are combined and evaluated with classifiers to determine if

they contain a face. In this method it is not clear how morphological operations are

performed and how the candidate facial feature regions are combines to determine if they

contain a face.

16

3.2. Hardware face detection

3.2.1. ASIC Implementation of a Neural Network Based Face Detection

An ASIC implementation of a neural network face detector is presented in [13]. The

algorithm used is the one proposed in [7] and as is described in section 3.1. The detection

process that includes the image pyramid generation stage, the image enhancement stage

and the neural network stage were all implemented on one chip. The architecture uses a

320 x 240 grayscale image frame.

The image pyramid generation unit acts is the interface of the system with the

outside world. The image pyramid generation unit utilizes a 64-bit bus to communicate

with the image data source and a 32-bit data bus to communicate with the image

enhancement unit. The unit requires 80 KB of memory. It creates the 20x20 windows that

are needed by the algorithm. The windows are generated in raster scan style and handed

to the image enhancement unit.

The image enhancement unit is responsible for improving the quality of the 20x20

windows it receives from the image pyramid generation unit. This unit performs

histogram equalization. The image enhancement unit creates the cumulative distribution

function needed for histogram equalization. It performs histogram equalization on the

whole window rather that in sub-regions of the window. The module requires 503 clock

cycles to fully process a 20x20 window.

The neural network unit in this implementation was designed for parallelism but

another goal was for the unit to remain as small as possible. This unit receives the 20x20

histogram equalized image from the image enhancement unit. The overall neural network

architecture is as follows. The pixel from the 20x20 window is multiplied by a predefined

weight value. For each region in the 20x20 window as described in section 3.1 the

17

multiplication results are accumulated and then pass through an activation function. The

activation function is a hyperbolic tangent implemented as a lookup table in a 16-bit

SRAM.

The input layer of the neural network consists of three regions. For the first region

which divides the window into 10x10 regions and the 5x5 region, one multiply

accumulator unit was used. For the 5x20 region two multiply accumulators were used

because of the overlapping regions. The three regions share the same SRAM unit.

For the hidden and output layers one shared multiply accumulator was used. They

also share a 4 KB SRAM that contains the weights.

The final hardware implementation operates at 125 MHz with a total consumption of

165mW. The face detection was performed at 24 frames per second for a 320x240 image

frame.

3.2.2. FPGA Implementation of a Neural Network Based Face Detection

Another hardware implementation of a face detection system on an FPGA is

described in [13]. The algorithm used was again the one proposed by [7]. The FPGA

platform used was the Xilinx XUP2V-Pro development board.

The implementation assumes a 320x240 input image, 20x20 search windows and 15

pixel overlap between consecutive windows. With these constraints in mind, up to sixteen

sub windows may be active for a given input pixel, including both the horizontal and

vertical directions.

The proposed architecture consists of a 32 node processing array. Sixteen of these 32

are layer-one multipliers. The implementation utilizes a packet-switched network on

programmable chip with sixteen multipliers that perform the layer-one multiplications of

the pixel and the respective weight value. There are 256 accumulators utilized in order to

18

sustain a maximum of 256 in-progress accumulations. The activation function is

implemented using lookup tables.

The neural network dataflow is controlled by registers call context registers. These

context registers are present in each processing node and are responsible for specifying

the destination of the result of the ongoing operation. There are two types of context

registers multiplier context register and accumulator context register. The multiplier

context registers store besides the destination, the value of the weight that will be used in

the multiplication. The accumulator context registers also store the number of

accumulations to be processed and the current number of accumulations. The architecture

of the implementation is showed in Figure 3.2.

Figure3.2: Architecture of the neural network-based face detector on an FPGA board [13]

19

The FPGA implementation achieved almost twice the detection frame rate in

comparison to the ASIC implementation. The additional multiple multiply accumulator

units of the FPGA as well as the concurrent window processing are the main reasons why

the FPGA implementation is better.

20

Chapter 4: FPGA Hardware Implementation

This senior design project is about implementing a neural network-based face

detector proposed in [7], on a Xilinx Virtex2 Pro XC2VP30 FPGA platform using the

verilog hardware design language. The purpose of this project was to exploit and

investigate different implementation methods considering the given boards resources and

constraints. The algorithm is about upright frontal face detection in a 320x240 greyscale

image frame. The hardware implementation involved the implementation of an image

pyramid generation unit responsible for the down scale of the original 320 x 240 image

frame, the implementation of the neural network structure described in [7] and the

implementation of a VGA controller that is required to output the results on a VGA

screen.

The main goal of this project is the implementation of a prototype of the neural

network that will cover as less area as possible with a view of integrating more neural

networks to the design. This will allow the concurrent processing of the down scaled

frames. Thus achieving a frame rate near 30 frames per second, that is the rate required

for real time video processing.

The implemented face detector system includes the VGA controller, a memory

containing a 320x240 image and one neural network. Due to time issues the image

pyramid generator unit was implemented but was not integrated into the face detection

system and only one neural network was integrated into the design. Nevertheless the

work that was done in this project and the decisions taken take into consideration the

integration of the IPG unit and the use of more than one neural network in a later stage.

The weights and thresholds for the neural network were given to us by Video Mining

Incorporated, so in this implementation we are not trying to improve the accuracy of the

neural network only implement the design.

21

4.1. Xilinx Virtex2 Pro XC2VP30 FPGA

The Virtex2 Pro XC2VP30 FPGA board (Figure 4.1) has many features that make it

almost ideal for the task of face detection.

The Virtex 2 Pro FPGA, were the system will be designed on, meets the memory and

processing requirements required by a face detection application. It offers a total of

306KB on chip block RAM that is more than capable of storing all the images needed, so

the use of the on-chip block RAM is the preferred option. Processing-wise the Virtex 2

Pro offers 136 18-bit embedded multipliers and 30,816 logic cells which is exactly what

is needed for the intense multiplication and accumulation operations that the neural

network requires. It is also equipped with two power PC processors. The two processors

may be used to run a software version of the neural network and thus process frames

concurrently.

The board that the FPGA is on is also very capable in terms of interface with the

outside world and the memory that is offers. It is equipped with video decoder board that

makes it easy to get live video from a camcorder that is connected to it. Additionally it

has a compact flash card port that can be used to provide the face detector with new

image frames. Also the board has USB 2.0 ports that provide the option of delivering

image frames to the system on the Virtex 2 Pro via a USB memory stick. Furthermore the

board can be connected with a hard disk drive to retrieve frames from the disk. For

output purposes the board offers connectivity with a VGA monitor through which makes

it is possible to view the results. The user can interact with the system through the push

buttons, switches and leds that are present on the board. The memory capabilities of the

board are also fitting for the task the board is needed. It has a 256 MB DDR SDRAM

DIMM module. The board’s system clock is at 100MHz.

22

The on chip block RAM is more than capable of storing all the images needed, so the

use of the on-chip block RAM is the preferred option because it will be faster to access

the frames and more simple.

Figure 4.1: Virtex2 Pro XC2VP30 FPGA board [18]

4.2. VGA Controller

A VGA controller Implementation was required in order to output the system results

onto a VGA monitor. The VGA format used is the 640x480, 60Hz format. For this format

there are 480 rows and each consists of 640 pixels [19].

A VGA monitor outputs the pixel values that it has received in a raster scan fashion.

Beginning from the top left corner of the screen, which is the position (0, 0) it outputs the

pixel values for each column in the row until position (640, 0) and then moves to the next

23

row starting from the zero column at position (0,1). The monitor returns to position (0, 0),

when it has outputted the pixel at position (640, 480). Figure 4.2 illustrates the order of

pixels in a VGA monitor.

Figure 4.2: Representation of pixels on a VGA monitor [19]

The VGA controller operates at a 25,175 MHz clock. For this reason a clock divider

had to be implemented so that it would slow down the 100MHz system clock four times

in order to generate the appropriate clock frequency needed by the VGA. At each

negative edge of the 25,175MHz clock the VGA gets the pixel values needed and outputs

them to the screen.

4.3. Image Pyramid Generation

4.3.1. IPG Implementation Strategy

The Image Pyramid Generation Unit (IPG) is responsible for taking a 320 x 240

image and produce downscaled versions of the original image that will be processed by

the neural network for face detection. In order to cover a vast majority of the possible

sizes that a human face can have, substantial downscaled images had to be generated. The

IPG process must conclude with an image frame of 20x20 pixels to check for the case

that the face covers the whole image.

The IPG unit could be the bottleneck of the face detection system if not implemented

in a smart way. If for example a downscaled version is generated then immediately

processed and then the next downscaled image is created and processed then the face

24

detection process would have to wait for a downscaled image to be generated and it will

remain idle until it does.

The solution proposed in this project considering the IPG unit is to create the needed

downscaled images all together at once. This can be achieved by multiplying the

coordinate with all the scale factors in parallel, to create the new coordinate. This way the

face detection stage would only have to wait only at the start until all required images

have been created. Then if the hardware constraints allow it, it could be possible to

integrate as many neural networks as the number of images and have all frames processed

at the same time. By doing this we all the images are processed, at the same time that is

required to process only the 320 x 240 image.

To implement this idea the following aspects had to be considered. The total memory

required the processing resources that the downscale operation will occupy and which

downscaled dimensions would be more efficient. Efficient in this case means that the

dimensions of the downscaled images would be equally distributed so that not too many

large images are created nor too many small images. The methodology employed was to

first identify some candidate scales, then calculate the total memory to see if all scales fit

onto the FPGA and finally if the memory constraint was met, determine the scaling

factors that produced these scales.

The process of IPG as mentioned in section 2.1 requires two multiplications to

calculate the new coordinates of a pixel value. This means that if 9 downscaled images

are produced then it would require 18 multipliers for the IPG process. But multipliers are

mostly needed for the face detection process and the neural network implementation.

This is why particular scales were chosen so that their respective scaling factors do not

require multiplication in order to calculate the new coordinates. Instead the new

coordinates can be calculated by the shift operation. As an example consider the scale

factor 0.25 for image scale 7 in Table 4.1. To find the new coordinates x’ and y’ we need

to multiply x and y with 0.25 which is a power of two. But the equivalent operation in

binary arithmetic when multiplying or dividing with a power of 2 is to shift left or right

respectively. This way some multipliers were saved for the neural network.

25

The final scales chosen are illustrated in Table 4.1, along with the memory each

scale will occupy, the total memory for all images, the scale factors that produce the

respective scale and the number of windows that will be generated out of each scale.

After the scales were determined it was observed that there is no need to have as

many neural networks as the number of images. This is because the number of windows

created by image scales from 3 to 10 in Table 4.1 is 2398 windows. This means that the

320x240 image which has the largest number of windows , 2745, will still be processed

after all others have finished and so the other neural networks would remain idle until the

next image arrives and is downscaled. With this in mind it is possible to process all

images at the same time it would take for a 320x240 to be processed image by using only

three neural networks. The first neural network will process the 320 x 240 image, the

second will process the 260 x 295 image and the third all the others.

Image Scales Memory Required

Scale Factors Number of generated windows

1. 320 x 240 76.800 KB None 2745

2. 260 x 195 50.700 KB 0.8125 1764

3. 200 x 150 30.000 KB 0.625 999

4. 160 x 120 19.200 KB 0.5 609

5. 120 x 90 10.800 KB 0.375 399

6. 100 x 75 7.500 KB 0.3125 204

7. 80 x 60 4.800 KB 0.25 117

8. 60 x 45 2.700 KB 0.1875 54

9. 40 x 30 1.200 KB 0.125 15

10. 20 x 20 400 KB 0.0625 (column) 0.083984375 (row)

1

Total: 204.500 KB Table 4.1: Resolutions of scaled down images, the required memory for each on, the scaling factors to

create each scale and the windows that each scale will create.

26

4.3.2. IPG Architecture The IPG unit consists of 10 block RAM units and two address/coordinate generation

units. Each block RAM stores one scale of the ten images shown in Table 4.1. The first

address/coordinate generator is responsible for creating an x and y coordinate for the

pixels of the 320x240 image. At the same time it will use this x and y coordinates to

generate the address from which to read from the 320x240 image. The address is

generated using Equation 4.1. The second address/coordinate generator is responsible for

creating the x’ and y’ coordinates for the 9 downscaled versions of the 320x240 image,

by multiplying the x and y coordinates with the respective scale factor. For simplification

only the integer part of the multiplication is taken and no rounding has been performed.

Again the addresses for each image are generated using Equation 4.1 and each images

respective x’ , y’ coordinates. The IPG unit block diagram is shown in Figure 4.3.

( _ * )Address image column x y= + Equation 4.1: How to create an address given the x and y coordinates, x represents current column

and y current row.

For a pixel value to be read from the 320x240 image RAM and to be stored in a

downscaled image RAM 3 clock cycles are required. In the first clock cycle the

address/coordinates from which to read are produced. In the second cycle the pixel value

is loaded from the 320x240 RAM and at the same time the addresses for where to write

are produced. At the final cycle the pixel value is stored in all the downscale images

block RAMs. In total 230,400 clock cycles are required for the all downscaled images to

be produced and be ready for processing.

27

Figure 4.3: Block Diagram of the Image Pyramid Generation Unit

28

4.4. Neural Network

The final unit of the face detector system is the neural network unit. The unit takes as

input a 20x20 pixel window and classifies it as a face or not a face. The unit is partitioned

into three parallel stages. The first stage segments the window into four regions each of

10x10 pixels, searching for facial features such as nose eyebrows, glasses etc. The second

stage segment the window to sixteen regions 5x5 pixels each. The final stage segments

the window in six overlapping 5x20 regions and searches for pairs of eyes and the mouth.

Figure 4.4 shows in detail the segmentation of the window in these three regions. Each

segment of a region is assigned an artificial neuron which takes as inputs the pixels of the

respective region and the respective weights for each pixel in that region. These three

regions are the input layer of the neural network.

The 10x10 region has 4 neurons, the 5x5 has 16 neurons and the 5x20 region has six

neurons. The neurons operation is a multiplication of the pixel value with the weight

value and an accumulation of all the multiplication results. The output of each neuron is

its accumulation result minus its threshold. The output of a neuron then passes through an

activation function which in this case is a hyperbolic tangent.

Each regions neuron is fully connected to a hidden layer neuron. There are three

hidden layer level neurons one for each region. The hidden layer neurons perform

multiplication on the outputs of the activation functions of all previous neurons with their

respective weights. These neurons also perform accumulation and threshold their output.

An activation function is applied on their output as well.

Finally the output of the three next level neurons is propagated to an output neuron

that performs multiplication with the next level neurons outputs and their respective

weights, accumulates the multiplication results and outputs the result. If the final number

is greater or equal to 0 then the image is classified as a face, otherwise it is classified as

non face. A detail description of the neural network structure is presented in Figure 4.5.

29

(a)

(b)

(c)

Figure 4.4: The segmentation of the window for (a) 10x10 region (b) 5x5 region (c) 5x20 region

30

Figure 4.5: Basic Neural Network structure [7]

31

4.4.1. NN implementation strategy

The goal when designing the NN was to minimize the amount of area and resources

needed by a NN to make it possible for more than one NN to be integrated into the face

detection system. The factors that need to be considered are how many multipliers and

accumulators are going to be used. The mathematical model used for number

representation e.g. twos complement, sign magnitude, the binary representation of real

numbers, the binary representation of signed numbers etc. Also under consideration was

how to share resources between units.

The first thing that had to be decided was the number of pixels the NN would process

at a time. For simplicity and faster implementation it was decided that the NN would

process one pixel of the 20x20 window at a time. Since only one pixel would be

processed at a time it means that only one accumulator will be active in the first two

stages of the NN and a maximum of two in the third stage. Under this observation the

number of multipliers that can be used in each stage can be one for the first two stages

and two for the third one. Hence in the first and second stage neurons will share one

multiplier respectively and the third stage neurons will share two because of the

overlapping regions. The number of accumulators has to be the same as the neurons since

all neurons accumulate independently.

Next thing to be considered was the signed numbers representation in binary. The

two methods that were considered were the twos complement representation and a

representation of the absolute value of the number together with a sign magnitude. After

an early prototype was implemented using sign magnitude representation it was observed

that the neurons were much to complex and the design occupied a lot of area considering

it was just a prototype. The reason was the extra hardware used to manipulate the sign of

each number and for the implementation of both an adder and a subtractor for each

neuron to use whichever was needed according to the numbers sign. For these reasons the

number representation was changed to twos complement so there was no need to use

extra hardware.

32

Another factor to consider was the representation of real numbers in binary form. The

representation used was the following. A number of bits were chosen to be the integer

part and the remaining ones were the fractional part of the number. The conversion from

decimal to binary for each part was done appropriately.

4.4.2. NN architecture

4.4.2.1. NN Weights and thresholds

Every pixel in each segment of the three regions has a respective weight that

determines its importance in the overall computations. Weights that are used by the same

neuron were grouped together and implemented as lookup tables in the on-chip block

RAMs. The weights are represented by 17-bits in twos complement form.

Specifically for the 10x10 region 4 weight memories were created each of 100

positions, for the 5x5 region 16 weight memories were created each of 25 positions and

for the 5x20 region 6 weight memories were created each of 100 positions. The 10x10

and 5x20 region weight memories occupy 213B each, the 5x5 weight memories occupy

54B each.

The separation of memories helped to avoid any conflicts as each memory was

accessed independently according to the segment of the 20x20 window that the pixel

value is located.

The thresholds for all neurons and the weights for the hidden layer neurons and the

output neuron were converted to 17-bit twos complement numbers and were hardwired in

the system to save up memory. The binary representations of the weights and thresholds

for all the neurons are shown in Table 4.2.

33

Thresholds binary representation for

input layer neurons 35-bits total

19-bit 16-bit integer decimal

Thresholds binary representation for

hidden and output layer neurons 35-bits total


Weights binary representation 17-bits total


Table 4.2

4.4.2.2. Input Layer Neurons

Each pixel in the 20x20 window is assigned an x and y coordinates when it is read for

processing. The x and y coordinates represent its position in the 20x20 image. Using the

x and y coordinates it can be identified in which segment the window belongs and so

which accumulator needs to be enabled in each of the three stages of the input layer.

The10x10 region of the input layer consists of one multiplier and 4 accumulators that

simulate the behaviour of the 4 neurons. The input to the multiplier is an 8-bit pixel value

and a 17-bit weight value. The multiplication result is a 35-bit number with the 16 least

significant bits being the fractional part and the other 19 bits being the integer part. The

accumulators take this 35-bit number and add it to their current sum. The binary

representation of the accumulator and multipliers results is illustrated in Table 4.3. Once

the accumulation process is complete the result is sent to the 10x10 input layer controller.

The accumulators have a counter that enables them to stop accumulating once they have

accumulated all necessary accumulations and inform the system that their result is ready

for further processing. The 5x5 region implementation also follows the same philosophy.

34

Multiplier results binary representation for input layer neurons

35-bits total 19-bit 16-bit integer decimal

Accumulator results binary representation for input layer neurons

35-bits total 19-bit 16-bit integer decimal

Table 4.3

The implementation method for the 5x20 region needed to change because of the

presence of overlapping segments. What has changed is that for the pixel values of

overlapping areas, two multiplications with different weights and two accumulations

would take place at the same time. Figure 4.6 shows the architecture the three regions.

35

(a)

(b)

36

+=

Enable

Result

Multiplexer

*5x20

Weight

Memory

Weight Value 1

*5x20

Weight

Memory

Weight Value 2

+=

Enable

Result

Multiplexer

+=

Enable

Result

Multiplexer

+=

Enable

Result

Multiplexer

+=

Enable

Result

Multiplexer

+=

Enable

Result

Multiplexer

Multiplication

Result1

Multiplication

Result2

Pixel Value

(c)

Figure 4.6: (a) 10x10 region neurons (b) 5x5 region neurons (c) 5x20 region neurons

4.4.2.3. Activation Function

The activation function used is a hyperbolic tangent (Figure 4.7). The input to the

hyperbolic tangent is the output of the input layer neurons. The hyperbolic tangent unit

was implemented as a lookup table in an on-chip block RAM. The hyperbolic tangent

function produces values between [-1, 1]. An important property of the hyperbolic

tangent is that a number z produces the same value as the negative of the value produced

by number –z, this property is illustrated in Equation 4.2. So it is convenient to store

37

only the values produced by positive numbers. Hence if the output of a neuron is a

negative number, all that needs to be done is to find its absolute value of that number and

sent the absolute value to the hyperbolic tangent lookup table. Also the sign of the

number must be propagated to the output of the hyperbolic tangent. If the sign is negative

the hyperbolic tangent output must be converted to its negative counterpart using the

twos complement method, if it is not the number will remain as is. By exploiting this

property of the hyperbolic tangent we can drastically decrease the size needed for the

storage of the hyperbolic tangent values to the half.

tanh( ) tanh( )

tanh( ) tanh( )

z z

z z

= − −= −

Equation 4.2: Hyperbolic tangent property

Figure 4.7: Hyperbolic tangent used in the implementation, the inputs are numbers from -8 to 8 with

a step of 0.0625

38

The hyperbolic tangent unit consists of 128 positions representing inputs 0 to 8 with

an interval of 0.0625. In each position the respective output for that number is stored in as

a 15-bit number represented in twos complement form. The memory required for one

hyperbolic tangent unit is 240B. The binary representation of the hyperbolic tangent

values is shown in Table 4.4.

Hyperbolic tangent binary representation

15-bits total 1-bit 14-bit

integer decimal Table 4.4.

There are three hyperbolic tangent units in the neural network. In the first stage each

one is used to produce the hyperbolic tangent values for the three regions of the input

layer respectively. In the next stage the first hyperbolic tangent unit is used to produce the

hyperbolic tangent output for the hidden layer neurons because it is no longer occupied

by the 10x10 region neurons. Furthermore the three hidden layer neurons do not finish

their operation at the same time so it is more convenient to use one hyperbolic tangent

unit for all three.

In all cases mention above the input to the hyperbolic tangent is a 7 bit part of the

neurons output. The 7-bit part consists of the lower three bits that form the integer part

and the most significant bits that form the fractional part of the number. The reason for

this is that for inputs in the space of [0, 8] the hyperbolic tangent as a function returns

different values, but for inputs that are larger than 8 the hyperbolic tangent returns 1. So

before passing the output of a neuron to the hyperbolic tangent the result is checked to

verify if it is less than 8. If it is then the 7-bit part is extracted and sent to the hyperbolic

tangent unit. In other case the hyperbolic tangent unit output the value 1.

39

4.4.2.4. Controller Units

Four controllers were required to control the data flow between the shared

subtractors and hyperbolic tangent units. The sharing of the units was possible because

none of the neurons in the same region finish the accumulations at the same time. Three

controllers are needed to control the flow for the three regions of the input layer and one

for the three neurons of the hidden layer. One controller is assigned to one region of the

input layer. The controllers act as and enable units and multiplexers.

When an input layer neuron has finished accumulating it informs its respective

controller that it is done via a ready signal. The controller then sends the neurons result to

the subtractor along with the respective threshold. The subtractor subtracts the threshold

from the accumulation result. Then the controller sends the appropriate bits from the

subtraction result as an address to the hyperbolic tangent unit and initiates the load

process from the hyperbolic tangent memory. The hyperbolic tangent output is sent to the

respective hidden layer neuron were the multiplication with the weight is done and then

that result is accumulated.

The hidden layer controller does the same operation but for the three hidden layer

neurons. It sends the hidden layer neuron results to a subtractor where the subtraction

result is sent to the hyperbolic tangent unit and the output is given to the output neuron

for its multiplication and accumulation processes. Additionally the hidden layer

controller is responsible for controlling a multiplexer that determines the input to the first

hyperbolic tangent unit. This is required in order to reuse the hyperbolic tangent unit for

the hidden layer neurons, as the input layer neurons have finished processing.

40

4.4.2.5. Hidden Layer Neurons

The hidden layer neurons are constituted by on multiplier and one accumulator. They

take as input the values returned by the hyperbolic tangent units. They multiply these

values by the appropriate weight and accumulate the multiplication results.

The input to the multiplier is a 15 bit hyperbolic tangent value and a 17-bit weight

value. The multiplication result is a 35-bit number with the 30 least significant bits being

the fractional part and the other 5 bits being the integer part. The accumulator takes this

35-bit number and adds it to its current sum. Once the accumulation process is complete

the result is passed to the hidden layer controller. The binary representation for the

accumulator and multiplier results is illustrated in Table 4.5.

The three neurons share the same activation function. The reason is that the neuron

following the 10x10 region will do 4 accumulations, the neuron following the 5x5 region

will do 16 accumulations and the neuron following the 5x20 region will do 6

accumulations. So the activation function can be used by the 10x10 hidden neuron first,

then by the 5x20 hidden neuron, and finally by the 5x5 neuron without any conflict

problems. In Figure 4.8 a model for the hidden layer neurons is shown.

41

Figure 4.8: Implementation of a hidden layer neuron

4.4.2.6. Output Layer Neuron

The final output neuron determined the face detection system output. It consists of

one multiplier and one accumulator. The neuron takes as inputs a 15 bit hyperbolic

tangent value and a 17-bit weight value. The inputs are multiplied producing a 35-bit

result with the 30 least significant bits being the fractional part and the other 5 bits being

the integer part. The binary representation for the accumulator and multiplier results is

illustrated in Table 4.5. The multiplication result is then sent to the accumulator for the

accumulation process. Once the accumulation process is complete the accumulators

subtract their respective threshold value from the total sum. If the result of the output

neuron is greater or equal to 0 the 20x20 window is classified as a face, otherwise it is

classified as a non-face. Figure 4.9 shows the implementation of the output neuron.

42

Figure 4.9: Implementation of the output neuron

Accumulator results binary representation for hidden and output

layer neurons 35-bits total


Multiplier results binary representation for hidden and output

layer neurons 35-bits total


Table 4.5

43

4.5. Overall System architecture

The operation of the face detection system can be in two stages. In the first stage

each 20x20 window pixel is read and processed by the input layer of the neural network.

This includes creating the addresses to read a 20x20 window from the 320x240 memory,

, read the weight value that corresponds to that pixel value from the proper weight

memory, enable the respective accumulator according to the pixel x and y window

coordinates and for each input layer neuron that completes its operation send its result to

the activation function. The second stage the hidden layer neurons and output layer

neurons finish their operations. The two operations are repeated until all the windows

have been read and processed. An FSM diagram illustrating the face detection system

operation is shown in Figure 4.10.

The first stage requires 3 cycles for each window pixel to be read and processed. In

the first cycle the addresses needed for the 320x240 image memory and the weight

memory are generated. In the second the values for pixel and weight are loaded from the

memory units and multiplied. In the third cycle the accumulation takes place. The first

stage takes 1,200 clock cycles to complete.

For the second stage each input layer neuron that completes its operation requires

two cycles to be multiplied and accumulated by the respective hidden layer neuron. In the

first cycle the hyperbolic tangent memory is accessed and in the second the hyperbolic

tangent output is multiplied with the weight value and accumulated. This procedure is

also followed for the output layer neuron. The second stage requires 66 clock cycles to

complete.

44

Figure 4.10: Face detection System Architecture

45

To summarize, 2745 windows will be generated for the 320x240 frame each

requiring 1,266 cycles so the frame should be finished in 3,475,170 cycles corresponding

to 34ms. For 30 frames it should take approximately 1.02 seconds to complete. Hence the

goal set for real time video processing has been achieved. The architecture of the face

detection system is illustrated in Figure 4.11. The memory requirements for the

proposed face detection system are shown in Table 4.6.

Unit Memory

Image Pyramid generation 204.500 KB

Hyperbolic tangent memories 2.160 KB

Weights memories 2.994 KB

Memories required for presentation 1.200 KB

Total: 210.854 KB

Table 4.6: Memory requirements when the Image pyramid generation unit is integrated and three neural networks are used.

46

Reset the system

1. Enable window

generation unit -Start

generating window

2. Enable Neural Network

1. Wait for Neural Network to

finish processing

2. Window generation unit on

hold

Window

processing

finished

Finished

reading

window

System idle

Finished

processing

the last

window

Reset Button

Pressed

Start

Figure 4.11: FSM illustrating the face detection system operation

47

Chapter 5: Experimental methodology

5.1. Experimental strategy

The face detection system was synthesized using Xilinx ISE 9.1i. The synthesis

report is appended in Appendix A. The debugging and error correction was done using

the simulator ModelSim XE III 6.2c in the early stages of the implementation and later by

viewing the results on a VGA monitor to test the actual hardware. Simulation waveforms

are appended in Appendix A.

To test the correctness of the implemented hardware face detection system it was

needed to write a software equivalent of the face detection system and compare the

results. The software was written in C++. The reason software is used is that it is easier to

find any errors in the algorithm implementation and mathematical operations that are

used. After the software was verified, the comparison of the two system results took

place. Additionally the C++ software was enhanced in order to take a 320x240 frame as

input and output the faces found in the frame.

The hardware and software implementations of the face detection system were given

a set of 20x20 frames, 15 faces and 15 non faces from a constructed database of various

faces to compare their results. The multiplication, accumulation and tangent function

results of the two systems were compared to see if the hardware implementation was

correct. The results were similar and the only reason there was a slight difference was due

to the loss of accuracy when representing real numbers in binary format.

In addition, the two implementations were given the same 320x240 frames to operate

on. By doing this it was made possible to check the same 320x240 frame in each window

position and make the sure the two systems were giving the same result This way it was

easier to test the hardware under working circumstances. Again the two systems

produced the same results.

48

5.2. Experimental Setup

The Setup of the System is illustrated in Figure 5.1 and Figure 5.2. The setup

consists of the FPGA board and a VGA monitor to output where the image and the results

are viewed.

Figure 5.1: Experimental setup. The system consists from the FPGA board and a VGA monitor.

49

Figure 5.2: Output of the system.

5.3. Results and discussion

The faces and non faces of Table 5.1 were given as input to the neural network and it

classified each input frame as illustrated in the following table. The majority of faces

were classified as non faces which yield a high number of false negatives. Only one non

face from the non face test set was classified as a face. Table 5.2 shows some of the

320x240 images that were given to the system as input. In this case the system found the

majority of faces in all four images but also found a lot of false positives.

The conclusion from these tests is that the weights and thresholds produce a lot of

false positives and false negatives and so are not suitable for the generalized task of face

detection as they were produced by a data set that was too specific. But as they were only

used to design the neural network prototype it is possible to change them into better ones

50

and produce better results. Also the addition of a pre-processing stage may reduce false

detections. Another conclusion drawn from these tests is that the system is very sensitive

to small differences in the window frame and thus produces a lot of false positives.

51

Sample 20x20 images

Faces Classification Non faces Classification

1 Face Face

2 Face Non Face

3 Non Face Non Face

4 Non Face Non Face

5 Non Face Non Face

6 Non Face Non Face

7 Non Face Non Face

8 Face Non Face

9 Non Face Non Face

10 Non Face Non Face




14 Face Non Face


Table 5.1

52

Sample 320x240 Images

Input Image Result Image

Found 4 out of 6 faces



53


Table 5.2

54

Chapter 6: Discussion - Future Work and Improvement

6.1. Conclusions and Discussion

During the course of this project implementation a lot of issues needed to be

addressed. First of all it was necessary to decide the exact structure of the neural network,

its processing throughput and the resource sharing of the network units. Next the binary

representations and precision of the negative and decimal numbers needed to be figured

out. After this the memory requirements for the weights needed to be addressed in respect

to the sharing of these memories between neurons and the parallelism capabilities that

this sharing would allow. Additionally there had to be a decision on how the hyperbolic

tangent would be implemented either as a unit that would actually calculate the

hyperbolic tangent value or as a lookup table that would have these values stored.

Furthermore the implementation was heavily driven by what resources the FPGA had to

offer and how their utilization allow for the further improvement of the system. All this

issues have been addressed and solved in the manner described in Chapter 3, and the

project was implemented successfully.

As noticed in section 5.3 the system produced a high number of false positives and

false negatives. In order to solve the problem and try to balance the high number of false

positives and false negatives, the threshold of the output neuron was altered to

experiment on how the systems behaviour would change. First the threshold was

decreased in order to try and balance the false negatives. This had as a result the increase

of the windows that were classified as faces and thus the number of false positives

increased. Next the threshold value was increased in order to decrease the number of

false positives. Due to this, the number of windows that were classified as faces

decreased dramatically, however the correct detections were decreased as well. Hence

this solution did not prove to be a good one and so other methods need to be exploited to

improve and balance the systems performance.

55

The throughput of the system is approximately 30 frames per second as is mentioned

in section 4.5 and so the implemented design is capable of real time video face detection.

The issue with the amount of false positives and false negatives that the system produces

can be fixed by utilizing new and better weights.

The problems that were raised during the implementation of the project in their

greater part had to do with the complexity of the system and the lack of experience on

implementing such a big design.

Through the process of implementing this project I gain valuable experience in

hardware design that will prove useful in the future. I also got familiar with the process

paradigm of neural networks and what benefits they offer.

56

6.2. Improvement

There is a lot of future work that could be done to improve and advance this project.

The first necessary improvement is the completion of the project. At this point the IPG

unit and NN unit are two separate designs. For the system to be complete the integration

of the image pyramid generation unit and neural network unit into the same system. After

this is done it is required to retrain the neural network on a much larger and variant

database because the given weights and threshold values did not demonstrate the best

results in terms of accuracy and correctness. This will improve the detection performance

of the system. After the system is complete the next step is to change the input of the

design from one static 320x240 frame to a video stream from a digital camera. This will

allow for real time face detection.

The output images shown in Table 5.2 have many indications for the same face. This

could lead to the processing of window images that contain the same face. A way around

this problem is to introduce a filtering mechanism that will track and eliminate any

redundant indications of the same face. Hence a vast number of square indicators would

be eliminated and this will have as an effect the reduction the numbers of false positives

and false negatives classifications.

57

6.3. Future Work

The image pyramid generation process can be changed from nearest neighbour

approach to bilinear interpolation. Using bilinear interpolation will retain the quality of

the downscaled images and thus improving the possibilities for correct detection by the

neural network.

Furthermore a pre-processing stage can be introduced that will receive the 20x20

image window and improve the images quality via histogram equalization, in order to

improve the possibilities for accurate detection by the neural network. In addition it can

also improve the images lightning conditions to eliminate any environmental effects in

the image.

The problem of balancing the number of false positives and false negatives still

needs to be dealt with. A more orthodox approach to the one mentioned in section 6.1

would be to re-examine the network. Specifically to examine the input layer networks

and find out which one has the biggest effect on the systems outcome in the cases of false

positives and false negatives detections and focus on changing that particular input layer

network.

More neural networks can be integrated to the system so that the frames created can

be processed in parallel. Also the neural network implementation could be revised and

altered so that the neural network is not so depended on the system clock cycles and as a

result improve its speed and reduce the total area of the system.

The neural network can be combined with other algorithms to improve the system

accuracy. For example after a window has been determined as a face from the neural

network it could be compared with a series of templates to verify that is really a face.

Another possibility would be to put three neural networks that are trained with different

58

training sets in the same system. The three neural networks would process the same

window and voting would take place on their results to determine the final result.

Other possible improvements could be to change the neural networks structure and

add more neurons to increase the intelligence and accuracy of the network and make it

more reliable. Also the regions that are not processed due to the window 5-pixel overlap

can be classified by interpolating the classification results of their processed neighbours

and thus determine a result for the non-processed regions.

59

References

[1] C. Stergiou and D. Siganos , “Neural Networks” . (Accessed: February 2008)

http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

[2] Dendrite – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Dendrite

[3] Cell Nucleus – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Cell_nucleus

[4] Rafael C. Gonzalez, Richard E. Woods, Steven L. Eddins , “Digital Image

Processing”, (Upper Saddle River NJ: Prentice Hall), Chapter 2, P1-31.

[5] Rafael C. Gonzalez, Richard E. Woods , “Digital Image Processing”, second

edition, (Upper Saddle River NJ: Prentice Hall, 2002), Chapter 6, P34-71.

[6] H. Rowley, S Baluja, and T. Kanade. “Rotation invariant neural network –based

face detector”. In Proc. IEEE Conf on Computer Vision and Pattern Recognition, Page(s)

38-44, Santa Barbara, CA, June 23-25, 1998.

[7] H. A. Rowley, S. Baluja, T. Kanade, “Neural Network-Based Face Detection”,

IEEE Trans. On PAMI, Vol. 20, No. 1 Page(s).39-51, 1998.

[8] Andew King, “A Survey of Methods for Face Detection” , March 2003.

[9] P. Viola, M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple

Features”, May 2004.

60

[10] AdaBoost – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/AdaBoost

[11] Yang Ming-Hsuan, DJ Kriegman, N Ahuja, “Detecting faces in images a survey”,

IEEE Trans. On PAMI, volume: 24 Issue 1, Page(s): 34-58, Jan. 2002.

[12] Wavelets – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Wavelets

[13] T. Theocharides, C. Nicopoulos, K. Irick, N. Vijaykrishnan and M.J. Irwin, “An

Exploration of Hardware Architectures for face Detection” in the “The VLSI handbook”

second edition (New York: CRC Press, 2006), Section XII, chapter 83, P1-27.

[14] Neural Network – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Neural_network

[15] Artificial Neural Network – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Artificial_neural_network

[16] Biological Neural Network – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Biological_neural_network

[17] Face detection – Wikipedia. (Accessed: May 2008)

http://en.wikipedia.org/wiki/Face_detection

[18] Digilent Inc. Virtex 2 Pro development system. (Accessed: May 2008)

http://www.digilentinc.com/Products/Detail.cfm?Nav1=Products&Nav2=Programmable

&Prod=XUPV2P

61

[19] Zainalabedin Navabi, “VGA adapter” in the “Design and implementation with

Field programmable arrays” (Norwell, Massachusetts: Kluwer Academic Publishers,

2005), Chapter 12, P147-257.

[20] Magnus Strengert, Martin Kraus and Thomas Ertl, “Pyramid Methods in GPU

based image processing”.

[21] Mohammad Inayatullah, Shair Akbar Khan and Bashir Ahmad, “A face detection

system using Neural Network Approach”.

[22] Henry Schneiderman and Takeo Kanade, “A Statistical Method for 3D Object Detection Applied to Faces and Cars”. [23] Axon – Wikipedia. (Accessed: May 2008) http://en.wikipedia.org/wiki/Axon

62

Appendix A

Part of the synthesis code produced by Xilinx ISE 9.1i

Advanced HDL Synthesis Report Macro Statistics # FSMs : 1 # Multipliers : 15 10x5-bit multiplier : 1 10x9-bit multiplier : 1

5x3-bit multiplier : 1 5x4-bit multiplier : 1 5x5-bit multiplier : 1 6x4-bit multiplier : 3 6x5-bit multiplier : 1 6x6-bit multiplier : 5 8x10-bit multiplier : 1 # Adders/Subtractors : 82 10-bit adder : 2 11-bit subtractor : 1 12-bit subtractor : 1 15-bit adder : 4 17-bit adder : 2 3-bit adder : 4 35-bit adder : 4 35-bit subtractor : 6 5-bit adder : 1 5-bit subtractor : 1 6-bit subtractor : 7 7-bit adder : 33 7-bit subtractor : 1 8-bit adder : 1 8-bit adder carry out : 3 8-bit subtractor : 1 9-bit adder : 4 9-bit adder carry out : 3 9-bit subtractor : 3 # Counters : 66 12-bit up counter : 1 3-bit up counter : 27 30-bit up counter : 1 4-bit up counter : 1 5-bit up counter : 3

63

=================================================== ====================== * Final Report * =================================================== ====================== Final Results RTL Top Level Output File Name : Face_detector. ngr Top Level Output File Name : Face_detector Output Format : NGC Optimization Goal : Speed Keep Hierarchy : NO Design Statistics

7-bit up counter : 30 8-bit up counter : 1 9-bit up counter : 2 # Accumulators : 31 35-bit up accumulator : 30

8-bit up accumulator : 1 # Registers : 330 Flip-Flops : 330 # Latches : 7 1-bit latch : 3 8-bit latch : 3 9-bit latch : 1 # Comparators : 84 10-bit comparator equal : 3 10-bit comparator greatequal : 9

10-bit comparator greater : 4 10-bit comparator less : 10 10-bit comparator lessequal : 10 10-bit comparator not equal : 1 11-bit comparator equal : 1 12-bit comparator equal : 1 5-bit comparator less : 5 5-bit comparator lessequal : 6 7-bit comparator equal : 29

8-bit comparator greatequal : 1 9-bit comparator equal : 1 9-bit comparator greatequal : 1 9-bit comparator greater : 1 9-bit comparator not equal : 1 # Multiplexers : 14 1-bit 4-to-1 multiplexer : 4 35-bit 4-to-1 multiplexer : 10

64

# IOs : 43 Cell Usage : # BELS : 9026 # GND : 42

# INV : 343 # LUT1 : 254 # LUT2 : 1177 # LUT3 : 857 # LUT3_D : 13 # LUT3_L : 1 # LUT4 : 2286 # LUT4_D : 4 # LUT4_L : 5 # MULT_AND : 1

# MUXCY : 1998 # MUXF5 : 269 # MUXF6 : 32 # MUXF7 : 16 # VCC : 34 # XORCY : 1694 # FlipFlops/Latches : 1793 # FD_1 : 20 # FDC : 144

# FDCE : 1383 # FDCPE : 34 # FDE : 96 # FDE_1 : 20 # FDP : 1 # FDR : 10 # FDR_1 : 18 # FDRE : 30 # FDRS : 1 # LD : 2 # LDC : 1 # LDC_1 : 4 # LDCP : 16 # LDCPE_1 : 9 # LDP_1 : 4 # RAMS : 70 # RAMB16_S18 : 29 # RAMB16_S4_S4 : 38 # RAMB16_S9 : 2 # RAMB16_S9_S9 : 1 # Clock Buffers : 8 # BUFG : 7 # BUFGP : 1 # IO Buffers : 42 # IBUF : 9

65

# OBUF : 33 # MULTs : 17 # MULT18X18 : 17 =================================================== ======================

Device utilization summary: --------------------------- Selected Device : 2vp30ff896-7 Number of Slices: 2647 out of 13696 19% Number of Slice Flip Flops: 1793 out of 27392 6% Number of 4 input LUTs: 4940 out of 27392 18% Number of IOs: 43

Number of bonded IOBs: 43 out of 556 7% Number of BRAMs: 70 out of 136 51% Number of MULT18X18s: 17 out of 136 12% Number of GCLKs: 8 out of 16 50%

66

Modelsim Simulation Waveforms

Figure A1: Region 1 neurons accumulation operation. The first two neurons have finished

accumulation. The last two are still accumulating.

Figure A2: Region 2 neurons accumulation operation. The first eleven neurons have finished

accumulation. The last five are still accumulating.

Figure A3: Region 3 neurons accumulation operation. The first four neurons have finished

accumulation. The last two are still accumulating.

67

Figure A4: Order in which the accumulation operation ends at each neuron.

Figure A5: Illustration of the Neural network final output. First the face or no_face signal becomes 1

which represents a non face and then the ready signal is asserted to inform other units to send the next window.

68

Appendix B

Face Detector top level implementation verilog code module Face_detector(show, switches, systemClock, l eds, up, right, left, down, vSync, hSync, Sync, blank,redOUT, greenOUT, b lueOUT, clock25Mhz); //////////////////////// // General Signals //////////////////////// input show; //Negative Logic input[3:0] switches; //Negative Logic input systemClock; wire address_clock; wire mem_clock;

wire mem_clock_vga; wire acc_clock; output[3:0] leds; ///////////////// // Reset signals ////////////////

wire reset_window_gen; wire reset_NN; wire reset_enable_unit; wire reset_clock_generator; wire reset_addr_buffer; ////////////////////////// // VGA Signals //////////////////////////

input up, right, left, down; // Negative Logic wire[9:0] col_counter; wire[9:0] row_counter; wire[7:0] REDvalue, GREENvalue, BLUEvalue; output wire clock25Mhz; output vSync, hSync, Sync, blank; output[7:0] redOUT, greenOUT, blueOUT; //////////////////////////// // 10x10 neurons signals ////////////////////////////

69

wire[6:0] address_weight_mem_10x10; wire[3:0] enable_10x10;

wire[16:0] weight_value_10x10; /////////////////////// // 5x5 neurons signals /////////////////////// wire[4:0] address_weight_mem_5x5; wire[15:0] enable_5x5;

wire[16:0] weight_value_5x5; ///////////////////////// // 20x5 neurons signals ///////////////////////// wire[6:0] address_weight_mem_20x5_1; wire[6:0] address_weight_mem_20x5_2;

wire[5:0] enable_20x5; wire[16:0] weight_value_20x5_1; wire[16:0] weight_value_20x5_2; ////////////////////// // Hidden 2 neuron /////////////////////// wire face_or_noface_320x240; wire ready_NN_320x240; ///////////////////////////////////////////// // Signals for window creation and viewing ///////////////////////////////////////////// wire[8:0] base_column_joy; wire[7:0] base_row_joy; wire[8:0] base_column_rast_scan; wire[7:0] base_row_rast_scan; wire[8:0] base_column; wire[7:0] base_row; wire[8:0] address20x20;

70

wire[16:0] address320x240; wire[8:0] w_address20x20; wire[16:0] read_address320x240;

wire[7:0] pixel_value_20x20; wire[7:0] pixel_value_320x240; wire[7:0] pixel_value_face; wire[7:0] pixel_value_noface; wire[8:0] address_classification_image; wire[7:0] read_pixel_value_320x240; wire[7:0] doutb;

wire we; wire stand_by; // to leds wire next_window; wire[11:0] number_of_faces;

/////////////////////////////////////////////////// //////// // Modules required for system monitoring and contr ol /////////////////////////////////////////////////// //////// // Generates the clocks for the memories, the multi pliers and the accumulators system_clock_generator scg(reset_clock_generator, s ystemClock, address_clock, mem_clock_vga, mem_clock, acc_clock) ; push_button_modifier pbm_right(systemClock, ~right, ~up, ~down, ~left, next_window); timer time_track(systemClock, ready_NN_320x240, tim e_out); //speed at which the face detector works mode_mux mode_m(~switches[2], time_out, next_window , mode); enable_view view_pic(reset_NN, show, view); controller ctrl(~switches[3], systemClock, done_win dow, done_frame, mode, enable_win_gen, stand_by, reset_NN, reset_enable_un it, reset_clock_generator, reset_window_gen, mem_clock, reset_addr_buffer, we, mode); // Controller that enables the appropriate memories and accumulators according with the created pixel window id

71

controller_column_row_gen enable_unit(reset_enable_ unit, address_clock, enable_10x10, enable_5x5, enable_20x5, address_weig ht_mem_10x10, address_weight_mem_5x5, address_weight_mem_20x5_1, address_weight_mem_20x5_2);

/////////////////////////////////////////////////// / //Modules Required To operate the Neural Network /////////////////////////////////////////////////// / // Contains the weights for the 10x10 neurons weights_memory_module_10x10 mem10x10(address_weight _mem_10x10, enable_10x10, mem_clock, weight_value_10x10);

// Contains the weights for the 5x5 neurons weights_memory_module_5x5 mem5x5(address_weight_mem _5x5, enable_5x5, mem_clock, weight_value_5x5); // Contains the weights for the 20x5 neurons weights_memory_module_20x5 mem20x5(address_weight_m em_20x5_1, address_weight_mem_20x5_2, enable_20x5, mem_clock, weight_value_20x5_1, weight_value_20x5_2);

// Neural Network // Reads value from the 320x240 memory neural_network NN(reset_NN, systemClock, enable_10x 10, enable_5x5, enable_20x5, read_pixel_value_320x240, acc_clock, weight_value_10x10, weight_value_5x5, weight_value_ 20x5_1, weight_value_20x5_2, ready_NN_320x240, face_or_noface_320x240); face_counter fac_count(~switches[3], done_frame, re ady_NN_320x240, mode, face_or_noface_320x240, number_of_faces); /////////////////////////////////////////////////// ////////////// //Modules Resaponsible for data extraction and imag e showing /////////////////////////////////////////////////// ////////////// // VGA clock Divider VGAclockDivider clock(systemClock, clock25Mhz); // VGA controller vga_controller vga(REDvalue, GREENvalue, BLUEvalue, clock25Mhz, vSync, hSync, Sync, blank, redOUT, greenOUT, blueOUT, col_counter, row_counter); navigation_module nav_unit(~switches[3], ready_NN_3 20x240, ~up, ~right, ~left, ~down, base_column_joy, base_row_joy);

72

window_generator gen(~switches[3], ready_NN_320x240 , base_column_rast_scan, base_row_rast_scan); br_bc_mux bcbr(~switches[2], base_column_rast_scan, base_row_rast_scan,

base_column_joy, base_row_joy, base_column, base_ro w); new_window_generator new_gen(reset_window_gen, enab le_win_gen, base_column, base_row, address_clock, read_address3 20x240, done_window, done_frame, mode); address_to_write addr_w(reset_addr_buffer, address_ clock, w_address20x20); // Address to output to VGA

Address addr(col_counter, row_counter, address20x20 , address320x240, address_classification_image); Source sour(ready_NN_320x240, mode, view, face_or_n oface_320x240, pixel_value_20x20, pixel_value_320x240, pixel_value _noface, pixel_value_face, col_counter, row_counter, base_co lumn, base_row, REDvalue, GREENvalue, BLUEvalue);

////////////////////////////////////////////// // Images and window buffers ////////////////////////////////////////////// // Classification images classification_face face(address_classification_ima ge, systemClock, pixel_value_face); classification_noface noface(address_classification _image, systemClock, pixel_value_noface); // address320x240 -> address for vga // read_address320x240 -> address by window generat or //read_pixel_value_320x240 -> For NN //image_320x240 im(address320x240, read_address320x 240, systemClock, mem_clock_vga, 8'b00000000, pixel_value_320x240, read_pixel_value_320x240, 1'b0); //test_faces im(address320x240, read_address320x240 , systemClock, mem_clock_vga, 8'b00000000, pixel_value_320x240, read_pixel_value_320x240, 1'b0); Old_focks im(address320x240, read_address320x240, s ystemClock, mem_clock_vga, 8'b00000000, pixel_value_320x240, read_pixel_value_320x240, 1'b0); window_buffer_20x20 buff(address20x20, w_address20x 20, systemClock, mem_clock, read_pixel_value_320x240,pixel_value_20x 20, doutb, we);

73

////////////////////////////////////////////// // Output //////////////////////////////////////////////

LED_unit led_un(~switches [1:0],number_of_faces, stand_by, done_frame, leds);

Neural Network top level implementation verilog code module neural_network(reset, systemClock, enable_10 x10, enable_5x5, enable_20x5, pixel_value, acc_clock, weight_value_10x10, weight_value_5x5, weight_value_ 20x5_1,

weight_value_20x5_2,acc_ready_hidden2, verdict); ////////////////////////// // General Signals ///////////////////////// input reset; input systemClock; input[7:0] pixel_value;

input acc_clock; ////////////////////////////// // 10x10 neurons signals ////////////////////////////// input[3:0] enable_10x10; input[16:0] weight_value_10x10;

wire[34:0] acc1_final_sum; wire[34:0] acc2_final_sum; wire[34:0] acc3_final_sum; wire[34:0] acc4_final_sum; wire acc1_ready; wire acc2_ready; wire acc3_ready;

wire acc4_ready; ////////////////////////////// // 5x5 neurons signals

74

////////////////////////////// input[15:0] enable_5x5; input[16:0] weight_value_5x5;

wire[34:0] acc5_final_sum; wire[34:0] acc6_final_sum; wire[34:0] acc7_final_sum; wire[34:0] acc8_final_sum; wire[34:0] acc9_final_sum; wire[34:0] acc10_final_sum; wire[34:0] acc11_final_sum; wire[34:0] acc12_final_sum; wire[34:0] acc13_final_sum; wire[34:0] acc14_final_sum;

wire[34:0] acc15_final_sum; wire[34:0] acc16_final_sum; wire[34:0] acc17_final_sum; wire[34:0] acc18_final_sum; wire[34:0] acc19_final_sum; wire[34:0] acc20_final_sum; wire acc5_ready; wire acc6_ready;

wire acc7_ready; wire acc8_ready; wire acc9_ready; wire acc10_ready; wire acc11_ready; wire acc12_ready; wire acc13_ready; wire acc14_ready; wire acc15_ready; wire acc16_ready; wire acc17_ready; wire acc18_ready; wire acc19_ready; wire acc20_ready; ///////////////////////////////// // 20x5 neurons signals ///////////////////////////////// input[5:0] enable_20x5; input[16:0] weight_value_20x5_1; input[16:0] weight_value_20x5_2;

75

wire[34:0] acc21_final_sum; wire[34:0] acc22_final_sum; wire[34:0] acc23_final_sum; wire[34:0] acc24_final_sum; wire[34:0] acc25_final_sum;

wire[34:0] acc26_final_sum; wire acc21_ready; wire acc22_ready; wire acc23_ready; wire acc24_ready; wire acc25_ready; wire acc26_ready; ////////////////////////////////////

// Hidden 1 - neuron 1 signals 10x10 /////////////////////////////////// wire clock_mem_10x10_to_hidden1; wire clock_acc_10x10_to_hidden1; wire[16:0] weight_10x10_to_hidden1; wire[34:0] acc_final_sum_10x10_to_hidden1; // Hidd en 1 - neuron 1

result wire acc_ready_10x10_to_hidden1; //////////////////////////////////// // Hidden 1 - neuron 2 signals 5x5 /////////////////////////////////// wire clock_mem_5x5_to_hidden1; wire clock_acc_5x5_to_hidden1; wire[16:0] weight_5x5_to_hidden1; wire[34:0] acc_final_sum_5x5_to_hidden1; // Hidden 1 - neuron 2 result wire acc_ready_5x5_to_hidden1; //////////////////////////////////// // Hidden 1 - neuron 3 signals 20x5 /////////////////////////////////// wire clock_mem_20x5_to_hidden1; wire clock_acc_20x5_to_hidden1; wire[16:0] weight_20x5_to_hidden1; wire[34:0] acc_final_sum_20x5_to_hidden1; // Hidde n 1 - neuron 2

76

result wire acc_ready_20x5_to_hidden1; //////////////////////////////////// // Hidden 2 - neuron signals

/////////////////////////////////// wire clock_mem_hidden2; wire clock_acc_hidden2; wire[16:0] weight_hidden2; wire[34:0] acc_final_sum_hidden2; // Hidden 2 - ne uron result // if acc_final_sum_hidden2[34] == 1 -> Not a face

// if acc_final_sum_hidden2[34] == 0 -> A face output acc_ready_hidden2; output verdict; ////////////////////////// // Tanh memory signals /////////////////////////

wire[34:0] tanh_address_10x10_to_hidden1; // tanh memory address wire[14:0] tanh_value_10x10_to_hidden1_to_hidden2; // Data wire[34:0] tanh_address_5x5_to_hidden1; // tanh me mory address wire[14:0] tanh_value_5x5_to_hidden1; // Data wire[34:0] tanh_address_20x5_to_hidden1; // tanh m emory address wire[14:0] tanh_value_20x5_to_hidden1; // Data wire[34:0] tanh_address_hidden2; ///////////////////////////////////////////////// // Multiplier and accumulators for neuron 10x10 ///////////////////////////////////////////////// neurons_10x10 group1(enable_10x10, acc_clock, reset , pixel_value, weight_value_10x10, acc1_final_sum, acc1_ready, acc2_final_sum, acc2_ready, acc3_final_sum, acc3_ready, acc4_final_sum, acc4_ready); ///////////////////////////////////////////////// // Multiplier and accumulators for neuron 5x5 /////////////////////////////////////////////////

77

neurons_5x5 group2(enable_5x5, acc_clock, reset, pi xel_value, weight_value_5x5, acc5_final_sum, acc5_ready, acc6_final_sum, acc6_ready,

acc7_final_sum, acc7_ready, acc8_final_sum, acc8_ready, acc9_final_sum, acc9_ready, acc10_final_sum, acc10_ready, acc11_final_sum, acc11_ready, acc12_final_sum, acc12_ready, acc13_final_sum, acc13_ready, acc14_final_sum, acc14_ready, acc15_final_sum, acc15_ready, acc16_final_sum, acc16_ready,

acc17_final_sum, acc17_ready, acc18_final_sum, acc18_ready, acc19_final_sum, acc19_ready, acc20_final_sum, acc20_ready); ////////////////////////////////////////////////// // Multipliers and accumulators for neuron 20x5 //////////////////////////////////////////////////

neurons_20x5 group3(enable_20x5, acc_clock, reset, pixel_value, weight_value_20x5_1, weight_value_20x5_2, acc21_final_sum, acc21_ready, acc22_final_sum, acc22_ready, acc23_final_sum, acc23_ready, acc24_final_sum, acc24_ready, acc25_final_sum, acc25_ready, acc26_final_sum, acc26_ready); ///////////////////////////////// // 10x10 to hidden1 controller //////////////////////////////// controller_10x10_to_hidden1 ctrl_1(reset, systemClo ck, clock_mem_10x10_to_hidden1, clock_acc_10x10_to_hidd en1, weight_10x10_to_hidden1, tanh_address_10x10_to_hidd en1, acc1_final_sum, acc1_ready, acc2_final_sum, acc2_ready, acc3_final_sum, acc3_ready, acc4_final_sum, acc4_ready); ///////////////////////////////// // 5x5 to hidden1 controller /////////////////////////////////

78

controller_5x5_to_hidden1 ctrl_2(reset, systemClock , clock_mem_5x5_to_hidden1, clock_acc_5x5_to_hidden1, weight_5x5_to_hidden1, tanh_address_5x5_to_hidden1, acc5_final_sum, acc5_ready,

acc6_final_sum, acc6_ready, acc7_final_sum, acc7_ready, acc8_final_sum, acc8_ready, acc9_final_sum, acc9_ready, acc10_final_sum, acc10_ready, acc11_final_sum, acc11_ready, acc12_final_sum, acc12_ready, acc13_final_sum, acc13_ready, acc14_final_sum, acc14_ready, acc15_final_sum, acc15_ready,

acc16_final_sum, acc16_ready, acc17_final_sum, acc17_ready, acc18_final_sum, acc18_ready, acc19_final_sum, acc19_ready, acc20_final_sum, acc20_ready); ///////////////////////////////// // 20x5 to hidden1 controller /////////////////////////////////

controller_20x5_to_hidden1 ctrl_3(reset, systemCloc k, clock_mem_20x5_to_hidden1, clock_acc_20x5_to_hidden 1, weight_20x5_to_hidden1, tanh_address_20x5_to_hidden 1, acc21_final_sum, acc21_ready, acc22_final_sum, acc22_ready, acc23_final_sum, acc23_ready, acc24_final_sum, acc24_ready, acc25_final_sum, acc25_ready, acc26_final_sum, acc26_ready); ////////////////// // Tanh Memory ////////////////// tanh hyperbolic_tangent(tanh_address_10x10_to_hidde n1, tanh_address_5x5_to_hidden1, tanh_address_20x5_to_h idden1, tanh_address_hidden2, acc_ready_20x5_to_hidden1, acc_ready_10x10_to_hidden1, clock_mem_10x10_to_hidd en1, clock_mem_5x5_to_hidden1, clock_mem_20x5_to_hidden1 , clock_mem_hidden2, tanh_value_10x10_to_hidden1_to_hidden2, tanh_value_ 5x5_to_hidden1, tanh_value_20x5_to_hidden1); ////////////////////// // Hidden 1 neurons /////////////////////

79

hidden1_neuron1 neuron10x10_to_hidden1(reset, clock_acc_10x10_to_hidden1, weight_10x10_to_hidden1 , tanh_value_10x10_to_hidden1_to_hidden2, tanh_address_10x10_to_hidden1[34], acc_final_sum_10 x10_to_hidden1,

acc_ready_10x10_to_hidden1); hidden1_neuron2 neuron5x5_to_hidden1(reset, clock_a cc_5x5_to_hidden1, weight_5x5_to_hidden1, tanh_value_5x5_to_hidden1, tanh_address_5x5_to_hidd en1[34], acc_final_sum_5x5_to_hidden1, acc_ready_5x5_to_hidden1); hidden1_neuron3 neuron20x5_to_hidden1(reset, clock_ acc_20x5_to_hidden1, weight_20x5_to_hidden1,

tanh_value_20x5_to_hidden1, tanh_address_20x5_to_hi dden1[34], acc_final_sum_20x5_to_hidden1, acc_ready_20x5_to_hidden1); ///////////////////////////////////// // hidden 1 to hidden2 controller ///////////////////////////////////// controller_hidden1_to_hidden2 ctrl_4(reset, systemC lock,

clock_mem_hidden2, clock_acc_hidden2, weight_hidden2, tanh_address_hidden2, acc_final_sum_10x10_to_hidden1, acc_ready_10x10_to_ hidden1, acc_final_sum_5x5_to_hidden1, acc_ready_5x5_to_hidd en1, acc_final_sum_20x5_to_hidden1, acc_ready_20x5_to_hi dden1); ////////////////////// // Hidden 2 neurons ///////////////////// hidden2_neuron hidden2_n1(reset, clock_acc_hidden2, weight_hidden2, tanh_value_10x10_to_hidden1_to_hidden2, tanh_addres s_hidden2[34], acc_final_sum_hidden2, acc_ready_hidden2); classification classify(acc_final_sum_hidden2, verd ict); endmodule

80

New coordinates generator implementation verilog code from the Image Pyramid generator module.

module new_coordinates(row_320x240, column_320x240, row_260x195, column_260x195, row_200x150, column_200x150, row_16 0x120, column_160x120, row_120x90, column_120x90, row_100x 75, column_100x75, row_80x60, column_80x60, row_60x45, column_60x45, r ow_40x30, column_40x30, row_20x20, column_20x20); /* factor1 = 0.8125 - 0.1101 - 260x195 factor2 = 0.625 - 0.101 - 200x150 factor3 = 0.5 - 0.1 - 160x120 factor4 = 0.375 - 0.011 - 120x90 factor5 = 0.3125 - 0.0101 - 100x75 factor6 = 0.25 - 0.01 - 80x60 factor7 = 0.1875 - 0.0011 - 60x45 factor8 = 0.125 - 0.001 - 40x30 factor9 = 0.0625 - 0.0001 & 0.083984375 - 0.0001010 11 - 20x20 */ parameter factor1=4'b1101; parameter factor2=3'b101; parameter factor4=3'b011; parameter factor5=4'b0101; parameter factor7=4'b0011; parameter factor9_row=9'b000101011; // 320x240 input[7:0] row_320x240; input[8:0] column_320x240; //260x195 output[11:0] row_260x195; reg[11:0] row_260x195=12'b000000000000; output[12:0] column_260x195; reg[12:0] column_260x195=13'b0000000000000; //200x150

output[10:0] row_200x150; reg[10:0] row_200x150=11'b00000000000; output[11:0] column_200x150; reg[11:0] column_200x150=12'b000000000000; //160x120 output[6:0] row_160x120; reg[6:0] row_160x120=7'b0000000; output[7:0] column_160x120;

81

reg[7:0] column_160x120=8'b00000000; //120x90 output[10:0] row_120x90; reg[10:0] row_120x90=11'b00000000000;

output[11:0] column_120x90; reg[11:0] column_120x90=12'b000000000000; //100x75 output[11:0] row_100x75; reg[11:0] row_100x75=12'b000000000000; output[12:0] column_100x75; reg[12:0] column_100x75=13'b0000000000000; //80x60

output[5:0] row_80x60; reg[5:0] row_80x60=6'b000000; output[6:0] column_80x60; reg[6:0] column_80x60=7'b0000000; //60x45 output[11:0] row_60x45; reg[11:0] row_60x45=12'b000000000000; output[12:0] column_60x45;

reg[12:0] column_60x45=13'b0000000000000; //40x30 output[4:0] row_40x30; reg[4:0] row_40x30=5'b00000; output[5:0] column_40x30; reg[5:0] column_40x30=6'b000000; //20x20 output[16:0] row_20x20; reg[16:0] row_20x20=17'b00000000000000000; output[4:0] column_20x20; reg[4:0] column_20x20=5'b00000; //260x195 always@(row_320x240 or column_320x240)begin column_260x195 = column_320x240 * factor1; row_260x195 = row_320x240 * factor1; end //200x150 always@(row_320x240 or column_320x240)begin

82

column_200x150 = column_320x240 * factor2; row_200x150 = row_320x240 * factor2; end

//160x120 always@(row_320x240 or column_320x240)begin column_160x120 = column_320x240[8:1]; row_160x120 = row_320x240[7:1]; end //120x90 always@(row_320x240 or column_320x240)begin

column_120x90 = column_320x240 * factor4; row_120x90 = row_320x240 * factor4; end //100x75 always@(row_320x240 or column_320x240)begin

column_100x75 = column_320x240 * factor5; row_100x75 = row_320x240 * factor5; end //80x60 always@(row_320x240 or column_320x240)begin column_80x60 = column_320x240[8:2]; row_80x60 = row_320x240[7:2]; end //60x45 always@(row_320x240 or column_320x240)begin column_60x45 = column_320x240 * factor7; row_60x45 = row_320x240 * factor7; end //40x30 always@(row_320x240 or column_320x240)begin column_40x30 = column_320x240[8:3];

83

row_40x30 = row_320x240[7:3]; end //20x20

always@(row_320x240 or column_320x240)begin column_20x20 = column_320x240[8:4]; row_20x20 = row_320x240 * factor9_row; end endmodule

neural network-based face detector implementation … · this thesis focuses on the design of a...

Documents