smart camera for visual tag recognition - universiteit …gerezsh/sendfile/lin-msc.pdf · contents...

78
Smart Camera for Visual Tag Recognition Ning Lin M.Sc. Thesis Supervisors: dr.ir. Sabih Gerez ir. Bert Molenkamp dr.ir. Berend Jan van der Zwaag ir. Mark Westmijze September 2010

Upload: vonga

Post on 15-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Smart Camera for Visual TagRecognition

Ning Lin

M.Sc. Thesis

Supervisors:

dr.ir. Sabih Gerezir. Bert Molenkamp

dr.ir. Berend Jan van der Zwaagir. Mark Westmijze

September 2010

Summary

Smart cameras are becoming increasingly popular with the advances in both machine vision and semi-conductor technology. A smart camera is a camera which integrates image processing hardware. Thishardware can be a microprocessor, a digital signal processor, an FPGA or any combination of these. Thegoal of image processing is usually to detect some activity (e.g. suspect movements in a parking lot),identify some object, etc. Instead of transmitting photographs or a video stream to a host computer, thesmart camera only needs to transmit the result of image processing such as the identity of a detectedobject. The goal of this project is to implement typical image processing algorithms on a smart camerahardware platform with low cost.

Visual tags, which is comparable to 2D barcodes, are used a lot all over the world. There are over 302D barcodes which are in use today. In this project, Target Recognition using Image Processing (TRIP)code is chosen as the target to identify and the Xilinx ML510 board is chosen as the smart camerahardware platform. The image processing algorithm to recognize the TRIP code is based on Ipina’s PhDthesis [3]. The ML510 board has a Virtex-5 FX130T FPGA with 2 PowerPC processors embedded. Partof the image processing algorithm is done in FPGA to reduce the processing time, other parts are doneby the PowerPC or the MicroBlaze processor. Hardware and software co-design is the main concern inthis project.

Eventually, both PowerPC system and MicroBlaze system have been successfully implemented withthe TRIP code recognition system on the ML510 board. The recognition rate of both systems is as goodas the Matlab implementation system, which is considered as the ideal recognition rate of the algorithm.With the size of input images of 640 × 480, the PowerPC system is able to process about 3 frames persecond, and the MicroBlaze system is able to process about 0.27 frames per second.

Several methods are given as recommendations for speeding up the system according to the profilingresult. The hardware resource analysis has also been made to find out the possibilities to transfer thissystem on other cheaper FPGA boards.

i

ii

Contents

Summary i

List of Figures v

List of Tables vii

1 Introduction 11.1 Image Processing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Visual Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 TRIP code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Other 2D barcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.3 Current Situation of 2D Barcode Use . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Smart Camera Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Challenges of this project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Image Processing Algorithm 72.1 Adaptive Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Sauvola’s Threshold Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Using integral images for computing means and variances . . . . . . . . . . . . . . 82.1.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.4 Simpler ways for adaptive threshold . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.5 Test Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Binary Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Computational Complexity and Test Result . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Edge Following and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Computational Complexity and Test Result . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Ellipse Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Concentric Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Code Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.7 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Vision on FPGA 173.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Convolution operations in FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Implement low-level processing algorithm for TRIP code . . . . . . . . . . . . . . . . . . . 20

3.3.1 Adaptive Thresholding Implementation . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 Binary Edge Detection Implementation . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Hardware Implementation 234.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Master Burst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iii

Contents

4.4.1 RGB to Grayscale Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 Inte gral Image Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.3 Local Mean Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4.4 Gray Data FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4.5 Adaptive Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4.6 Binary Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4.7 Inter-connection of all blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Software Implementation 335.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Adding Floating Point Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3 Debugging Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4.1 Edge Following and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4.2 Ellipse Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4.3 Synchronization Sector Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4.4 Decode and Validate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Performance Evaluation 376.1 Recognition Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Hardware Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusions and Recommendations 437.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.2.1 For Image Processing IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.2 For PowerPC System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.3 For MicroBlaze System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2.4 Other Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A Proof of Equation 2.6 49

B Mathematical Derivation of the five parameters of an ellipse 51B.1 Derive center point (x0, y0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52B.2 Derive ellipse axis Ea, Eb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52B.3 Derive the rotation angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

C Control Flow Diagrams 55

D Proof of alternative method to derive desired eigenvector 59

E Testing Images and Results 61

Bibliography 65

iv

List of Figures

1.1 TRIP code example [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Examples of other 2D Barcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Basic Smart Camera Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 System Architecture from [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Xilinx ML510 FPGA board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Target Recognition Process stage pipeline [3] . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Binarization result using Sauvola’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Binarization result using Equation 2.7 with C is 0.06 . . . . . . . . . . . . . . . . . . . . . 102.5 Binarization result using Equation 2.8 with P is 0.85 . . . . . . . . . . . . . . . . . . . . . 102.6 Pixel neighborhood notation: a) 4-connectivity; b) 8-connectivity; c) 8-connected neighbors

of a central pixel denoted by pn [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Binary Edge Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.8 Binary Edge Detection Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.9 Edge tracking of 1 pixel thick edge using 8-connectivity [3] . . . . . . . . . . . . . . . . . . 122.10 Edge Following and Filtering Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.11 Ellipse with center at (x0, y0), long axis Ea, short axis Eb, rotated by θ . . . . . . . . . . 142.12 Ellipse Fitting and Concentric Test Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.13 Dimensions of a TRIP target [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.14 Code Recognition Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Shift Register Buffer Implementation [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Block Buffering Method Example [47] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Proposed processing block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Integral image constrction and use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Binary edge detection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Shift register buffer for binary edge detection . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 PowerPC Hardware System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 MicroBlaze Hardware System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 PLBv46 Master Burst Block Diagram [57] . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Simulation waveform of 64 data burst read . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 Image Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.6 RGB2Gray block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.7 Integral Image Calculation Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.8 Local Mean Calculation Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.9 Area of valid pixel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.10 Top level gray data FIFO block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.11 Adaptive Threshold Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.12 Binary Edge Detection Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.13 Combined Result of Adaptive Threshold and Binary Edge Detection . . . . . . . . . . . . 304.14 Inter-connection of pipeline blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 Whole system software block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v

List of Figures

5.2 Sector distribution for check points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Simplified Decode Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.1 Dual PowerPC System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Possible Timing Graph for Dual PPC System . . . . . . . . . . . . . . . . . . . . . . . . . 457.3 Block Diagram of Edge Following and Filtering stage . . . . . . . . . . . . . . . . . . . . . 45

B.1 No rotation ellipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

C.1 Integral Image Calculation Control Flow Diagram . . . . . . . . . . . . . . . . . . . . . . 56C.2 Local Mean Calculation Control Flow, a) FIFO Control; b) Output Control . . . . . . . . 57

E.1 Matlab result of Image01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61E.2 ML510 result of Image01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61E.3 Matlab result of Image02 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61E.4 ML510 result of Image02 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61E.5 Matlab result of Image03 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.6 ML510 result of Image03 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.7 Matlab result of Image04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.8 ML510 result of Image04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.9 Matlab result of Image05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.10 ML510 result of Image05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.11 Matlab result of Image06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.12 ML510 result of Image06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.13 Matlab result of Image07 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.14 ML510 result of Image07 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.15 Matlab result of Image08 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.16 ML510 result of Image08 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.17 Matlab result of Image09 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64E.18 ML510 result of Image09 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64E.19 Matlab result of Image10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64E.20 ML510 result of Image10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64E.21 Matlab result of Image11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64E.22 ML510 result of Image11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vi

List of Tables

6.1 TRIP code Recognition Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Matlab System Profiling Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 PowerPC System Profiling Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4 MicroBlaze Double Precision System Profiling Result . . . . . . . . . . . . . . . . . . . . . 396.5 MicroBlaze Single Precision System Profiling Result . . . . . . . . . . . . . . . . . . . . . 406.6 Resource Utilization for the the whole MicroBlaze System . . . . . . . . . . . . . . . . . . 406.7 Resource Utilization for the whole PowerPC System . . . . . . . . . . . . . . . . . . . . . 41

7.1 Data rate comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

vii

List of Tables

viii

Chapter 1

Introduction

Smart Cameras are becoming increasingly popular with the advances in both machine vision and semi-conductor technology. In the past, a typical camera was only able to capture images, while with thesmart camera concept, a camera, which integrates image processing hardware, will have the ability togenerate specific information from the images that it has captured.

So far, there seems to be no established definition of what is a smart camera. We use the definitionused in paper [1], which is a proper definition for our project:

A smart camera is a vision system which can extract information from images and generate specificinformation for other devices such as a PC or a surveillance system without the need for an external

processing unit.

The goal of this project is to implement typical image processing algorithms on a smart camerahardware platform with low cost. Visual tags, which are comparable to 2-dimensional barcodes (2Dbarcodes), are the objects to detect and identify in this project.

1.1 Image Processing Hardware

The image processing hardware, which is integrated in a smart camera, could be a microprocessor, a digitalsignal processor, an FPGA or any combination of these. The goal of image processing is usually to detectsome activity (e.g. suspect movements in a parking lot), identify some object, etc. One challenge forthis project is to find how typical image processing algorithms can be partitioned into a data-dominatedpart which can be mapped on e.g. an FPGA and a control-dominated part which can be mapped on amicroprocessor. The solution should have low cost in hardware and have low power consumption whilemeeting the computational requirements.

1.2 Visual Tag

The object for a smart camera to recognize in this project is the “visual tag”, which is some kind of2D barcode. Recently, an M.Sc. project was carried out in the Pervasive Systems group [2]. That workinvestigated the recognition of “visual tags”, in a scene captured by the camera. The“visual tag” proposedin [2] is based on the TRIP code which is discussed in detail by Ipina’s [3]. The thesis introduces thewhole image processing algorithms to detect and recognize the TRIP code. Lopez de Ipina also providesthe TRIP code generator on his home page [4]. This TRIP code is used as the target to detect in thisproject.

1.2.1 TRIP code

TRIP (Target Recognition using Image Processing) is a vision-based sensor system that uses a combina-tion of 2D circular barcode or ringcode tags (see Figure 1.1) and inexpensive low-resolution CCD camerasto identify and locate tagged objects in a camera’s field of view. Relatively low CPU-demanding imageprocessing and computer vision algorithms are applied to obtain, in real-time, the identifier (TRIPcode).

1

Chapter 1. Introduction

Figure 1.1: TRIP code example [3]

A TRIP marker, or TRIPtag, (see Figure 1.1), encoding a ternary identifier, presents the followingmain features [3]:

� A central bull’s-eye makes the identification process easier due to its properties of invariance torotation and perspective, and high contrast.

� Two code rings around the bull’s-eye encode its ternary identifier. Each ring code provides 16ternary digits of information that are read in counterclockwise direction. To encode a ‘0’ ternarydigit two blank areas are left within a sector. To encode a ‘1’, the area corresponding to theinnermost encoding ring is drawn black and the area corresponding to the second encoding ring isleft white. Finally, to encode a ‘2’ ternary digit, the reverse approach to the case of ‘1’ is applied.The 16 sectors of ringcode are utilized in the following form:

– The first (or synchronization) sector’s special and elsewhere impossible configuration, i.e. twoblack sections in the encoding ring’s areas, indicates the beginning of the code. This factorexcludes the use of quaternary codes, which could give place to a larger address space.

– The second and third sectors are used to implement an even parity error check. The secondsector is the parity bit for ‘1’, and the third sector is the parity bit for ‘2’. These parity bitsmake the total number of ‘1’s and ‘2’s to be even numbers.

– The four subsequent sectors, in counterclockwise direction, encode (in ternary) the radius ofthe outermost border of the bull’s-eye. This is basically for calculating the distance from thecamera. These sectors seem not practical in use because the printing size of the same TRIPcode can vary from person to person. So in the implementation of this project, the radiusencoding sectors are not used. These four sectors are used as part of the code space.

– The remaining nine sectors encode an identifier or TRIPcode in ternary. Therefore, the max-imum number of valid codes is 39 = 19, 683. In the implementation of this project, since theradius encoding sectors are used for TRIPcode as well, the maximum number of valid codesis 313 = 1, 594, 323.

1.2.2 Other 2D barcodes

There are more than 30 2D barcodes which are in use today. They can be divided in two categories:data-based 2D barcodes and index-based 2D barcodes [5].

The data-based 2D barcodes (QR Code, VSCode and Data Matrix) were invented to improve datacapacity for industrial applications. However, when integrated into mobile phones with built-in camerasthat can scan and decode data, these 2D barcodes can operate as portable databases. All the informationis encoded in these 2D barcodes (e.g. company name, address, email address, websites and text messages)

2

1.3 Smart Camera Structure

and you can read these information directly on you decoder (e.g. a mobile phone) anytime, anywhere,regardless of network connectivity.

The index-based 2D barcodes (Visual Code, ShotCode and ColorCode) were developed for cameraphones, so they take into account the reading limitations of these built-in cameras. They have a muchlower data capacity than data-based 2D barcodes, but they offer more robust and reliable barcode reading.Each barcode basically works as an index that links the digital world to the real world. For instance, thedecoder for index-based barcodes will return series of digits (e.g. 1221002120), and these digits shouldbe linked to a database to see which real-world objects these numbers are referring to.

Figure 1.2 shows what those barcodes look like.

Figure 1.2: Examples of other 2D Barcodes

1.2.3 Current Situation of 2D Barcode Use

Since QR code (Quick Response code) was invented in 1999 in Japan, many applications about QR codeare developed and used in east Asian countries (e.g. Japan, South Korea and China).

Decode software for QR code and some other code have been developed for mobile phones which candecode the barcode image captured by the phone camera. Right now, most camera mobile phones soldin Japan and South Korea have built in software decoders for QR code and Data Matrix code. Variousapplications, such as Mobile Money Service, Mobile Ticket Service, Marketing and Anti-Counterfeiting,are already widely used in Japan and South Korea. Since January 2010, China mainland started to useQR code for the Real-name Train Ticket System, which will introduce the 2D barcode to more than onebillion Chinese people and bring another application boom to 2D barcode.

Not only in Asia, the 2D barcodes are also hitting the mainstream of western countries. You can seehuge QR code as billboard on the streets in UK. There are also many QR codes on the posts at bus stopsin UK. QR codes appear in one episode of “CSI:NY” (Crime Scene Investigation: New York), which is afamous serial TV play in the USA. This episode “Dead Inside” was on air on November 12th in 2008 [6].The Data Matrix code is often seen on electronic devices, especially on PCI device cards. Aztec code isused by Deutsche Bahn (German Railway), Trenitalia (primary train operator within Italy), NederlandseSpoorwegen (Dutch Railway), Przewozy Regionalne (one of Poland train operators), PKP Intercity andSwiss Federal Railways for tickets sold online and printed out by customers [7]. PDF417 code is oftenseen on the airplane tickets all over Europe.

The ShotCode was first created in 1999 at the University of Cambridge when researching a low costvision based method to track locations and later developed into the TRIP code. Further developmentwas performed to make it commercially use. In 2005, it changed the name to ShotCode and really startedcommercially use. In 2006, the ShotCode had done a successful campaign in Brazil with the Brazilianfootball team. Nowadays, ShotCode is often seen in the commercial posts in Japan and South Korea.

1.3 Smart Camera Structure

Figure 1.3 shows the basic structure of a smart camera system. Just like a typical digital camera, a smartcamera captures an image using an image sensor, stores the captured image in memory and transfers itto another device or user using a communication interface. However, unlike the simple processor in atypical digital camera, the processor in a smart camera will not only control the camera functionalitiesbut it is also able to process the captured images so that extra information can be obtained from them.

3

Chapter 1. Introduction

Figure 1.3: Basic Smart Camera Architecture

There are many smart camera products already available in the market today from a variety ofmanufacturers such as Tattile, Cognex, Matrix Vision, Sony, Philips, EyeSpector, PPT Vision, andVision Components [1]. The field of smart cameras is still a very active area of research because of thewide range of features of a smart camera that could be improved.

Acquiring the right targeted hardware for a smart camera processor can be considered as an importantissue. While ASIC provides a high performance and power efficient platform, it suffers from lack offlexibility and can be very expensive due to the high non-recurring engineering (NRE) cost. DSP onthe other hand only has a single flow of control which could pose problem to meet real-time constraints.The general-purpose processor (GPP) also faces problem in meeting real-time constraint due to poorexecution time predictability. The graphics processing unit (GPU) is widely used in video cards andgame consoles for image and video processing. It can process images very fast due to its high parallelstructure but is not good for general purpose processing. Developing software for GPU is complex due toa restrictive programming model [8]. With the rising importance of GPU computing, GPU hardware andsoftware are changing at a remarkable pace. We expect to see several changes to allow more flexibilityand performance from future GPU computing systems [9]. It might be good for general purpose imageprocessing in the near future. Garcia et al. [10] suggested the reconfigurable hardware as the best optionand cost effective for embedded systems. Presently, FPGA is one of the most widely used and competitivereconfigurable hardware in the market.

One of the key aspects of FPGA is that it has large number of arrays of parallel logics and registerswhich enable the designer to produce effective parallel architecture. Parallel processing is an importantfeature especially for embedded system that involves high-level computation in real-time. Besides that,FPGA allows to incorporate microprocessor on its chip.

Mustafah et al. [1] proposed an architecture of smart camera that can be used as an aid for facerecognition in crowd surveillance. It uses an FPGA based processor for extracting the region of interest(ROI), and only sends ROI to the client processor for face recognition. Figure 1.4 shows the systemarchitecture.

Figure 1.4: System Architecture from [1]

4

1.4 Hardware Platform

1.4 Hardware Platform

There is an available FPGA board (Xilinx ML510 development board) in the group which has beenselected as the prototyping platform for this project. It has a Virtex-5 FPGA which is embedded withtwo PowerPC processors.

The Xilinx ML510 development board (ML510) has similarities with a PC mother board. Instead ofa general purpose processor on a PC mother board, the ML510 has a Virtex-5 FX130T FPGA. A pictureof ML510 development board is shown in Figure 1.5.

Figure 1.5: Xilinx ML510 FPGA board

General FPGA development boards offer a large number of user I/O pins in combination with com-munication interfaces like UARTs and Ethernet. Besides these, the ML510 also offers features normallyfound in PCs. An overview of the features of the ML510 is shown below:

� FPGA: Virtex-5 FX130T

� Memory: 2× DIMM banks for DDR2 memory (by default each contains 512MB)

� Expansion slots: 4× PCI, 2× PCI-Express

� Storage: 1× Compactflash interface, 256Mbit onboard Flash, 2× IDE, 2× Serial ATA

� LAN: 2× Gigabit Ethernet physical layers

� Video: 1× DVI-I, max resolution 640×480

� Audio: AC97 audio with headphone and microphone ports

� I/O-ports: 2× RS232 UART, 2×PS/2, 2× USB 1.1

� User I/O: 124 GPIO pins, 4× debug LED/switch

5

Chapter 1. Introduction

As can be seen in the list, the ML510 offers a lot of peripherals which can be used for prototypinga lot of applications. In the Smart Camera project, the hardware and software co-design is of our mostconcern. So the focus is on the Virtex-5 FX130T FPGA. The overview of the features of the Virtex-5FX130T is given below:

� Slices: 20,480 (contains logic cells and flip-flops)

� Logic cells: 81,920 6-input LUTs

� Flip-flops: 81,920

� Maximum operating frequency: 550MHz

� DSP blocks: 320× DSP48E

� Processor block: 2× IBM PowerPC 440 cores

� Block RAM: 298 blocks of 36K bit for a total of 10,728K bit

� Hard-core memory controller with support for DDR2 using soft-core bridge

� LAN: 6× 10/100/1000 Ethernet MAC blocks

� High-speed serial I/O: 20× RocketIO transceivers capable of 150Mbps to 6.5Gbps

1.5 Challenges of this project

The Virtex-5 FX130T FPGA has two PowerPC hard-cores embedded, and Xilinx FPGA could also beimplemented with soft-core such as MicroBlaze. Only one PowerPC processor is needed to implementthe TRIP code recognition algorithm. But there are many questions left to be answered: Whether I canuse both of the PowerPC processors for the image processing algorithm and how do they co-operate?Whether I can use none of these hard-core processors but a soft-core processor to do the same job? Howdo these changes affect the performance?

The Virtex-5 FX130T FPGA is very powerful, but how many resources the algorithm will use isanother issue. The challenge is on how to downscale the whole system so that the whole system can fitin other FPGA boards which have less resources and lower price without violating certain constraints.

Partitioning the hardware and software processing would be another challenge. The PowerPC pro-cessors on the ML510 would operate at 300MHz, which would take more processing time for softwareimplementation than normal PC. The algorithm for TRIP code recognition consists of adaptive thresh-olding, binary edge detection, edge following and fitering, ellipse fitting, and code recognition stages whichis described in details in the following chapter. Since the ellipse fitting algorithm is not suitable forimplementing in hardware, we can expect that the processing time of ellipse fitting stage would takemuch longer on the board. Whether to use alternative ellipse fitting algorithms [54, 55, 56] should beconsidered. Can this stage be implemented in FPGA using DSP blocks also needs to be investigated.There are 320 DSP48E blocks inside the Virtex-5 FX130T FPGA. At what extent these DSP blocks couldbe involved in the image processing algorithm could be another interesting question.

The aim is to make the whole system real-time. Lopez de Ipina’s previous work can process aboutthree frames per second with the resolution of 640×480, while using a PC to process. So increasingthe processing speed is another challenge for this project. Another interesting question is what otherapplications could be implemented on this system. Object tracking, face recognition and traffic signrecognition are some examples for further studying.

6

Chapter 2

Image Processing Algorithm

The image processing algorithm for TRIP code detecting and decoding is based on the PhD thesis ofDiego Lopez de Ipina [3]. Figure 2.1 shows the process stage pipeline for the target recognition algorithm.Details for each step is discussed in this chapter.

Figure 2.1: Target Recognition Process stage pipeline [3]

2.1 Adaptive Thresholding

The simplest way of thresholding is to set all pixels below a certain gray level to black (0) and clear allthe others to white (255). This process is called global thresholding, because every pixel, pn, in the imageis compared against the same fixed threshold value, T.

B(n) =

{0 if(pn < T )

255 otherwise

}(2.1)

The key problem here is to select a suitable threshold value T. Using Otsu’s method [11] to calculatethe global threshold is a very common and fast way. But if the illumination over the image is not uniform,global thresholding methods tend to produce more noise and would lose some important part in the image.

To overcome this problem, local adaptive thresholding is used. For each pixel, the local thresholdingmethod uses information from the local neighborhood of the pixel. This method is usually able to achievegood results even on severely degraded images. But it needs more computation time.

Bradley et al. [12] have proposed real-time adaptive thresholding using the mean of a local window,where the local mean is computed using an integral image. But Shafait et al. [13] indicated later that

7

Chapter 2. Image Processing Algorithm

the local mean alone does not work as good as considering both local mean and local variance. Sauvola’sbinarization technique [14] is used in their paper. Using integral images, it reduces the computing timefor standard deviation of each pixel.

Here, Sauvola’s method is discussed according to Shafait’s paper. Because the calculator of the localmean value is part of Sauvola’s algorithm, it needs to be decided later by multiple tests to see whetherusing Sauvola’s algorithm is necessary.

2.1.1 Sauvola’s Threshold Algorithm

In Sauvola’s binarization method, the threshold t(x,y) is computed using the mean m(x,y) and standarddeviation s(x,y) of the pixel intensities in a W×W window centered around the pixel (x,y):

t(x, y) = m(x, y)

[1 + k

(s(x, y)

R− 1

)](2.2)

where R is the maximum value of the standard deviation (R=128 for a grayscale image), and k is aparameter which takes positive values in the range [0.2, 0.5]. The local mean m(x, y) and standarddeviation s(x, y) adapt the value of the threshold according to the contrast in the local neighborhood ofthe pixel. A value of k = 0.5 is normally used. In general, the algorithm is not very sensitive to the valueof k used.

2.1.2 Using integral images for computing means and variances

An integral image i of an input image g is defined as the image in which the intensity at a pixel positionis equal to the sum of the intensities of all the pixels above and to the left of that position in the originalimage. So the intensity at position (x, y) can be written as

I(x, y) =

x∑i=1

y∑j=1

g(i, j) (2.3)

The integral image of any grayscale digital image can be efficiently computed in a single pass. Theway to do that is shown as follows:

I(x, y) = g(x, y)− I(x− 1, y − 1) + I(x− 1, y) + I(x, y − 1); (2.4)

Here g is the input grayscale image, I is the desired integral image. In this way, each pixel only needstwo additions and one subtraction for computing the integral image.

Once we have the integral image, the local mean m(x, y) for any window size w can be computedsimply by using two additions and one subtraction instead of the summation over all pixel values withinthat window:

m(x, y) =1

w2

(I(x+

w

2, y +

w

2) + I(x− w

2, y − w

2)− I(x+

w

2, y − w

2)− I(x− w

2, y +

w

2))

(2.5)

Similarly, if we consider the computation of the local variance:

s2(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g2(i, j)−m2(x, y) (2.6)

The proof of Equation 2.6 is shown in Appendix A.

The first term in Equation 2.6 can be computed in a similar way as Equation 2.5 by using an integralimage of the squared pixel intensities. Once the integral images are available, local means and variancescan be computed very efficiently, independent of the local window size.

8

2.1 Adaptive Thresholding

2.1.3 Computational complexity

For a image of size M×N and window size w×w, the computation complexity analysis is shown below.Here, subtraction is also considered as addition in the calculation; if the window size is a constant foreach pixel, the division can be replaced by multiplication, since division is expensive in hardware.

mean value:Computing integral image needs 2MN additions. Then mean value of a pixel needs 3 additions plus

1 division. So in total of mean value, an image needs 5MN additions plus MN divisions, which indicatesthe computational complexity for mean value is O(MN).

standard deviation:Computing the squared integral image needs MN multiplications plus 2MN additions. Then the

variance of a pixel needs 4 additions plus 1 division. To compute the standard deviation, it needs 1square root operation. So in total, each pixel needs 6 additions plus 1 multiplication plus 1 divisionplus 1 square root operations to compute the standard deviation. The computational complexity forcalculating standard deviation is O(MN).

threshold:Based on Equation 2.2, to compute the threshold of each pixel needs 13 additions plus 3 multiplications

plus 2 divisions plus 1 square root operations. The computational complexity for the whole adaptivethresholding step is O(MN).

2.1.4 Simpler ways for adaptive threshold

Instead of using Equation 2.2, a simpler equation can be used:

t(x, y) = m(x, y)− C (2.7)

where C is a constant. Compared to Equation 2.2, it reduces the step to calculate the standard deviationwhich takes most computation time. For each pixel, it only needs 7 additions and 1 division, which isless than half of the operations needed for Equation 2.2.

Another simple adaptive threshold equation can be expressed as:

t(x, y) = m(x, y) · P (2.8)

where P is also a constant, and range from 0 to 1. It only needs 6 additions, 1 division, and 1 multiplicationfor each pixel.

These two simpler ways for adaptive thresholding, reduce most of the computation operations. Andthey avoid to calculate the square root operation, which is very expensive in hardware. From the testresult images in the following section, these two ways can both work properly. So it might be a goodimplementation algorithm on hardware. Some future testing work needs to be done to see whetherEquation 2.7 has a good enough performance.

Normally, when use intensity value of an image pixel, the value is an integer from 0 to 255. Sinceintegral image is used, the value could easily overflow when we are adding all the integer numbers. Soit is better to use a floating point number with range from 0 to 1 for the intensity value of a pixel. Forhardware implementation, fixed-point is preferred. For calculating integral image, normalization shouldalso be considered when in hardware implementation.

2.1.5 Test Result

Some experiment in Matlab has been done to test the adaptive thresholding algorithm. The intensity ofeach pixel value is a floating point number from 0 to 1. Figure 2.2 is the original input image, which sizeis 640×480. Figure 2.3 shows the result of using Equation 2.2 with parameter k=0.5. Figure 2.4 showsthe result of using Equation 2.7 with window size w=16 and constant C =0.06. Figure 2.5 shows theresult of using Equation 2.8 with window size w=16, and constant P=0.85.

Note: The window size w can be any integer numbers, from the test, 10 to 20 would be a properrange. For division, a power of 2 for w is a better option, because it can be done by shifting. So w is 8or 16 is suggested.

9

Chapter 2. Image Processing Algorithm

Figure 2.2: Original imageFigure 2.3: Binarization result usingSauvola’s algorithm

Figure 2.4: Binarization result using Equation2.7 with C is 0.06

Figure 2.5: Binarization result using Equation2.8 with P is 0.85

2.2 Binary Edge Detection

Edges correspond to the pixels at or around which the image values undergo a sharp variation. Con-ventionally, edge detection techniques apply the geometric interpretation of the gradient to an image,expressed as the rate of change of the gray levels in it. The Canny [15] edge detector is probably themost commonly used in today’s machine vision community.

For TRIP, given that the output of the adaptive thresholding stage is a binary image with only blackand white pixel intensity values, it is not necessary to apply a fully-fledged edge detection method. Thebinary edge detector presented is not as precise as other morphological or gradient-based edge detectionmethods, only offering pixel accuracy in the edge location. However, the complexity of other moresophisticated detectors is high and requires much more computational power.

The process of finding the edges of a binary object may be carried out in different forms. Taking theassumption that the edge of an object has to lie within the object, the following edge-finding heuristicfor binary images can be formulated:

“A pixel in a binary image is an edge point if it has black intensity value, a 4-connectedneighbor with white intensity value and an 8-connected one with black intensity value”

Figure 2.6 illustrates what it is meant by 4-connected or 8-connected neighbor of a pixel. It also showsthe numbering scheme used to name the 8-connected neighboring pixels of a central pixel, denoted bypn. This numbering scheme serves to calculate two essential parameters required to undertake the binaryedge detection of an image, namely sigma and gamma. The sigma value of pixel pn is defined as the sum

10

2.2 Binary Edge Detection

Figure 2.6: Pixel neighborhood notation: a) 4-connectivity; b) 8-connectivity; c) 8-connected neighborsof a central pixel denoted by pn [3]

Figure 2.7: Binary Edge Detection Algorithm

of the pixel values surrounding it, i.e. sigma=P0+P1+P2+P3+P4+P5+P6+P7. The gamma value of pixelpn is the sum of the pixel values corresponding to its diagonally adjacent neighbor pixels, i.e. gamma =

P1+P3+P5+P7.The algorithm to find one pixel thick edges in a very efficient manner is shown in Figure 2.7.

2.2.1 Computational Complexity and Test Result

For this method used, only 7 additions are needed for each black pixel. Figure 2.8 is the result of usingthis algorithm when Figure 2.4 is the input image. In this test, about 14.35% of all pixels are black pixels.

Figure 2.8: Binary Edge Detection Result

11

Chapter 2. Image Processing Algorithm

2.3 Edge Following and Filtering

In order to fit the boundary of a pattern to a model, e.g. an ellipse, it is necessary to store an ordered listof edge points (edgel) locations in a data structure. Thus, all the 8-connected chains of edgels previouslylocated are followed in a clockwise fashion, producing for each edge a list of ordered point locations.Figure 2.9 illustrates how the previously calculated edge can be tracked. Starting from the first edgelencountered while scanning the image from top to bottom and left to right, this method searches for thenext edge pixel to the current one in the order shown by the left hand side of Figure 2.9. Note that everyedge point that is assigned to an edge, must be made black in order to prevent the method from loopingthrough already visited edge points.

Figure 2.9: Edge tracking of 1 pixel thick edge using 8-connectivity [3]

The criteria undertaken filters out edgel chains whose ratio between its perimeter and the distancebetween the initial and final points is bigger than an empirical value larger than 1. This ratio equals 1 inthe case of straight lines.

In the experiment I did in Matlab, Manhattan distance is used instead of algebraic distance in order toavoid the expensive square root operation. And according to Lopez de Ipina’s thesis [3], a ratio thresholdof 5 is used to maintain the edges which are likely to define elliptical arcs.

2.3.1 Computational Complexity and Test Result

The calculation of the computation complexity is based on the number of tracked edges in total and thenumber of edge points in total. For X sets of tracked edges in total, and n is the total number of edgepixels of input image, it needs n+2X additions plus X divisions.

In the example image from the previous step, n=22766 and X =1546. Figure 2.10 shows the result ofthis algorithm with the ratio threshold of 5.

Figure 2.10: Edge Following and Filtering Result

12

2.4 Ellipse Fitting

2.4 Ellipse Fitting

The previous stage provides edges likely to define an elliptical shape. In this stage, an ellipse fitting [16]method is applied which seeks for each of these edges a conic function representing an ellipse, given bythe following implicit equation:

F (~a, ~x) = ax2 + bxy + cy2 + dx+ ey + f = 0 (2.9)

where ~a = [a b c d e f ]T is the conic curve parameter vector and ~x = [x2 xy y2 x y 1] with (x,y) repre-senting a point belong to the curve. F (~a, ~xi) is called the “algebraic distance” of a point (xi, yi) to theconic F (~a, ~x) = 0. The fitting of a general conic can be approached by minimizing the sum of squaredalgebraic distances:

DA(~a) =

N∑i=1

F (~a, ~xi)2 (2.10)

of the curve to the N data points xi.Bookstein [17] showed that if a quadratic constraint is set on the parameters, the minimization can

be solved by the rank-deficient generalized eigenvalue systems:

S~a = λC~a (2.11)

whereS = DTD

is called scatter matrix, and

D =

x21 x1y1 y21 x1 y1 1x22 x2y2 y22 x2 y2 1... ... ... ... ... ...... ... ... ... ... ...x2n xnyn y2n xn yn 1

is called design matrix, and

C =

0 0 2 0 0 00 −1 0 0 0 02 0 0 0 0 00 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0

is the matrix that expresses the Fitzgibbon et al. [16] constraint:

4ac− b2 = ~aTC~a = 1 (2.12)

The purpose is to find the vector ~a which defines the ellipse. The constraints are necessary to avoidtrivial solution ~a = ~0. Fitzgibbon et al. [16] also proved that the solutions to the conic fitting problemgiven by the generalized eigen-system (Eq. 2.11) subject to the constraint (Eq. 2.12) include one andonly one elliptical solution corresponding to the single positive generalized eigenvalue of Equation 2.11.

The six parameters defining a conic representing an ellipse returned by this method are transformedinto the five parameters that define an ellipse in a 2D coordinate system (see in Figure 2.11). Thisrepresentation is more convenient for the two following stages of the target recognition process. AppendixB shows the description of how to recover the five parameters of an ellipse from its implicit equation.

13

Chapter 2. Image Processing Algorithm

Figure 2.11: Ellipse with center at (x0, y0), long axis Ea, short axis Eb, rotated by θ

2.5 Concentric Test

In the previous stage, each valid contour returns a vector of the fitting ellipse implicit equation parameters.The center of the ellipse is recovered from that and stored in memory for this concentric testing stage.

Potentially, three concentric ellipses should be identified per TRIP target’s central bull’s-eye view.However, the innermost circle (the dot in the middle) is often too small and its pixel values are filteredout by the previous image processing stages. Here, concentric means relatively concentric, the returnedcenter pixel may change a little bit, so a tolerance is needed for testing whether they are concentric.Still, in order to avoid spurious candidates, the intensity value of the pixel at the center of the concentricellipses in the outcome of the thresholding stage is tested to check whether it corresponds to a blackintensity value. Figure 2.12 shows the result of the ellipse fitting and concentric testing.

There is one target missing in Figure 2.12, because the missing TRIP tag’s bull’s eye edges areconnected after the binary edge detection stage, and the inner most circle was filtered out. So instead ofreturn two or three ellipse centers, it only returned one ellipse center. That is why this tag did not passthe concentric test. There is one false target on the eye. Because the ellipse fitting algorithm is robust,the eyebrow and the eye ball contours return valid ellipses, and these ellipses passed the concentric test.

Figure 2.12: Ellipse Fitting and Concentric Test Result

14

2.6 Code Recognition

2.5.1 Computational Complexity

One way to do concentric test is to compare every two returned center points. If there are N returnedpoints, we should compare N(N − 1)/2 times, so the complexity is O(N2). In the example test, N=111.When N is large, this way is not acceptable. There are other ways available for concentric testing. ReduceN is another way. According to Lopez de Ipina’s thesis [3], the concentric test stage takes very little time.

2.6 Code Recognition

A code extraction stage is applied to the candidate TRIP sites. This method applies an efficient pixel-sampling mechanism on the binary image result of the adaptive thresholding stage, based on the param-eters of the outermost ellipse of the projection of a candidate TRIPcode. The pixel-sampling mechanismrequires the following two steps:

1. Identify the synchronization sector. It is necessary to locate the beginning of a TRIPcode. Forthat purpose, the bull’s-eye ellipse of reference is transformed to the unit circle, since the ratiosbetween the radius of the bull’s-eye and the code ring circular borders are only known with respectto the TRIP target design. Figure 2.13 shows in the right hand side the dimensions of the sectorhighlighted in the left hand side. The measurements depicted represent the radii of each circularborder within a TRIPtag. The transformation of an ellipse point p = (x, y) to one in the unit circleis given by:

� The translation of the reference ellipse center to the origin:

x′ = x− x0y′ = y − y0

(2.13)

� Rotating the ellipse by -θ:x′′ = cosθx′ + sinθy′

y′′ = −sinθx′ + cosθy′(2.14)

� Compressing the axes:x′′′ = x′′/Ea

y′′′ = y′′/Eb(2.15)

Figure 2.13: Dimensions of a TRIP target [3]

Once this is done, the intersection points between an arbitrary line passing through the center ofthe unit circle, and the two imaginary circumferences going through the middle of the ring codes aredetermined. These intersection points are transformed back to the corresponding image locationusing the inverse transformation to the one employed to convert the reference ellipse point to a

15

Chapter 2. Image Processing Algorithm

unit circle point. If the intensity values of these points, sampled on the output of the adaptivethresholding stage, simultaneously correspond to the black intensity value in the two ring codes,then the synchronization sector has been identified. This process is repeated iteratively by rotatingthe reference line 15◦ and calculating the new intersection points. The reason for the value of 15◦

is that if we miss the detection in certain situation, we are able to detect it again. So that we won’tmiss any synchronization sector. Further detail of why to use 15◦ is discussed in Chapter 5. If thesynchronization sector cannot be found after 12 iterations, the candidate bull’s eye is spurious andrejected.

2. Decode the barcode identifier. After the previous operation is completed, sample points in the middleof each of the 22.5◦ code sectors are taken following the same transformations. If the black colorintensity value is found in a point belonging to the first ring code, the ternary value in that sectoris ‘1’, if found in the second, the value is ‘2’, and otherwise it is a ‘0’. The ternary codes obtainedare validated against the two even parity check bits reserved in a TRIPcode (see Figure 1.1).

2.7 Experiment Result

Figure 2.14 shows the experiment result of code recognition stage. It draws a square around the success-fully recognized TRIP code, and the number beneath each square is the decimal number recognized fromthe image.

Figure 2.14: Code Recognition Result

Comparing Figure 2.14 and Figure 2.12, the false target is not decoded because the code recognitionstage found out that the false detected TRIP tag is not valid.

The following chapter discusses the state of the art for vision on FPGA studies, and gives someproposals for implementing the low level processing algorithms (adaptive thresholding and binary edgedetection stages) in hardware for the TRIP code.

16

Chapter 3

Vision on FPGA

The goal of computer vision is to automatically extract information of a given scene from an analysis ofthe sensed images of the scene. The sensed images can be a single image taken from a single camera,multiple views of the scene using multiple cameras or a sequence of images of the same scene takenover a period of time (as in video sequences). The description of the scene consists of the identity andlocalization (position and orientation of an object) of the objects present in the scene based on theirmeasured physical features.

Based on the computational and communication characteristics, computer vision tasks can be di-vided into a three-level hierarchy, namely, low-level, intermediate-level and high-level. Low-level visiontasks consist of pixel-based operations such as filtering and edge detection. The tasks at this level arecharacterized by a large amount of data (pixels), small neighborhood operators and relatively simpleoperations (e.g., multiplication and addition). The pixel grouping operations such as segmentation andregion labeling are intermediate-level vision tasks. These tasks are characterized by local data access,but more complex pixel operations. High-level vision tasks are more decision-oriented such as pointmatching, tree matching and graph matching. These tasks are characterized by non-local data access andnon-deterministic and complex algorithms [19].

Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs), is becoming increas-ingly attractive for image processing and computer vision tasks. The ability to exploit the parallelismoften found in these problems, as well as the ability to support different modes of operation on a singlehardware substrate, gives these devices a particular advantage over fixed architecture devices such asserial CPUs and DSPs. Further, development times are substantially shorter than dedicated hardwarein the form of application specific ICs (ASICs), and small changes to a design can be prototyped in amatter of hours [20].

For low-level vision algorithms, FPGA is a better hardware platform than a normal processor due toits high parallelism structure. Though the clock frequency of general-purpose processor is about 20 timesthat of typical FPGA implementations, the processing speedup factor of using FPGA over general-purposeprocessor can be achieved from 20 to 100 [21].

Intermediate-level and high-level vision tasks on FPGA have been investigated as well [19]. But thesealgorithms are restricted to no high data dependencies and no random data access.

In this chapter, the state of the art is introduced and the low-level vision task, such as convolution, isintroduced in detail. Some ideas for mapping the first two stages of the TRIP code detection algorithmis also introduced.

3.1 State of the Art

Motivated by the high computational complexity of many computer vision algorithms, there have beenmany attempts to create hardware implementations to achieve high-performance vision systems. The tar-get applications have ranged from localization and pose estimation of parts in manufacturing settings, tobiometric identification in banking applications, and depth estimation and target tracking for navigationsystems and robotic control.

Perhaps the most common computer vision algorithm implemented using FPGAs is that of stereodisparity estimation [22, 23, 24, 25, 26]. Estimation of a disparity field between a stereo image pair is

17

Chapter 3. Vision on FPGA

a classical correspondence problem in computer vision. During disparity estimation, for each pixel inone image the corresponding pixel in the other image is sought, so that the corresponding pixels are theprojections of the same 3D position. Afterwards, if the camera calibration is known, a depth map can becalculated from the disparity field. Several stereo vision algorithms were introduced based on differentcorrespondence methods. Block matching, feature matching and gradient-based optimization are someexamples.

In Hou et al. [26], a combination of FPGA and digital signal processors (DSPs) is used to performedge-based stereo vision. Their approach uses FPGAs to perform low level tasks like edge detection anduses DSPs for higher level integration tasks. In [24], a development system based on four Xilinx XCV2000E devices is used to implement a dense, multi-scale, multi-orientation, phase-correlation based stereoalgorithm [27] running at video rate (30 fps).

Other vision applications are also widely investigated. Pattern recognition, face detection, objecttracking and gesture recognition are some examples of vision tasks.

Face detection is another common application for computer vision. Many algorithms for face detec-tion have been proposed in the past twenty years [28]. Farrugia et al. [29] introduced a methodologyfor designing a face detection system based on the Convolutional Face Finder algorithm [30]. They im-plemented this system on a Virtex 4 SX 35 FPGA using internal DSP blocks. The system is able toprocess more than 1800 images of size 256×256 per second. C. He et al. [31] have introduced a novelSoC architecture for ultra fast face detection on FPGA. The system is based on an efficient and robustalgorithm that uses a cascade of Artificial Neural Network classifiers on AdaBoost trained Haar features[32]. The system can process 625 frames of size 640×480 per second. These projects only use FPGA fortheir tasks. The processing speed showed the advantages of using FPGAs for processing. Besides that,different algorithms are also investigated. Zhang et al. [33] presented a real-time face detector usingellipse-like features based on desktop PC.

The in-vehicle smart cameras have been studied for different applications. They are hard to designbecause the in-vehicle environment can significantly degrade the performance of smart cameras. Twoof the most common challenges for in-vehicle smart cameras are exposure control and motion induceddistortion. Lee and Tang [34] presented two image processing algorithms which are efficient and highlysuitable for embedded real-time smart cameras. The algorithms introduced in [34] are for in-vehicleautomatic license plate recognition, but the techniques are also applicable to other in-vehicle smartcamera applications. Cao and Deng [35] had successfully implemented a reliable stop sign recognitionsystem built in Virtex-4 FPGA. The system uses a Micron 752×480 pixel high frame rate (60 fps) grayscalecamera, and has a detection rate of 81.25%.

Video-based object tracking can be described as a correspondence problem. Given a video stream or animage sequence, its goal consists of establishing correspondences between objects detected in consecutiveimage frames. There is a large variety of object tracking methods available in the field of computer vision[36]. Sen et al. [37] applied simple vision tasks on a FPGA to perform gesture recognition based on thealgorithm in [38]. Schlessman et al. [39] presented a tracking system based on optical flow running on anFPGA. Optical flow (same as optic flow) is the pattern of apparent motion of objects, surfaces, and edgesin a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene[40]. In [39], optical flow pattern calculation system is designed in FPGA. Price et al. [41] presented anFPGA based vision system for detecting cars for unmanned aerial vehicle (e.g. helicopters). Dias et al.[42] presented a template tracking system as an example of embedded application. Some methodologicalaspects on heterogeneous systems design were addressed and a design methodology approach was sketchedin [42] as well.

Window-based convolution and edge detection are two examples of low-level vision tasks. Edge de-tection is the basic image processing step. Many algorithms are in use today, but Canny edge detection[15] might be the most famous one. Quite some work has been done to improve the performance of edgedetection [43, 44]. In the edge detection process, Gaussian filtering and deviation calculation are bothbased on a window-based convolution. Implementing the convolution on FPGA is done by many researchprojects [45, 46, 47].

Besides vision tasks, system structure has also been investigated. In [48, 49, 50], parallel structure isdiscussed. Jin et al. [51] presented a pipelined virtual camera configuration for increasing the performanceof several low-level processing algorithms. Leon-Salas et al. [52] proposed to tile multiple lower-resolution,lower-cost embedded smart cameras instead of fabricating a higher-resolution sensor as a single chip. Thisscheme has lower read-out bandwidth, lower read-out circuit complexity, lower costs and higher robustness

18

3.2 Convolution operations in FPGA

compared to a single-chip high-resolution sensor.

3.2 Convolution operations in FPGA

Image convolution has been and continues to be one of the fundamental processing elements in any digitalimage processing system. The image convolution, often denoted as a two-dimensional FIR filtering,employs the following mathematical equation to every pixel of an image frame:

hx,y =

k/2∑i=−k/2

q/2∑j=−q/2

wi,jax+i,y+j (3.1)

where: k×q is the size of convolution kernel, ax,y is an input pixel value, hx,y is an output pixel value,and wi,j is a coefficient of the convolution.

For many digital cameras, data is sent from the camera in a serial bit stream. Each pixel is clockedout of the camera, and into the FPGA in a raster pattern. In the absence of large frame buffers, it wasdecided to capitalize on the serial nature of the data and implement the image processing algorithms sothat they could provide a useful result every clock cycle.

K. Wiatr and E. Jamro [46] presented an example of the 3×3 convolver for 512×512 (8 bits) pixelsimage as shown in Figure 3.1. This convolver requires 3×2 pixel buffers (Z−1 block in Figure 3.1.) whichcan be implemented as 3×2×8 flip-flops. This method buffers two lines of pixel data and this structurewas used for the majority of the algorithms implemented.

The Look-Up-Table (LUT) based convolution and multiplierless convolution implementations intoFPGA structures has also been investigated in [46] in order to reduce the processing time for eacharithmetic block. The choice between these architectures depends on the given coefficient values.

Figure 3.1: Shift Register Buffer Implementation [46]

Instead of this line buffering method, a new method called block buffering method is proposed in [47].It can greatly reduce buffer size while still keeping the data loading redundancy low. Figure 3.2 gives anillustration of the block buffering method. In this example the window size is 3×4 and the block size is4×6. The disadvantage of this block buffering method is that it needs external memory to store imagedata. In this example, the external memory should at least store 4 lines of image data at first, whichwould introduce more latency for each pixel.

Yu and Leeser’s work [47] has presented a simple design flow for window-based image processingapplications. They also proposed the way to fully utilize the available resources on the board by findingthe three upper bounds according to area constraints, memory bandwidth constraints and on-chip memoryconstraints.

19

Chapter 3. Vision on FPGA

Figure 3.2: Block Buffering Method Example [47]

3.3 Implement low-level processing algorithm for TRIP code

By analyzing all the image processing algorithm stages for TRIP code detection, the adaptive thresholdingand binary edge detection stages can be successfully implemented in FPGA. Because these two stages arebasically window-based processing and do not have many data dependencies. Refer to [53], the first twostage takes over 75% of the total time if they are done by software. To implement these two stages inhardware would significantly reduce the processing time. The third stage (edge following and filtering)also takes much time. It is possible to implement this stage in hardware, but the complexity of the controlof this stage make it harder to implement. ellipse fitting is very computationally intensive, and needsfloating point operations. So this stage would be preferably done by a processor. The processing pipelineis shown in Figure 3.3. The focus would be on how to implement the low-level processing algorithm forTRIP code on FPGA.

Figure 3.3: Proposed processing block diagram

3.3.1 Adaptive Thresholding Implementation

The local mean value can be implemented using window-based convolution. However, the computationcomplexity increase rapidly when the window size increase. And from the Matlab simulation result, thewindow size for adaptive thresholding should be larger than 10 in order to get a high detection rate. Sointegral image is used to calculate the mean value of a window. Meanwhile, the window size should betterbe power of two in order to avoid divide operation. For instance, a window size of 16×16 would be agood window for adaptive thresholding stage.

20

3.3 Implement low-level processing algorithm for TRIP code

The concept of integral image is introduced in Chapter 2. This section only tells how to calculate thelocal threshold using the integral image.

Calculate Integral Image

To calculate integral image efficiently, neighboring entries are used. Referring to Figure 3.4(a), the integralimage entry at point d denoted as I(d) is calculated as:

I(d) = I(b) + I(c)− I(a) + P (d) (3.2)

where point a, b, c are direct neighbors of point d in the integral image, I(a), I(b), I(c) are the computedintegral image values at a, b, c, and P (d) is the pixel value at equivalent location d in the original image.In this way, two additions and one subtraction is needed for each pixel. However, by using column integralimage as in equation 2.4, only two additions is needed for each pixel. But more memory is needed tostore the column integral image.

When it is required to find the sum of all pixel values within a rectangular window bounded by anyfour points e, f, g, h as shown in Figure 3.4(b), then the sum S is calculated by:

S = I(h)− I(g)− I(f) + I(e) (3.3)

The mean value of the local window is the sum of all pixels in the window divided by the windowsize. If the window size is a power of 2 number, then the divide operation could be done by shifting tothe right, which has zero cost in hardware.

Figure 3.4: Integral image constrction and use

When local mean value is done, the simple ways of local threshold using Equation 2.7 and 2.8 couldbe done directly from the local mean value. If Sauvola’s thresholding method is used, the standarddeviation should also be calculated using a similar way. But the expensive square root operation can notbe avoided. The memory requirement also increases.

Memory Analysis

Using integral image reduces the computational complexity from O(w2MN) to O(MN), where M×N isthe image size and w is the window size. But it has disadvantages as well. One disadvantage for usingintegral image is that it requires large amount of memory to store the integral image. If the resolution ofthe image is large, it requires more bits to present one pixel value in the integral image.

Assume the resolution of the input image is 640×480, each input pixel use 8 bit to indicate its grayscalevalue. The total number of bits needed for storing one pixel in the integral image without losing anyprecision will be:

N = dlg2640× 480e+ 8 = 27bits

Assume the window size is 16×16, at least 16 lines of integral image data needs to be stored, whichrequire 16× 640× 27 = 276480 bits memory.

For calculating the standard deviation, the integral image need 35 bits for one pixel. So another extra358400 bits memory is needed for storing 16 lines of the integral image.

21

Chapter 3. Vision on FPGA

The memory needed increases if the resolution or the window size increases. For large size image orshort of memory, the whole image could be divided in multiple blocks. Calculation of integral image andlocal mean value could be done in each block. This way reduces the memory need but the pixels nearthe edge of each block should be carefully considered.

3.3.2 Binary Edge Detection Implementation

The algorithm said in Figure 2.7 to find binary edge can be easily implemented using a typical 3×3 kernelconvolution. Since the input image is a binary image, the binary edge detection can be done by just usingOR gate and a multiplexer. The hardware implementation is shown in Figure 3.5.

Figure 3.5: Binary edge detection method

B(x,y) indicates the pixel value of the result binary image from previous stage of position (x,y) andE(x,y) indicates the pixel value of binary edge image of corresponding position (x,y).

The data-stream implementation is shown in Figure 3.6.

Figure 3.6: Shift register buffer for binary edge detection

Instead of the data-stream implementation, we can use block buffering method as said in section 3.2for binary edge detection. Since the binary image is got from the last stage and each pixel only requiresone bit, several lines of image data would not take too much memory. If there is enough memory, thewhole image could be stored in local memory and processing it takes much less time than data-streamimplementation.

These methods for the adaptive thresholding and binary edge detection implementations are used inthe real hardware implementation which is discussed in details in the following chapter.

22

Chapter 4

Hardware Implementation

4.1 Introduction

The PowerPC 440 processor which is integrated in the Virtex-5 FX130T FPGA is built by the Apple-IBM-Motorola alliance. It is a high performance processor core and can be integrated with peripheral andapplication-specific macro cores using the IBM bus architecture to develop SoC solutions. The processorlocal bus (PLB) is a high-performance bus that provides a standard interface between the processor coresand its peripherals. Lower-performance peripherals can be attached on the on-chip peripheral bus (OPB).A bridge is needed between the PLB and OPB to enable data transfer by PLB masters to and from OPBslaves.

The PowerPC 440 processor has one master PLB bus (MPLB) and two slave PLB buses (SPLB).The MPLB bus is always used to connect the processor core to the peripherals. The PowerPC processorbehaves as a master of this bus. The two SPLB buses are optionally connected. The processor behavesas a slave on the SPLB bus. In this image processing hardware design, both SPLB buses are used. Thelatest PLB version used in SoC design is PLB v4.6, which is always denoted as PLBv46, and can beconnected with low-performance peripherals without using the OPB bus.

Unlike a hardcore, a MicroBlaze is a low-cost and configurable 32-bit soft processor core designedfor Xilinx FPGAs. The MicroBlaze has been widely used in the area of data communication, telecom-munication, military etc. The MicroBlaze has a versatile interconnect system to support a variety ofembedded applications. Its primary I/O bus is the PLB bus, vendor-supplied and third-party IPs canbe directly connected to this bus (or through an PLB to OPB bus bridge). For access to local memory(FPGA BRAM), MicroBlaze uses a dedicated local memory bus (LMB), which reduces loading on theother buses. User-defined co-processors are supported through a dedicated FIFO-style connection calledfast simplex link (FSL). The MicroBlaze is also able to run operating systems.

In this project, both a PowerPC system and a MicroBlaze system have been implemented for TRIPcode recognition algorithm.

4.2 System Structure

The hardware structure of the PowerPC system is shown in Figure 4.1, and the hardware structure ofthe MicroBlaze system is shown in Figure 4.2.

The image processing IP block is the custom IP block built for image processing purpose. In PowerPCsystem, it connects to both SPLB1 and MPLB bus of the PowerPC processor. It is a master of the SPLB1bus, so it can instantiate data burst from and to DDR2 memory. It behaves as a slave of the MPLB bus.In MicroBlaze system, the IP’s slave and master ports connect to the same PLB bus. And it is both aslave and a master of the PLB bus. The DVI controller IP is also both a master and a slave. It is usedfor image display purpose. The BRAM memory block in the PowerPC system is used only for simulationpurpose and can be removed from the system. The TIMER block in the MicroBlaze system is used toprovide a time reference, because the MicroBlaze core does not have an internal timer. The timer is usedfor profiling and displaying images on monitor.

This project is mainly to implement the TRIP code recognition algorithm into the FPGA, so a camera

23

Chapter 4. Hardware Implementation

Figure 4.1: PowerPC Hardware System Structure

Figure 4.2: MicroBlaze Hardware System Structure

is not connected to the system. The frame capture functionality is simulated by fetching image data fromthe DDR2 memory. The whole system works like this: images are stored in the compact flash (CF), theprocessor (PowerPC or MicroBlaze) first reads the image files from CF and stores the data in the DDR2memory. When all the images have been loaded, the image processing IP block instantiate a burst readfrom the DDR2 memory to fetch the image data. After the processing has finished, the processed datais transfered back to the DDR2 memory by burst write. After all the images have been processed, theprocessed image data will be transfered to the DVI Controller to display the images on a monitor.

The Xilinx software tools support simulation at system level. But they do not have simulation modelsfor external memories, such as DDR2 and CF. So the behavior of transferring data from and to DDR2can not be simulated. That’s why the “BRAM Memory” IP has been built in the PowerPC system. It isactually a custom IP core which implements memories using Virtex-5 Block RAM. Since DDR2 can beconsidered as a memory block on the MPLB bus, the behavior of transferring data from and to DDR2can be simulated by transferring data from and to the BRAM Memory IP core. Note that though thebehavior is the same, the timing of the data transfer may be different.

24

4.3 Master Burst

4.3 Master Burst

The speed of image processing by hardware in the design highly depends on the data input rate. Ideally,to achieve the fastest processing speed, the data should come every clock cycle. But in reality, the datatransfer speed is limited by bus protocols. The fastest way for data transfer is via the burst mechanism.Xilinx has provided the master burst IP core for PLBv46. This IP core can be integrated in the customIP block, and it provides data burst from and to memories or other IP cores connected to the PLB bus.

The PLBv46 Master Burst is designed to provide a user with a quick way to implement a masteringinterface between user logic and the IBM PLB v4.6. Figure 4.3 shows a block diagram of the PLBv46Master Burst. The design allows for parameterization of both the Master’s internal data width (NativeData Width) and the PLB data width of 32, 64, or 128 bits. Transfer request protocol between the PLBand the user logic is through the IPIC master interface, and the user logic receives data and transmitsdata to the PLB via the Xilinx LocalLink interface protocol [57].

Figure 4.3: PLBv46 Master Burst Block Diagram [57]

The burst transfer support feature enables the transfer protocol for PLB fixed length burst operationsof 2 to 16 data beats. That means if more than 16 data beats of burst transfer is instantiated, every 16data beats would be transfered as a packet, and within each packet, new data comes every clock cycle.There would be some time between each packet transfer.

Figure 4.4 shows the simulation waveform of a 64 data beats of burst read transfer. The user logicgets data in 4 packets and each packet has 16 data beats. More details of master burst behavior can befound in [57].

Figure 4.4: Simulation waveform of 64 data burst read

25

Chapter 4. Hardware Implementation

4.4 Processing Pipeline

The image processing IP block is implemented as a pipeline, shown in Figure 4.5.

Figure 4.5: Image Processing Pipeline

The input data is formed in 32 bits containing RGB format value. It has to be converted into 8 bitsgray scale image first. The pipeline is using data valid signals for main control setup. That means foreach block, input data signal din has its enable signal din en to indicate whether the input data is valid;similarly, output data signal dout has its enable signal dout en to indicate whether the output data isvalid.

4.4.1 RGB to Grayscale Conversion

The block diagram of RGB2Gray block is shown in Figure 4.6.

Figure 4.6: RGB2Gray block diagram

The formula used to convert RGB value to grayscale is gray = 0.3R + 0.59G + 0.11B [58]. Forcalculation within one clock cycle, three multipliers and two adders are needed. The input pixel data is32 bits which contains RGB color information. Every red, green and blue signal is 8 bits. This conversionblock has one clock cycle delay to assign the color value signals to the correct multipliers. Overall, theinput data and the output data has 2 clock cycle delay, so the output enable signal also has 2 clock cycledelay with respect to the input valid signal.

4.4.2 Inte gral Image Calculation

The data flow diagram of Integral Image Calculation block is shown in Figure 4.7.The integral image calculation block has one line buffer and several registers. There are three input

delay registers. That is because the corresponding pixel data (buf out after) which is read from the linebuffer has 3 clock cycle delay of the input signal. The line buffer is implemented as a FIFO using oneblock RAM of the FPGA.

26

4.4 Processing Pipeline

Figure 4.7: Integral Image Calculation Block Diagram

The control of this block works as follows: the input data signal (din) has 3 delays to pass to theALU, and the input valid signal (din en) has multiple delay signals for internal control use. The buf insignal is both the input of the line buffer and the ALU. When one line of image data has been loadedinto the buffer, the control starts to read data from the buffer. The data read out from the buffer isregistered, so the signal buf out after has one clock cycle delay of the buffer read data. When one lineof data has been processed, the buf in, buf out after and corner signal has to be cleared. A counter isused to indicate when to clear these signals. It counts from 0 to 639, which is the number of line pixelminus 1. When count is 0, the buf in, buf out after and corner signal is cleared. The read enable signalis triggered when the first line of data has been loaded into the FIFO. The FIFO provides write countsignal (wrcount) and read count signal (rdcount) to tell the control logic how many data has been loaded.The FIFO read enable signal buf rd en equals to one clock cycle delay of the din en signal. The FIFOwrite enable signal (buf wr en) and output valid signal (dout en) are both 4 clock cycle delay of signaldin en. The output data (dout) is registered output of ALU result, so it is equal to buf in signal.

From Figure 4.7, the valid output signal has 4 clock cycle delay of the valid input signal. The detailedcontrol flow graph can be found in Appendix C - Figure C.1.

4.4.3 Local Mean Calculation

The local mean calculation block is quite simple but it requires more memory space since about 16 lineof data has to be buffered. The block diagram is shown in Figure 4.8.

Figure 4.8: Local Mean Calculation Block Diagram

27

Chapter 4. Hardware Implementation

The large buffer is implemented as a FIFO which is created by Xilinx CORE Generator tool [59]. TheCORE Generator built the large FIFO using block RAM of the FPGA, so the user can easily use thegenerated FIFO. The generated FIFO provide a data count signal to indicate how many data is in theFIFO, which can be used for control logic.

The control flow of local mean calculation block is more complex than the integral image calculationblock. But the control for the FIFO is similar. It also has multiple delay of the input signals for controluse. The control flow graph can be divided to two parts. One is the FIFO control part which is shownin Appendix C - Figure C.2(a). The other is the output control part which is shown in Appendix C -Figure C.2(b).

For FIFO control part, when the FIFO has loaded certain amount of data (16 lines of data minus 16),the control starts to read data from FIFO.

For output control part, since the window size is 16×16, the first valid output starts from the 17thpixel of the 17th line. For each line from the 17th line till the last line, the valid output data starts fromthe 17th pixel to the 640th pixel. Different counters are used for this output valid control. More detaileddescriptions can be found in Appendix C.

Because this hardware structure just ignore the edge pixels, if the input image is 640×480, the outputimage would reduce the size to 624×464. And from the block diagram of this block, it has 3 clock cyclesof latency, which means the last valid output data has 3 clock cycle delay with respect to the last validinput data.

4.4.4 Gray Data FIFO

The purpose of this block is mainly used to align the pixel gray value with its local mean value, so thenext block could have correct input data. For instance, the first valid output of local mean calculationblock is the pixel at position (17,17), but this data indicates the local mean value of pixel at position(9,9). So the first valid output of this block should be the gray value at position (9,9), which means theFIFO needs to store about 8 lines of data. The FIFO is also built by Xilinx CORE Generator.

In Figure 4.9, you can see that if the input frame is of size 640×480, the valid output pixels are withinthe yellow rectangle area. The size of the yellow rectangle corresponds with the output image size of thelocal mean calculation block (624×464).

The top level of this gray data FIFO block is shown in Figure 4.10. The control of the FIFO workslike this: The control block uses a counter to keep track of the row and column of the input pixel data.When the row number is between 9 and 472 and the column number is between 9 and 632, the controlwrites data into the FIFO. The control block uses mean out en signal to decide when to read out theFIFO data. The mean out en signal is the output enable signal of the local mean calculation block andis an input of gray data FIFO block. The FIFO read enable signal is actually delayed by one clock cycleof mean out en signal, so the valid output data of this block has 2 clock cycle delay with respect to thecorresponding local mean data. This delay will be handled in the adaptive threshold block.

Figure 4.9: Area of valid pixel data Figure 4.10: Top level gray data FIFO block

28

4.4 Processing Pipeline

4.4.5 Adaptive Threshold

The adaptive threshold block is shown in Figure 4.11.

Figure 4.11: Adaptive Threshold Block

C is a constant which is stored in one of the custom IP slave registers. These salve registers aresoftware accessible registers, so it is easier to change this constant value by software in the real test. Theoutput of gray pixel data has 2 clock cycle delay of the corresponding local mean output, so there aretwo registers for local mean value input. The control of this block is relatively easy to implement. Thevalid output data has one clock cycle delay of the valid gray FIFO data input, so the data out enablesignal also has one clock cycle delay of gray data FIFO output enable signal. The output data signalcould be set to 32 bits in order to send it back to DDR2 for display. The one bit output signal should beconnected to the binary edge detection block. In total, this block introduces 3 clock cycles of latency.

4.4.6 Binary Edge Detection

The Binary Edge Detection block is shown in Figure 4.12.

Figure 4.12: Binary Edge Detection Block Diagram

The input signal is the 1 bit output signal of the adaptive threshold block. The input has three delayelements for FIFO control purposes, similar to other hardware blocks which were explained above. Thereare two line buffers which are built by CORE Generator. The data width is set to 1 bit to reduce BRAMmemory usage. The control of this block is similar to the control part in the local mean calculation block.It has control for read and write of the two FIFOs, and it also has control for output enable signal.

For FIFO read and write enable signals, all the input data should be put into FIFO. After the firstline of input data has been loaded into the first FIFO, both FIFO write enable signals equal to 2 clockcycle delay of signal din en, and both FIFO read enable signals equal to one clock cycle delay of signaldin en. For output enable signal control, there is a counter to keep track of the position of the inputpixel data. When the column number is between 2 and 463 and the row number is between 2 and 623,the data out valid signal is equal to 4 clock cycle delay of din en signal.

The algorithm used for binary edge detection can be considered as a 3×3 slide window. So the validoutput of this block reduces the size of the input image. In this example, the input image is the output

29

Chapter 4. Hardware Implementation

of the local mean calculation block, the size of which is 624×464. So the output image size of binary edgedetection is 622×462. That means all the edge pixel data is ignored for output.

Since the software implementation needs both the adaptive threshold and binary edge detection result,it is better to combine these two results together, so that the image processing IP only needs to sendback one image instead of two. The Ctrl Block in Figure 4.12 has implemented the logic to combine thesetwo result images. There are 2 output signals implemented in this block. The 32-bit output is used totransfer the processed image data back to DDR2 memory. The 1-bit output is reserved for future use.

The B(x,y) signal value stores the pixel’s adaptive threshold information. And the method to embedthis information in the output (32 bits) is as follows:

� If B(x,y) is 1, which means it is a background pixel in the adaptive threshold output image, set theoutput pixel to black (0x00000000).

� If B(x,y) is 0 and the output of the OR gate is 1, which means it is an edge pixel, set the outputpixel to green (0x0000FF00).

� If B(x,y) is 0 and the output of the OR gate is 0, which means it is not an edge pixel but a validpixel in adaptive threshold result, set the output pixel to red (0x00FF0000).

Using this method, the adaptive threshold result and the binary edge detection result is combinedproperly, and it can be used for the software implementations. Figure 4.13 shows one test image outputfrom the ML510 board.

Figure 4.13: Combined Result of Adaptive Threshold and Binary Edge Detection

In total, the binary edge detection block introduces 5 clock cycles of latency, which means the lastvalid output data has 5 clock cycle delay with respect to the last valid input data.

30

4.5 Summary

4.4.7 Inter-connection of all blocks

The inter-connection of all blocks is shown in Figure 4.14.

Figure 4.14: Inter-connection of pipeline blocks

The clk port of all blocks is connected with system bus clock. The rst port of all blocks is connectedwith a custom reset signal. The reset signal of instantiated FIFO using Virtex-5 BRAM needs at least 3clock cycles. The custom reset signal goes high for 5 clock cycles after each frame is finished just to makesure that the image processing IP has been correctly reset. The number in each block shows the latencyit introduces.

By adding all the numbers in Figure 4.14, we can see that the whole pipelined image processing IPblock is very fast. It only introduces 17 clock cycle latency in total, which means after the last valid inputdata comes in, the last valid output data would be produced in 17 clock cycles.

4.5 Summary

This chapter describes the hardware implementation for the TRIP code recognition system. It shows thehardware structure of both the PowerPC and the MicroBlaze systems and describes in details about thenewly designed image processing IP block which is used for adaptive thresholding and binary edge detectionstages. The resources used by the system and the custom IP block is shown in chapter 6 for performanceevaluation discussion. Other hardware blocks which are related to the software implementation, such asfloating point unit, will be introduced together with the software implementation in the next chapter.

31

Chapter 4. Hardware Implementation

32

Chapter 5

Software Implementation

5.1 Introduction

The hardware implementation is done up to the binary edge detection stage, so the stages left should beimplemented in software. The left stages are Edge Following and Filtering, Ellipse Fitting, ConcentricTest and Code Recognition. The algorithm of all these stages has already been explained in Chapter 2. Inthe software implementation, all stages are implemented as functions or using multiple routine functionsin C. Figure 5.1 shows the block diagram of the software of the whole system implementation.

Figure 5.1: Whole system software block diagram

The first two blocks are used in the hardware implementation, these two stages are basically datatransfer from one address to the other. The later six blocks are used as real software implementation forthe TRIP code recognition. The implementation strategy is first make all the blocks work separately andthen combine them together. Source C code is written for each block as a function. This chapter firstlydiscusses about adding a floating point unit in the hardware structure which is necessary for softwareimplementation. Secondly, different issues are discussed for implementing each block.

5.2 Adding Floating Point Unit

From the algorithm used, it is necessary to use floating point data type in the calculation. So a floatingpoint unit (FPU) should be added to the hardware system. The Virtex-5 auxiliary processor unit (APU)is an optimized FPU designed for the PowerPC 440 embedded microprocessor of the Virtex-5 FXT FPGAfamily. The FPU implementation provides support for IEEE-754 floating-point arithmetic operations in

33

Chapter 5. Software Implementation

single or double precision [60]. Another advantage for adding the APU is that it can accelerate thesoftware applications. Pellerin et al. [61] showed that with an APU, the Virtex-4 FX FPGA couldaccelerate the software application more than 10 times compared to the system without APU.

In the software implementation, the APU is added into the hardware system and the PowerPC mi-croprocessor is configured to enable floating point arithmetic.

For MicroBlaze system, an FPU is also added, but it only supports single precision floating pointoperations.

5.3 Debugging Methods

For implementing the software application, debugging is very important and would take most of the time.In this project, multiple debugging methods are used.

The first method is using the UART port to print desired information on the serial console. TheML510 has two UART ports which can be connected with PowerPC or MicroBlaze core. In this project,only one UART is used. The UART port is the main connection between the PC and the ML510 board.

The second way for debugging is to use the monitor to display images. This is a more direct way fordebugging, since it shows the image representation of the data sets. The monitor can be connected onthe ML510’s DVI port, and it is able to show color images. The DVI output can show RGB images upto 24 bits. That is 8 bits for red, 8 bits for green and 8 bits for blue.

The third method and the most important method is to test the C code on the local PC first. Runningcode and debugging on the local PC first is to make sure the function of the source code works properly.There are a lot of other things can effect the software running result of the ML510 board. For instance,the optimization level of compiler, the serial port output speed, etc. In the software implementation, allthe source code for each block is first run on the local PC. When a problem is found combining all theblocks on the board, intermediate results are ported back to the local PC for debugging, because thelocal PC would report system exceptions (such as divide by zero, overflow, etc.) while the board just stopexecuting or loop from the beginning as the default exception handler routine when meeting exceptions.

5.4 Implementation Issues

Different issues are encountered when implementing the software blocks. This sector discuss about theseissues and tells what has been done to deal with these issues block by block.

5.4.1 Edge Following and Filtering

This stage uses the algorithm exactly described in Chapter 2. One issue is that one can not see whichcontour is found valid. That is because the algorithm needs to clear all the edge pixels which have beenprocessed. The way to solve that is: store all the pixels on the contour, if the contour is valid, set thesecontour pixels to some other color (e.g. white); else throw those invalid contour pixels away. So afterprocessing, one can see that the white pixels are valid contours.

5.4.2 Ellipse Fitting

The main issue for ellipse fitting is the calculation of valid eigenvector of a 6 × 6 matrix. The Jacobimethod [62] is used for calculating the eigenvector of a 6× 6 real-symmetric matrix, because this methodis accurate and fast for small matrix.

While calculating the eigenvectors, warnings pump up in Matlab saying that the scale is too small sothat the result looses certain precision. That is because from the Equation 2.11, the data we need is theeigenvector of matrix S−1C. Since S is always large, S−1 could be very small.

The way to solve this problem is shown as follows: Since S is a real-symmetric and matrix, it can bedecomposed as S = LLT , where L is a lower triangular matrix. This is called Cholesky decomposition[63]. Then, we need to calculate the only valid eigenvector of matrix L−1CL−T , denoted as ~e. Then, thedesired eigenvector of S−1C is the same as L−T~e. The proof is shown in Appendix D.

In order to implement the Jacobi method, many routine functions needs to be done. Such as Choleskydecomposition, inverse matrix, matrix multiplication, etc. The inverse matrix function uses Gaussion

34

5.4 Implementation Issues

elimination method and only needs to process the lower triangle of input matrix, because the inputmatrix is a lower triangular matrix. The algorithm for Cholesky decomposition is relatively simple, butbe aware that this method is only valid for positive definite matrix. For an ellipse like contour, the Smatrix would be decomposed successfully. If the decomposition failed, it means the detected contour isnot valid for ellipse.

Some other issues are met during the derivation of ellipse’s long axis and short axis. Appendix Bshows the mathematical derivation equation. But you can see that Equation B.13 depends on the scale ofa, b and c. But since the eigenvector is derived by a numerical method, it always has error in the result.Because a, b and c are in the scale of 10−5, and they are in the lower side of a fraction, even a tiny errormay result in a large mistake after calculation. The way to solve that is to put the same amount of errorto the upper side of the fraction in Equation B.13. From Equation B.8 and B.5, we can get

f = ax02 + cy0

2 + bx0y0 − 1

so1 = −(f − ax02 − cy02 − bx0y0) (5.1)

Mathematically speaking, we just multiply 1 to Ea and Eb, but the computation results would bevery different. The new derivation equation used in the implementation is shown as follows:

Ea =

√−2(f − ax02 − cy02 − bx0y0)

a+ c−√

(a− c)2 + b2

Eb =

√−2(f − ax02 − cy02 − bx0y0)

a+ c+√

(a− c)2 + b2

(5.2)

5.4.3 Synchronization Sector Detection

As described in Chapter 2, the code recognition stage has two steps: Identify the synchronization sectorand Decode the barcode identifier. In the first step, a rotation angle of 15◦ is used. It is used to makesure the check points lie in the middle of each sector. 15◦ is 2/3 of 22.5◦, so two situations could happen.

� Situation 1: Once the synchronization sector has been found, and the next iteration is not asynchronization sector. This means the check points lie in area b in Figure 5.2. It is in the middlearea of the sector, so the check points can be used for the decode step.

� Situation 2: Once the synchronization sector has been found, and the next iteration is also foundas a synchronization sector. This means the check points lie in area c and the previous check pointlie in area a. Then, it needs to be rotated back 7.5◦ to let the check points in area b so we can usethose check points in the decode step.

Figure 5.2: Sector distribution for check points

In the real implementation, false detection of synchronization sector is a problem. The algorithm tofind a synchronization sector is as follows:

� If situation 2 is met, the returned checker points are considered as real synchronization sector.

� If situation 1 is met, continue to look for synchronization sector. If no situation 2 is met, the firstchecker points of situation 1 is returned as the synchronization sector.

35

Chapter 5. Software Implementation

5.4.4 Decode and Validate

For the Decode step, 3 sample check points are used to detect each ring sector instead of only 1 checkpoint. This can improve the correct recognition rate. Refer to Figure 2.13, the checker points for innercircle lie in the radius of 1.3r, 1.4r and 1.5r; and the checker points for outer circle lie in the radius of1.9r, 2.0r and 2.1r, where r is the radius of the outer most circle of the bull’s eye. The algorithm usedis if one of the three check points are black, the sector is black, otherwise, the sector is white. In thisstep, if another synchronization sector is detected, the ternary code is written as “4” and the invalid flagis written to “1” to show this code is not valid.

The code is also checked with the parity bits to see whether it is valid or not. The first parity bitmakes the total number of ‘1’s to be even and the second parity bit makes the total number of ‘2’s to beeven. If the code does not pass the parity bit check, the invalid flag is also set to “1”. The definition ofthe parity bit makes the first parity bit never be ‘2’ and the second parity bit never be ‘1’. If one of thesesituations is met, the code’s invalid flag is set to “1” as well.

From the implementation, it shows that the false detection of the synchronization sector does causeproblems for the correct recognition rate. The algorithm used in the synchronization sector detectionblock does solve this problem in many cases, but an additional algorithm with decode block is needed toimprove the correct recognition rate.

The new algorithm for getting the correct TRIP code is implemented as a loop, and the simplifiedstate diagram for getting the correct TRIP code is shown in Figure 5.3.

Figure 5.3: Simplified Decode Algorithm

After the decode stage, there are multiple ways (such as invalid flag check and the parity bits check) tocheck whether the code is valid or not. If the code is valid, go to the next stage to display the code. If thecode is invalid, do another synchronization sector search starting from the previous found synchronizationsector position. If the newly found synchronization sector is the same as the first one, set the invalid flagof this code to ‘1’ and jump out of the loop. Otherwise, use the newly found synchronization sector to doanother decode. Then, check the new code is valid or not. Using this algorithm, false detection problemhas been solved, and the correct recognition rate has been improved. More details about the recognitionrate follows in the next Chapter.

36

Chapter 6

Performance Evaluation

The TRIP code recognition system has been implemented on the Xilinx ML510 board. To evaluate theperformance of this system, the correct recognition rate and the processing speed are the two crucialaspects we focused on. The resource utilization of the system is also listed at the end of the chapter toshow how large the whole system is.

The correct detection rate is measured by using a set of test pictures. The same set of pictures aretested by the PowerPC and MicroBlaze system running on the ML510 board and by Matlab running onlocal PC. The results are compared to measure the recognition rate of correct TRIP codes.

For the PowerPC system, the processing speed is measured using the PowerPC internal timer facilities.The PowerPC processor (PPC440) provides four timer facilities: a time base, a decrementer, a fixedinterval timer, and a watchdog timer. These facilities, which share the same source clock frequency, cansupport various software timing functions [64]. In this TRIP code recognition system, the time basefacility is used for measuring how much time each processing block takes. The time base is a 64-bitregister which increments once during each period of the source clock, and provides a time reference. Afew clock cycles are required to access the register, but this time is negligible in view of the two hundredmillion clock cycles needed to process one image. The time base register does not need to be reloadedbecause it can last for over 1900 years without overflow at the frequency of 300MHz.

For the MicroBlaze system, the processing speed is measured using the xps timer IP [65] attached tothe PLB bus. Similarly, the timer can also be considered as an incrementing register and provides a timereference. But the timer register is only 32 bits long which means it can only last for about 34 secondswithout overflow at the frequency of 125MHz. So for profiling the MicroBlaze system, the timer registerneeds to be reloaded after processing each image.

6.1 Recognition Rate

The test result images are shown in Appendix E from Figure E.1 to Figure E.22. Based on those resultimages, the table for recognition result is shown in Table 6.1.

The “Valid Tag” column in the table indicates how many detectable TRIP tags there are in the image.Theoretically speaking, detectable TRIP tag means that the bull’s eye area of the tag covers more thancertain number of pixels. This means if the TRIP tag is too small on the image, this algorithm is not ableto correctly detect the tag. Based on the experience with multiple images, 25 would be a good numberas a threshold. But since the image could be quite blurred on the edge, when the tag is not too big ortoo small, one can not tell exactly how many pixels the bull’s eye covers. So the number 25 is a roughestimate, and detectable or not would be a subjective question. In the testing of all the images, if the tagcan be recognized by my eyes without too much effort, it is considered as detectable. Only those verysmall or very blurred tags are considered undetectable.

The TRIP code recognition system has been implemented in Matlab which is considered as thereference system for the recognition rate. Based on the testing result, the recognition rate of the PowerPCand MicroBlaze system is as good as the Matlab result. In a total of 26 valid tags, the Matlab systemcorrectly detects 22 (84.6%), and the PowerPC system and the MicroBlaze system detect exactly thesame result.

37

Chapter 6. Performance Evaluation

Valid Correctly Correctly CorrectlyTest Image Tag Detected Detected Detected Detected Detected Detected

(Matlab) (Matlab) (PPC) (PPC) (MB) (MB)image01 7 6 6 6 6 6 6image02 1 1 1 1 1 1 1image03 2 2 2 2 2 2 2image04 1 1 1 1 1 1 1image05 2 1 1 1 1 1 1image06 3 2 2 2 2 2 2image07 3 2 2 2 2 2 2image08 1 1 1 1 1 1 1image09 3 3 3 3 3 3 3image10 1 1 1 1 1 1 1image11 2 2 2 2 2 2 2

total 26 22 22 22 22 22 22

Table 6.1: TRIP code Recognition Result

6.2 Processing Speed

For a smart camera system, the processing speed is a crucial aspect for performance. The profiling resultsof Matlab system, PowerPC system and MicroBlaze system are listed in this section.

Matlab profiling is taken on a computer with Pentium (R) Dual-Core CPU T4300 which both workat 2.10GHz with 4GB RAM running Windows XP. Using Matlab Profiler tool to measure the CPUconsumption time for each stages.

PowerPC system profiling is performed on ML510 board with one PPC440 core running at 300MHzand the PLB bus is running at 100MHz. Using PPC internal time base facility to measure the consumedtime.

MicroBlaze system profiling is performed on ML510 board with one MicroBlaze core running at125MHz and the PLB bus is running at 125MHz. Using XPS Timer IP, which connected to PLB bus, tomeasure the consumed time.

The profiling data of Matlab system is shown in Table 6.2.

Average Min Max Percentage Percentage PercentageAlgorithm stage Time Time Time (average) (min) (max)

(ms) (ms) (ms)Adaptive Threshold 239 219 250 13.3% 10.6% 15.9%

Edge Detection 1065 1047 1078 58.9% 45.5% 72.2%Edge Followingand Filtering 398 129 875 22.0% 8.6% 36.9%

Ellipse Fitting 54 16 125 3.0% 1.0% 5.2%Concentric Test 7 0 31 0.4% 0.0% 1.5%

Code Recognition 45 15 106 2.5% 0.8% 5.1%Total 1810 1472 2369 - - -

Table 6.2: Matlab System Profiling Result

The consumed time data which is got from Matlab profiler depends on how the programmer usesbuit-in functions. So the real meaning of Matlab profiling result lies in the percentage each stage takes.From Table 6.2, we can see that the adaptive threshold stage with edge detection stage takes about 72.2%of the total time. Because these two stages are data intensive.

From the PowerPC profiling result which is shown in Table 6.3, the hardware processing time is thesame for all images. The edge following and filtering stage is approximately the same for all the imagesand takes most of the time for processing. That is because the used algorithm needs to fetch all pixelsin the image and write back if an edge pixel is found. This is the bottle neck for the system, and shouldbe the focus to optimize in the future. ellipse fitting stage takes about 1/3 of the total time, because

38

6.2 Processing Speed

Average Min Max Percentage Percentage PercentageAlgorithm stage Time Time Time (average) (min) (max)

(ms) (ms) (ms)Hardware Processing(Adaptive Threshold 15 15 15 6% 4% 8%and Edge Detection)

Edge Followingand Filtering 172 156 184 62% 50% 80%

Ellipse Fitting 89 21 148 30% 11% 43%Concentric Test 2 0 5 1% 0.1% 1.5%

Code Recognition 3 0 3 1% 0.2% 2.2%Total 281 195 363 - - -

Table 6.3: PowerPC System Profiling Result

the ellipse fitting algorithm needs a lot of floating point calculations. The PowerPC system already usesan FPU to speed up the floating point calculation, so if the ellipse fitting algorithm did not change, theprocessing time for this stage does not have much room for improvement. In total, the system can processabout 3 images per second.

The PowerPC has an FPU which supports double precision floating point calculation. But the FPUfor MicroBlaze system only support single precision floating point calculation. Any double precisioncalculation is done by software emulation. So we can predict that the processing time for the MicroBlazesystem would take longer. The profiling data for total double precision implementation is shown in Table6.4.

Average Min Max Percentage Percentage PercentageAlgorithm stage Time Time Time (average) (min) (max)

(ms) (ms) (ms)Hardware Processing(Adaptive Threshold 18 18 18 0.3% 0.2% 1.6%and Edge Detection)

Edge Followingand Filtering 204 182 223 3.1% 3.1% 16.2%

Ellipse Fitting 3722 881 6570 91.2% 78.3% 93.8%Concentric Test 58 3 143 2.0% 0.3% 2.1%

Code Recognition 82 18 244 3.4% 0.3% 5.7%Total 4088 1126 7204 - - -

Table 6.4: MicroBlaze Double Precision System Profiling Result

From Table 6.4, we can see that the processing time grows dramatically on the MicroBlaze system. Inorder to use the FPU to shorten the processing time, single precision floating point calculation has beenimplemented, only the ellipse fitting stage is still implemented using double precision. That is becausethe single precision implementation in the ellipse fitting stage can not handle the large range of data ineach image. The single precision output result of ellipse fitting stage would cause the system fail to detectany tag. The profiling data for single precision implementation is shown in Table 6.5.

Comparing the two MicroBlaze system profiling result in Table 6.4 and Table 6.5, we can see thatwhen using single precision, the process time reduced dramatically for software implementation (except forellipse fitting stage). That is because in edge following and filtering, concentric test and code recognitionstages, there are lots of floating point operations (especially comparison). And these operations had beenpassed to the FPU to execute instead of using software emulation, which saved much time.

Comparing Table 6.3 with Table 6.2, we can see that the ellipse fitting stage takes about 1/7 of thetime of edge following and filtering stage in the Matlab system, but the ellipse fitting stage takes abouthalf the time of edge following and filtering stage in the PowerPC system. That is because the Matlabsystem uses very high clock speed (2.1GHz) while the PowerPC core is only running at 300MHz, whichshows that high processor clock frequency increases the processing speed for software implementation.

39

Chapter 6. Performance Evaluation

Average Min Max Percentage Percentage PercentageAlgorithm stage Time Time Time (average) (min) (max)

(ms) (ms) (ms)Hardware Processing(Adaptive Threshold 18 17 18 0.5% 0.3% 1.8%and Edge Detection)

Edge Followingand Filtering 130 118 140 3.5% 2.2% 12.1%

Ellipse Fitting(double precision) 3500 831 6219 95.1% 85.8% 96.2%Concentric Test 32 2 79 0.9% 0.2% 1.3%

Code Recognition 1 0 3 0.0% 0.0% 0.1%Total 3680 969 6462 - - -

Table 6.5: MicroBlaze Single Precision System Profiling Result

6.3 Hardware Resource Utilization

One of the benefit for the MicroBlaze system is that it can be transfered to other FPGA boards withlittle effort. The resources used for the MicroBlaze TRIP code recognition system in Virtex-5 FPGA(xc5vfx130t-ff1738-2) are shown in Table 6.6. This table shows the minimum resource needed for thehardware structure as shown in Figure 4.2.

IP modules Slices Registers Slice LUTs BRAMs DSP48EsImage Processing IP 1565 1881 18 0

MicroBlaze core(incl. FPU) 1795 1952 64 5

DDR2 Memory Controller(MPMC) 3707 2524 5 0Other IPs 2785 2950 0 0

Total 9852 9307 87 5

Table 6.6: Resource Utilization for the the whole MicroBlaze System

In Table 6.6, we can see that the total system takes many resources. The DDR2 memory controlleris implemented using the multi port memory controller (MPMC) IP which takes most of the resources,and the image processing IP does not use many resources. The Other IPs includes interrupt controller,UART, GPIO, SYSACE, timer, DVI controller, etc. These IPs are standard IPs provided by Xilinx. TheMicroBlaze core itself does not use BRAM resources, the 64 BRAMs in the table actually means thatthese BRAMs are used as local memory for the MicroBlaze core. The 5 DSP blocks are used for FPU.

From the table, we are able to calculate how many resources are used without the MicroBlaze core.So that we can have some clue about what the size of the FPGA should be if the system can be fit intoanother development board with a different kind of processor. But be aware that the LUTs in Virtex-5FPGA are 6-input. Many previously produced FPGAs only have 4-input LUTs. A 6-input LUT has acapacity of 26 entries which is equivalent to four traditional 4-input LUTs, which have 24 entries each.Not all designs can take advantage of the larger LUTs. For example basic bitwise operations which needless than six inputs, do not use a 6-input LUT up to its maximum capacity. Xilinx claims that on average,one Virtex-5 logic cell is equivalent to 1.6 traditional logic cells [66]. This factor should be consideredwhen determining whether a specific FPGA is large enough for the whole system.

For the PowerPC system, the minimum resources used by the PowerPC system (Figure 4.1) is shownin Table 6.7. It is a little bit different from the MicroBlaze system:

1. The PowerPC processor is a hard core which uses almost no resource in the FPGA.

2. The DDR2 memory controller in the PowerPC system is a specially designed DDR2 memory con-troller for PPC440 processor, which takes less resources than the DDR2 memory controller for theMicroBlaze system.

40

6.3 Hardware Resource Utilization

3. The PowerPC has its built-in timer, so no timer IP is needed. Though the timer IP in the MicroBlazesystem only uses 366 slice registers and 311 slice LUTs.

4. The PowerPC system has an FPU added, which supports double precision floating point operations.This FPU takes a lot of resources of the FPGA.

IP modules Slices Registers Slice LUTs BRAMs DSP48EsImage Processing IP 1565 1881 18 0

FPU(PPC440) 2629 4202 0 13

DDR2 Memory Controller(PPC440) 2336 1756 2 0Other IPs 2594 2638 67 0

Total 9124 10477 85 13

Table 6.7: Resource Utilization for the whole PowerPC System

Similar to the MicroBlaze system, the PowerPC system also uses 64 BRAMs as processor local memory.The total resources used by the PowerPC system seems larger than the MicroBlaze system. It is mainlybecause the FPU of the PowerPC is much larger than the MicroBlaze core.

41

Chapter 6. Performance Evaluation

42

Chapter 7

Conclusions and Recommendations

7.1 Conclusions

This project has shown that the TRIP code recognition system can be implemented on the ML510 boardsuccessfully. Both the PowerPC system and the MicroBlaze system show the same correct recognitionrate compared as the Matlab result.

The implemented PowerPC system is able to process about 3 frames per second, and the implementedMicroBlaze system is able to process about 0.27 frames per second.

By comparing the processing time of adaptive threshold and binary edge detection stages in both Ml510system and the Matlab system, we can see that implementing the low level image processing algorithmin hardware dramatically speeds up the whole processing time.

For the PowerPC system, the edge following and filtering stage is the bottleneck. This is because thisstage requires many fetch and send operations of pixels stored in the DDR2 memory, which takes a lotof time. The time needed mainly depends on the PLB bus clock frequency.

The MicroBlaze system implementation shows that double precision floating point data type is neededfor ellipse fitting stage. Since the MicroBlaze FPU only supports single precision floating point operations,all calculations for this stage are done by software emulation, which takes a lot of time. This makes theellipse fitting stage the bottleneck for the MicroBlaze system.

According to the resource utilization result in Table 6.6, the hardware resources needed for the wholesystem are quite large, especially the BRAMs. The MicroBlaze core uses 64 BRAMs for its local memory,because the software size is large. With the optimization level setting to medium (-O2) in the compiler,the size of the software is about 139KB. This factor makes this system not able to be fit into many otherFPGAs. Comparing the data sheet of some other Xilinx FPGAs [67, 68, 69, 70], the MicroBlaze TRIPcode recognition system can be fitted into the following FPGAs:

� Spartan-6 Family

– XC6SLX100, XC6SLX100T, XC6SLX150, XC6SLX150T

� Virtex-4 Family

– XC4VLX80, XC4VLX100, XC4VLX160, XC4VLX200, XC4VSX35, XC4VSX55

The Spartan-3 family and Virtex-II family FPGAs are not suitable for this system, because they donot have enough BRAMs.

In order to solve this problem, it is better to put the software executable file in external memory(e.g. Compact Flash, DDR2, Flash, etc.). For the embedded system which has a hardcore (e.g. Virtex-4FX FPGAs), Xilinx provides the ACE file feature to solve this problem. If the software size exceedsthe processor local memory, ACE file could be generated including the software executable file and thehardware configuration file. The processor can ask for the instructions in the ACE file which is stored inthe Compact Flash. So short of BRAMs would not be a problem for software implementation.

At the beginning of this project, we indented to do more research about implementing more algo-rithm stages into hardware and to see how this affects the whole system behavior; or implement other

43

Chapter 7. Conclusions and Recommendations

applications on this system. Unfortunately, it took me quite a lot of time to implement the TRIP coderecognition system into the ML510 board because of practical issues. For instance, the Xilinx softwaresuite is quite complex, much time is spent to learn how to use these tools. Since the Xilinx tools onlysupport system level simulation with internal memories, it took me about a month to investigate how tosimulate the system with an external memory device model (DDR2 memory). But without the correctmodel for the DDR2 which is used by the board, this attempt failed and an alternative method (usingBRAM memory) is used for data transfer simulation. So finding the correct DDR2 model to make thesystem simulation running would be a recommendation for the future work. Further more, adding aUART model and a Compact Flash model would make a complete system level simulation.

7.2 Recommendations

Because there is only a little time left, some ideas from the beginning of this project are left for futurestudy, and these ideas are discussed in the following sections as recommendations.

7.2.1 For Image Processing IP

From the synthesis result, there are 18 BRAMs used in the image processing IP design, which is mainlyused by the adaptive threshold stage. If the window size is changed to be smaller, the number of BRAMmemory could be reduced. This could become an important optimization when using this hardware IPcore on the other systems or other boards which have limited BRAMs. In Lopez’s PhD thesis [3], thewindow size is 1× 10, so the window size for future implementation could be set to 1× 16 or 1× 8.

7.2.2 For PowerPC System

For the PowerPC system, the processing speed is about 3 frames per second. From the profiling result inTable 6.3, the bottleneck is the edge following and filtering stage which takes 172 ms in average. If wechange the system structure using parallelism, the processing speed would increase to about 5.8 framesper second. Two PowerPC core would be needed and the possible block diagram is shown in Figure 7.1.

Figure 7.1: Dual PowerPC System Block Diagram

In this system, the two PowerPC cores cooperate with a communication block. The image processingIP is connected to PowerPC1. The two PowerPC cores also have a control block in order to let the twoprocessors have access to the same address in DDR2 memory. PowerPC2 is only used for doing edgefollowing and filtering stage, and PowerPC1 is doing all the other stages. The possible timing schedulefor this system is shown in Figure 7.2. Notice that in the figure, HW stands for the hardware processingstages and C. stands for the concentric test and code recognition stages.

While the PPC2 is doing edge following and filtering, the PPC1 is doing ellipse fitting, concentric testand code recognition for the previous frame, and doing the hardware implemented stages for the nextframe. This timing schedule makes the system be able to process about 5.8 frames per second.

But 5.8 frames per second is not good enough. Since the edge following and filtering stage is thebottleneck of the PowerPC system, it is better to investigate how to implement this stage into hardware.

44

7.2 Recommendations

Figure 7.2: Possible Timing Graph for Dual PPC System

Figure 7.3: Block Diagram of Edge Following and Filtering stage

Suppose the memory is not a bottleneck of the system, this edge following and filtering block is not hardto build. A possible design block diagram for implementing this stage is shown in Figure 7.3.

The “Memory1” stores the whole image from the edge detection stage. Each pixel is set to 1 bits toreduce BRAM requirement. The RD Ctrl block is used for reading data from the “Memory1” and the WRCtrl block is used for writing data to the “Memory1”. When a pixel data is read from the “Memory1”, itis sent to the detection ctrl block. This block detects if it is a contour pixel or not, and sends the controlsignals to the address generator block. If a contour is found to be valid, the detection ctrl block sendsall the contour pixel data to the “Memory2”. The “Memory2” is the memory which can be accessed bysoftware for the following stages. The address generator block generates the read address for the nextpixel and generates the write address for already detected contour pixels. In this case, the capacity for“Memory1” is about 281Kb (622× 462/1024 ≈ 280.6), which can be implemented with 9 BRAMs. Sincethe image size is 640× 480, it is better to use 32 bits to store the position of a valid contour pixel. Fromthe multiple test results, the number of valid contour pixel varies from 2000 to 3500. So 4 BRAMs shouldbe enough for the “Memory2”. With this edge following and filtering block added to the system, theadaptive threshold result should be sent to the DDR2 memory in order for the code recognition stage.

Suppose the edge following and filtering has been implemented in hardware, and the time for thisstage takes less than 60 ms. Then, the bottleneck of a single PPC system would be the ellipse fittingstage, which takes 89 ms in average. Using the same parallel system described in Figure 7.1, the systemcould be able to process about 11 frames per second.

45

Chapter 7. Conclusions and Recommendations

7.2.3 For MicroBlaze System

For the MicroBlaze system, ellipse fitting is the real bottleneck, which takes 95.1% of the total time. Thisis because there are no double precision FPU support for MicroBlaze and double precision floating pointdata type is needed for the ellipse fitting algorithm. So three possible ways to solve that:

1. Develop an FPU for MicroBlaze which support double precision operations.

2. Use other algorithms for ellipse fitting.

3. Increase the MicroBlaze clock frequency.

Developing an FPU which support double precision operations for MicroBlaze is quite a difficult job.Though there is a double precision floating point co-processor IP which supports MicroBlaze processorsavailable from website [71], this IP needs to be purchased and the performance needs to be tested.

Changing the ellipse fitting algorithm seems more realistic. Finding an algorithm to detect ellipsewith only single precision floating point needed would be a challenge. If the new ellipse fitting algorithmonly requires fixed point operations, implementing this stage in hardware would be another interestingtopic for the future.

Increasing the clock frequency of MicroBlaze can speed up the software calculations, but the maximumclock frequency is restricted by the clock generator of the FPGA board and high frequency would causeother timing issues for the whole system. So increasing the clock frequency is not a good way to speedup the process.

7.2.4 Other Recommendations

This project does not connect a real camera to the system. The capture frame functionality is simulatedby fetching image data stored in the DDR2 memory. So for the future work, connecting a camera to thesystem would be another challenge to make the whole system look like a real “Smart Camera” system.Basically, there are multiple ways to connect a camera to the ML510 board. One is to use the USB portto connect a USB camera. Another way is to use FireWire PCI card connected to the PCI slot. It is alsopossible to use the UART or the Ethernet port to transfer image data.

The data rate of these ports is shown in Table 7.1. The Max. frame per second is for gray-scale image,resolution is 640×480, each pixel uses 8 bits.

Interface Max. Data Rate Max. Frames per secondUSB 1.1 12Mbits/s 4.88USB 2.0 480Mbits/s 195.3UART (RS232) 20Kbits/s 0.008Ethernet 1000Mbits/s 406.9Firewire400 400Mbits/s 162.76

Table 7.1: Data rate comparison

From the table, we can see that the rate of UART port is too slow for transfering images, it is not anoption for implementation. But PCI, Ethernet and USB ports are the proper ports to connect a cameraor transfer images.

Some previous work was done to import Linux kernel to the ML510 board [18] which includes theUSB driver. It makes using a USB camera easier. Since it is easy to buy cheap webcams (less than $60)and many webcams are USB cameras , the USB camera would be the first choice to connect to the systemin the future.

In this project, the system focused on processing each frame. But when the processing speed had beenimproved in the future, a sequence of frames can be processed as a short video, and more informationcould be obtained. For instance, if a camera is connected to the system, tracking the TRIP code wouldbe an interesting topic for future investigation.

Ideally, an FPGA with a hardcore would be the best combination for the TRIP code recognitionsystem. Because a hardcore (e.g. PowerPC, ARM) can work at high clock frequency and double precisionfloating point calculation is supported in hardware. But a board with both a hardcore and an FPGA is

46

7.2 Recommendations

expensive. For instance, the ML510 board costs $3995 [74]. So transfer the whole system into a cheaperFPGA board is another challenge for the future. In the conclusion section, it listed some possible FPGAsfor the TRIP code recognition system. By further investigation, it is shown that in principle, the Spartanfamily FPGAs are cheaper than the Virtex family FPGAs. The Spartan-6 LX150T FPGA costs about$250 [72], while the FPGA on the ML510 board (Virtex-5 FX130T) costs $3453.33 [73]. Instead of usingthe expensive ML510 board, the Spartan-6 LX150T Development Board would be a good alternativeboard ($995 [75]) to use for smart camera system implementations. This board has a large enough FPGAand all the basic interfaces a smart camera system needs. With a USB camera connected, the board canbe a good prototype platform for smart camera systems for a relatively low price ($995 + $60 = $1055).

47

Chapter 7. Conclusions and Recommendations

48

Appendix A

Proof of Equation 2.6

A pixel in an image g, which is centered within a w×w window. g(i,j) is the expression of pixel intensityat position (i,j); m(x,y) is the expression of the mean value of all the pixels within this window and s(x,y)is the expression of the standard deviation of all the pixels within this window. So the variance can beexpressed as Equation 2.6:

s2(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g2(i, j)−m2(x, y)

Proof:According to the definition, the variance of these pixels are:

s2(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

(g(i, j)−m(x, y))2

(A.1)

The mean value m(x,y) here is a constant for each pixel, so the equation could be expressed as:

s2(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g2(i, j)− 2m(x, y)1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g(i, j) +m2(x, y) (A.2)

According to the definition of mean value, the expression of m(x,y) is:

m(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g(i, j) (A.3)

When we use Equation A.3 in the Equation A.2, we can get:

s2(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g2(i, j)− 2m2(x, y) +m2(x, y) =1

w2

x+w/2∑i=x−w/2

y+w/2∑j=y−w/2

g2(i, j)−m2(x, y)

(A.4)This is exactly the same equation as Equation 2.6.

49

Appendix A. Proof of Equation 2.6

50

Appendix B

Mathematical Derivation of the fiveparameters of an ellipse

The chapter demonstrates the mathematical processes to derive five parameters of an ellipse from theparameters of its implicit equation parameters. In order to avoid misunderstanding, Ea indicates thelong axis of an ellipse and Eb indicates the short axis of an ellipse, while a and b are the returnedparameters from the implicit equation. (x0, y0) is the center of the ellipse.

For an ellipse shown in Figure B.1, each point (x,y) on the ellipse can be expressed by:

(x− x0)2

Ea2+

(y − y0)2

Eb2= 1 (B.1)

For a point (x,y) rotates around a point (x0, y0) at a degree of θ (counterclockwise), the new position(x′,y′) would be:

x′ = (x− x0)cosθ − (y − y0)sinθ

y′ = (x− x0)sinθ + (y − y0)cosθ(B.2)

Figure B.1: No rotation ellipse

So for any arbitrary ellipse like in Figure 2.11, the ellipse can be rotated -θ, and then apply EquationB.1. So every point (x,y) on the ellipse must satisfy:

[(x− x0)cosθ + (y − y0)sinθ]2

Ea2+

[−(x− x0)sinθ + (y − y0)cosθ]2

Eb2= 1 (B.3)

From Equation B.3 we can get:

(x−x0)2(cos2θ

Ea2+sin2θ

Eb2)+(y−y0)2(

sin2θ

Ea2+cos2θ

Eb2)+2sinθcosθ(

1

Ea2− 1

Eb2)(x−x0)(y−y0)−1 = 0 (B.4)

The ellipse implicit equation is

F (~a, ~x) = ax2 + bxy + cy2 + dx+ ey + f = 0 (B.5)

51

Appendix B. Mathematical Derivation of the five parameters of an ellipse

Based on Equation B.5 and B.4, we can get:

a =cos2θ

Ea2+sin2θ

Eb2

b = 2sinθcosθ(1

Ea2− 1

Eb2)

c =sin2θ

Ea2+cos2θ

Eb2

(B.6)

We can see here

a− c =cos2θ − sin2θ

Ea2− cos2θ − sin2θ

Eb2= cos2θ(

1

Ea2− 1

Eb2)

b = 2sinθcosθ(1

Ea2− 1

Eb2) = sin2θ(

1

Ea2− 1

Eb2)

So if 1Ea2 6=

1Eb2 , then

b

a− c= tan2θ (B.7)

If 1Ea2 = 1

Eb2 , then we get Ea=Eb, which means it is a circle, not an ellipse. Then θ does not have anymeaning at all.

B.1 Derive center point (x0, y0)

Since we got Equation B.6, we rewrite the Equation B.4 as

a(x− x0)2 + c(y − y0)2 + b(x− x0)(y − y0)− 1 = 0

so we can get

ax2 + bxy + cy2 + (−2ax0 − by0)x+ (−2cy0 − bx0)y + ax20 + cy20 + bx0y0 − 1 = 0 (B.8)

From Equation B.8 and B.5, we can get

d = −2ax0 − by0e = −2cy0 − bx0

(B.9)

Since we know a, b, c, d and e, it is quite easy to solve the Equations of B.9. The result is:

y0 =2ae− dbb2 − 4ac

x0 =d(b2 − 1)− 2abe

2a(b2 − 4ac)

(B.10)

Note that here the result of (x0, y0) is under the constraint of b2 − 4ac < 0.

B.2 Derive ellipse axis Ea, Eb

From Equation B.6, we can see that

a+ c =1

Ea2+

1

Eb2

b

sin2θ=

1

Ea2− 1

Eb2

(B.11)

The way to solve bsin2θ is: Since

b

a− c=sin2θ

cos2θ

52

B.3 Derive the rotation angle

sob2

(a− c)2=

(sin2θ)2

(cos2θ)2=

(sin2θ)2

1− (sin2θ)2

(a− c)2

b2=

1

(sin2θ)2− 1

then we get(a− c)2 + b2

b2=

1

(sin2θ)2

(a− c)2 + b2 =b2

(sin2θ)2

since we consider Ea as the long axis and Eb as the short axis, so Ea>Eb is our constraint here. Thatmeans 1

Ea2 −1Eb2 < 0, so according to Equation B.11, b

sin2θ must be negative:

b

sin2θ= −

√(a− c)2 + b2 (B.12)

using Equation B.12 and Equation B.11, we get

Ea =

√2

a+ c−√

(a− c)2 + b2

Eb =

√2

a+ c+√

(a− c)2 + b2

(B.13)

Notice in the process of mathematical derivation, some conditions are assumed to be fulfilled, such asb 6= 0 and a− c 6= 0, so that the process can be continued to derive Ea and Eb. But after consider thoseconditions after the result is derived (as shown in Equation B.13), it is shown that the result is correcteven if those conditions are not fulfilled.

Note: Directly using Equation B.13 does not work properly. That is because the eigenvectors arenumerical calculated, the value is not perfectly precise. Chapter 5 shows the explanation and Equation5.2 is used for calculation in the real implementation.

B.3 Derive the rotation angle

Though θ is the rotation angle as one of the five ellipse parameters we need, but in the real implementation,cosθ and sinθ is actually what we use. So this section focuses on how to derive cosθ and sinθ.

The range of θ we use is defined as

θ ∈ [0, π)

because ellipses are symmetric.From Equation B.6, since 1

Ea2 −1Eb2 < 0, so if b > 0, cosθ < 0; if b < 0, cosθ > 0. If b = 0, that

means θ is either π2 or 0. Then we need to compare a and c. If a < c, sinθ = 0; if a > c, sinθ = 1.

We rewrite Equation B.6

a =1− sin2θEa2

+sin2θ

Eb2=

1

Ea2− sin2θ

(1

Ea2− 1

Eb2

)

⇒ sin2θ =Eb2(aEa2 − 1)

Ea2 − Eb2(B.14)

Using Equation B.13, Equation B.14 becomes:

sin2θ =a− c+

√(a− c)2 + b2

2√

(a− c)2 + b2(B.15)

53

Appendix B. Mathematical Derivation of the five parameters of an ellipse

According to the conditions discussed above, we get

cosθ =

√−a+ c+

√(a− c)2 + b2

2√

(a− c)2 + b2if(b < 0)

√−a+ c+

√(a− c)2 + b2

2√

(a− c)2 + b2if(b > 0)

1 if(b = 0 and a < c)

0 if(b = 0 and a > c)

(B.16)

Similarly, we can derive cosθ as well

sinθ =

√a− c+

√(a− c)2 + b2

2√

(a− c)2 + b2if(b 6= 0)

0 if(b = 0 and a < c)

1 if(b = 0 and a > c)

(B.17)

There is one condition which is not included in derivation of sinθ and cosθ. What θ should be if band a-c are both zero? From Equation B.6, we can see that if a− c = b = 0, Ea should be equal to Eb,which means this is a circle, not an ellipse. For a circle, the rotation angle θ dose not have any meaningat all. So sinθ and cosθ do not have meaning as well.

54

Appendix C

Control Flow Diagrams

Here are some conventions that used in the control flow diagrams.

� The rectangle blocks indicate states. The content in the states are signal assignments. The diamondblocks indicate conditions.

� The “<=” symbol and the “:=” symbol are for signal assignments. “<=” indicates there is one clockcycle delay from the source signal to the target signal while “:=” indicates there is no delay of thetwo signals.

� The signal name ended with dly means it is a delay signal. For instance, the signals nameddin en dly1 and din en dly2 means they have 1 clock cycle delay and 2 clock cycle delay of signaldin en respectively.

� buf rd en and buf wr en signals are the FIFO read enable and write enable control signals. buf inand buf out are signals stand for input and output data of FIFO.

� Other signals which appeared in the control flow graph can be found in the text related to the graphor in the related data flow graph.

� The numbers in the diagram is based on image size 640×480. It can be changed to other parameters.

The control flow diagram of Integral Image Calculation block is shown in Figure C.1.The control flow diagram of Local Mean Calculation block is shown in Figure C.2.Some description for Figure C.2:For FIFO control part, when the FIFO has loaded certain amount of data (16 lines of data minus 16),

the control starts to read data from FIFO. The fifo count signal is an output signal from the large FIFOindicating how many data is in the FIFO. The buf wr en signal is equal to din en dly2 signal except forthe first 16 valid data of each frame.

For output control part, since the window size is 16×16, the first valid output starts from the 17thpixel of the 17th line. For each line from the 17th line till the last line, the valid output data startsfrom the 17th pixel to the 640th pixel. Different counters are used for this output valid control. Thesignal line count counts from 0 to 639. The data outvalid signal stays high only if the value of line countis between 15 and 639. The signal count counts how many valid data has come into the local meancalculation block. When the first 16 pixel data has come, the out en valid signal goes high. When bothout en valid and data outvalid signals are high, the out en signal is equal to the signal din en dly2. Thismeans that the output of ALU is valid, so the mean value enable signal (dout en) of this block is oneclock cycle delay of signal out en.

55

Appendix C. Control Flow Diagrams

Figure C.1: Integral Image Calculation Control Flow Diagram

56

Figure C.2: Local Mean Calculation Control Flow, a) FIFO Control; b) Output Control

57

Appendix C. Control Flow Diagrams

58

Appendix D

Proof of alternative method toderive desired eigenvector

For a generalized eigenvalue system as shown in Equation 2.11:

S~a = λC~a

where S is real-symmetric and positive defined matrix. The valid eigenvector ~a is the same as L−T~e,where L is the lower triangular matrix derived from S, and ~e is the valid eigenvector of L−1CL−T .

Proof:

Based on Cholesky decomposition, S can be decomposed as LLT , where L is a lower triangular matrix.In this proof, we assume ~a = L−T~e, and then prove ~e is the eigenvector of L−1CL−T .

From equation 2.11, we rewrite that in a standard eigenvalue system equation as

S−1C~a = λ2~a (D.1)

where λ2 = 1/λ.Since ~a = L−T~e, we get

S−1CL−T~e = λ2L−T~e (D.2)

Then left multiply both sides by S, and because S = LLT , we get

SS−1CL−T~e = λ2LLTL−T~e (D.3)

simplify that, we getCL−T~e = λ2L~e (D.4)

then, both side multiply by L−1, we get

L−1CL−T~e = λ2~e (D.5)

Equation D.5 is the standard eigenvalue system equation, in which ~e is the eigenvector of L−1CL−T .This is the result we expected.

59

Appendix D. Proof of alternative method to derive desired eigenvector

60

Appendix E

Testing Images and Results

The testing images and results for performance evaluation are shown in this appendix. All testing imagesare of size 640×480.

The Matlab image results and the ML510 image results are shown side by side to compare.

Figure E.1: Matlab result of Image01 Figure E.2: ML510 result of Image01

Figure E.3: Matlab result of Image02 Figure E.4: ML510 result of Image02

61

Appendix E. Testing Images and Results

Figure E.5: Matlab result of Image03 Figure E.6: ML510 result of Image03

Figure E.7: Matlab result of Image04 Figure E.8: ML510 result of Image04

Figure E.9: Matlab result of Image05 Figure E.10: ML510 result of Image05

62

Figure E.11: Matlab result of Image06 Figure E.12: ML510 result of Image06

Figure E.13: Matlab result of Image07 Figure E.14: ML510 result of Image07

Figure E.15: Matlab result of Image08 Figure E.16: ML510 result of Image08

63

Appendix E. Testing Images and Results

Figure E.17: Matlab result of Image09 Figure E.18: ML510 result of Image09

Figure E.19: Matlab result of Image10 Figure E.20: ML510 result of Image10

Figure E.21: Matlab result of Image11 Figure E.22: ML510 result of Image11

64

Bibliography

[1] Y. M. Mustafah, A. W. Azman, A. Bigdeli, B. C. Lovell, “An Automated Face Recognition Systemfor Intelligence Surveillance: Smart Camera Recognizing Faces in the Crowd”, Distributed SmartCameras, 2007, pp. 147-152. 2007.

[2] P. Koerhuis, “A Smart Camera Platform for Smart Signs, Visual Tag Recognition Using ComputerVision”, M.Sc. Thesis, University of Twente, Faculty of Electrical Engineering, Mathematics andComputer Science, Pervasive Systems Group, July 2009.

[3] Diego Lopez de Ipina, “Visual Sensing and Middleware Support for Sentient Computing”, PhD thesis,Cambridge University, Engineering Department, January 2002.

[4] http://paginaspersonales.deusto.es/dipina/cambridge/ (accessed March 2010)

[5] Kato H., Tan K.T. , “Pervasive 2D Barcodes for Camera Phone Applications”, Pervasive ComputingIEEE, Volume6, Issue 4, pp. 76-85, 2007.

[6] http://www.imdb.com/title/tt1292825 (accessed March 2010)

[7] http://en.wikipedia.org/wiki/Aztec_Code (accessed September 2010)

[8] P. McCormick, “Supernova collapse simulated on a GPU”, posted on June 2005, http://www.

eetasia.com/ART_8800367816_480100_NT_055ce2a3.HTM (accessed March 2010).

[9] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone and J.C. Phillips, “GPU Computing”,Proceedings of the IEEE, Vol. 96, No. 5, May 2008.

[10] P. Garcia, K. Compton, M. Schulte, E. Blem, and W. Fu, “An Overview of Reconfigurable Hardwarein Embedded Systems,” EURASIP Journal on Embedded Systems, pp. 1-19, 2006.

[11] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Tans. Systems, Man, andCybernetics 9(1), pp. 62-66, 1979.

[12] D. Bradley and G. Roth, “Adaptive thresholding using the integral image,” Journal of Graphics Tools12(2), pp. 13-21, 2007.

[13] F. Shafait, D. Keysers and T. M. Breuel, “Efficient Implementation of Local Adaptive ThresholdingTechniques Using Integral Images,” Conference on Document Recognition and Retrieval, 20080129-31, San Jose, CA(US), 2008.

[14] J. Sauvola and M. Pietikainen, “Adaptive document image binarization,” Pattern Recognition 33(2),pp. 225-236, 2000.

[15] J. Canny, “A Computational approach to edge detection”. IEEE Transactions on Pattern Analysisand Machine Intelligence, 8(6),679-698, 1986.

[16] A. Fitzgibbon, M. Pilu and R. Fisher, “Direct least squares fitting of ellipses”, Proceedings of Inter-national Conference on Pattern Recognition, August 1996.

[17] F.L. Bookstein, “Fitting conic sections to scattered data”, Computer Graphics and Image Processing,(9):56-71, 1979.

65

Bibliography

[18] R. Colenbrander, “On FPGAs with embedded processor cores for application in robotics”, M.Sc.Thesis, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science,Control Engineering Group, August 2009.

[19] N.K. Ratha, A.K. Jain, “FPGA-based computing in computer vision”, Computer Architecture forMachine Perception, pp. 128-137, 1997.

[20] W.j. MacLean, “An Evaluation of the Suitability of FPGAs for Embedded Vision Systems”, Proceed-ings of IEEE Computer Society conference on Computer Vision and Pattern Recognition, CVPR,San Diego, CA, USA. pp. 131-131, 2005.

[21] Z. Guo, W. Najjar, F. Vahid, K. Vissers, “A quantitative analysis of the speedup factors of FPGAsover processors”, Proceedings of the 2004 ACM/SIGDA 12th international symposium on Fieldprogrammable gate arrays, pp. 162-170, Monterey, California, USA, 2004.

[22] S. Jin, J. Cho, X.D. Pham, K.M. Lee, S.K. Park, M. Kim, J.W. Jeon, “FPGA Design and Imple-mentation of a Real-Time Stereo Vision System”, IEEE Transactions on Circuits and Systems forVideo Technology, Vol. 20, Issue 1, pp. 15-26, 2010.

[23] C. Cuadrado, A. Zuloaga, J.L. Martin, J. Lazaro, J. Jimenez, “Real-Time Stereo Vision ProcessingSystem in a FPGA”, IECON 2006-32nd Annual Conference on IEEE Industrial Electronics, pp.3455-3460, 2006.

[24] A. Darabiha, J. Rose, W.J. MacLean, “Video-rate stereo depth measurement on programmablehardware”, Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision andPattern Recognition, Vol. 1, pp. 203-210, Madison, WI, June 2003.

[25] A. Kjaer-Nielsen, L. Jensen, A.S. Sorensen, N. Kruger, “A Real-Time Embedded System for StereoVision Preprocessing Using an FPGA”, International Conference on Reconfigurable Computing andFPGAs, pp. 37-42, 2008.

[26] K.M. Hou, A. Belloum, “A reconfigurable and flexible parallel 3d vision system for a mobile robot”,In IEEE Workshop on Computer Architecture for Machine Perception, New Orleans, Louisiana,December 1993.

[27] D.J. Fleet, “Disparity from local weighted phase correlation”, International Conference on Systems,Man and Cybernetics, Vol. 1, pp. 48-54, 1994.

[28] M.H. Yang, D.J. Kriegman, N. Ahuja, “Detecting faces in image : A survey”, IEEE transactions onpattern analysis and machine intelligence, Vol. 24, No. 1, Jan 2002.

[29] N. Farrugia, F. Mamalet, S. Roux, F. Yang, M. Paindavoine, “A Parallel Face Detection SystemImplemented on FPGA”, IEEE International Symposium on Circuits and Systems, pp. 3704-3707,2007.

[30] C. Garcia, M. Delakis, “Convolutional face finder: A neural architecture for fast and robust facedetection”, IEEE transactions on patter analysis and machine intelligence, Vol. 26, No. 11, Nov2004.

[31] C. He, A. Papakonstantinou, D. Chen, “A Novel SoC Architecture on FPGA for Ultra Fast FaceDetection”, IEEE International Conference on Computer Design, pp. 412-418, 2009.

[32] P. Viola, M. Jones, “Robust real-time face detection”, International Journal of Computer Vision,57(2), pp. 137-154, 2004.

[33] S. Zhang, Z. Liu, “A Real-time Face Detector Using Ellipse-like Features”, Proceedings of 7th Inter-national Conference on Signal Processing, Vol. 2, pp. 1227-1230, 2004.

[34] K.F. Lee, B. Tang,“Image Processing for In-vehicle Smart Cameras”, Intelligent Vehicles Symposium,pp. 76-81, 2006.

66

Bibliography

[35] T.P. Cao, G. Deng, “Real-Time Vision-based Stop Sign Detection System on FPGA”, Digital ImageComputing: Techniques and Applications, pp. 465-471, 2008.

[36] G.F. Dominguez, C. Beleznai, M. Litzenberger and T. Delbruck, “Object Tracking on EmbeddedHardware”, A. N. Belbachir, Smart Cameras, Springer, US, pp. 199-223, 2010.

[37] M. Sen, I. Corretjer, F.H.S. Saha, J. Schlessman, S.S. Bhattacharyya and W. Wolf, “Computer Visionon FPGAs: Design Methodology and its Application to Gesture Recognition”, Proceedings of IEEEWorkshop on Embedded Computer Vision, CVPR, San Diego, CA, USA. pp. 133-141, 2005.

[38] W. Wolf, B. Ozer, T. Lv, “Smart cameras as embedded systems”, IEEE Computer Magazine, Vol35, Iss 9, Sept 2002.

[39] J. Schlessman, C. Chen, W. Wolf, B. Ozer, K. Fujino and K. Itoh, “Hardware/Software Co-Designof an FPGA-Based Embedded Tracking System”, Proceedings of the 2006 Conference on ComputerVision and Pattern Recognition Workshop, Washington, DC, USA. pp. 123-130, 2006.

[40] http://en.wikipedia.org/wiki/Optical_flow (accessed March 2010)

[41] A. Price, J. Pyke, D. Ashiri, T. Cornall, “Real Time Object Detection for an Unmanned Aerial Vehicleusing an FPGA based Vision System”, Proceedings of the 2006 IEEE International Conference onRobotics and Automation, pp. 2854-2859, USA, 2006.

[42] F. Dias, F. Berry, J. Serot, F. Marmoiton, “Hardware, Design and Implementation Issues on a FPGA-BAsed Smart Camera”, First ACM/IEEE International Conference on Distributed Smart Cameras,pp. 20-26, 2007.

[43] W. He, K. Yuan, “An Improved Canny Edge Detector and its Realization on FPGA”, 7th WorldCongress on Intelligent Control and Automation, pp. 6561-6564, 2008.

[44] D.V. Rao and M. Venkatesan, “An Efficient Reconfigurable Architecture and Implementation of EdgeDetection Algorithm using Handle-C”, Proceedings of the International Conference on InformationTechnology: Coding and Computing, pp. 843-847, 2004.

[45] I. Bravo, A. Hernandez, A. Gardel, R. Mateos, J.L. Lizaro, V. Diaz, “Different Proposals To TheMultiplication of 3×3 Vision Mask In VHDL For FPGAs”, Proceedings of IEEE Conference ofEmerging Technologies and Factory Automation, Vol. 2, pp. 208-211, 2003.

[46] K. Wiatr, E. Jamro, “Implementation Image Data Convolutions Operations in FPGA ReconfigurableStructures for Real-Time Vision Systems”, Proceedings International Conference on InformationTechnology: Coding and Computing, pp. 152-157, 2000.

[47] H. Yu, M. Leeser, “Optimizing data intensive window-based image processing on reconfigurablehardware boards”, IEEE Workshop on Signal Processing Systems Design and Implementation, pp.491-196, 2005.

[48] T. Sugimura, J. Shim, H. Kurino, M. Koyanagi, “Parallel Image Processing Field Programmable GateArray for Real Time Image Processing System”, Proceedings of 2003 IEEE International Conferenceon Field-Programmable Technology (FPT), pp. 372-374, 2003.

[49] A. Tanwer, R. Singh, P.S. Reel, “Real Time Low Level Parallel Image Processing For Active VisionSystems”, IEEE International of Advance Computing Conference, pp. 1347-1352, 2009.

[50] S. McBader, P. Lee, “An FPGA Implementation of a Flexible, Parallel Image Processing Architec-ture Suitable for Embedded Vision Systems”, Proceedings of International Parallel and DistributedProcessing Symposium, 2003.

[51] S. Jin, J. Cho, J. Jeon, “Pipelined Virtual Camera configuration for Real-time Image Processingbased on FPGA” IEEE International Conference on Robotics and Biomimetics, pp. 183-188, 2007.

[52] W.D. Leon-Salas, S. Velipasalar, N. Schemm, S. Balkir, “A Low-Cost, Tiled Embedded Smart Cam-era System for Computer Vision Applications”, First ACM/IEEE International Conference on Dis-tributed Smart Cameras, pp. 20-26, 2007.

67

Bibliography

[53] Diego Lopez de Ipina, “TRIP: A Distributed vision-based Sensor System”, PhD 1st Year Report,LCE, Cambridge University, Engineering Department, August 1999.

[54] S. Zhang, Z. Liu, “A new algorithm for real-time ellipse detection”, International Conference onMachine Learning and Cybernetics, Vol. 1, pp. 602-607, 2003.

[55] Y. Xie, Q. Ji, “A New Efficient Ellipse Detection Method”, Proceedings of 16th International Con-ference on Pattern Recognition, Vol. 2, pp. 957-960, 2002.

[56] Thanh Minh Nguyen, S. Ahuja, Q.M.J. Wu, “A real-time ellipse detection based on edge grouping”,IEEE International Conference on Systems, Man and Cybernetics, pp. 3280-3286, 2009.

[57] http://www.xilinx.com/support/documentation/ipembedprocess_coreconnect_

plbv46-master-burst.htm, “PLBV46 Master Burst (v1.00a)”, (accessed May 2010)

[58] http://www.dfanning.com/ip_tips/color2gray.html (accessed May 2010)

[59] http://www.xilinx.com/tools/coregen.htm (accessed May 2010)

[60] “Virtex-5 APU Floating-Point Unit v1.01a”, Xilinx, April 2009.

[61] D. Pellerin, G. Edvenson, K. Shenoy, D. Isaacs, “Accelerating PowerPC Software Applications”,http://www.trias-mikro.de/pdfs/news/IMP_Accelerating_PPC_XIL_PR_.Sept_01_05.pdf (ac-cessed June, 2010)

[62] “Eigenvector Computing Algorithms”, http://scholar.lib.vt.edu/theses/available/

etd-62597-173629/unrestricted/chapter5a.PDF (accessed April, 2010)

[63] http://en.wikipedia.org/wiki/Cholesky_decomposition (accessed June 2010)

[64] “PPC440 Processor User’s Manual”, Applied Micro Circuits Corporation (AMCC), Revision1.09, March, 2008. http://www.phxmicro.com/CourseNotes/AMCC_UM/PPC440_UM2013.pdf (ac-cessed July, 2010)

[65] “Xilinx DS573 LogiCORE IP XPS Timer/Counter (v1.02a)”, April, 2010.

[66] Xilinx, “Advantages of the Virtex-5 FPGA 6-Input LUT Architecture”, http://www.xilinx.com/support/documentation/white_papers/wp284.pdf (accessed September 2010)

[67] Xilinx, “Spartan-6 Family Overview”, DS160(v1.5), August 2, 2010.

[68] Xilinx, “Virtex-4 Family Overview”, DS112(v3.1), August 30, 2010.

[69] Xilinx, “Virtex-II Platform FPGAs: Complete Data Sheet”, DS031(v3.5), November 5, 2007.

[70] Xilinx, “Spartan-3 FPGA Family Data Sheet”, DS099, December 4, 2009.

[71] http://www.hitechglobal.com/IPCores/FPMU-DP.htm (accessed September 2010)

[72] http://avnetexpress.avnet.com/store/em/EMController?langId=-1&storeId=

500201&catalogId=500201&term=xc6slx150t&N=0&action=products (accessed September 2010)

[73] http://avnetexpress.avnet.com/store/em/EMController?langId=-1&storeId=

500201&catalogId=500201&term=xc5vfx130t&N=0&action=products (accessed September 2010)

[74] http://avnetexpress.avnet.com/store/em/EMController/Kits-and-Tools/

Development-Kits/_/N-100639?action=products&cat=1&catalogId=500201&cutTape=

&inStock=&langId=-1&proto=&regionalStock=&rohs=&storeId=500201&term=

ml510&topSellers= (accessed September 2010)

[75] http://avnetexpress.avnet.com/store/em/EMController?langId=-1&storeId=

500201&catalogId=500201&term=spartan%252D6%2Bdevelopment%2Bkit&N=0&action=products

(accessed September 2010)

68