a new scalable systolic array processor …heath/albert_wong_master_thesis.pdfa new scalable...

ABSTRACT OF THESIS

A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION

Two-dimensional discrete convolution is an essential operation in digital imageprocessing. An ability to simultaneously convolute an (i×j) pixel input image plane withmore than one Filter Coefficient Plane (FC) in a scalable manner is a targetedperformance goal. Assuming k FCs, each of size (n×n), an additional goal is that thesystem have the ability to output k convoluted pixels each clock cycle. To achieve theseperformance goals, an architecture that utilizes a new systolic array arrangement isdeveloped and the final architecture design is captured using the VHDL hardwaredescriptive language. The architecture is shown to be scalable when convoluting multipleFCs with the same input image plane. The architecture design is functionally andperformance validated through VHDL post-synthesis and post-implementation (functionaland performance) simulation testing. In addition, the design was implemented to a FieldProgrammable Gate Array (FPGA) experimental hardware prototype for furtherfunctional and performance testing and evaluation. KEYWORDS: Systolic Array Processor, Discrete Convolution, Hardware Prototyping,

Scalable Architecture, Parallel Architecture.

________________________

________________________


By

Albert Tung-Hoe Wong

____________________________ Director of Thesis

____________________________ Director of Graduate Studies

____________________________

RULES FOR THE USE OF THESIS

Unpublished theses submitted for the Master’s degree and deposited in the University ofKentucky Library are as a rule open for inspection, but are to be used only with dueregard to the rights of the authors. Bibliographical references may be noted, butquotations or summaries of parts may be published only with permission of the author,and with the usual scholarly acknowledgements. Extensive copying or publication of the thesis in whole or in part also requires theconsent of the Dean of the Graduate School of the University of Kentucky. A library that borrows this thesis for use by its patrons is expected to secure the signatureof each user. Name Date ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________

THESIS

2 3

U

T


00

niversity of Kentucky

he Graduate School


THESIS

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical

Engineering in the College of Engineering at the University of Kentucky

By


Lexington, Kentucky

Director: Dr. J. Robert Heath, Associate Professor of Electrical and Computer Engineering

Lexington, Kentucky

2003

MASTER’S THESIS RELEASE

I authorize the University of Kentucky Libraries to reproduce this thesis in

whole or in part for purposes of research.

Signed: _____________________________________ Date: _______________________________________

ACKNOWLEDGEMENTS The following thesis, while an individual work, benefited from the insights and

direction of several people. First, my Thesis Chair, Dr. J. Robert Heath, exemplifies the

high quality scholarship to which I aspire. In addition, Dr. J. Robert Heath provided

timely and instructive comments and evaluation at every stage of the thesis process,

allowing me to complete this project. Next, I wish to thank the complete Thesis

Committee: Dr. J. Robert Heath, Dr. Hank Dietz, and Dr. William R. Dieter. Each

individual provided insights that guided and challenged my thinking, substantially

improving the finished product. I would also like to thank Dr. Michael Lhamon from

Lexmark Inc. for his technical insights, guidance and comments.

In addition to the technical and instrumental assistance above, I received equally

important assistance from family and friends. My wife, Sze Ying Ng, provided on-going

support through out the thesis process for me to finish the thesis.

iii

CONTENTS

Acknowledgements............................................................................................................ iii

List of Tables ..................................................................................................................... vi

List of Figures ................................................................................................................... vii

Chapter 1 Introduction .........................................................................................................1

Chapter 2 Background and Convolution Architecture Requirements .................................2

Chapter 3 Version 1 Convolution Architecture ...................................................................7

3.1. Arithmetic Unit (AU) .......................................................................................... 7

3.2. Coefficient Shifters (CSs) ................................................................................. 10

3.3. Input Data Shifters (IDSs)................................................................................. 11

3.3.1. Register Bank (RB) ............................................................................... 12

3.3.2. Pattern Generator Pointers (PGPs) ....................................................... 12

3.3.3. Delay Units (DU).................................................................................. 15

3.4. Systolic Flow of Version 1 Convolution Architecture. .................................... 15

3.5. Data Memory Interface (DM I/F) ..................................................................... 17

3.6. Output Correction Unit ..................................................................................... 19

3.7. Controller .......................................................................................................... 19

Chapter 4 Revised Architectural Requirements and Resulting Version 2 Convolution

Architecture...............................................................................................................21

4.1. Version 2 Convolution Architecture for (k = 1) ............................................... 21

4.2. Arithmetic Unit (AU) ........................................................................................ 22

4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU) ...... 26

4.2.2. Delay Units (DU).................................................................................. 31

4.3. Input Data Shifters (IDS) .................................................................................. 31

4.4. Data Memory Interface (DM I/F) ..................................................................... 32

4.5. Memory Pointers Unit (MPU) .......................................................................... 32

4.6. Systolic Flow of Version 2 Convolution Architecture ..................................... 34

4.7. Controller .......................................................................................................... 35

4.8. Multiple Filter Coefficient Sets when (k > 1) ................................................... 43

Chapter 5 VHDL Description of Version 2 Convolution Architecture..............................45

iv

Chapter 6 Version 2 Convolution Architecture Validation via Virtual Prototyping

(Post-Synthesis and Post-Implementation Simulation Experimentation).................47

6.1. Post-Synthesis Simulation ................................................................................ 48

6.1.1. Adders ................................................................................................... 48

6.1.2. Multiplication Unit................................................................................ 51

6.1.3. Version 2 Convolution Architecture (with k = 1) ................................. 52

6.2. Post-Implementation Simulation ...................................................................... 61

6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture

(with k = 1)...................................................................................................... 61

6.2.2. Version 2 Convolution Architecture (with k = 1) ................................. 62

6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k

= 3) .................................................................................................................. 65

6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)........... 66

Chapter 7 Hardware Prototype Development and Testing ................................................72

7.1. Board Utilization Modules and Prototype Setup .............................................. 73

7.2. Hardware Prototyping Flow.............................................................................. 76

7.3. Test Cases ......................................................................................................... 80

Chapter 8 Conclusion.........................................................................................................84

Appendix A VHDL Code for Version 2 Discrete Convolution Architecture ....................86

Appendix B VHDL codes, C++ source codes and Script file for Post-Synthesis

Simulation ...............................................................................................................133

Appendix C C++ Source Codes for Programs Used During Post-Implementation

Simulation ...............................................................................................................140

Appendix D C++ Source Codes for Programs Used During Hardware Prototype

Implementation .......................................................................................................143

Appendix E VHDL Files for Modules External to the Convolution Architecture...........149

References........................................................................................................................157

Vita .................................................................................................................................159

v

LIST OF TABLES Table 3.1. Filter coefficient array. .....................................................................................11

Table 3.2. 5×5 Filter size (with one output pointer). .........................................................13

Table 3.3. 5×5 Filter size (Convolution with two output pointers). ..................................14

Table 4.1. Gate count comparison between CSA and CLA................................................25

Table 4.2. A summary of the multiplication. .....................................................................26

Table 4.3. Partial Product Selection Table.........................................................................28

Table 4.4. Comparison between method I and method II..................................................30

Table 6.1. Details of FPGA on the XESS protoboard. .......................................................62

Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)........62

Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3)........66

vi

LIST OF FIGURES Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and

Output Image Plane (OI).............................................................................................3

Figure 2.2. Example showing how two consecutive output pixels are generated. This

example is shown with a 3×3 size FC. .......................................................................4

Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels

need to be stored. In this example, 2 previous rows (shaded rows in addition to

IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are

needed for a 3×3 filter size..........................................................................................4

Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed

to be 8 in this example). ..............................................................................................8

Figure 3.2. A MAU and included functional units. ..............................................................8

Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data

Shifters and CSs are the outputs from Coefficient Shifters. .......................................9

Figure 3.4. Functional units within CSs.............................................................................10

Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters............11

Figure 3.6. Functional units within IDSs. ..........................................................................11

Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the input

pixels)........................................................................................................................12

Figure 3.8. Additional hardware and modification for convolution of x output pixels in

parallel for (x ≤ n) (Functional units shaded in gray are additional hardware

required for processing two convolutions in parallel). .............................................14

Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the figure

denotes one flip-flop. ................................................................................................15

Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel..........16

Figure 3.11. Basic functional units within Data Memory I/F............................................17

Figure 3.12. A more detailed look at DM I/F. ...................................................................18

Figure 3.13. Time line for activities, where W denotes Write or R denotes Read (from

external memory device) registers indicated in boxes directly below......................19

Figure 3.14. Output pattern for two convolutions in parallel. ...........................................19

vii

Figure 4.1. A top level view of Version 2 of the convolution architecture for one

distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication

and Add Array and AT denotes Adder Tree). ...........................................................22

Figure 4.2. Functional units within the Multiplication and Add Array (MAA). ................23

Figure 4.3. A MAU and its functional units. ......................................................................23

Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline

stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC. ..........24

Figure 4.5. One possible arrangement of the AT when CLA is utilized within the

MAUs. .......................................................................................................................25

Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row

of the partial products denotes sign extension of that particular row of partial

product). ....................................................................................................................27

Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial products....28

Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree

Structure....................................................................................................................29

Figure 4.9. Illustration of multiplication technique based on Modified Booth’s

Algorithm..................................................................................................................29

Figure 4.10. Partial Product’s sign extension reduced for hardware saving......................30

Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline

stages within each MAU (PL denotes a pipeline stage composed of flip-flop

registers)....................................................................................................................31

Figure 4.12. Structural view of the IDS with n = 5 and d = 8............................................32

Figure 4.13. External memory devices organization for n = 5 and d = 8. .........................33

Figure 4.14. Functional units within the Memory Pointers Unit (MPU)...........................34

Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td

denotes the time delay between each MAU). ............................................................35

Figure 4.16. Top level view of the Controller Unit (CU). .................................................36

Figure 4.17. Functional Units that receive control signals from the CU. ..........................37

Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller

Unit (CU). .................................................................................................................39

Figure 4.19. Modified Version 2 system flow chart. .........................................................41

viii

Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be

any number). .............................................................................................................43

Figure 5.1. Version 2 Convolution Architecture organization. .........................................46

Figure 6.1. Testing model for lower level functional components. ...................................49

Figure 6.2. Post-Synthesis simulation for 14-bit CLA. ......................................................49

Figure 6.3. A close up view of one segment of Figure 6.2. ...............................................50





Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication Unit

(MU)..........................................................................................................................51

Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above..........52

Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven

columns of both IP and OI are shown due to report page width limit). ...................53

Figure 6.11. The source code for C++ program that generates test vectors to program

the filter coefficients into MAUs...............................................................................54

Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit. .............55

Figure 6.13. First phase of operation; programming of FCs into MAUs. ..........................56

Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown in

figure above is the beginning of the second row of the input pixels). ......................56

Figure 6.15. Second phase of operation; output pixels generated. ....................................57

Figure 6.16. Second phase of operation; output pixels of the second row of OI

(superimposed)..........................................................................................................58

Figure 6.17. Third phase of operation; output pixels of the last row of OI

(superimposed)..........................................................................................................58

Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns). ..................59

Figure 6.19. First phase of operation for test case 2. .........................................................60

Figure 6.20. Second phase of operation for test case 2; output pixels shown are the

first six of row one of OI (superimposed).................................................................60

ix

Figure 6.21. Third phase of operation for test case 2; output pixels shown are the first

six of the last row for OI (superimposed). ................................................................61

Figure 6.22. Second phase of operation for test case 1 (post-implementation

simulation); output pixels of the second row of OI (superimposed). .......................63

Figure 6.23. Third phase of operation for test case 1 (post-implementation simulation);

output pixels of the last row of OI (superimposed). .................................................64


simulation); output pixels shown are the first six of row one of OI

(superimposed)..........................................................................................................64

Figure 6.25. Third phase of operation for test case 2 (post-implementation simulation);

output pixels shown are the first six of the last row for OI (superimposed).............65

Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes......................67

Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first row

of the OIs for test case 1. ..........................................................................................68

Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the second

row of the OIs for test case 1. ...................................................................................68

Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes. .....................69

Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third row

of the OIs for test case 2. ..........................................................................................70

Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth

row of the OIs for test case 2. ...................................................................................70

Figure 6.32. A plot of equivalent system gates versus number of FC planes....................71

Figure 7.1. Convolution Architecture hardware implementation. .....................................72

Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture

obtained from XESS Co. website, http://www.xess.com).........................................73

Figure 7.3. Top level view of the prototyping hardware. ..................................................74

Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing

input image pixels for the convolution system (seed number of 1 is provided to

the program)..............................................................................................................75

Figure 7.5. FPGA configuration and bit stream download program, gxsload from

XESS Co. ...................................................................................................................77

x

Figure 7.6. Execution of the FCs configuration program..................................................78

Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the

upper bound of the SRAM address space whereas the low address indicates the

lower bound of the SRAM address space. .................................................................79

Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are

two segments due to the fact that the program wrote the right bank of the SRAM

(16-bit) first and the left bank of the SRAM next (16 MSB bits). .............................79

Figure 7.9. SRAM contents retrieved for first OI plane for test case 1. .............................81

Figure 7.10. SRAM contents retrieved for second OI plane for test case 1........................81

Figure 7.11. SRAM contents retrieved for third OI plane for test case 1. ..........................81

Figure 7.12. SRAM contents retrieved for first OI plane for test case 2. ...........................82

Figure 7.13. SRAM contents retrieved for second OI plane for test case 2........................82

Figure 7.14. SRAM contents retrieved for third OI plane for test case 2. ..........................83

xi

Chapter 1

Introduction

Performance and cost are both important parameters and criteria in today’s

computing system components, whether the components are an entire computer or

computer accessories and peripherals such as printers. The ever increasing desire for

higher performance from consumers has driven printer manufacturers to develop and

incorporate performance enhancements into their products at a cheapest price. It is a

known fact that cost is, most of the time, directly proportional to performance, but

manufacturers are constantly pursuing higher performance for less cost.

An ability to scan and print exceedingly clear images at a maximum page-per-

minute rate at the cheapest cost are performance metrics printer manufacturers aim for. In

order to produce highly enhanced clear images, the “discrete convolutional-filtering

algorithm” must be implemented within the scanner or printer. General-purpose signal

processors from various vendors are widely used to implement the convolutional-filtering

algorithm. Many times all the functionality offered by general-purpose signal processors

are not needed or required by the manufacturers, thus the unused functionality becomes a

cost overhead. Also, many times, commercially available general-purpose processors

cannot meet desired performance/cost requirements of having the highest performance at

lowest cost. Thus, a special-purpose architecture signal/image processor is desired to

implement the discrete convolutional filtering algorithm. The subject of this thesis is the

development of an efficient high performance special purpose signal/image processor

architecture which may be used to implement the discrete convolutional-filtering

algorithm at a lowest cost.

1

Chapter 2

Background and Convolution Architecture Requirements

Convolution is one of the essential operations in digital image processing required

for image enhancements [15,16]. It is used in linear filtering operations such as

smoothing, denoising, edge detection and so on [15,16]. In general, image processing is

carried out in a two dimensional space/array [16]. A digital image can be represented

with an array of numbers in a two dimensional space. Each number (or pixel) has an

associated row and column to indicate its coordination (position) in the two dimensional

space and the number’s value represents gray levels for that coordinate [15]. The gray

levels are usually represented with a byte or 8-bit unsigned binary number, ranging from

0 to 255 in decimal. Equation 1 shows the two dimensional discrete convolution

algorithm, where IP is the Input Image Plane, FC is the Filter Coefficient Plane, and OI is

the Output Image Plane [16].

∑∑−

=

−

=

−+−

−+−==

1

0

1

0)]

21(),

21([],[],[*],[],[

n

I

n

J

nJynIxIPJIFCyxIPyxFCyxOI (1)

Figure 2.1 below shows the basic definitions for the Input Image Plane (IP), Filter

Coefficient Plane (FC), and Output Image Plane (OI). Assuming that the IP has a size of

i×j pixels and FC has a size of n×n pixels, then, OI would have a size of i×j pixels. In

most cases, n<i and n<j.

2

FC(0,0)

FC(n-1,n-1)

IP(0,0)

IP(i-1,j-1)

x IP(x,y) i

j

n

n

OI(0,0)

x OI(x,y) i

j

OI(i-1,j-1)

c

Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and Output Image Plane (OI).

Digital convolution can be thought of as a moving window of operations [16]. As

shown in Equation 1, one output pixel of the OI[x,y] can be obtained by rotating the FC

180 degrees around the center point (denoted c in Figure 2.1 within FC) and place it over

the IP with the center point on top of IP[x,y]. All the overlapping IP pixels would be

multiplied by the corresponding filter coefficients on FC and then all the products are

summed to generate the one pixel OI[x,y]. The next output pixel can be obtained by

sliding the FC plane by one pixel to the right and then repeat all the processes mentioned

above. Figure 2.2 illustrates the idea of the moving window of operations. FC is first

centered at IP[3,4] to compute OI[3,4] and then moves to IP[3,5] for OI[3,5].

From Figure 2.2 below, one can deduce that when an output pixel is computed,

access to entire previous rows or portions of previously input rows of input pixels are

needed. Hence, previous input image pixels must be stored for this purpose. However,

not all of the previous rows of input image pixels are necessarily needed. Instead, only

(n-1) rows plus n input image pixels are required. This can be shown by example in

Figure 2.3 below. Another important observation that can be made from Figure 2.2 is that

for consecutive convolution, only n input image pixels are obsolete and require update.

This is an important observation that influences the design of the convolution

architecture. Figure 2.3 shows an example for a 3×3 filter size where the shaded areas of

3

the IP and the area under the FC plane are the input image pixels that need to be stored.

Hence, these pixels can be stored in a memory device whether it is on chip or off chip.

Figure 2.2. Example showing how two consecutive output pixels are generated. This

example is shown with a 3×3 size FC.

• IP23

• •

• IP24

•

• IP25

•

•

•

•

• IP43

• IP33

• IP34

• IP44

• IP35

• IP45

•

•

•

•

• • • • •

•

• •

• IP24

•

• IP25

•

•

•

• IP26

•

• • IP34

• IP44

• IP35

• IP45

•

•

• IP36

• IP46

• • • • •

FC00 FC01 FC02

FC10 FC11 FC12

FC20 FC21 FC22

Filter Coefficients

FC centered at IP[3,4]

FC centered at IP[3,5]

Current output, OI[3,4] = FC00IP45 + FC01IP44 + FC02IP43 + FC10IP35 + FC11IP34 + FC12IP33 + FC20IP25 + FC21IP24 + FC22IP23

next output, OI[3,5] = FC00IP46 + FC01IP45 + FC02IP44 + FC10IP36 + FC11IP35 + FC12IP34 + FC20IP26 + FC21IP25 + FC22IP24

• IP35

• IP23

• •

• IP24

•

• IP25

•

•

•

•

• IP43

• IP33

• IP34

• IP44

• IP45

•

•

•

End of the rowsBeginning of the

rows

Current input image pixel. The input image pixels coming in row by row.

Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels

need to be stored. In this example, 2 previous rows (shaded rows in addition to IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are needed

for a 3×3 filter size.

Convolution is a vital part of image processing and it can be done through both

software and hardware [1]. Much effort has been directed towards speeding up the

convolution process through hardware implementation [1,2,3,9,14]. This is because

convolution is a computation intensive algorithm as shown in Equation 1. For example,

with a 5×5 filter size, each output pixel requires 25 Multiplication and Addition

4

Operations (MAOPS). Thus, as the total number of pixels for an image increases,

MAOPS will increase substantially. Bosi and Bois in [1] propose the use of FPGAs

programmed with a 2D convolver as a coprocessor to an existing digital signal processor

(DSP) to speedup the convolution process. In [2,3,9,14], special purpose convolution

architectures are designed to meet real-time image processing requirements. Hsieh and

Kim in [2] proposed a highly pipelined VLSI convolution architecture. Parallel one

dimensional convolutions and a circular processing module were the approaches used in

the architecture for performance gain and the architecture required n×n processing

elements, each being a multiplier and adder. Both [3] and [14] propose convolution

architectures based on systolic arrays which operate on real time images with a size of

512×512 pixels. Also, both of the architectures performed bit-serial arithmetic. The

architecture of [14] requires on chip memory to store the necessary input pixels.

The focus of this thesis is the development of a high performance real time special

purpose convolution architecture which is desired for scanning and printing applications.

The requirement for the final “production” version (implemented with ASIC technology)

of this architecture is the capability to perform convolution with a 5×5 FC size on input

images of size of 8½”×11” at a rate of 60 Pages Per Minute (PPM) at 600dpi (dot per

inch). A total of 33.66M pixels are generated when a standard paper size of 8½”×11” is

scanned at a resolution of 600dpi. Add to that the requirement to process 60 scanned

PPM results in 1.69G MAOPS per second. The multiplication operands are each 8 bits in

width generating 16 bit products which must be summed. The method proposed in [1] is

not feasible from a cost standpoint and also some of the functionality of the DSP may not

be required. The architectures presented in [2] and [9] are special purpose architectures

for convolution, and both architectures require n×n processing elements which could

potentially occupy a large chip area. The architectures of [3] and [14] are both systolic

array architectures employing bit-serial arithmetic operations and hence may not be able

to meet performance requirements mentioned above. In [7] the authors point out the well

known fact that for most applications bit-parallel arithmetic has a performance edge over

bit-serial arithmetic. However, the processing architecture of [7] is based on bit-serial

arithmetic since it is sufficient for their requirements and has less gate count.

5

Two hardware architectures, version 1 and version 2, are proposed for the

implementation of the two dimensional discrete convolution algorithm shown as

Equation 1. Version 1 of the architecture proposed in the next section, is based on a linear

systolic array structure. Version 2 of the architecture, based on an extension of version 1,

will be shown in a later section. Version 2 of the architecture was developed to meet

some functional and performance requirements different from those of Version 1 of the

architecture.

Version 1 is a special purpose convolution architecture. Unlike [2] and [9], the

architecture will not use n×n processing elements and it will be scalable in order to meet

variable performance requirements. A scalable architecture when implemented into

programmable Field Programmable Gate Array (FPGA) technology allows users to

implement the architecture to meet their specific performance needs. Parallel arithmetic

operations will be utilized in the version 1 architecture for performance gain.

6

Chapter 3

Version 1 Convolution Architecture

Specially designed hardware to implement convolution can offer a performance

gain over a general-purpose Digital Signal Processor (DSP). Since the convolution

algorithm requires a large number of multiplications and additions, a multiprocessor

architecture is desired. Multiprocessor architectures offer the benefit of processing

multiple operations at a given instance.

A specific type of multiprocessor architecture will be utilized for implementation

of convolution in hardware. It is referred to as a systolic array structure [6]. The

advantages of this type of structure include its modularity and regularity in structure and

the ease of pipelining. In other words, the systolic structure has the ability to fully and

simultaneously utilize all computational units within the architecture. A major challenge

in developing systolic array architectures can be in providing data simultaneously to the

multiple computational units in correct order. Figure 3.1 shows the top-level view of

version 1 of the developed hardware convolution architecture with the basic functional

units indicated. An external memory device (external to an FPGA or ASIC chip) will be

utilized to hold a portion of the scanned input image plane during the convolution

process. A more detailed description of the interface between the main system and the

external memory will be presented later. The functionality of all basic functional units of

the architecture of Figure 3.1 will now be described.

3.1. Arithmetic Unit (AU)

The Arithmetic Unit (AU) is the core of version 1 of the convolution architecture.

As shown in Figure 3.1 above, the AU consists of an Accumulator plus Multiplication

and Add Units (MAUs). As the name implies, the basic building blocks within MAUs are

7

multiplication units and adders. Each MAU consists of one multiplication unit and one

adder as shown in Figure 3.2 below.

Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed to be 8 in this example).

AA cc cc

uu mmuu ll

aa ttoo rr

MMuullttiipplliiccaattiioonn aanndd AAdddd UUnniittss ((MMAAUUss))

CCooeeffffiicciieenntt SShhiifftteerrss ((CCSSss))

IInnppuutt DDaattaa SShhiifftteerrss ((IIDDSSss)) DDaattaa MMeemmoorryy IInntteerrffaaccee ((DDMM II//FF))Input Image Plane, IP

Outputs, OI

Filter Coefficients, FC

CS0 CS1 CSn-1

IDS0 IDS1 IDSn-1

IDS Input

CCoonnttrroolllleerr

OOuuttppuutt CCoorrrreeccttiioonn

UUnniitt

Arithmetic Unit (AU)

External Memory Device

MAUn-1 MAU1 MAU0

40

8

8 8

8 8 8

8

8

8

192121

Multiplication Unit

Adder

Output from previous MAU

IDS output

CS output

To the next MAU

8

8

16

1617

Figure 3.2. A MAU and included functional units.

8

As depicted in Figure 3.2 the multiplication unit will multiply two 8-bit binary

numbers, and then the adder will add the product to input from the previous MAU. Output

from the adder will be used as the input to the next MAU. In order to achieve high

performance it is important to utilize high-speed adders and multiplication units within

the architecture. It is also of interest to adopt a multiplication technique that is suitable for

pipelining for performance enhancement. For example, the Wallace Tree multiplication

or array multiplication techniques can be easily pipelined into multiple stages. The

implementation platform influences performance as well. For instance, it is now common

to find high performance built in core adders and multiplication units within FPGA

technology chips.

MMAAUU00 MMAAUU11 MMAAUUnn--11 PPaarrttiiaall rreessuulltt ttoo aaccccuummuullaattoorr

RReeggiisstteerr00

IIDDSS00 IIDDSS11 IIDDSSnn--11

CClloocckk,, CCllkk

RReeggiisstteerr11 RReeggiisstteerrnn--11

CCSS00 CCSS11 CCSSnn--11

MMuullttiipplliiccaattiioonn aanndd AAdddd UUnniittss ((MMAAUUss))

Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data Shifters and CSs are the outputs from Coefficient Shifters.

The MAUs are arranged in such a way that they create the systolic array structure.

Figure 3.3 shows the systolic array structure of the MAUs. The total number of MAUs

used is determined by the size of the coefficient filter. For an n×n filter size, n MAUs

would be utilized.

Use of the systolic array structure will require the outputs from the Input Data

Shifter (IDS) and Coefficient Shifter (CS) functional units to be skewed. This is to ensure

that the correct sequence of multiplications is added correctly. An accumulator is needed

at the end of the structure to add the necessary partial results generated by the MAUs to

form an output pixel. For example, for an n×n filter size, the accumulator must

9

accumulate n partial results generated by the MAUs in the generation of one output pixel.

This requires n clock cycles if each MAU takes one clock cycle to complete its operation.

The registers to the left of each MAU serve as pipeline stage registers. If necessary,

additional pipeline stages can be implemented within each MAU to increase performance

[8].

3.2. Coefficient Shifters (CSs)

Coefficient shifters are a group of parallel register shifters that can be

programmed to retain the values of the filter coefficients. CSs are responsible for

generating a skewed output of filter coefficients to MAU inputs. Figure 3.4 shows a

structural view of the CSs. The number of shifters within the CSs is also dependant on the

coefficient filter size. For an n×n filter size, there will be n coefficient shifters. Each CS

stores n coefficients as seen in Figure 3.4. Once programmed with the filter coefficients,

the CSs will retain the filter coefficients through out the convolution process. In order to

provide the MAUs with the skewed input from the CS, as the convolution process starts,

CS0 will shift after the first clock cycle, the next clock cycle CS1 and CS0 will shift, the

following clock cycle CS2, CS1, and CS0 will shift. The process continues until all the CSs

are shifting every clock cycle. This will ensure that the MAUs will receive required

skewed input. Figure 3.5 shows the arrangement of the filter coefficients within the CSs

for the convolution algorithm corresponding to the filter coefficients shown in Table 3.1.


CCooeeffffiicciieennttss SShhiifftteerrss ((CCSSss))

nn--11

11

00

nn--11

11

00

nn--11

11

00

FFCC IInnppuutt


CClloocckk,, ccllkk

88

88 88 88

88

88

88 88

88

88 88

88

88

Figure 3.4. Functional units within CSs.

10

Table 3.1. Filter coefficient array. FC(0, 0) FC(0, 1) FC(0, n-2) FC(0, n-1) FC(1, 0) FC(1, n-1)

FC(n-1, 0) FC(n-1, 1) FC(n-1, n-2) FC(n-1, n-1)

Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters.

3.3. Input Data Shifters (IDSs)

The main function of the Input Data Shifters (IDSs) is to generate a proper

sequence of input image pixels for the MAUs. Figure 3.6 shows the basic functional units

within the IDSs of Figure 3.1.

Register Bank (RB) (Size of n×n Registers)

Delay Units Outputs to MAU Inputs

Pattern Generator Pointer(s) (PGPs)

Input from Data Memory Interface

CCooeeffffiicciieenntt SShhiifftteerrss ((CCSSss))

FC(n-1,0)

FC(1,0)

FC(0,0)

FC(n-1,1)

FC(0,1)

FC(n-1,n-2)

FC(0,n-2)

FC(n-1,n-1)

FC(1,n-1)

FC(0,n-1)

CSn-1 CSn-2 CS1 CS0 88 88 88 88

Figure 3.6. Functional units within IDSs.

11

3.3.1. Register Bank (RB)

Due to the structure of the convolution algorithm, for each successive output

pixel, it requires access to the previous ((n×(n-1))+n-1) input image pixels. Hence, the RB

of Figure 3.6 is used to provide the correct input image pixels for successive convolution.

Figure 3.7 shows the detail of the RB. The RB consists of n registers and each register has

a length of n input image pixels or (n×d) bits assuming each input image pixel is d bits in

length. Thus, RB has the capacity to hold n2 input image pixels that are needed for each

convolution. This functional unit and its structure also improve the scalability of the

architecture.

n2log

Dem

ux

Mux

n×d n×d

n2log

0

n-1 n-1

0 0 d d(n-1)-1 (d×n)-1

RReeggiisstteerrss BBaannkk

To delay units

Input from Data

Memory

Demux’s select lines

Mux’s select lines

Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the

input pixels).

3.3.2. Pattern Generator Pointers (PGPs)

In order to provide the MAUs with the correct sequence of input image pixels for

each convolution, the Pattern Generator Pointers (PGPs) of Figure 3.6 are utilized. As the

RB fully fills up for each convolution, only one register needs to be updated with new

input image pixels. Thus, the update sequence for the RB (Input image pixels coming

12

from the Data Memory Interface) will repeatedly go from top to bottom (repeating from

zero to (n-1)). As for the output sequence from RB, each convolution requires all

registers’ contents being fetched to the Delay Units. Hence, all except the Data Memory

Interface (DM I/F) run at frequency n times faster (for an n×n filter size). The DM I/F

operates at the same frequency as the input image pixel rate. Table 3.2 below shows the

output sequence for one output pointer. The output sequence is 0, 1, 2, 3, 4 for n = 5. The

example in Table 3.2 is based on a 5×5 filter size. It is found that the output pattern will

repeat itself every five convolutions.

Table 3.2. 5×5 Filter size (with one output pointer). Convolutions

1st 2nd 3rd 4th 5th 0 1 2 3 4 1 2 3 4 0 2 3 4 0 1 3 4 0 1 2

Rea

ding

ord

er

from

the

RB

4 0 1 2 3

The architecture can be scaled up to process up to x convolutions in parallel where

(x ≤ n). This is made possible by adding (x-1) additional output pointer(s), Delay Unit(s)

and AU(s) to the existing architecture. As in the example above, the output sequence for

each pointer can be predetermined and they repeat after every five convolutions. Table

3.3 below shows an example for a 5×5 filter size with two output pointers. Figure 3.8

shows the additional hardware required if two convolutions are to be done in parallel and

the figure infers the additional hardware required to convolute x = n points in parallel. In

addition, an Output Correction Unit (OCU) will be needed for convolution of two or

more output pixels in parallel (See OCU in Figure 3.1). The function of the OCU will be

explained in a following section.

As the architecture is scaled up to process more than one convolution in parallel,

all functional units within the architecture except the DM I/F (which runs at the same

frequency as the input image pixels’ rate) can operate at a lower frequency. If the

architecture is scaled up to process n convolutions in parallel, then the whole architecture

operates at the same clock rate as the input image pixels rate. Thus, on average one

convolution can be achieved in every clock cycle. However, the RB needs modification in

order to process n convolutions in parallel. To process n convolutions in parallel, on

every clock cycle all n registers within the RB will be read at once. Thus, it is necessary

13

for the current input from the DM I/F for updating one of the registers within RB being

fetched to the MAU at the same instance. For this case, n pointers are utilized and each

pointer will only have one sequence instead of five as shown in Table 3.2 and Table 3.3.

Table 3.3. 5×5 Filter size (Convolution with two output pointers). Convolutions

1st 2nd 3rd 4th 5th 0 2 4 1 3 1 3 0 2 4 2 4 1 3 0 3 0 2 4 1

Rea

ding

ord

er

from

the

RB

(poi

nter

0)

4 1 3 0 2

Register Bank

Pointer

IDS Inputs

01234

Pointern-1

Convolutions

1st 2nd 3rd 4th 5th 1 3 0 2 4 2 4 1 3 0 3 0 2 4 1 4 1 3 0 2

Rea

ding

ord

er

from

the

RB

(poi

nter

1)

0 2 4 1 3

DUn-1

AUn-1

DU1DU0

AU1AU0

Output Correction Unit

CSs outputs

1Pointer0 Figure 3.8. Additional hardware and modification for convolution of x output pixels

in parallel for (x ≤ n) (Functional units shaded in gray are additional hardware required for processing two convolutions in parallel).

14

The PGPs can be synthesized by using a finite state machine model. Another

modeling possibility is by storing the predetermined sequences into RAM and reading

them out sequentially as needed.

3.3.3. Delay Units (DU)

Output from the Register Bank (RB of Figure 3.7 and Figure 3.8) will go through

the Delay Units (DU) of Figure 3.8 before being fetched into MAUs within the Aus of

Figure 3.8. The delay units consist of a series of flip-flops placed in a manner that will

generate a skewed input to the MAUs. This is necessary for the AU to generate the correct

outputs. Figure 3.9 below shows the internal structure of a DU.

R R R R

R R R

R R

R

8 8 8 8 8

40

MAU4 MAU3 MAU2 MAU1 MAU0

Output from the IDS

IDS(39 – 32) IDS(31 – 24) IDS(23 – 16) IDS(15 – 8)

IDS(7 – 0)

Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the

figure denotes one flip-flop.

3.4. Systolic Flow of Version 1 Convolution Architecture.

In order to further demonstrate how the data flow within the MAUs occurs, Figure

3.10 may be used. As the input image pixels pass through the DU from the RB, skewed

input image pixels are generated and fed to the MAUs of the AU. At the same time,

skewed filter coefficients are also input into MAUs by the CSs. Figure 3.10 below shows

an example of how an output pixel is obtained as it flows through the MAUs with a 5×5

filter coefficient size.

15

• IP23

• IP13

• IP14

• IP24

• IP15

• IP25

• IP12

• IP22

• IP16

• IP26

• IP43

• IP33

• IP34

• IP44

• IP35

• IP45

• IP32

• IP42

• IP36

• IP46

• IP53

• IP54

• IP55

• IP52

• IP56

FC00 FC01 FC02

FC10 FC11 FC12

FC20 FC21 FC22

Filter Coefficients FC centered at IP[3,4]

output, OI[3,4] = FC00IP56 + FC01IP55 + FC02IP54 + FC03IP53 + FC04IP52 + FC10IP46 + FC11IP45 + FC12IP44 + FC13IP43 + FC14IP42 + FC20IP36 + FC21IP35 + FC22IP34 + FC23IP33 + FC24IP32 + FC30IP26 + FC31IP25 + FC32IP24 + FC33IP23 + FC34IP22 + FC40IP16 + FC41IP15 + FC42IP14 + FC43IP13 + FC44IP12

FC03

FC13

FC23

FC04

FC14

FC24

FC30 FC31 FC32 FC33 FC34


Time

t0 c.c. - - - - FC44IP12 (t0 + 1)

c.c. - - - FC34IP22 + FC44IP12 FC43IP13

(t0 + 2) c.c. - -

FC24IP32 + FC34IP22 + FC44IP12

FC33IP23 + FC43IP13 FC42IP14

(t0 + 3) c.c. - FC14IP42 + FC24IP32 +

FC34IP22 + FC44IP12 FC23IP33 + FC33IP23 + FC43IP13


(t0 + 4) c.c.

FC04IP52 + FC14IP42 + FC24IP32 + FC34IP22 + FC44IP12

FC13IP43 + FC23IP33 + FC33IP23 + FC43IP13



(t0 + 5) c.c.




FC30IP26 + FC40IP16 -

(t0 + 6) c.c.




- -

(t0 + 7) c.c.


FC10IP46 + FC20IP36 + FC30IP26 + FC40IP16 - - -

(t0 + 8) c.c.

FC00IP56 + FC10IP46 + FC20IP36 + FC30IP26 + FC40IP16 - - - -

MAU4 MAU3 MAU2 MAU1 MAU0

Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel.

As shown in Figure 3.10 above, the convolution starts at t0 clock cycle (cc) when

the first input image pixel (IP12) is multiplied by filter coefficient FC44 in MAU0. During

the next clock cycle, (t0 + 1), the previous product from MAU0 is added with the product

of the IP22 and FC34 multiplication in MAU1 while a new product is generated in MAU0

(FC43 and IP13). The sum of the two products in MAU1 will be propagated into MAU2 the

next clock cycle (t0 + 2) and it is then summed with the product generated within MAU2.

16

The process continues as shown in Figure 3.10 above. An output pixel is generated

during (t0 + 9) cc when partial results from MAU4 at (t0 + 4), (t0 + 5), (t0 + 6), (t0 + 7), and

(t0 + 8) cc are summed by the accumulator. Once the first output pixel is generated on the

9th cc, from then on a new output pixel will be generated every five cc’s.

3.5. Data Memory Interface (DM I/F)

It is anticipated that external memory devices will be utilized for IP pixel storage

since the cost of having an on chip memory within the single-chip convolution

architecture of Figure 3.1 is high for any implementation platform. (The following

assessment is made on the assumption that a 5×5 filter size is desired) The bus width for

data transfer between external memory device and DM I/F will be 40 bits wide. This will

ensure that each access to the external memory device can yield five input image pixels.

However, since memory devices such as the SRAM devices on the market only come in

sizes of 8-bit, 16-bit, and 32-bit, two memory devices will be used; one 8-bit and one 32-

bit. Due to the fact that for consecutive convolution only five input image pixels need to

be updated, only one access to an external memory device will be required. Figure 3.11

below shows the basic functional units within the DM I/F.

Cache Unit

Zero Padding Hardware

Input From Scanning Device


Output To IDSs

40 40

8

8

Figure 3.11. Basic functional units within Data Memory I/F.

A cache unit is utilized to reduce the penalty of accessing external memory

devices. Figure 3.12 shows a more detailed DM I/F. Register File A and B each consists

of four shift registers, namely Registers b, c, d, and e. Each register is 40-bits in size and

holds five input image pixels. In order to prevent data starvation, Register File A and B

are used alternatively. As Register File A is providing input image pixels to IDSs,

17

Register File B is being filled with input image data from external memory devices and

vise versa. As either one of the Register Files are outputting input image pixels to IDSs,

each internal register will shift an input image pixel out by shifting right.

Register File A

Register File B

Register a

‘0’ Scanning

Device

To External Memory Device

From External Memory Device

Output to IDSs

40

8 8

40

8

32

32

32

Figure 3.12. A more detailed look at DM I/F.

Observing Equation 1, there are instances where references are made outside the

range of the input image. For these accesses zero pixel values will be used. To address

these boundary conditions, zero padding hardware is incorporated into the DM I/F.

Whenever the end of a row (input image pixels) is reached 2

)1( −n pixels are attached.

Thus, a column counter is needed within the main controller of the architecture.

Register a of Figure 3.12 is a register that can hold up to five input image pixels.

As Register a is filled, the contents will be stored into External Memory Devices. There

are also five address pointers needed for addressing the External Memory Devices for

storage of input image pixels. Each addressable location of the two External Memory

Devices can hold up to five input image pixels (40 bits). Figure 3.13 shows the time line

for activities within the DM I/F. For every four reads from External Memory Devices

(read from each pointer once) and one write to store input image pixels stored in Register

a, five output pixels (OI) of Figure 3.1 will be produced.

18

Figure 3.13. Time line for activities, where W denotes Write or R denotes Read

(from external memory device) registers indicated in boxes directly below.

3.6. Output Correction Unit

This unit is responsible for correcting the output sequence when the architecture is

scaled to process two or more convolutions in parallel. Figure 3.14 below shows an

example of the output pixel sequence when two convolutions are processed in parallel.

Instead of one output on each output clock cycle, two output pixels are generated on a

single clock cycle every two output clock cycles. Thus, the Output Correction Unit may

be needed to correct the output sequence back to one output pixel per one output clock

cycle. Whether the OCU is needed can be addressed at a later time.

Figure 3.14. Output pattern for two convolutions in parallel. 3.7. Controller

This is the functional unit of Figure 3.1 that will coordinate all the other

functional units within the architecture. The main controller of the architecture will be

19

implemented in finite state machine form. Within the main controller, there will be a row

and a column counter to keep track of the row and column counts so that it knows when

the end of a row is encountered. There can be two separate controlling units within the

main controller, one controller responsible for the DM I/F (controller DM) and the other

responsible for the rest of the architecture (controller R). The DM I/F will be running at

the same rate as the input image pixels while the rest of the architecture will be running at

least n times faster assuming convolution of a single point. However, as the architecture

is scaled up to handle x convolutions at the same time (see Figure 3.8), then the controller

R can be run at a correspondingly lower frequency rate.

Basically the controller can be divided into three main stages. The first stage is

mainly devoted to storing the first few rows of input image pixels and waiting until there

are enough input image pixels for convolution to start. The second stage is responsible

for filling up the pipeline and making sure that the convolution starts in the correct

manner. The last stage deals with shutting down of the system.

20

Chapter 4

Revised Architectural Requirements and Resulting Version 2 Convolution Architecture

The convolution architecture proposed in the previous section is scalable and

suitable for applications that require scalable performance and hardware. In this section a

more stringent performance requirement is addressed for which a convoluted OI pixel is

expected on each clock cycle of 7.3 ns (for a final “production” model based on ASIC

technology). In addition, k distinct n×n FCs are required to be simultaneously convoluted

with each Input Image Plane (IP) resulting in a performance requirement of k OI pixels

on each 7.3 ns clock cycle. The performance requirement of k convoluted OI pixels on

each 7.3 ns clock cycles can only be expected from final high speed production

technologies. In Version 1 of the architecture, filter coefficients (FCs) were assumed to

be 8 bits in length. Filter coefficients will now be 6 bits in length. Even though the

convolution architecture proposed in the previous section can be scaled up and pipeline

stages within the MAUs can also be increased to meet all the above requirements, a

specially tailored architecture can save hardware and reduce the architecture’s controller

complexity. For example, as shown in Figure 3.1, within each AU there is an accumulator

in front of the MAUs. As the architecture is scaled up to process n convolutions in

parallel, n accumulators within the architecture will be required which can be costly from

a hardware standpoint. Furthermore, a simplified controller for the IDSs can also

contribute to a hardware saving. Hence, in this section a modified and specially tailored

convolution architecture will be presented and it is referred to as Version 2 of the

convolution architecture.

4.1. Version 2 Convolution Architecture for (k = 1)

Since the desired output rate is the convolution of one OI pixel per clock cycle,

for a n×n FC size, a total of n2 MAUs are needed for one distinct filter coefficient set.

Figure 4.1 below shows a top level view of Version 2 of the architecture where n and d

21

(width of input image pixels) are assumed to be 5 and 8 respectively. Buses shown in

Figure 4.1 with width of 40 bits are resultant of (n×d). Each functional unit in this

architecture will implement required functionality as will be addressed below.

DDaattaa MMeemmoorryy IInntteerrffaaccee ((DDMM II//FF))

Input Image Plane, IP

Outputs,OI

Filter Coefficients, FC

CCoonnttrroolllleerr UUnniitt ((CCUU))


(Data Memory)

Input Data

Shifters (IDS)

Arithmetic Unit (AU)

MAA0

MAA1

MAAn-1

AT

IDSn-1

IDS1

IDS0

n×d

d

6

19

n×d

n×d

n×d

n×d

Figure 4.1. A top level view of Version 2 of the convolution architecture for one distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication and

Add Array and AT denotes Adder Tree). 4.2. Arithmetic Unit (AU)

As shown in Figure 4.1 above, the Arithmetic Unit (AU) consists of n

Multiplication and Add Arrays (MAAs) plus an Adder Tree Structure (AT) at the end of

the MAAs. Within each MAA there are n Multiplication and Add Units (MAUs) arranged

in a systolic array structure. Figure 4.2 shows the arrangement of the n MAUs within each

MAA.

The basic functional units within each MAU remain the same as in the previous

section. In Version 2 of the architecture, filter coefficients will be held within the MAUs,

therefore, an additional register is needed to hold the filter coefficient value assigned to a

specific MAU. Since Version 2 of the modified convolution architecture will feature n2

MAUs, the Coefficient Shifters (CSs) shown in Version 1 of the architecture (see Figure

3.1, Figure 3.4 and Figure 3.5) of the previous section can be eliminated. Hence, all the n2

22

filter coefficients will be assigned to a specific MAU. Figure 4.3 shows the functional

units within each MAU.

Coefficients

MMAAUU00 MMAAUU11 MMAAUUnn--11 PPaarrttiiaall

rreessuulltt ttoo AAddddeerr TTrreeee

RReeggiisstteerr00

Clock, Clk

RReeggiisstteerr11 RReeggiisstteerrnn--11

MMuullttiipplliiccaattiioonn aanndd AAdddd AArrrraayy ((MMAAAA))

DU0 DU1 DUn-1

Delay Units (DU) IDS output

Filter

Figure 4.2. Functional units within the Multiplication and Add Array (MAA).

Multiplication Unit

Adder Output from

previous MAU

DU output

Filter Coefficient

To the next MAU

6

8

14

14

15

Register

Figure 4.3. A MAU and its functional units.

In order to achieve the desired performance, it will be necessary to pipeline all

MAUs beyond the minimum pipeline stages shown in Figure 4.2 (the register to the right

of each MAU represents a pipeline stage). Thus it is important to employ multiply

techniques that can easily be pipelined into multiple stages. It is possible to combine the

multiplication unit and the adder shown in Figure 4.3 into one unit. For the most part, a

multiplication unit usually consists of an adder tree that adds all the generated partial

23

products. As shown in Figure 4.3 an adder is required to sum the previous MAU output

with the product generated by the multiplier. It is possible to use a Carry Save Adder and

generate the output as two separate outputs (a sum output and a carry output). This will

eliminate the need for another high speed adder at the end of each MAU.

The Adder Tree (AT) within the AU is responsible for adding all the n partial

results from the MAAs to form the output image pixel. The AT can be constructed with

Carry Save Adders (CSAs) and a Carry Look Ahead Adder (CLA). In addition, the AT

will be pipelined into multiple stages as well for performance. Figure 4.4 shows a

possible arrangement of CSAs and CLA within the AT. This example is based on a 5×5

FC size.

CSA

R

R

Sum from MAA0

Carry from MAA0

Sum from MAA1

Output Image Pixel

AT

Carry from MAA1

Sum from MAA2

Carry from MAA2

Sum from MAA3

Carry from MAA3

Sum from MAA4

CSA

CSA

Carry from MAA4

CSA

CSA

CSA

R

R

CSA

CSA

R

CLA

16

16

16

16

16

16

16

16

16

16

19

Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC.

Another basic functional unit within the AU is the Delay Units (DU) which are

responsible for generating the skewed input image pixels for the MAUs. However, the

DUs will need to be pipelined as well with the same pipeline stages that the MAA has.

Upon further investigation, even though the replacement of a high speed adder

with a CSA within a MAU can save a small amount of hardware, the replacement is not as

beneficial when the architecture is reviewed at the highest level. Table 4.1 shows a direct

24

comparison of number of gates required for both a 14-bit CSA and a 14-bit CLA (EX-OR

gate is counted as five gates). The amount of hardware saved is not as significant as the

increase in hardware for the AT. Figure 4.5 shows a possible arrangement of the AT if

CLA is utilized within the MAUs.

Table 4.1. Gate count comparison between CSA and CLA. CSA CLA

Gate Count 182 210

Output from MAA3

CSA

R

CSA

CSA

CLA

R

R

Output from MAA0

Output from MAA1

Output from MAA2

Output from MAA4

Output Image Pixel

AT

17

17

17

17

17

17

18

17

17

18

19

17

18

1919

Figure 4.5. One possible arrangement of the AT when CLA is utilized within the MAUs.

First and foremost, comparison between Figure 4.4 and Figure 4.5 shows a

number of CSAs being saved and also a number of pipeline stages being saved as well.

This results in a large amount of hardware savings. If CSA is used within each MAU, the

number of bits (or bus lines) running from one MAU to another is doubled. Hence, when

implemented, CSA will require more real estate within the chip (especially when

implemented as an ASIC) than CLA, thus reinforcing the need to reduce the number of

CSA units. Another important hardware reduction is the reduction in the number of flip-

flops required for the pipeline into half since only one bus (one output from each MAA) is

required.

25

In conclusion, the adder within each MAU of Figure 4.3 will be a CLA type and

the Adder Tree (AT) of Figure 4.1 will be implemented as shown in Figure 4.5 for the

case of n = 5.

4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU)

The Multiplication Unit (MU) of Figure 4.3 is one of the most important

arithmetic components within the proposed convolution architecture. Thus, it is important

that a high speed and area efficient multiplication technique be derived and implemented

since the architecture requires 25 MUs for one convolution set. For each MU, an 8-bit

unsigned binary number (IP) is to be multiplied by a 6-bit signed binary number (FC)

and a 14-bit signed binary output (OI) is generated. Table 4.2 below shows a summary of

all the elements involved in the multiplication. All signed binary numbers will be

represented as 2’s complement numbers.

Table 4.2. A summary of the multiplication. Description Representation Range (Decimal)

Multiplicand 8-bit unsigned binary number 0 to 255

Multiplier 6-bit signed binary number (2’s complement) -32 to 31

Product 14-bit signed binary number (2’s complement) -8192 to 8191

Multiplication in binary can be done using the same technique as with the

commonly used paper and pencil method. Partial products are generated based on each

bit of the multiplier and then all the partial products are summed to generate the product.

The number of partial products required is dependent on the number of bits of the

multiplier. Hence, as shown in Table 4.2, a 6-bit signed binary number is used as the

multiplier instead of the 8-bit unsigned binary number; this is due to the fact that using a

reduced number of bits for the multiplier results in fewer partial products. However, since

the multiplier in this case is a signed binary number, for the regular paper and pencil

method to work when the multiplier is in negative range, both the multiplicand and

multiplier need to be complemented before the multiplication. This is due to the fact that

all the partial products are positive and hence the result generated will be positive as well,

which is not correct since a negative result should be obtained as the multiplier is of

negative value. Hence, by complementing both the multiplicand and the multiplier, the

26

signs are switched between the two, but the result should be the same, a negative value.

Besides, all the partial products need to be sign extended for the multiplication to be

correct. Figure 4.6 illustrates the multiplication concept mentioned above. A copy of the

multiplicand will be placed into the partial product with sign extension(s) if the

respective multiplier bit is one, otherwise all zeros will be placed.

B0 B1 B2 B3 B4 B5 B6 B7

A0 A1 A2 A3 A4 A5 × x x x x x x x x

x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x

x

x x x x x x x x x

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

+ P13

s s s

s s s s

s s s s s s s

s

partial product based on A0






multiplicand

multiplier

Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row of the partial products denotes sign extension of that particular row of partial

product).

From Figure 4.6 above, it is shown that the most expensive operation is to sum all

the partial products into the final result. It is difficult to design an adder that can add six

operands at the same time and also it may not be speed efficient as well. However, a

method such as the Multilevel Carry Save Adder (CSA) Tree [10] can be employed to add

all the partial products into a final result. The Multilevel CSA Adder Tree uses multiple

stages of the CSA to reduce the operands into two operands and a final adder stage is used

to sum both operands to generate the final result. Depending on the speed requirement,

the Multilevel CSA Tree can be easily pipelined into multiple stages to increase

throughput. Besides, the final stage adder can be replaced with a fast adder such as Carry

Lookahead Adder (CLA) to reduce the latency. Figure 4.7 below shows a possible

arrangement of the Multilevel CSA Tree for adding six operands. Also, a five stage

pipeline can be implemented with this configuration.

It is possible to reduce the hardware count if the number of partial products can be

reduced. This can be done thorough use of the Modified Booth’s Algorithm (MBA) [13].

The MBA inspects three multiplier bits at a time and generates respective partial product

27

selections. Compared to the MBA, the original Booth’s Algorithm (BA) [4] inspects two

bits of the multiplier at an instance and hence the number of partial products generated

still remains proportional to the number of multiplier bits. The MBA can reduce the

number of partial products required to ( 12+

x ), assuming x is the number of bits for the

multiplier. Thus, for a 6-bit multiplier, the partial products generated will be reduced

from six to four. However, the Partial Products Generator’s (PPG) complexity is

increased due to the different possible outputs for each partial product. Table 4.3 below

gives a summary of the possible outputs for a partial product based on the three multiplier

bits examined.

CSA CSA

CSA

CSA

CLA

Partial Products Generator PP5PP4PP3PP2PP1PP0

Product

Multiplicand Multiplier

1st Stage

2nd Stage

3rd Stage

Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial

products.

Table 4.3. Partial Product Selection Table.

Multiplier Bits Selection 000 0 001 + Multiplicand 010 + Multiplicand 011 + 2×Multiplicand100 - 2×Multiplicand 101 - Multiplicand 110 - Multiplicand 111 0

28

As shown in Table 4.3, each partial product generated can have a different output

and thus the hardware complexity for the PPG is increased. Also, the output for the

partial product can be easily obtained by either shifting the multiplicand left one position

for 2× and complement plus one for all the negative values required. Figure 4.8 below

illustrates the changes to the Wallace Tree when MBA is employed. Compared to Figure

4.7, two CSAs can be saved.

Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree Structure.

Figure 4.9 below depicts the detailed multiplication technique when MBA is

employed to reduce the number of partial products required. Some hardware saving can

be achieved when a Full Adder (FA) can be replaced with a Half Adder (HA) within the

MU. Figure 4.10 shows the reduced sign extension within the partial products which in

turn can contribute to hardware savings [12].

B0 B1 B2 B3 B4 B5 B6 B7


x x x x x x x x x x x x x s3

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

+ P13

s1 s2

x x x x

s1 x s1 s1

s2 s2

s1

x

partial product based on 0, A0, A1 partial product based on A1, A2, A3

partial product based on A3, A4, A5

Sign corrections for all partial products

multiplicand

multiplier

s3

s2

s1

CSA

CSA

CLA

Partial Products GeneratorPP3PP2PP1PP0

Product

Multiplicand Multiplier

1st Stage

2nd Stage

3rd Stage 14

14 14

Figure 4.9. Illustration of multiplication technique based on Modified Booth’s

Algorithm.

29

B0 B1 B2 B3 B4 B5 B6 B7


x x x x x x x x x x x x x s3

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

+ P13

s1 s2

x x x x

s1 x s1 s1

1 s2

x

partial product based on 0, A0, A1 partial product based on A1, A2, A3

partial product based on A3, A4, A5

Sign corrections for all partial products

multiplicand

multiplier

s3

Figure 4.10. Partial Product’s sign extension reduced for hardware saving.

Table 4.4 below shows the hardware comparison between multiplication

techniques shown in Figure 4.7 and Figure 4.8. For simplicity, the multiplication

technique shown in Figure 4.7 is denoted as method I and method II denotes the

multiplication technique shown in Figure 4.8. Also, Exclusive-OR gates (EX-OR) within

both methods are counted as five gates. FF in Table 4.4 denotes Flip-Flops.

Table 4.4. Comparison between method I and method II. # FA # HA # FF # CLA Gate Count (excluding FF)

Method I 37 5 78 1 691

Method II 9 15 72 1 641

From Table 4.4, the total number of Gate Count, excluding FFs, is almost

identical but in order to achieve the speed requirement, multiple pipeline stages are

included. The maximum gate delay for both methods is identical, which is due to the CLA

(10 gate delays) [11]. Even though the hardware saving between the two methods is

modest at a glance, the number of replications of the unit can be a factor. Another

important note, in order for method I to be able to handle two’s complement, both the

multiplicand and the multiplier need to be complemented (complement each bit and add

one to the result) if the multiplier has a negative value. Hence extra overhead is required

for method I to function correctly. There is a workaround to avoid adding hardware for

complementing both the multiplicand and multiplier for method I, and that is by reversing

the multiplicand and the multiplier. This will ensure that the multiplier will always have

positive value and hence avoid the two’s complement hardware, but in return more partial

products will need to be generated since the multiplier consists of an 8-bit unsigned

number.

30

The multiply technique as shown in Figure 4.8 will be employed when the version

2 architecture is implemented. This is because this multiply technique requires less

hardware compared to the multiply technique shown in Figure 4.7.

4.2.2. Delay Units (DU)

The Delay Units (DU) are responsible for generating skewed input image pixels

for the MAA. The number of stages within DU is directly proportional to n and the

number of pipeline stages employed within the MAA. Figure 4.11 below shows the

organization of the flip-flops within the DU. The number of stages shown in Figure 4.11

below assumes n = 5 and that there are two pipeline stages within each MAU (one

pipeline stage after the Multiplication Unit and another one after the Adder; exclude the

first MAU since it only has a Multiplication Unit and hence one pipeline stage) as shown

in Figure 4.1.

PL1

PL2

PL3

PL4

PL5

PL6

PL7

32 24 24 16 16 8 8

8 8 8 8 8

DU0 (7-0)

DU1 (15-8)

DU2 (23-16)

DU3 (31-24)

DU4 (39-32)

IDS Output

Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline stages within each MAU (PL denotes a pipeline stage composed of flip-flop

registers).

4.3. Input Data Shifters (IDS)

This functional unit (see Figure 4.1) is responsible for providing the AU with the

correct input image pixel sequence. Figure 4.12 below shows the structural view of the

Input Data Shifters (IDS). There are n shift registers (S Registers) within the IDS with

each register capable of holding n input image pixels and with parallel load capability.

Each input image pixel is d bits wide. All shift registers also need to be able to shift all n

input image pixels in parallel.

31

S Register0

S Register1

S Registern-1

Output from DM I/F

Outputs to AU

40

40

40

40

40

40

40

IDS0

IDS1

IDSn-1

Figure 4.12. Structural view of the IDS with n = 5 and d = 8.

As shown in Figure 4.12 above, all data within the structure are shifted in parallel

(in this example, it is 40 bits) from one shift register to another shift register. For

example, S Register0 is loaded with 40 bits in parallel from the output of DM I/F and

shifts 40 bits in parallel into S Register1.

4.4. Data Memory Interface (DM I/F)

The Data Memory Interface (DM I/F) of Figure 4.1 will remain unchanged from

section 3.5. See Figure 3.11 and Figure 3.12.

4.5. Memory Pointers Unit (MPU)

The external memory devices (see Figure 4.1) that are required by the architecture

are read and written through several memory pointers within the Memory Pointer Unit

(MPU). In order to achieve a minimum number of writes to the external memory devices,

MPU receives and stores n (five) input image pixels (for a 5×5 filter size) before it writes

all five input image pixels to the memory location pointed to by one of the memory

pointers. Thus, the bus width for the interconnection between the memory devices and

the architecture is 40 bits (n×d). If the memory accesses cannot keep up with the main

system clock, then the memory bandwidth can be increased to reduce the number of

accesses to the external memory devices.

32

ptr_a (ptr_0)

ptr_b (ptr_1)

ptr_c

ptr_d

ptr_e (ptr_n-1)

0

1024

2048

3072

4096

0 31 39

external memory device a

external memory device b

Figure 4.13. External memory devices organization for n = 5 and d = 8.

Figure 4.13 above shows the memory organization of the external memory

devices for the case of n = 5 and d =8 with different segments of the memory designated

for each memory pointer while Figure 4.14 below shows the functional units within the

MPU, again, for the case of n = 5 and d = 8. Each of n memory segments of the memory

is capable of storing one row of input image pixels. For example, for a n×d (40) bit

memory bus width and 5100 pixels of paper size width, each segment of the memory

should have at least 1020 locations. From Figure 4.13 above, 1024 locations are allocated

to each memory pointer. By allocating 1024 locations to each memory segment, the three

most significant bits of each memory pointer can be used to differentiate each memory

segment. Also, for every five output pixels generated, every memory pointer will have a

common ten least significant bits, except for the memory pointer that is used to write

current pixels into the external memory devices. This is because the other four memory

pointers need to pre-fetch all necessary input image pixels for the next convolution

iteration into the cache memory. Thus, two 10 bit counters (col_cntr #1 and col_cntr #2)

are needed to generate memory addresses as shown in Figure 4.14.

33

3

reg_sel

ptr_b

col_cntr #2

col_cntr #1

10

1

2

3

4

5

1

2

3

4

5

13add_out

ptr_c

ptr_d

ptr_e

ptr_a

reg_sel

Figure 4.14. Functional units within the Memory Pointers Unit (MPU).

As shown in Figure 4.14 above, the three most significant bits of each memory

pointer is stored in registers. One important note is that, as the architecture is

reinitialized, all registers are initialized differently. For example, Memory Pointer b

(ptr_b) is initialized with 001, Memory Pointer c (ptr_c) with 010, Memory Pointer d

(ptr_d) with 011, Memory Pointer e (ptr_e) with 100, and Memory Pointer a (ptr_a) with

000. This is to ensure that each memory pointer is designated to a specific memory

segment. In addition, the memory pointers are shifted as indicated in Figure 4.14 above

whenever a row of the input image pixels is completed. This is because the least recent

row of input image pixels stored within the external memory devices are no longer

needed and can be overwritten to store the current input image pixels.

4.6. Systolic Flow of Version 2 Convolution Architecture

This section shows how the input image data flows through the AU of Figure 4.1

for the case of (n = 5) and how each output pixel (OI) is generated in Version 2 of the

Convolution Architecture. Figure 4.15 below shows the systolic flow of the data for

Version 2 of the Convolution Architecture. In order to simplify the figure, all pipeline

stages within the MAAs are ignored and the figure corresponds to Figure 3.10 with the

same convolution point and input image pixels. As can be seen in Figure 4.15 below, at

time t0 every MAA will multiply the input image pixel received with the filter coefficient

34

that is stored within the first MAU (within the MAA). The next time instant, t0 + 1td where

td denotes the pipeline delay between two MAUs within each MAA, the previous product

from MAU0 (in each MAA) is summed with the product of MAU1. This process continues

until time instant t0 + 4td when all input image pixels (for one convolution point) have

flowed through MAU4 within each MAA and they will be summed by the AT the next

cycle to generate one output pixel.

4.7. Controller

The controller for version 2 of the architecture, shown in Figure 4.1, is only

responsible for controlling the DM I/F of the architecture and the described Memory

Pointers Unit (MPU). This is due to the fact that the AU and the IDS need no controller to

regulate their activities. AU and IDS will be clocked with the main clock and all the

necessary input image pixels will propagate through the pipeline stages as required,

hence no controller is necessary for both units.

Time t0 c.c. (t0 + 1td) c.c. (t0 + 2 td) c.c. (t0 + 3 td) c.c. (t0 + 4 td) c.c.

FC40IP16 FC30IP26 + FC40IP16



FC00IP56 + FC10IP46 + FC20IP36 + FC30IP26 +

FC40IP16 MAA0





FC41IP15 MAA1





FC42IP14 MAA2





FC43IP13 MAA3





FC44IP12 MAA4

Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td

denotes the time delay between each MAU).

Figure 4.16 below shows the top level view of the Controller Unit (CU) with the

input and output control signals shown. The CU is responsible for generating control

signals to functional units within the DM I/F and the MPU.

35

beginning of a row, bor

Controller Unit (CU)

clock, clk

end of column, eoc shut down signal, sds

row greater than, rgt reset, rst

f_sel, cache banks select line z_pad, zero padding reg_sel, registers select 3

en_w, write enable for cache 2

en_sf, shift enable for cache 2

z_input, zero input c_inc, column counter increment 2

rot, rotate memory pointers r_inc, row counter increment r_w, memory read/write line sd_inc, shut down counter increment

Figure 4.16. Top level view of the Controller Unit (CU).

Figure 4.17 below shows the functional units for which the CU generates control

signals for the case of n = 5 and d = 8. The functional units labeled C_BANK1 and

REG_A are functional units contained within the DM I/F, whereas the functional unit

labeled as MEMPTRS is the MPU referred to above. The MPU is the functional unit

responsible for generating memory addresses for the external memory devices which

store all the necessary input image pixels (IP) for each convolution. C_BANK1 is the

functional unit that supplies input image pixels to the Input Data Shifters (IDS) and pre-

fetch the necessary input image pixels from the external memory devices for the next

iteration of convolutions. In other words, the functional unit serves as a cache memory

for the convolution system. As for the functional unit labeled as REG_A, it is a unit that is

responsible for storing the most recent input image pixels received from the external

scanning device and later write to the external memory device when its register is full. In

addition, the functional unit also supplies the most current input image pixels to the IDS.

36

Figure 4.17. Functional Units that receive control signals from the CU. The convolution system is pipelined into multiple stages requiring synchronized

operation. Thus, the CU is modeled as a finite state machine. Figure 4.18 below shows

the system flow chart for the CU. The system flow chart describes micro-operations of

the system on a clock-cycle by clock-cycle basis and it also indicates values that must be

assigned to appropriate control signals of the architecture on each clock cycle of

operation. Operation of the system flow chart shown below can be divided into three

segments. The first segment starts from the beginning of the flow chart and runs until the

row counter (row_cntr) reaches a count of greater than one. This segment is operational

when the input image pixels of a scanned page start and the convolution process will only

be started after the first two rows of input image pixels have been received. In this first

segment, the received two rows of input image pixels will be stored in the external

memory device. The purpose of the tog signal within the flow chart is to alternate writing

to the two Regfiles within C_BANK1 to avoid data starvation from the external memory

device. There are two column counters (col_cntr #1 and col_cntr #2) within the

MEMPTRS functional unit, the reason being that the first column counter is for ptr_a

address generator to write to the external memory device while the second column

37

counter will be one count ahead of the first column counter to pre-fetch the rest of the

rows (ptr_b, ptr_c, ptr_d, and ptr_e) required from the external memory device.

After the first two rows of the input image pixels have been stored, the

convolution process can be started as the third row of input image pixels of the new row

are received. The second operational segment of the flow chart, which starts from the

decision box of row greater than one (row_cntr > 1) and ends at the connector A in the

figure, is operational as the convolution process starts. This segment of the flow chart

will continue until all input image pixels for the entire scanned page are received.

The last segment of the flow chart starts from the connector A and continues until

the end of the flow chart. This segment is mainly responsible for supplying the system

with zeros as input to the system until the last two rows of the output pixels are

completely generated. As can be seen from the flow chart, a special counter (sd_cntr) is

designated for counting to two for indication of the end of the convolution process.

Control signals such as tog, en_w and en_sf are expected to retain their latest

value as the system transitions from one micro-operation to another. Thus, some memory

elements such as latches are required. If this compromises the CU’s speed and

performance, then a modification to the system flow chart such as the one shown in

Figure 4.19 below would be desired.

The system flow chart shown in Figure 4.19 contains extra states added to

eliminate the shared states (after each decision branch) shown in Figure 4.18.This

modification is aimed to remove the memory elements for signal tog, en_w and en_sf,

which need to be toggled after each branching after the decision making states. This

reduces the control signals generation delay.

38

Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller Unit (CU).

39

Figure 4.18. (Continued) System flow chart for Version 2 convolution architecture’s Controller Unit (CU).

40

Figure 4.19. Modified Version 2 system flow chart.

41

Figure 4.19. (Continued) Modified Version 2 system flow chart.

42

4.8. Multiple Filter Coefficient Sets when (k > 1)

To address the need to simultaneously convolute k different sets of Filter

Coefficients (FC) with a single Input Image Plane (IP), such as when scanning and

printing color images, the version 2 architecture will require some hardware to be

replicated. Figure 4.20 below shows a high level view of the arrangement of the

additional required replicated hardware. For each additional FC set, one additional AU

will need to be added. However, not all the functional units within the AU need to be

replicated. A common DU (within the MAAs) can be shared among all the AUs for

additional FC sets (see Figure 4.2 for detail within a MAA). For example, all the MAA0s

within all the AUs can all share a common DU rather than each MAA0 having its own DU.

Input Image Plane, IP

AUk-1

AU1

DDaattaa MMeemmoorryy IInntteerrffaaccee ((DDMM II//FF))

OOuuttppuuttss,, OOII11

FC1

Input Data Shifters (IDS)

AU0

IDSn-1

IDS1

IDS0 MAA0

MAA1

MAAn-1

AT

OOuuttppuuttss,, OOII22

OOuuttppuuttss,, OOIIkk

FC2

FCk

Clock, CLK

CCoonnttrroolllleerr UUnniitt ((CCUU))

Data Memory (External or

Internal)

Control signals

Control signals

Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be

any number). The CU, DM I/F, and IDS functional units of Figure 4.20 are functionally and

operationally identical to the same units of Figure 4.1 for a given n and d and only one

instantiation of these units is required when k filter coefficient sets are used. This

enhances the scalability of the convolution architecture when expanded to handle

43

multiple FC planes. Likewise, the CU does not have to control any of the AUs of Figure

4.20. It only has to control the DM I/F and IDS units.

The version 2 convolution architecture of Figure 4.20, from a functional and

performance standpoint, can now simultaneously convolute a single IP with k (n×n) FCs

resulting in k convoluted OI pixels (OI1, OI2, … OIk) on each system clock cycle. This

functionality and performance of the version 2 architecture will first be validated via

HDL post-synthesis and post-implementation simulation in a later chapter of the thesis.

The functionality and performance will finally be validated in a later chapter via

development and experimental testing of a FPGA based hardware prototype.

44

Chapter 5

VHDL Description of Version 2 Convolution Architecture

This chapter describes the VHDL coding style and approach used to capture the

Version 2 convolution architecture. After the design is captured through VHDL, it is

synthesized and implemented to a targeted FPGA. Before a hardware prototype is built,

functional and performance level simulation will be done to validate its proper

functionality and determine its performance.

Modular and bottom up hierarchical design approaches were employed during the

VHDL design capture process. The modular design approach partitions the entire system

into smaller modules or functional units that can be independently designed and

described in VHDL. Besides, identical modules (with the same functionalities) can share

the same VHDL code or reuse the previously designed module. In addition, the bottom up

hierarchical design approach allows a multiple level view of the entire system for design

ease. Hence, by employing these approaches the smaller modules or functional units can

be tested and validated before they are combined together as the entire system.

For prototype purposes the Version 2 convolution architecture is captured with

three AUs instantiated (k = 3), no pipeline stages are built into the multiplication units,

and the architecture is tailored to an input image plane of size 5×60 pixels. The VHDL

described system has a total of 13 pipeline stages within each AU. In addition, the

external memory device as shown in Figure 4.1 is described in VHDL and incorporated

into the overall system and will thus, for the experimental hardware prototype, be

implemented within the FPGA chip containing the other functional units of the


Figure 5.1 below shows the organization of functional units within the convolution

architecture. For simplicity of the chart only the main functional units are shown, sub-

modules within the main functional units are omitted. Both behavioral and structural

level coding styles were used during the VHDL coding process. Behavioral level style

45

coding has the advantage in that only the behaviors of the modules are described in the

code and the CAD software must infer the internal logic blocks. However, this may

present inconsistency since different CAD software may infer different logic blocks for

the same code. For this thesis behavioral level coding style was employed for most of the

functional units, however all the various sized adders and multiplication units were coded

at the structural level. This was to validate the correctness of the multiply and addition

techniques proposed in the previous chapter.

Convolution Architecture


Data Memory Interface

Arithmetic Unit Controller UnitInput Data Shifters

Memory Pointers Unit Cache Unit Multiplication and Add Array Adder Tree

Multiplication and Add Unit

Multiplication Unit Various sized Adder

Figure 5.1. Version 2 Convolution Architecture organization.

After the system is captured through VHDL, post-synthesis and post-

implementation HDL software simulation can determine if the system is functioning and

performing as it should. The next chapter presents post-synthesis and post-

implementation simulations of the convolution architecture. All VHDL code for Version

2 of the Discrete Convolution Architecture with three AUs (k = 3, see Figure 4.20) is

included in Appendix A. The code is appropriately commented such that one should be

able to identify the VHDL code describing all functional units of the convolution

architecture system.

46

Chapter 6

Version 2 Convolution Architecture Validation via Virtual Prototyping (Post-Synthesis and Post-Implementation Simulation Experimentation)

Hardware Description Language (HDL) simulation of an architecture design,

sometimes known as virtual prototyping, is an important step in the design flow for fine

tuning and detecting potential problem areas before the design is implemented or

manufactured. In this section, Post-Synthesis simulation results and Post-Implementation

simulation results of version 2 of the convolution architecture will be presented. Both

Post-Synthesis simulation and Post-Implementation simulation are utilities contained

within the Xilinx Foundation 4.1i CAD software packages utilized during this project

[18]. During the process of validating version 2 of the convolution architecture, the

computer system that was used to run the specific software has the following

configuration; Intel Pentium III 450 MHz processor, 128 MB memory capacity and

Windows 98 Second Edition operating system.

After a design has been captured either through schematic capture or via HDLs

such as VHDL or Verilog, software HDL simulation of the design is the next step in the

design flow for functional and timing validation. Software HDL simulation has the

advantage of identifying potential problem areas before a design is implemented (for

FPGA) or manufactured (for ASIC) and hence correction or modification can be made.

The usage of both Post-Synthesis simulation and Post-Implementation simulation within

the design flow for design prototyping via FPGA technology can be attributed to the fact

that Post-Synthesis simulation is utilized for functional validation of the design whereas

Post-Implementation simulation is utilized for both functional and timing (performance)

validation of the design. In order to obtain a better understanding of the characteristics of

a particular design, both utilities can be important tools for such purpose.

The testing methodology employed in this project uses the bottom up approach,

which means lower level functional components such as various types of adders and

47

multipliers were tested before these components were combined to form higher level

functional units. Using the bottom up approach in testing is desired since this will help

assure that when the lower level components are combined into higher level functional

units, one can be more assured that the lower level components will not be at fault if

errors are detected.

6.1. Post-Synthesis Simulation

In order to be assured that version 2 of the convolution architecture functionally

performs as intended in the previous sections, Post-Synthesis simulation was utilized for

functional level validation. Post-Synthesis HDL simulation of a system is simulation of

the system as synthesized to netlist (gate-level) form and zero propagation delay is

assumed through gates. To determine the correctness of the functional unit under test, all

possible input vectors are required to be applied and checked against known correct

outputs or expected outputs from the functional unit under test. Thus, testbenches are

required and need to be developed for this purpose. However, if the number of inputs for

the functional unit under test is large, fully testing all the possible inputs or stimulus for

the functional unit under test can be quiet complex. Hence, automated generation of the

testbenches is preferred. In order to achieve this objective, C++ was used to write a

program capable of generating testbenches in the required format. For ease of re-running

the simulation process, the script file editor, a feature of the Xilinx Foundation Simulator,

has been used to eliminate the process of inputting test vectors after each simulation run.

Figure 6.1 below shows the testing model that was used for verifying the functionality of

lower level functional components.

The testing model shown in Figure 6.1 below was captured through VHDL as an

entity with the functional unit under test being instantiated and its output is compared to

the expected or theoretically correct result from the testbench.

6.1.1. Adders

Different types of Carry Lookahead Adders (CLA) were employed within the

convolution system. The main difference between all of them lies in the length of the

operands that they operate on and depending on the length they are referred to as 14-bit,

48

15-bit, 16-bit, 17-bit and 19-bit CLA. To check that these lower level functional

components operate as intended, Post-Synthesis simulation was used. One of the most

utilized CLA is the 14-bit CLA, and it was duplicated within each Multiplication Unit

(MU) contained in the convolution system.

Functional Unit Under Test

Comparator

err (zero indicates both results are identical, one indicates otherwise)

stimulus / test vectors of testbenchexpected / theoretically

correct result from testbench

Figure 6.1. Testing model for lower level functional components.

The testing methodology described in Figure 6.1 was used. The VHDL file that

contains the testbench entity and C++ program source code that generates the

theoretically correct outputs (in a file format that is acceptable to the script editor for

software simulation) can be found in Appendix B. Figure 6.2 shows the Post-Synthesis

simulation output of the testbench for the 14-bit CLA. However, due to the length of the

simulation, selective test vectors were used instead of an exhaustive (all possible inputs)

set. As can be seen from Figure 6.2, the signal err remains low throughout the simulation

and indicates that the outputs from the unit under test agree with the theoretically correct

outputs generated from the C++ program.

Figure 6.2. Post-Synthesis simulation for 14-bit CLA.

Figure 6.3 below shows a close up view for one segment of the testbench

simulation shown in Figure 6.2 above. Buses vec_a and vec_b are the two input operands,

while ans_ut is the output from the unit under test (14-bit CLA) and ans is the

49

theoretically correct output. For instance, at the left-most of bus vec_a one sees a

hexadecimal value of 0008 (8 in decimal) and vec_b shows a hexadecimal value of 2000

(-8192 in decimal), thus the sum should be 2008 (-8184 in decimal) which is the same

value shown on both buses ans and ans_ut.

Figure 6.3. A close up view of one segment of Figure 6.2.

The procedure for testing all the other CLAs with different operand lengths is the

same as shown above. Figure 6.4, Figure 6.5, Figure 6.6, and Figure 6.7 show Post-

Synthesis simulation results of testbenchs for each CLA. As can be seem from the figures,

the err signal stays low throughout, thus indicating that outputs from the unit under test

agrees with the predicted correct results generated by the C++ program.




50


6.1.2. Multiplication Unit

The Multiplication Unit (MU) is a lower level component that is replicated 25

times in version 2 of the convolution architecture within each AU for the case of n = 5.

Hence, it is important to determine that MU is functioning correctly. In order to test MU

with all possible inputs, a C++ program was written to generate the required testbench;

the program can be found in Appendix B. In addition, an entity was created in a VHDL

file with MU being instantiated and a comparator was also instantiated to compare the

output generated by MU with the theoretically correct output from the program (as an

input to the entity). This VHDL file can also be found in Appendix B.

Figure 6.8 below shows the complete run of the testbench. Coef is a 6-bit wide

signed filter coefficients bus, bus mag is an unsigned 8-bit magnitude input, bus product

is the output generated by the unit under test (MU in this case) and bus t_ans is the

theoretically correct output. Signal err is the output from the comparator, and it will be

one if both the outputs, t_ans and product are not matched. As shown in Figure 6.8

below, all the buses are packed closely and cannot be distinguished due to the length of

the simulation, however, the err signal remains low for the entire simulation. Thus, both

the buses t_ans and product are identical throughout the simulation.

Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication

Unit (MU).

51

Figure 6.9 below shows a close up view of one segment of the simulation. For

instance, in the first part of Figure 6.9, bus coef has a value of 21 (-31 in decimal) and bus

mag has a value of FB (251 in decimal), thus the product of the multiplication should

have a value of 219B (14-bit signed value) in hexadecimal (-7781 in decimal). Both

buses product and t_ans have the same value and thus the err signal has a value of zero

indicating that both the values agree with one another.

Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above.

6.1.3. Version 2 Convolution Architecture (with k = 1)

In the process of testing Version 2 of the convolution architecture as a whole unit

for the case of k = 1 (see Figure 4.20), a few minor modifications were made to the

system such that the Post-Synthesis simulation can be completed within a reasonable time

frame. However, these modifications do not affect the system’s intended characteristics.

For instance, the intended Input Image Plane (IP) has a size of 5100×6600 pixels; for

simulation purpose the IP size was reduced to 5×60 pixels. This reduction will in no way

hamper the system’s functional characteristics. The Filter Coefficient Plane (FC) will

remain the same as in the previous sections; a 5×5 size. Figure 6.10 below shows a test

case used to verify the functional correctness of Version 2 of the convolution

architecture. The IP has a size of 5×60, but due to the page width limit (this thesis) only

the first seven columns of the IP can be clearly shown. The same can be said about the

Output Image Plane (OI) in Figure 6.10.

A C++ program was written to generate the test vectors required to program all

MAUs with the correct filter coefficients. This C++ program can read a text file with

filter coefficients indicated within and then generate waveform vectors for the script

editor to use. Figure 6.11 below shows the source file for the program.

52

Input Image Plane (Decimal)

0 1 2 3 4 5 6 60 61 62 63 64 65 66

120 121 122 123 124 125 126 180 181 182 183 184 185 186 240 241 241 241 241 241 241

Filter Coefficient Plane (Decimal)

-1 2 3 0 1 1 -1 2 1 0 0 0 1 1 1 1 1 0 1 1 2 -2 1 0 1

Output Image Plane (Hexadecimal)

00259 0029C 0031D 00328 00333 0033E 00349 00400 004BD 005B9 005C8 005D7 005E6 005F5 0061F 00790 0093D 0094A 00956 00962 0096E 003C5 005E7 00756 0075F 00768 00771 0077A 002D5 0047D 0069E 006A5 006AB 006B1 006B7

Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven columns of both IP and OI are shown due to report page width limit).

The operation of the convolution architecture can be divided into three phases; the

first phase of the operation starts as the system commences operation until the first two

rows of the IP have been stored into the external memory devices and no OI is generated.

The second phase of the operation starts when the system has enough IP to commence

generation of OI; this phase of operation ends when IP is received completely. Finally,

the last phase of the operation starts with the system being provided with zeros as input

and continues operation until all the OIs are generated.

53

#include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ifstream in_file1; ofstream out_file1, out_file2; in_file1.open("coef.txt"); out_file1.open("v_coef.dat"); out_file2.open("v_c_reg.dat"); int array[5][5]; int time, count, temp, a, b; time = 40; count = 1; for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file1 >> temp; cout << temp << endl; array[b][a] = temp; } } out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; for (a=0; a<5; a++) { for (b=0; b<5; b++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << array[a][b] << "\\H +" << endl; out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << count << "\\H +" << endl; time += 20; count++; } } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; in_file1.close(); out_file1.close(); out_file2.close(); return 0; }

Figure 6.11. The source code for C++ program that generates test vectors to program the filter coefficients into MAUs.

Figure 6.12 below shows the arrangement of the Filter Coefficients (FCs) within

each MAU contained in the Arithmetic Unit (AU). The arrangement of the FCs only

showing a 90 degree (clockwise) rotation is because the input image pixels have been

rotated by 90 degrees (counter clockwise) before flowing through the AU.

54

MAUs Filter Coefficients

1 FC40 2 FC30 3 FC20 4 FC10 5 FC00 6 FC41 7 FC31 8 FC21 9 FC11 10 FC01 11 FC42 12 FC32 13 FC22 14 FC12 15 FC02 16 FC43 17 FC33 18 FC23 19 FC13 20 FC03 21 FC44 22 FC34 23 FC24 24 FC14 25 FC04

FC00 FC01 FC02

FC10 FC11 FC12

FC20 FC21 FC22

Filter Coefficients

FC03

FC13

FC23

FC04

FC14

FC24



1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

Arithmetic Unit

Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit.

Figure 6.13 and Figure 6.14 below show the Post-Synthesis simulation output of

the first phase of operation for the version 2 convolution architecture based on test case 1

shown in Figure 6.10. Figure 6.13 shows the programming of the FCs into respective

MAUs. As shown in the figure, coef_regs bus acts as a write enable for each MAU within

55

the Arithmetic Unit (AU) and coef bus is the FC value which is given as input to each

MAU.

Figure 6.13. First phase of operation; programming of FCs into MAUs.

Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown

in figure above is the beginning of the second row of the input pixels).

56

In Figure 6.14, the input image pixel values are shown in hexadecimal instead of

the decimal values in test case 1 shown in Figure 6.10. As can be seen in Figure 6.14, in

this phase of operation, there are no output pixels generated since the system has to wait

until the first two rows of IP are received. Also shown in Figure 6.14, the system will

receive five input image pixels and then write to one memory location.

The second phase of the system operation is shown in the following figures.

Figure 6.15 below shows that the system is generating the first three output pixels for the

second row of the OI as compared to the test case 1 shown in Figure 6.10. Under normal

operation, the output pixels will be generated after multiple stages of pipeline delays

contained within the AU as shown in Figure 6.15. Figure 6.16 shows superimposed

(timing delay not included) output pixels with their corresponding input pixels for ease of

comparison. The output pixels shown in Figure 6.16 are the first six output pixels from

the second row of the OI. Buses o1, o2, o3, o4 and o5 are the output buses from

functional unit IDS to the AU (all the buses’ value are shown in hexadecimal) and each

bus is 40 bits wide (five input image pixels). In Figure 6.16 below, the 25 input image

pixels that correspond to each output pixel are highlighted. As can be seen from Figure

6.16, all the output pixels are as predicted in Figure 6.10.

Figure 6.15. Second phase of operation; output pixels generated.

57

Figure 6.16. Second phase of operation; output pixels of the second row of OI

(superimposed).

The third phase of system operation is shown in Figure 6.17 below. Figure 6.17

below shows that the system generates the first six output pixels for the last row of OI. At

this phase of operation, the input image pixels are completely received and zeros are

inserted into the system. The six output pixels shown in Figure 6.17 are compared and

validate with the output pixels predicted in test case 1shown in Figure 6.10.

Figure 6.17. Third phase of operation; output pixels of the last row of OI

(superimposed).

58

A second test (test case 2) was also done to further investigate the correctness of

operation of the system. Figure 6.18 shows the FCs, IP (the first seven columns) and the

OI (the first seven columns) for test case 2. As in the previous test case (Figure 6.10), due

to the page width limit, only the first seven columns of the IP and OI are being displayed

from the 60 columns.

Input Image Plane (Decimal)

0 1 2 3 4 5 6 60 61 62 63 64 65 66

120 121 122 123 124 125 126 180 181 182 183 184 185 186 240 241 241 241 241 241 241

Filter Coefficient Plane (Decimal)

-1 2 1 0 1 1 -2 1 1 0 0 0 1 -1 2 2 1 0 1 -2 2 -2 1 0 1

Output Image Plane (Hexadecimal)

000F0 0012F 001AA 001B0 001B6 001BC 001C2 001A9 001EB 0031E 00326 0032E 00336 0033E 00314 00392 00500 00508 0050F 00516 0051D 0025E 00318 003D2 003D8 003DE 003E4 003EA 0038B 00354 00448 0044E 00452 00456 0045A

Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns).

Figure 6.19, Figure 6.20 and Figure 6.21 show the post-synthesis simulation

results of all three phases of operation for Version 2 of the convolution architecture with

IP and FCs as intended in Figure 6.18. All the results from Figure 6.19 and Figure 6.20

agree with the expected results shown in Figure 6.18 above.

59

Figure 6.19. First phase of operation for test case 2.

Figure 6.20. Second phase of operation for test case 2; output pixels shown are the

first six of row one of OI (superimposed).

60

Figure 6.21. Third phase of operation for test case 2; output pixels shown are the

first six of the last row for OI (superimposed).

6.2. Post-Implementation Simulation

After version 2 of the convolution architecture, for the case of (k = 1), was

functionally validated via post-synthesis HDL simulation, its functional and timing

characteristics were studied and validated through post-implementation simulation. The

following sections will describe and depict synthesis and implementation of the system to

a particular Field Programmable Gate Array (FPGA) chip and the post-implementation

simulations that have been done to validate the version 2 convolution architecture.

6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture (with k

= 1)

In general, when a system described in a HDL is synthesized to a specific FPGA

chip, the CAD packages (Xilinx Foundation Series in this case) invoke a process that

translates the system described in HDL to a specific gate level netlist. The gate level

netlist may consist of any gate level elements or functional units that are specific to a

certain family of FPGA, hence a targeted (specific) FPGA is to be specified before the

process begins. Following the synthesis process is the implementation process of the

desired system which targeted a specific FPGA chip. This process includes map, place

and route of the netlist within the specific FPGA chip. Within each FPGA chip there are a

61

certain number of Configurable Logic Blocks (CLBs), and within each of these CLBs

there are a number of Lookup Tables (LUTs) and memory elements such as Flip-Flops

(FFs). The mapping process implements the gate level netlist to the FPGA chip using all

the available resources. Then, the place and route process determines the best placement

and routing of all the resources used for the mapped system such that all the components

(resources) are connected according to the netlist.

For this project, a prototyping board (XSV800) manufactured by XESS Co. is

used. This protoboard featured the Virtex family FPGA chip (XCV800) from Xilinx. Table

6.1 below shows a summary of the resources available within the FPGA chip on the

protoboard. There are 4704 CLBs in this specific FPGA chip and within each CLB there

are four 4-Input LUTs and four FFs. Table 6.2 below shows the resource utilization on

the XCV800 chip as version 2 of the convolution architecture (with k = 1) is implemented.

Table 6.1. Details of FPGA on the XESS protoboard. FPGA XCV800 (Virtex FPGA family)

System Gates 888,439 CLB Array 56×84

FF 18,816 4-Input LUT 18,816

Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)

CLBs 1,878 FF 2,620

4-Input LUT 5,955 Equivalent System Gates 96,210

6.2.2. Version 2 Convolution Architecture (with k = 1)

The post-implementation simulations of version 2 of the convolution architecture

with k = 1 were conducted with the same test cases run in the post-synthesis simulations

in the previous section. The script file and C++ programs used in the post-synthesis

simulations were reused in the post-implementation simulation testing and validation

processes described here.

Figure 6.22 and Figure 6.23 below show the results of the second phase and third

phase of operation for post-implementation simulation of test case 1 (see Figure 6.10). As

can be seen from both of the figures, the highlighted output image pixels were as

62

predicted in Figure 6.10. Figure 6.24 and Figure 6.25 show the second and third phase of

operation of post-implementation simulation for test case 2 (see Figure 6.18). All the

output image pixels highlighted within these figures were in agreement with the predicted

output image pixels as shown in Figure 6.18.

In both Figure 6.22 and Figure 6.24, the second phase of operation is shown; after

the first two rows of the IP has been stored and the convolution architecture starts the

convolution process. Meanwhile, in Figure 6.23 and Figure 6.25 the third phase of

operation is shown; with all the IP received and zeros are inserted as input for the

convolution system to process the last two rows of the OI.

A clock frequency (clk in all figures) of 12.5 MHz has been used in all the post-

implementation simulations (Figure 6.22, Figure 6.23, Figure 6.24 and Figure 6.25)

conducted thus far. The main objective of the simulation testing described in this section

was to validate the system functionality and performance with respect to being able to

generate one OI pixel on each system clock cycle with a 5×5 FC. For the case of k = 1,

the convolution architecture met our just stated functional and perfomance goals.


simulation); output pixels of the second row of OI (superimposed).

63

Figure 6.23. Third phase of operation for test case 1 (post-implementation

simulation); output pixels of the last row of OI (superimposed).


simulation); output pixels shown are the first six of row one of OI (superimposed).

64

Figure 6.25. Third phase of operation for test case 2 (post-implementation

simulation); output pixels shown are the first six of the last row for OI (superimposed).

6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k = 3)

As shown in Figure 4.20, the architecture can be scaled up to perform k

convolutions in parallel. To validate the scalability of version 2 convolution architecture,

the version 2 convolution architecture with three AUs instantiated is synthesized and

implemented to the XCV800 FPGA chip. Table 6.3 below shows the XCV800 chip

resource utilization as the convolution architecture is implemented. However, as the

system is scaled up to process three convolutions in parallel, the total number of system

gates did not increase proportionally. As can be seen from Table 6.3, the equivalent

system gates for k = 3 is 173,170 gates compared to the 96,210 gates for k = 1 (Table

6.2), an increase of 80 percent, which is less than the factor of three. This is due to the

fact that when the system is scaled up, only the AU needs to be replicated and it does not

need to be totally replicated as earlier discussed. Comparison of the total number of CLBs

utilized between the two implementations will not yield a good measurement since not all

the elements within each CLB are utilized.

65

Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3) CLBs 4,613 FF 5,226

4-Input LUT 15,307 Equivalent System Gates 173,170

6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)

In order to validate that version 2 of the convolution architecture can be scaled up

to include more than one AU (k > 1 in Figure 4.20) and continue to operate correctly from

a functional and performance standpoint, this section presents post-implementation

simulation results of version 2 convolution architecture operating with three instantiated

AUs. All VHDL code for version 2 of the convolution architecture with three AUs

instantiated can be found in Appendix A.

To validate the output image planes (OI) generated by version 2 convolution

architecture for k = 3, a C++ program with the ability to generate different sets of input

image planes of size 5×60 pixels, depending on the seed number given, has been written

and used. The program uses the rand function to generate random numbers based on the

given seed and the generated numbers were limited in the range of 0 to 255. In addition,

the program also generates the three expected output image planes based on the three

filter coefficient planes that it reads in. The source code of the program mentioned above

can be found in Appendix C. Another program that generates all the test vectors

necessary to program each individual MAUs with the filter coefficients was written and

used. This program reads in three FC planes contained in a text file and then generates

the test vectors according to the script editor format (source code for this program can

also be found in Appendix C).

Two test cases were post-implementation simulated and each of the test cases was

run with a single and different IP (generated by giving different seed number) and three

distinct FC sets. This was done to further validate correct operation and performance of

the version 2 convolution architecture with k = 3. Figure 6.26 below shows test case

number 1 with the inputs and expected outputs (generated by the C++ program

mentioned in the above paragraph). However, again due to page width limitation the

figure only shows part of the IP and predicted OIs.

66

Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes.

Figure 6.27 and Figure 6.28 below show the results of the post-implementation

HDL simulation with the inputs of test case 1 (see Figure 6.26). Figure 6.27 shows the

output image pixels for the first row of the three OIs starting from the third output image

pixel (signals out_pxl1, out_pxl2, and out_pxl3 were output image pixels for the first OI,

second OI and third OI respectively). Figure 6.28 shows the second row of output image

pixels (start from the third pixel) for all three OIs. All output pixels generated by post-

implementation simulation of the version 2 convolution architecture system agreed with

the expected results shown in Figure 6.26.

As can be seen from Figure 6.27 and Figure 6.28 all the input image pixels

highlighted within each rectangle correspond to the 25 input image pixels required for all

three convolutions (one output image pixel per FC plane). Again, the clock frequency

that has been utilized in the post-implementation simulation run of test case 1 in Figure

6.27 and Figure 6.28 is 12.5 MHz.

67

Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first

row of the OIs for test case 1.

Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the

second row of the OIs for test case 1.

For test case number 2, the IP generator program was given a seed number of 2

and hence a different IP plane was produced as shown in Figure 6.29. Figure 6.30 (OIs

result for third row) and Figure 6.31 (OIs result for fourth row) show the post-

68

implementation simulation result for test case 2. The output results from both of the

figures agreed with the predicted results shown in Figure 6.29. The clock frequency of

test case 2 is the same as in test case 1. All the highlighted input image pixels within each

of the rectangles corresponding to the 25 input image pixels required for each output

pixel generated. Each individual OI pixel is generated within a single system clock cycle.

Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes.

Validation of version 2 of the convolution architecture has been accomplished

through post-synthesis and post-implementation HDL simulation utilizing the Xilinx

Foundation CAD software packages. All the simulations are done with the system

implemented to a Xilinx Virtex FPGA (XCV800). As the system is scaled up to process k

convolutions in parallel, the hardware increment is directly proportional to k since only

AUs are replicated. A graph showing the equivalent system gates count compared to the

number of FC planes is plotted and shown in Figure 6.32 below. Since all the simulation

69

results are as desired and correct, the version 2 convolution architecture is functionally

and performance validated in that it can correctly generate three OI pixels (OI1, OI2, and

OI3) within one system clock cycle (with k = 3).

Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third


Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth


70

Equivalent System Gates versus Number of FC planes

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

0 0.5 1 1.5 2 2.5 3 3.5

Figure 6.32. A plot of equivalent system gates versus number of FC planes.

71

Chapter 7

Hardware Prototype Development and Testing

Hardware prototype development and testing were done to experimentally

validate the functionality and performance of the convolution architecture. Ideally, the

convolution architecture will be implemented in ASIC technology with external SRAM

(Data Memory) as shown in Figure 7.1 below. In the figure below, b and l denote the bus

width for the address bus of the external SRAM and output image pixel respectively. CE,

OE, and WE are chip enable, output enable and write enable control signals for the

external SRAM. For example, to implement the convolution architecture with three 5×5

FC planes, a total of 113 IO (Input Output) pins are needed on the FPGA or ASIC.

SRAMs

FPGA or ASIC (Implementation of

Convolution architecture)

n × d datab

address

CEOEWE

Output Image planes

OI1

OI2

OIk

l

l

l

Figure 7.1. Convolution Architecture hardware implementation. To further validate the convolution architecture functionality and performance

correctness, hardware emulation of version 2 of the convolution architecture is done

through the development and testing of a FPGA based prototype. A hardware prototyping

board manufactured by the XESS Corporation [17] which features Xilinx Virtex FPGA

(XCV800) technology was available and utilized. Figure 7.2 below shows a picture of the

XSV-800 prototype board. Even though the XCV800 FPGA has enough IO pins to handle

the convolution architecture configuration shown in Figure 7.1 above, the SRAM on the

72

prototyping board has lower data bandwidth than desired. Because of this, the Data

Memory was inferred or emulated within the FPGA.

Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture obtained from XESS Co. website, http://www.xess.com).

The hardware emulation is carried out by programming the FPGA with the

convolution architecture through the parallel port of a computer. The Xilinx Foundation

series CAD software package [18] was utilized to generate the bit stream file (FPGA

configuration bit stream) necessary to program the FPGA with the desired convolution

architecture hardware description. A software utilization package, XSTOOLs is provided

by XESS cooperation for use with the prototyping board. The software package includes

programs such as bit stream download (for FPGA programming), a clock frequency

setting program and an on-board SRAM content retrieval or initialization program.

7.1. Board Utilization Modules and Prototype Setup

As shown in Figure 7.2, the prototyping board contains many auxiliary parts such

as push buttons, LEDs, SRAMs, parallel port and so on. However, to utilize these parts,

the FPGA needs to be programmed with the appropriate driver or module. These drivers

or modules are implemented within the FPGA since all these parts are connected to the

FPGA. Thus, for the purpose of hardware emulation of the convolution architecture

system, a means of supplying the input image plane (IP) and storing of the output image

73

plane (OI) is necessary and must be developed. An internal Block RAM within the FPGA

is implemented to provide the convolution system with the input image pixels. The

internal RAM is initialized with input image pixels when the system is synthesized and

implemented. Figure 7.3 below shows a pictorial view of the prototyping hardware. All

functional blocks or modules within the FPGA were implemented with VHDL.

FPGA SRAM

Convolution System

SRAM driver

19

21

Output pixels, data lines

address

6 Control signals clock

Block RAM

8

Input pixels

External clock

FC Programming

Filter Coefficients

6

Parallel port

Data Memory

Figure 7.3. Top level view of the prototyping hardware. In Figure 7.3, there is a FC Programming module. This module, as the name

implies is responsible for initialization of all the filter coefficients within the convolution

system. The filter coefficients are supplied from the computer through the use of the

parallel port. A C++ program (this program can be found in Appendix D) was written to

read in the filter coefficients for each FC plane contained in a file and send them through

the parallel port to the module. The FC Programming module receives two bytes of data

from the parallel port to program one filter coefficient register. In addition, the FC

Programming module also sends the output image plane to the external SRAM for storage

and analysis later on. Due to the limit of the data bus for the external SRAM, it is only

possible to retain one OI from each run.

Also shown in the figure is the Block RAM module within the FPGA chip. This

module is mainly responsible for providing the convolution system with the input image

pixels and it is implemented through VHDL code. This module makes use of the internal

RAM within the FPGA to store the input image pixels. In order to be able to generate the

VHDL file for this module easily, a C++ program was written to generate this file

depending on the seed of the random number provided. By using the same seed number

74

as those used in the previous chapter, the same input image pixels can be generated. An

example of the generated VHDL file is shown in Figure 7.4 below. The C++ program can

be found in Appendix D. library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end entity IN_RAM; architecture STRUCT of IN_RAM is component RAMB4_S8 is port( DI: in std_logic_vector(7 downto 0); EN: in std_logic; WE: in std_logic; RST: in std_logic; CLK: in std_logic; ADDR: in std_logic_vector(8 downto 0); DO: out std_logic_vector(7 downto 0) ); end component RAMB4_S8; attribute INIT_00: string; attribute INIT_00 of IRM: label is "9fb6add75e5a290b8713f55f888c72ce37e06d31362b091dd779a254c5b24a30"; attribute INIT_01: string; attribute INIT_01 of IRM: label is "80b20000c7212be922035da80f3273676826daf8fd91c35f0b99e093f209c61c"; attribute INIT_02: string; attribute INIT_02 of IRM: label is "2f2d2f198cec76303797682ed5553d18eb5345050260bccdc4eed36fc92e4910"; attribute INIT_03: string; attribute INIT_03 of IRM: label is "e4392f650000e90330e84c421110aa1db0d9d34e544884c1b5a7ce5aeaff060d"; attribute INIT_04: string; attribute INIT_04 of IRM: label is "29af02af2cf91bfc241bd2ada5d75262f228d437b5d0fc6e8f18bb82b5216f9e"; attribute INIT_05: string; attribute INIT_05 of IRM: label is "8f552a661c42000095af0ea4bb1cfdb88c34cdf122f8ae8904447e84657c6a00"; attribute INIT_06: string; attribute INIT_06 of IRM: label is "e1ab005cf16e41b0d174d275a95fa85c177eb6d6ec1b2ef67cd891e33f88cfb7"; attribute INIT_07: string; attribute INIT_07 of IRM: label is "679780f0f91cba3e000007b707bc25aba634015c6ab3c55053130fd44d7f9ee8"; attribute INIT_08: string; attribute INIT_08 of IRM: label is "53bb6d2e0a67fd7071d852926a66ce6a617485308dc35ad9177c391f8f32a8b3"; attribute INIT_09: string; attribute INIT_09 of IRM: label is "00000000000000000000000066896a4dc61c8c7b1434242dcc20e505cb7aa7a9"; signal din : std_logic_vector(7 downto 0); signal addr: unsigned(8 downto 0);

Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is provided to the

program).

75

signal adr : std_logic_vector(8 downto 0); signal en : std_logic; signal we : std_logic; begin L1: din <= (others=>'0'); L2: en <= '1'; L3: we <= '0';

L4: adr <= std_logic_vector(addr);

P1: process(clk, rst, req) is begin if (rst = '1') then addr <= (others=>'0'); elsif (clk'event and clk = '1') then if (req = '1') then addr <= addr + 1; end if; end if; end process P1; IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, ADDR=>adr, DO=>dout); end architecture STRUCT;

Figure 7.4(Continued) Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is

provided to the program). In addition to the two modules mentioned above, there is another module that is

responsible for controlling the external SRAM (parts that are external to the FPGA). This

module generates progressive addresses to store the output image pixels in an ascending

order in addition to the other signals such as cen (chip enable), wen (write enable) and so

on to ensure the proper functioning of the SRAM.

All modules mentioned in this section were developed and implemented via

VHDL description. The VHDL files for all these modules can be found in Appendix E.

7.2. Hardware Prototyping Flow

After a design is synthesized and implemented through use of CAD packages, a

bit stream file (FPGA configuration bit file) for a specific FPGA chip is generated. In this

case, the bit stream file contains the configuration bits for the convolution architecture as

well as the auxiliary modules generated for a Xilinx XCV800 FPGA chip. Next, the bit

stream file is programmed into the FPGA through the parallel port of a computer. For this

particular prototyping board by the XESS Co., a FPGA configuration or download

76

program, gxsload, is provided. Figure 7.5 shows the graphic interface of the gxsload

program once it is executed.

Figure 7.5. FPGA configuration and bit stream download program, gxsload from

XESS Co.

After the FPGA chip has been configured with the convolution architecture, it is

ready for hardware experimental testing and validation of the convolution architecture.

Since the input image pixels are stored within the Block RAM module within the FPGA,

the only time that the system requires external input (external of the FPGA) is for filter

coefficient programming. This is done through hardware (FC Programming module) and

software. The software program (can be found in Appendix D) was developed and

written with the C++ language to read in the filter coefficients from a text file, coef.txt

(the same file as shown in Figure 6.26), and send all the data through the parallel port to

the convolution architecture. However, within this file it also specifies which OI plane

the external SRAM stores during each experimental run. Figure 7.6 below shows a

segment of the verbose output for the execution of the FCs configuration program. The

program enters the filter coefficient in the order shown in Figure 6.12. For each filter

coefficient two bytes of data are sent through the parallel port, the first byte indicates the

position of the filter coefficient within AU and the following byte is the filter coefficient.

77

Figure 7.6. Execution of the FCs configuration program.

Next, the convolution process is commenced when the start push button on the

prototyping board is pressed. One of the push buttons on the prototyping board is mapped

as the start signal for the convolution system. Since the execution of the convolution

architecture is transparent to the user, a LED on the prototyping board is mapped to the

invert of the SRAM’s write enable signal (invert of the wen signal in the highest level of

VHDL description). Consequently, once the convolution architecture finishes its

execution, the SRAM’s write enable line will be pulled low, hence the LED is lighted.

Then, the output image pixels stored in the external SRAM are retrieved by using

the gxsload program, which is the same program used to download the FPGA

configuration file. Figure 7.7 shows the graphical interface of the gxsload program when

used to upload SRAM contents to a file. The uploaded SRAM content is stored in an Intel

hex file format. Figure 7.8 below shows the uploaded SRAM contents in a file. There are

two banks of SRAM on the prototyping board, the left bank and the right bank. Each bank

of the SRAM has a 16-bit data bus and 19-bit address bus. Since the output image pixels

are 19-bits wide, both sides of the SRAM are utilized.

As evident from Figure 7.8 below, it is tedious to trace and compare the uploaded

SRAM contents to the expected output results. As mentioned in the previous section, a

program was written to generate the theoretically correct output image pixels (shown in

Figure 6.26). In order to compare the uploaded results with the known correct results

efficiently, a C++ program was written to parse the Intel hex file and check against the

78

theoretically correct output for similarity. The source code for this program can be found

in Appendix D.

SRAM Upper limit

SRAM lower limit

Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the upper bound of the SRAM address space whereas the low address indicates the

lower bound of the SRAM address space.

Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are two segments due to the fact that the program wrote the right bank of the SRAM

(16-bit) first and the left bank of the SRAM next (16 MSB bits).

79

7.3. Test Cases

To validate correct functional and performance operation of the FPGA based

hardware prototype of the convolution architecture, two test cases were run. The

convolution architecture was run at 2 kHz clock frequency for these test cases. Maximum

clock rate hardware prototype performance was not a goal for these two tests. The

performance metric of interest is whether the prototype can simultaneously convolute one

IP with k (n×n) FCs and generate k OI pixels on each system clock cycle. These two test

cases are as shown in Figure 6.26 and Figure 6.29 of the previous section. Since for each

test case there are three different Filter Coefficient (FC) planes, three experimental runs

must be carried out. The reason that three experimental runs are needed even though all

three OI planes are generated in each experimental run is because of the SRAM data bus

bandwidth limitation. Each OI requires 19-bits and the SRAM data bus is only 32-bits

wide.

Figure 7.9 (first OI plane), Figure 7.10 (second OI plane) and Figure 7.11 (third

FC plane) show the results obtained from the SRAM after each experimental run. The

grayed areas of the figures are Intel hex file header and checksum, and the highlighted

boxes with arrows projected to the bottom of the figure are the first OI pixel for the

respective OI plane. To obtain the second OI pixel, slide the windows to the next column

as marked. Comparison of all three obtained OI planes with the result shown in Figure

6.26 reveals they match. The comparing program executed on each of the experimental

runs show that the obtained results are identical to the expected results.

80

Figure 7.9. SRAM contents retrieved for first OI plane for test case 1.

Figure 7.10. SRAM contents retrieved for second OI plane for test case 1.

Figure 7.11. SRAM contents retrieved for third OI plane for test case 1.

81

Figure 7.12 (first OI plane), Figure 7.13 (second OI plane) and Figure 7.14 (third

OI plane) below show the experimental runs for test case 2. Again, the OI planes received

from each experimental run are compared to the expected results shown in Figure 6.29 of

the previous section. After each experimental run, the results obtained were compared

with the expected results using the comparison program. All three of the OI planes

matched with the expected results and thus, again, validated the correctness of the


From the results obtained from testing of the hardware prototype, the functionality

and performance correctness of version 2 of the convolution architecture is thus further

validated.

Figure 7.12. SRAM contents retrieved for first OI plane for test case 2.

Figure 7.13. SRAM contents retrieved for second OI plane for test case 2.

82

Figure 7.14. SRAM contents retrieved for third OI plane for test case 2.

83

Chapter 8

Conclusions In summary, the main objective of this thesis research project was to develop the

architecture for and design, validate, and build a hardware prototype of a convolution

architecture capable of processing an input image plane such that an output image pixel is

obtained every clock cycle assuming convolution with one FC plane. In addition, the

convolution architecture needed to be scalable in both the filter coefficient plane size

(kernel size) and the number of filter coefficient planes which could be simultaneously

processed. The motivations behind the scalability were such that, first, the convolution

architecture can be tailored to any size of kernel and still produce one output image pixel

per clock cycle. Secondly, a motivation for scalability was to allow k kernels of any size

within the architecture and the architecture will have the functional and performance

capability to output k output image pixels on each system clock cycle.

The developed convolution architecture was captured through the use of the

VHDL hardware description language. Xilinx Foundation series CAD software packages

were used to synthesize and implement the architecture to a FPGA chip. Before the

architecture was prototyped into the prototyping board for experimental testing, the

architecture was functionally and performance validated through HDL software

simulation via post-synthesis and post-implementation software simulations.

Experimental testing of the architecture was done on a prototyping board that featured a

Virtex family FPGA.

Post-synthesis and Post-implementation HDL software simulation and

experimental hardware testing of the hardware prototype showed that the implemented

prototype of the convolution architecture was indeed functionally correct as intended. It

is felt that if the convolution architecture were implemented in high speed ASIC

“production” technology with a high speed external SRAM, the intent that a single IP

pixel can be convoluted with k (n×n) FCs and k OI pixels generated within a clock cycle

of 7.3 ns could be achieved. If required, and as earlier indicated, pipelining of the

84

multiply unit within MAUs of the AUs would greatly increase overall system performance

if needed.

As a side note, a convolution program with three 5×5 FC planes and 5100×6600

IP plane was run on a general purpose processor (AMD Athlon 650 MHz) for a “loose”

comparison to the performance of the new convolution architecture. The processor used

on average 0.4 second of system time to convolute the one IP plane with three FC planes

which indicates that when the processor runs at around 260 MHz it would be able to meet

the “production” requirements for the new convolution architecture system. However, the

cost/performance ratio for the general purpose processor will be higher than the version 2

convolution architecture implemented in ASIC technology considering the die size of

both architectures (the convolution architecture has roughly less than ten percent of the

general purpose processor’s transistor count).

In conclusion, the best cost/performance ratio can be obtained from implementing

the new convolution architecture in “production” ASIC technology which should allow

the system clock of the convolution architecture to have a desired cycle time of 7.3 ns or

less. Thus, the primary factor that determines performance of the new convolution

architecture is the speed of the implementation technology, optimization of the

layout/placement of the implementation to reduce longest path delays, and the degree of

pipelining one chooses to design into the system.

85

Appendix A

VHDL Code for Version 2 Discrete Convolution Architecture (Figure 4.20 for k = 3)

1. Version 2 Convolution Architecture -- sys.vhd (Top Level System of Version 2 Convolution Architecture) library IEEE; use IEEE.std_logic_1164.all; entity SYS is port( clk, rst, str: in std_logic; d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF) coef: in std_logic_vector(5 downto 0); --(FCs from parallel port) ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp) au_sel: in std_logic_vector(1 downto 0); --(AU select from pp) o_sel: in std_logic; --(Output config pin from pp) req: out std_logic; --(Controller -> FIFO) sram_w: out std_logic; --(SYS -> SRAM) d_out: out std_logic_vector(18 downto 0) ); end entity SYS; architecture STRUCT of SYS is component CTR is port( clk, rst, str, bor, eoc, sds, rgt: in std_logic; f_sel: out std_logic; z_pad: out std_logic; reg_sel: out std_logic_vector(2 downto 0); en_w: out std_logic_vector(1 downto 0); en_sf: out std_logic_vector(1 downto 0); z_input: out std_logic; c_inc: out std_logic_vector(1 downto 0); rot: out std_logic; r_inc: out std_logic; req: out std_logic; r_w: out std_logic; s_inc: out std_logic; sd_inc: out std_logic; ans: out std_logic ); end component CTR; component RCNT is port( clk, rst, r_inc, sd_inc: in std_logic; eoc, sds, rgt: out std_logic ); end component RCNT; component REG_A is generic( n: integer := 8; -- denotes the data width d: integer := 5 );-- denotes the number of registers port( clk, rst, z_pad, z_input: in std_logic; reg_sel: in std_logic_vector(2 downto 0); d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector((n*d)-1 downto 0); ids_out: out std_logic_vector(n-1 downto 0) ); end component REG_A; component C_BANK1 is port( clk, rst, f_sel, z_pad: in std_logic; en_sf, en_w: in std_logic_vector(1 downto 0); ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end component C_BANK1; component RAM is port( wclk, r_w: in std_logic; d_in: in std_logic_vector(39 downto 0); addr: in std_logic_vector(6 downto 0);

86

d_out: out std_logic_vector(39 downto 0) ); end component RAM; component MEMPTR is port( clk, rst, rot, s_inc: in std_logic; inc: in std_logic_vector(1 downto 0); reg_sel: in std_logic_vector(2 downto 0); bor: out std_logic; -- begining of a new row bseq: out std_logic_vector(3 downto 0); add_out: out std_logic_vector(6 downto 0) ); end component MEMPTR; component IDS is port( clk, rst, ans: in std_logic; ids_in: in std_logic_vector(39 downto 0); o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0)); end component IDS; component DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end component DU; component AU is port( clk, rst: in std_logic; du0, du1, du2, du3, du4: in std_logic_vector(39 downto 0); p_en: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); out_pxl: out std_logic_vector(18 downto 0); ovf: out std_logic); end component AU; -- Internal signals to connect components signal r_inc, sd_inc, eoc, sds, rgt: std_logic; --(Row counter <-> Controller) -- signal req : std_logic; signal z_pad, z_input : std_logic; --(Controller -> REG_A ) signal rot, s_inc, bor : std_logic; --(Controller -> MEMPTR ) signal r_w : std_logic; --(Controller -> RAM ) signal f_sel : std_logic; --(Controller -> BANK_1 ) signal ans : std_logic; --(Controller -> IDS ) signal a1, a2, a3 : std_logic; --(Controller->a1->a2->a3->ans) signal ovf1, ovf2, ovf3 : std_logic; --(Overflow from AUs) --(Controller -> MEMPTR) signal c_inc : std_logic_vector(1 downto 0); --(Controller -> BANK_1) signal en_w, en_sf : std_logic_vector(1 downto 0); --(Controller -> REG_A, MEMPTR) signal reg_sel : std_logic_vector(2 downto 0); --(REG_A -> RAM) signal rega_ram : std_logic_vector(39 downto 0); --(REG_A -> IDS) signal rega_ids : std_logic_vector(7 downto 0); --(MEMPTR -> C_BANK1) signal bseq : std_logic_Vector(3 downto 0); --(C_BANK1 -> IDS) signal cbank_ids : std_logic_vector(31 downto 0); --(RAM -> C_BANK1) signal ram_cbank : std_logic_vector(39 downto 0); --(MEMPTR -> RAM) Ram address signal memptr_ram : std_logic_vector(6 downto 0); --(IDS -> DUs) signal o1, o2, o3, o4, o5 : std_logic_vector(39 downto 0); --(Combined output from REG_A and C_BANK1 into ids_in) signal ids_in : std_logic_vector(39 downto 0); --(DUs -> AUs) signal du_au1, du_au2, du_au3 : std_logic_vector(39 downto 0); signal du_au4, du_au5 : std_logic_vector(39 downto 0); --(AUs -> Output Pixels)

87

signal out_pxl1 : std_logic_vector(18 downto 0); signal out_pxl2 : std_logic_vector(18 downto 0); signal out_pxl3 : std_logic_vector(18 downto 0); --(AU's select line for programming) signal a_sel : std_logic_vector(2 downto 0); --(Output select register for holding output selection from parallel port) signal op_sel_reg : std_logic_vector(1 downto 0); --(ans delays signals) signal ds : std_logic_vector(13 downto 0); begin -- Main Controller of Version 2 Convolution Architecture U0: CTR port map(clk=>clk, rst=>rst, str=>str, bor=>bor, eoc=>eoc, sds=>sds, rgt=>rgt, f_sel=>f_sel, z_pad=>z_pad, reg_sel=>reg_sel, en_w=>en_w, en_sf=>en_sf, z_input=>z_input, c_inc=>c_inc, rot=>rot, r_inc=>r_inc, req=>req, r_w=>r_w, s_inc=>s_inc, sd_inc=>sd_inc, ans=>a1); -- Row counter for the main controller U1: RCNT port map(clk=>clk, rst=>rst, r_inc=>r_inc, sd_inc=>sd_inc, eoc=>eoc, sds=>sds, rgt=>rgt); -- Register A of the DM_IF U2: REG_A port map(clk=>clk, rst=>rst, z_pad=>z_pad, z_input=>z_input, reg_sel=>reg_sel, d_in=>d_in, d_out=>rega_ram, ids_out=>rega_ids); -- C_BANK1 of the DM_IF U3: C_BANK1 port map(clk=>clk, rst=>rst, f_sel=>f_sel, z_pad=>z_pad, en_sf=>en_sf, en_w=>en_w, ld_reg=>reg_sel, bseq=>bseq, d_in=>ram_cbank, d_out=>cbank_ids); -- RAM U4: RAM port map(wclk=>clk, r_w=>r_w, d_in=>rega_ram, addr=>memptr_ram, d_out=>ram_cbank); -- MEMPTR (memory pointer) U5: MEMPTR port map(clk=>clk, rst=>rst, rot=>rot, s_inc=>s_inc, inc=>c_inc, reg_sel=>reg_sel, bor=>bor, bseq=>bseq, add_out=>memptr_ram); -- IDS L1: ids_in <= rega_ids & cbank_ids; -- Combine signals output from REG_A and CBANK u6: IDS port map(clk=>clk, rst=>rst, ans=>ans, ids_in=>ids_in, o1=>o1, o2=>o2, o3=>o3, o4=>o4, o5=>o5); -- This process is to create delays such that the output from IDS will outputs at -- the same time. Also take cares of the boundary outputs from line to line. D2: process (clk, rst, a1, a2) is begin if (rst = '1') then a2 <= '0'; a3 <= '0'; ans <= '0'; elsif (clk'event and clk = '1') then a2 <= a1; a3 <= a2; ans <= a3; end if; end process D2; -- This process is to propagate ans signal through out all the pipeline stages to -- the SRAM writer interface so it could strat "recording". D3: process (clk, rst, ans, ds) is begin if (rst = '1') then ds <= (others => '0'); elsif (clk'event and clk = '1') then ds(0) <= ans; ds(1) <= ds(0);

88

ds(2) <= ds(1); ds(3) <= ds(2); ds(4) <= ds(3); ds(5) <= ds(4); ds(6) <= ds(5); ds(7) <= ds(6); ds(8) <= ds(7); ds(9) <= ds(8); ds(10) <= ds(9); ds(11) <= ds(10); ds(12) <= ds(11); ds(13) <= ds(12); end if; end process D3; L2: sram_w <= ds(13) or ds(12) or ds(11) or ds(10); -- DUs u7: DU port map(clk=>clk, rst=>rst, ids_in=>o1, du_out=>du_au1); u8: DU port map(clk=>clk, rst=>rst, ids_in=>o2, du_out=>du_au2); u9: DU port map(clk=>clk, rst=>rst, ids_in=>o3, du_out=>du_au3); u10: DU port map(clk=>clk, rst=>rst, ids_in=>o4, du_out=>du_au4); u11: DU port map(clk=>clk, rst=>rst, ids_in=>o5, du_out=>du_au5); -- AUs u12: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(0), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl1, ovf=>ovf1); u13: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(1), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl2, ovf=>ovf2); u14: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(2), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl3, ovf=>ovf3); AUP_SEL: process (au_sel) is begin case (au_sel) is when "01" => a_sel <= "001"; when "10" => a_sel <= "010"; when "11" => a_sel <= "100"; when others => a_sel <= "000"; end case; end process AUP_SEL; -- Testing Purposes -- Output selection logic OP_SEL: process (clk, rst, o_sel) is begin if (rst = '1') then op_sel_reg <= (others => '0'); elsif (clk'event and clk = '1') then if (o_sel = '1') then op_sel_reg <= coef(1 downto 0); else op_sel_reg <= op_sel_reg; end if; end if; end process OP_SEL; -- Output selection mux OP_D: process (op_sel_reg, out_pxl1, out_pxl2, out_pxl3) is begin case (op_sel_reg) is when "00" => d_out <= out_pxl1; when "01" => d_out <= out_pxl2;

89

when "10" => d_out <= out_pxl3; when others => d_out <= out_pxl1; end case; end process OP_D; end architecture STRUCT; 2. Controller Unit (CU) -- ctr.vhd (Controller) library IEEE; use IEEE.std_logic_1164.all; entity CTR is port( clk, rst, str, bor, eoc, sds, rgt: in std_logic; f_sel: out std_logic; z_pad: out std_logic; reg_sel: out std_logic_vector(2 downto 0); en_w: out std_logic_vector(1 downto 0); en_sf: out std_logic_vector(1 downto 0); z_input: out std_logic; c_inc: out std_logic_vector(1 downto 0); rot: out std_logic; r_inc: out std_logic; req: out std_logic; r_w: out std_logic; s_inc: out std_logic; sd_inc: out std_logic; ans: out std_logic ); end entity CTR; architecture BEHAVIORAL of CTR is type statetype is (st0, st1, st2, st3, st4, st5, st6, st7, st8, st9, st10, st11, st12, st13, st14, st15, st16, st17, st18, st19, st20, st21, st22, st23, st24, st25, st26, st27, st28, st29, st30, st31, st32, st33, st34, st35, st36, st37, st38, st39, st40, st41, st42, st43, st44); signal c_st, n_st: statetype; signal tog : std_logic; begin NXTSTPROC: process (c_st, str, bor, eoc, rgt, tog, sds) is begin case c_st is when st0 => if (str = '1') then n_st <= st1; else n_st <= st0; end if; when st1 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when st2 => n_st <= st3; when st3 => n_st <= st4; when st4 => n_st <= st5; when st5 => n_st <= st6;

90

when st6 => if (bor = '1') then n_st <= st7; else if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st7 => n_st <= st8; when st8 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when st9 => n_st <= st10; when st10 => n_st <= st11; when st11 => n_st <= st12; when st12 => n_st <= st13; when st13 => if (bor = '1') then n_st <= st14; else if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st14 => n_st <= st15; when st15 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when ST16 => n_st <= st17; when st17 => n_st <= st18; when st18 => n_st <= st19; when st19 => n_st <= st20; when st20 => if (bor = '1') then

91

n_st <= st21; else if (tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st21 => n_st <= st22; when st22 => if (eoc = '1') then if (tog = '0') then n_st <= st37; else n_st <= st30; end if; else if (tog = '0') then n_st <= St23; else n_st <= st16; end if; end if; when ST23 => n_st <= st24; when st24 => n_st <= st25; when st25 => n_st <= st26; when st26 => n_st <= st27; when st27 => if (bor = '1') then n_st <= st28; else if (tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st28 => n_st <= st29; when st29 => if (eoc = '1') then if (tog = '0') then n_st <= st37; else n_st <= st30; end if; else if (tog = '0') then n_st <= St23; else n_st <= st16; end if; end if; when st30 => n_st <= st31; when st31 => n_st <= st32; when st32 => n_st <= st33; when st33 => n_st <= st34; when st34 => if (bor = '1') then n_st <= st35; else

92

n_st <= st37; end if; when st35 => n_st <= st36; when st36 => if (sds = '1') then n_st <= st44; else n_st <= st37; end if; when st37 => n_st <= st38; when st38 => n_st <= st39; when st39 => n_st <= st40; when st40 => n_st <= st41; when st41 => if (bor = '1') then n_st <= st42; else n_st <= st30; end if; when st42 => n_st <= st43; when st43 => if (sds = '1') then n_st <= st44; else n_st <= st30; end if; when st44 => n_st <= st44; when others => null; end case; end process NXTSTPROC; CURSTPROC: process (clk, rst) is begin if (rst = '1') then c_st <= st0; elsif (clk'event and clk = '0') then c_st <= n_st; end if; end process CURSTPROC; OUTCONPROC: process (c_st) is begin case c_st is when st0 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '0';

93

ans <= '0'; when st1 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '0'; ans <= '0'; when st2 => reg_sel <= "001"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st3 => reg_sel <= "010"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st4 => reg_sel <= "011"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st5 => reg_sel <= "100"; f_sel <= '0'; en_w <= "01"; en_sf <= "00";

94

z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st6 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st7 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st8 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st9 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1';

95

tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st10 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st11 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st12 => reg_sel <= "100"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st13 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '0';

96

when st14 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st15 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st16 => reg_sel <= "001"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st17 => reg_sel <= "010"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st18 => reg_sel <= "011"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0';

97

c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st19 => reg_sel <= "100"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st20 => reg_sel <= "101"; f_sel <= '1'; en_w <= "00"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st21 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st22 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0';

98

sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st23 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st24 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st25 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st26 => reg_sel <= "100"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1';

99

when st27 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st28 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st29 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st30 => reg_sel <= "001"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st31 => reg_sel <= "010"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1';

100

c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st32 => reg_sel <= "011"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st33 => reg_sel <= "100"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st34 => reg_sel <= "101"; f_sel <= '1'; en_w <= "00"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st35 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '1'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0';

101

sd_inc <= '1'; s_inc <= '1'; ans <= '0'; when st36 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st37 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st38 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st39 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st40 => reg_sel <= "100"; f_sel <= '0';

102

en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st41 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st42 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '1'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '1'; s_inc <= '1'; ans <= '0'; when st43 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st44 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0';

103

r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when others => null; end case; end process OUTCONPROC; end architecture BEHAVIORAL; 3. Memory Pointers Unit (MPU) -- mem_ptr.vhd (Memory Pointers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity MEMPTR is port( clk, rst, rot: in std_logic; inc: in std_logic_vector(1 downto 0); reg_sel: in std_logic_vector(2 downto 0); bor: out std_logic; -- begining of a new row bseq: out std_logic_vector(3 downto 0); add_out: out std_logic_vector(12 downto 0) ); end entity MEMPTR; architecture BEHAVIORAL of MEMPTR is signal count1, count2 : unsigned(9 downto 0); signal ptr_a, ptr_b, ptr_c, ptr_d, ptr_e: unsigned(2 downto 0); signal b_seq : unsigned(3 downto 0); signal eor : std_logic; begin CNTR1: process (clk, rst, inc) is begin if (rst = '1') then count1 <= to_unsigned(0, 10); elsif (clk'event and clk = '1') then if (inc(0) = '1') then if (count1 = to_unsigned(1020,10)) then count1 <= to_unsigned(0, 10); else count1 <= count1 + 1; end if; end if; end if; end process CNTR1; CNTR2: process (clk, rst, inc) is begin if (rst = '1') then count2 <= to_unsigned(1, 10); elsif (clk'event and clk = '1') then if (inc(1) = '1') then if (count2 = to_unsigned(1020,10)) then count2 <= to_unsigned(0, 10); else count2 <= count2 + 1; end if;

104

end if; end if; end process CNTR2; CODOUT: process (count1, count2) is begin if (count2 = to_unsigned(0, 10)) then eor <= '1'; else eor <= '0'; end if; if (count1 = to_unsigned(0, 10)) then bor <= '1'; else bor <= '0'; end if; end process CODOUT; BLK: process (clk, rst, rot) is begin if (rst = '1') then b_seq <= to_unsigned(0, 4); elsif (clk'event and clk = '1') then if (rot = '1') then b_seq(0) <= '1'; b_seq(1) <= b_seq(0); b_seq(2) <= b_seq(1); b_seq(3) <= b_seq(2); end if; end if; end process BLK; L1: bseq <= std_logic_vector(b_seq); PTRS: process (clk, rst, rot) is begin if (rst = '1') then ptr_a <= to_unsigned(0, 3); ptr_b <= to_unsigned(1, 3); ptr_c <= to_unsigned(2, 3); ptr_d <= to_unsigned(3, 3); ptr_e <= to_unsigned(4, 3); elsif (clk'event and clk = '1') then if (rot = '1') then ptr_b <= ptr_a; ptr_c <= ptr_b; ptr_d <= ptr_c; ptr_e <= ptr_d; ptr_a <= ptr_e; end if; end if; end process PTRS; MUX: process (reg_sel, count1, count2, ptr_a, ptr_b, ptr_c, ptr_d, ptr_e, eor) is begin case reg_sel is when "001" => if (eor = '1') then add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_e) & std_logic_vector(count2); end if;

105

when "010" => if (eor = '1') then add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2); end if; when "011" => if (eor = '1') then add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2); end if; when "100" => if (eor = '1') then add_out <= std_logic_vector(ptr_a) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2); end if; when "101" => add_out <= std_logic_vector(ptr_a) & std_logic_vector(count1); when others => add_out <= std_logic_vector(to_unsigned(0, 13)); end case; end process MUX; end architecture BEHAVIORAL;

4. Data Memory Interface (DM I/F) -- dm_if.vhd (Data Memory Interface) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity C_BANK1 is port( clk, rst, f_sel, z_pad: in std_logic; en_sf, en_w: in std_logic_vector(1 downto 0); ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end entity C_BANK1; architecture STRUCTURAL of C_BANK1 is component REGFILE is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_sf, en_w: in std_logic; ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end component REGFILE; signal f_a, f_b, f_mux: std_logic_vector(31 downto 0); begin RF1: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(0), en_w=>en_w(0), bseq=>bseq, ld_reg=>ld_reg, d_in=>d_in, d_out=>f_a); RF2: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(1), en_w=>en_w(1), bseq=>bseq, ld_reg=>ld_reg, d_in=>d_in, d_out=>f_b); MUX1: f_mux <= f_a when f_sel = '0' else f_b; Z_P: d_out <= f_mux when z_pad = '0' else std_logic_vector(to_unsigned(0, 32)); end architecture STRUCTURAL;

106

-- -- regfile.vhd library IEEE; use IEEE.std_logic_1164.all; entity REGFILE is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_sf, en_w: in std_logic; ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end entity REGFILE; architecture STRUCTURAL of REGFILE is component PLS_REG is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_ld, en_sf: in std_logic; d_in: in std_logic_vector((n*d)-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component PLS_REG; signal reg_sel: std_logic_vector(3 downto 0); begin LF: for f in 1 to 4 generate PR_F: PLS_REG generic map(n=>n, d=>d) port map(clk=>clk, rst=>rst, en_ld=>reg_sel(f-1), en_sf=>en_sf, d_in=>d_in, d_out=>d_out((f*n)-1 downto ((f-1)*n))); end generate LF; SEL: process (ld_reg, en_w, bseq) is begin if (en_w = '0') then reg_sel <= "0000"; else case ld_reg is when "001" => if (bseq(0) = '1') then reg_sel <= "0001"; else reg_sel <="0000"; end if; when "010" => if (bseq(1) = '1') then reg_sel <= "0010"; else reg_sel <= "0000"; end if; when "011" => if (bseq(2) = '1') then reg_sel <= "0100"; else reg_sel <= "0000"; end if; when "100" => if (bseq(3) = '1') then reg_sel <= "1000"; else reg_sel <= "0000"; end if; when others => reg_sel <= "0000"; end case; end if;

107

end process SEL; end architecture STRUCTURAL; -- pls_reg.vhd library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity PLS_REG is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_ld, en_sf: in std_logic; d_in: in std_logic_vector((n*d)-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity PLS_REG; architecture STRUCTURAL of PLS_REG is signal s: std_logic_vector((n*(d+1))-1 downto 0); begin L1: s(n-1 downto 0) <= std_logic_vector(to_unsigned(0, n)); LK: for k in 1 to d generate REGK: S_REG generic map(n=>n) port map(clk=>clk, rst=>rst, en_ld=>en_ld, en_sf=>en_sf, p_in=>d_in((n*k)-1 downto n*(k-1)), d_in=>s((n*k)-1 downto n*(k-1)), d_out=>s((n*(k+1))-1 downto (n*k))); end generate LK; L2: d_out <= s((n*(d+1))-1 downto n*d); end architecture STRUCTURAL; -- reg_a.vhd library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity REG_A is generic( n: integer := 8; -- denotes the data width d: integer := 5 );-- denotes the number of registers port( clk, rst, z_pad: in std_logic; reg_sel: in std_logic_vector(2 downto 0); d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector((n*d)-1 downto 0); ids_out: out std_logic_vector(n-1 downto 0) ); end entity REG_A; architecture BEHAVIORAL of REG_A is signal reg1, reg2, reg3, reg4, reg5, regt: unsigned(n-1 downto 0); begin -- Register write with conditions REGSEL: process (clk, rst, reg_sel) is begin if (rst = '1') then reg1 <= to_unsigned(0, n); reg2 <= to_unsigned(0, n); reg3 <= to_unsigned(0, n); reg4 <= to_unsigned(0, n); reg5 <= to_unsigned(0, n); regt <= to_unsigned(0, n); elsif (clk'event and clk = '1') then

108

case reg_sel is when "001" => reg1 <= unsigned(d_in); regt <= unsigned(d_in); when "010" => reg2 <= unsigned(d_in); regt <= unsigned(d_in); when "011" => reg3 <= unsigned(d_in); regt <= unsigned(d_in); when "100" => reg4 <= unsigned(d_in); regt <= unsigned(d_in); when "101" => reg5 <= unsigned(d_in); regt <= unsigned(d_in); when others => null; end case; end if; end process REGSEL; -- Output Logic L1: d_out <= std_logic_vector(reg5) & std_logic_vector(reg4) & std_logic_vector(reg3) & std_logic_vector(reg2) & std_logic_vector(reg1); L2: ids_out <= std_logic_vector(regt) when z_pad = '0' else std_logic_vector(to_unsigned(0, n)); end architecture BEHAVIORAL; 5. Input Data Shifters (IDS) -- ids.vhd (Input Data Shifters) -- This is the functional unit that responsible for providing inputs to the five MAAs. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity IDS is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0)); end entity IDS; architecture STRUCTURAL of IDS is signal s1, s2, s3, s4: std_logic_vector(39 downto 0); begin R1: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>ids_in, d_out=>s1); L1: o1 <= s1; R2: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s1, d_out=>s2); L2: o2 <= s2; R3: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s2, d_out=>s3); L3: o3 <= s3; R4: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s3, d_out=>s4); L4: o4 <= s4; R5: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s4, d_out=>o5); end architecture STRUCTURAL; 6. Arithmetic Unit (AU) -- au.vhd (Arithmetic Unit) -- This is the combination of all the arithmetic units, which including all the

109

-- MAUs (25 of them). library IEEE; use IEEE.std_logic_1164.all; entity AU is port( clk, rst: in std_logic; ids0, ids1, ids2, ids3, ids4: in std_logic_vector(39 downto 0); ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); out_pxl: out std_logic_vector(18 downto 0); ovf: out std_logic); end entity AU; architecture STRUCTURAL of AU is component MAA is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end component MAA; component AT is port( clk, rst: in std_logic; maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0); ovf: out std_logic; out_pxl: out std_logic_vector(18 downto 0)); end component AT; signal maa0, maa1, maa2, maa3, maa4: std_logic_vector(16 downto 0); signal ld_coef : std_logic_vector(24 downto 0); begin MAA_0: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(4 downto 0), coef=>coef, img_pxl=>ids0, p_rst=>maa0); MAA_1: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(9 downto 5), coef=>coef, img_pxl=>ids1, p_rst=>maa1); MAA_2: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(14 downto 10), coef=>coef, img_pxl=>ids2, p_rst=>maa2); MAA_3: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(19 downto 15), coef=>coef, img_pxl=>ids3, p_rst=>maa3); MAA_4: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(24 downto 20), coef=>coef, img_pxl=>ids4, p_rst=>maa4); U1: AT port map(clk=>clk, rst=>rst, maa0=>maa0, maa1=>maa1, maa2=>maa2, maa3=>maa3, maa4=>maa4, ovf=>ovf, out_pxl=>out_pxl); MUX: process (ld_reg, coef) is begin case (ld_reg) is when "00001" => ld_coef <= "0000000000000000000000001"; -- 1 when "00010" => ld_coef <= "0000000000000000000000010"; -- 2 when "00011" => ld_coef <= "0000000000000000000000100"; -- 3 when "00100" => ld_coef <= "0000000000000000000001000"; -- 4 when "00101" => ld_coef <= "0000000000000000000010000"; -- 5 when "00110" => ld_coef <= "0000000000000000000100000"; -- 6 when "00111" => ld_coef <= "0000000000000000001000000"; -- 7 when "01000" => ld_coef <= "0000000000000000010000000"; -- 8 when "01001" => ld_coef <= "0000000000000000100000000"; -- 9 when "01010" => ld_coef <= "0000000000000001000000000"; -- 10 when "01011" => ld_coef <= "0000000000000010000000000"; -- 11 when "01100" => ld_coef <= "0000000000000100000000000"; -- 12 when "01101" => ld_coef <= "0000000000001000000000000"; -- 13 when "01110" => ld_coef <= "0000000000010000000000000"; -- 14

110

when "01111" => ld_coef <= "0000000000100000000000000"; -- 15 when "10000" => ld_coef <= "0000000001000000000000000"; -- 16 when "10001" => ld_coef <= "0000000010000000000000000"; -- 17 when "10010" => ld_coef <= "0000000100000000000000000"; -- 18 when "10011" => ld_coef <= "0000001000000000000000000"; -- 19 when "10100" => ld_coef <= "0000010000000000000000000"; -- 20 when "10101" => ld_coef <= "0000100000000000000000000"; -- 21 when "10110" => ld_coef <= "0001000000000000000000000"; -- 22 when "10111" => ld_coef <= "0010000000000000000000000"; -- 23 when "11000" => ld_coef <= "0100000000000000000000000"; -- 24 when "11001" => ld_coef <= "1000000000000000000000000"; -- 25 when others => ld_coef <= "0000000000000000000000000"; end case; end process MUX; end architecture STRUCTURAL; -- at.vhd (Adding Tree) -- This is the Adding Tree that is responsible for adding five 17-bit word from -- five different MAAs. The structure will include four level of pipeline stages. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity AT is port( clk, rst: in std_logic; maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0); ovf: out std_logic; out_pxl: out std_logic_vector(18 downto 0)); end entity AT; architecture STRUCTURAL of AT is signal low, ovf_r : std_logic; signal sum1 : std_logic_vector(16 downto 0); signal sum2, sum3, carry1, carry2: std_logic_vector(17 downto 0); signal carry3 : std_logic_vector(18 downto 0); signal sum4 : std_logic_vector(19 downto 0); signal pl1_r1 : std_logic_vector(17 downto 0); signal pl1_r2, pl1_r3, pl1_r4 : std_logic_vector(16 downto 0); signal pl2_r5, pl2_r6 : std_logic_vector(17 downto 0); signal pl2_r7 : std_logic_vector(16 downto 0); signal pl3_r8 : std_logic_vector(18 downto 0); signal pl3_r9 : std_logic_vector(17 downto 0); signal pl4_r10 : std_logic_vector(19 downto 0); begin L1: low <= '0'; U1: CSA generic map(n=>17) port map(a=>maa0, b=>maa1, c=>maa2, sum=>sum1, carry=>carry1); R1: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry1, d_out=>pl1_r1); R2: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum1, d_out=>pl1_r2); R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa3, d_out=>pl1_r3); R4: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa4, d_out=>pl1_r4); U2: CSA generic map(n=>17) port map(a=>pl1_r1(16 downto 0), b=>pl1_r2, c=>pl1_r3, sum=>sum2(16 downto 0), carry=>carry2); L2: sum2(17) <= pl1_r1(17); -- This is the most significant bit from carry1 above R5: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry2, d_out=>pl2_r5); R6: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum2, d_out=>pl2_r6); R7: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>pl1_r4, d_out=>pl2_r7); U3: CSA generic map(n=>17) port map(a=>pl2_r5(16 downto 0), b=>pl2_r6(16 downto 0), c=>pl2_r7, sum=>sum3(16 downto 0), carry=>carry3(17 downto 0)); L3: HA port map(a=>pl2_r5(17), b=>pl2_r6(17), s=>sum3(17), cout=>carry3(18)); R8: REG_P generic map(n=>19) port map(clk=>clk, rst=>rst, d_in=>carry3, d_out=>pl3_r8); R9: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum3, d_out=>pl3_r9);

111

U4: CLA_19 port map(a(17 downto 0)=>pl3_r9, a(18)=>low, b=>pl3_r8, s=>sum4(18 downto 0), ovf=>sum4(19)); R10: REG_P generic map(n=>20) port map(clk=>clk, rst=>rst, d_in=>sum4, d_out=>pl4_r10); L4: out_pxl <= pl4_r10(18 downto 0); L5: ovf <= pl4_r10(19); end architecture STRUCTURAL; -- maa.vhd (This is ths systolic array of five MAUs with a DU) library IEEE; use IEEE.std_logic_1164.all; entity MAA is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end entity MAA; architecture STRUCTURAL of MAA is component DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end component DU; component MAUS is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end component MAUS; signal s: std_logic_vector(39 downto 0); begin U1: DU port map(clk=>clk, rst=>rst, ids_in=>img_pxl, du_out=>s); U2: MAUS port map(clk=>clk, rst=>rst, ld_reg=>ld_reg, coef=>coef, img_pxl=>s, p_rst=>p_rst); end architecture STRUCTURAL; -- du.vhd -- This is the Delay Unit for the propagation of the image data library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end entity DU; architecture STRUCTURAL of DU is signal p1: std_logic_vector(31 downto 0); signal p2, p3: std_logic_vector(23 downto 0); signal p4, p5: std_logic_vector(15 downto 0); signal p6, p7: std_logic_vector(7 downto 0); begin L1: du_out(7 downto 0) <= ids_in(7 downto 0); PL1: REG_P generic map(n=>32) port map(clk=>clk, rst=>rst, d_in=>ids_in(39 downto 8), d_out=>p1);

112

L2: du_out(15 downto 8) <= p1(7 downto 0); PL2: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p1(31 downto 8), d_out=>p2); PL3: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p2, d_out=>p3); L3: du_out(23 downto 16) <= p3(7 downto 0); PL4: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p3(23 downto 8), d_out=>p4); PL5: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p4, d_out=>p5); L4: du_out(31 downto 24) <= p5(7 downto 0); PL6: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p5(15 downto 8), d_out=>p6); PL7: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p6, d_out=>p7); L5: du_out(39 downto 32) <= p7; end architecture STRUCTURAL; -- maus.vhd library IEEE; use IEEE.std_logic_1164.all; entity MAUS is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end entity MAUS; architecture STRUCTURAL of MAUS is component MAU_0 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(7 downto 0); p_res: out std_logic_vector(13 downto 0)); end component MAU_0; component MAU_1 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(13 downto 0); -- previous MAU output p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU end component MAU_1; component MAU_2 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(14 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end component MAU_2; component MAU_3 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end component MAU_3; component MAU_4 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU end component MAU_4;

113

signal p_res1: std_logic_vector(13 downto 0); signal p_res2: std_logic_vector(14 downto 0); signal p_res3, p_res4: std_logic_vector(15 downto 0); begin U0: MAU_0 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(0), coef=>coef, img_pxl=>img_pxl(7 downto 0), p_res=>p_res1); U1: MAU_1 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(1), coef=>coef, img_pxl=>img_pxl(15 downto 8), p_mau=>p_res1, p_res=>p_res2); U2: MAU_2 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(2), coef=>coef, img_pxl=>img_pxl(23 downto 16), p_mau=>p_res2, p_res=>p_res3); U3: MAU_3 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(3), coef=>coef, img_pxl=>img_pxl(31 downto 24), p_mau=>p_res3, p_res=>p_res4); U4: MAU_4 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(4), coef=>coef, img_pxl=>img_pxl(39 downto 32), p_mau=>p_res4, p_res=>p_rst); end architecture STRUCTURAL; -- mau_0.vhd -- This is the first MAU of the MAUs (for one systolic array). -- This MAU only contains Multiplication unit and no Adder since there is no previous -- MAU output that needs to be accumulated. The range of multiplication is within -- -8160 to 7905 (decimal), hence the output (p_res) is 14-bit word. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_0 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- Filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- Image pixels p_res: out std_logic_vector(13 downto 0)); -- Partial result to end entity MAU_0; architecture BEHAVIORAL of MAU_0 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); begin U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL;

114

-- mau_1.vhd -- This is the second MAU of the MAUs. The range of this MAU is between -- -16320 and 15810 (decimal), hence the 15-bit word is used for the partial result. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_1 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(13 downto 0); -- previous MAU output p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU end entity MAU_1; architecture BEHAVIORAL of MAU_1 is signal coef_reg: std_logic_vector(5 downto 0); signal pl1, pl2, product: std_logic_vector(13 downto 0); signal sum: std_logic_vector(14 downto 0); begin R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2); U2: CLA_15 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL; -- mau_2.vhd -- This is the third MAU of the MAUs systolic array. -- The range of the MAU is between -24480 and 23715 (decimal), thus -- a 16-bit word is used for the partial result bus. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_2 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(14 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end entity MAU_2; architecture BEHAVIORAL of MAU_2 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2, sum: std_logic_vector(15 downto 0); begin

115

L1: pl2(15 downto 14) <= "00"; L2: pl1(15) <= '0'; R1: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1(14 downto 0)); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL; -- mau_3.vhd -- This is the third MAU with the MAUs. The range of the MAU is between -- -32640 and 31620 (decimal), thus a 16-bit word bus is used for the partial -- result coming out from this MAU. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_3 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end entity MAU_3; architecture BEHAVIORAL of MAU_3 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2, sum: std_logic_vector(15 downto 0); begin L1: pl2(15 downto 14) <= "00"; R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg;

116

end if; end if; end process STORE; end architecture BEHAVIORAL; -- mau_4.vhd -- This is the last MAU within the MAUs systolic array. The range for this MAU -- is between -40800 and 39525 (decimal), thus a 17-bit word bus is used for the -- partial result coming out of this MAU. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_4 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU end entity MAU_4; architecture BEHAVIORAL of MAU_4 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2: std_logic_vector(15 downto 0); signal sum: std_logic_vector(16 downto 0); begin L1: pl2(15 downto 14) <= "00"; R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_17 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL;

7. Multiplication and Adder Units (These functional units have been defined as a library package) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; package bfu_pckg is

117

component CLA_15 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(14 downto 0)); end component CLA_15; component CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end component CLA_16; component CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17; component CLA_19 is port(a, b: in std_logic_vector(18 downto 0); s: out std_logic_vector(18 downto 0); ovf: out std_logic); end component CLA_19; component MULT is port(a: in std_logic_vector(7 downto 0); b: in std_logic_vector(5 downto 0); p: out std_logic_vector(13 downto 0)); end component MULT; component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA; component CSA is generic(n: positive := 5); port( a, b, c: in std_logic_vector(n-1 downto 0); sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end component CSA; component REG_P is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component REG_P; component REG_N is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component REG_N; end package bfu_pckg; -- mult.vhd (Multiplication Unit) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity MULT is

118

port(a: in std_logic_vector(7 downto 0); b: in std_logic_vector(5 downto 0); p: out std_logic_vector(13 downto 0)); end entity MULT; architecture STRUCT of MULT is component PPG is generic(n: integer := 8); port( a: in std_logic_vector(n-1 downto 0); mult: in std_logic_vector(2 downto 0); pp: out std_logic_vector(n downto 0); spp: out std_logic); end component PPG; component R3_2C is generic(n: integer := 14); port(a, b, c: in std_logic_vector(n-1 downto 0); sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end component R3_2C; component S3_2C is port(pp1, pp2, pp3: in std_logic_vector(8 downto 0); sp1, sp2, sp3: in std_logic; sum, carry: out std_logic_vector(13 downto 0)); end component S3_2C; component CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end component CLA_14; signal sp1, sp2, sp3: std_logic; signal ls: std_logic_vector(2 downto 0); signal pp1, pp2, pp3: std_logic_vector(8 downto 0); signal pp4, sum1, sum2, carry1: std_logic_vector(13 downto 0); signal carry2: std_logic_vector(14 downto 0); begin L1: ls <= b(1 downto 0) & '0'; U1: PPG port map(a=>a, mult=>ls, pp=>pp1, spp=>sp1); U2: PPG port map(a=>a, mult=>b(3 downto 1), pp=>pp2, spp=>sp2); U3: PPG port map(a=>a, mult=>b(5 downto 3), pp=>pp3, spp=>sp3); U4: S3_2C port map(pp1=>pp1, pp2=>pp2, pp3=>pp3, sp1=>sp1, sp2=>sp2, sp3=>sp3, sum=>sum1, carry=>carry1); L2: pp4 <= "000000000" & sp3 & '0' & sp2 & '0' & sp1; U5: R3_2C port map(a=>sum1, b=>carry1, c=>pp4, sum=>sum2, carry=>carry2); U6: CLA_14 port map(a=>sum2, b=>carry2(13 downto 0), s=>p); end architecture STRUCT; -- ppg.vhd (Partial Product Generator) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity PPG is generic(n: integer := 8); port( a: in std_logic_vector(n-1 downto 0); mult: in std_logic_vector(2 downto 0); pp: out std_logic_vector(n downto 0); spp: out std_logic); end entity PPG;

119

architecture BEHAVIORAL of PPG is begin PP_PROC: process(mult, a) begin case mult is when "000" => for k in n downto 0 loop pp(k) <= '0'; end loop; spp <= '0'; when "001" => pp <= '0' & a; spp <= '0'; when "010" => pp <= '0' & a; spp <= '0'; when "011" => pp <= a & '0'; spp <= '0'; when "100" => pp <= not(a & '0'); spp <= '1'; when "101" => pp <= not('0' & a); spp <= '1'; when "110" => pp <= not('0' & a); spp <= '1'; when "111" => for l in n downto 0 loop pp(l) <= '0'; end loop; spp <= '0'; when others => null; end case; end process PP_PROC; end architecture BEHAVIORAL; -- r3_2c.vhd (Second level 3 to 2 Counter for Multiplier) library IEEE; use IEEE.std_logic_1164.all; entity R3_2C is generic(n: integer := 14); port(a, b, c: in std_logic_vector(n-1 downto 0); -- a->sum, b->carry c->pp4 sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end entity R3_2C; architecture STRUCTURAL of R3_2C is component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA; component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; begin L1: carry(0) <= '0'; LK: for k in 2 downto 0 generate HAK: HA port map(a=>a(k), b=>c(k), s=>sum(k), cout=>carry(k+1)); end generate LK; L2: HA port map(a=>a(3), b=>b(3), s=>sum(3), cout=>carry(4)); L3: FA port map(a=>a(4), b=>b(4), cin=>c(4), s=>sum(4), cout=>carry(5)); LF: for f in n-1 downto 5 generate

120

HAF: HA port map(a=>a(f), b=>b(f), s=>sum(f), cout=>carry(f+1)); end generate LF; end architecture STRUCTURAL; -- s3_2c.vhd (Special 3 to 2 Counter) library IEEE; use IEEE.std_logic_1164.all; entity S3_2C is port(pp1, pp2, pp3: in std_logic_vector(8 downto 0); sp1, sp2, sp3: in std_logic; sum, carry: out std_logic_vector(13 downto 0)); end entity S3_2C; architecture STRUCTURAL of S3_2C is component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA; component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; signal high, n_sp1, n_sp2, n_sp3: std_logic; begin G0: high <= '1'; G1: n_sp1 <= not sp1; G2: n_sp2 <= not sp2; G3: n_sp3 <= not sp3; L1: sum(1 downto 0) <= pp1(1 downto 0); L2: carry(2 downto 0) <= "000"; L3: sum(13) <= n_sp3; LK: for k in 1 downto 0 generate HAA: HA port map(a=>pp1(k+2), b=>pp2(k), s=>sum(k+2), cout=>carry(k+3)); end generate LK; LG: for g in 4 downto 0 generate FAG: FA port map(a=>pp1(g+4), b=>pp2(g+2), cin=>pp3(g), s=>sum(g+4), cout=>carry(g+5)); end generate LG; LF: for f in 6 downto 5 generate FAF: FA port map(a=>sp1, b=>pp2(f+2), cin=>pp3(f), s=>sum(f+4), cout=>carry(f+5)); end generate LF; FA7: FA port map(a=>n_sp1, b=>n_sp2, cin=>pp3(7), s=>sum(11), cout=>carry(12)); HA2: HA port map(a=>high, b=>pp3(8), s=>sum(12), cout=>carry(13)); end architecture STRUCTURAL; -- cla_16.vhd (Carry Lookahead Adder ~ 16 Bits) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end entity CLA_16; architecture STRUCTURAL of CLA_16 is

121

component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L; signal p, g: std_logic_vector(3 downto 0); signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12), p_out=>p(3), g_out=>g(3)); U5: CLL_2L port map(p=>p(2 downto 0), g=>g(2 downto 0), cout=>c); end architecture STRUCTURAL; -- cla_19.vhd (19-bit Carry Lookahead adder) library IEEE; use IEEE.std_logic_1164.all; entity CLA_19 is port(a, b: in std_logic_vector(18 downto 0); s: out std_logic_vector(18 downto 0); ovf: out std_logic); end entity CLA_19; architecture STRUCTURAL of CLA_19 is component CLA_3S is port(a, b: in std_logic_vector(2 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0); ovf: out std_logic); end component CLA_3S; component CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17; signal cout, high, low, ovf0, ovf1: std_logic; signal s1, s2: std_logic_vector(2 downto 0); begin L1: high <= '1'; L2: low <= '0'; U1: CLA_17 port map(a=>a(15 downto 0), b=>b(15 downto 0), s(15 downto 0)=>s(15 downto 0), s(16)=>cout); U2: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>high, s=>s1, ovf=>ovf1);

122

U3: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>low, s=>s2, ovf=>ovf0); L3: s(18 downto 16) <= s1 when cout = '1' else s2; L4: ovf <= ovf1 when cout='1' else ovf0; end architecture STRUCTURAL; -- cla_17.vhd (Carry Lookahead Adder ~ 17 Bits) -- This Carry Lookahead adder adds two 16-bit number and generate a 17-bit sum. library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end entity CLA_17; architecture STRUCTURAL of CLA_17 is component CLA_16_P is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0); p_out, g_out: out std_logic); end component CLA_16_P; signal p_4, g_4: std_logic; begin U1: CLA_16_P port map(a=>a, b=>b, s=>s(15 downto 0), p_out=>p_4, g_out=>g_4); L3: s(16) <= g_4; end architecture STRUCTURAL; -- cla_16_p.vhd (Carry Lookahead Adder ~ 16 Bits) with p and g library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_16_P is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0); p_out, g_out: out std_logic); end entity CLA_16_P; architecture STRUCTURAL of CLA_16_P is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLL_2 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 1); p_out, g_out: out std_logic); end component CLL_2; signal p, g: std_logic_vector(3 downto 0);

123

signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12), p_out=>p(3), g_out=>g(3)); U5: CLL_2 port map(p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out); end architecture STRUCTURAL; -- -- cll_2.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) with P, G output library IEEE; use IEEE.std_logic_1164.all; entity CLL_2 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 1); p_out, g_out: out std_logic); end entity CLL_2; architecture BEHAVIORAL of CLL_2 is begin L1: cout(1) <= g(0); L2: cout(2) <= g(1) or (p(1) and g(0)); L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); L4: p_out <= p(3) and p(2) and p(1) and p(0); L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- cla_15.vhd (Carry Lookahead Adder ~ 15 Bits) -- This is a adder that add two 14-bit number and the sum is 15-bit word library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_15 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(14 downto 0)); end entity CLA_15; architecture STRUCTURAL of CLA_15 is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLA_3 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic;

124

s: out std_logic_vector(2 downto 0)); end component CLA_3; component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L; signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_3 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(14 downto 12)); U5: CLL_2L port map(p=>p, g=>g, cout=>c); end architecture STRUCTURAL; -- -- cla_3.vhd (Carry Lookahead Adder ~ 3 Bits (Last 3 Bits)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_3 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0)); end entity CLA_3; architecture STRUCTURAL of CLA_3 is component SCLL_3 is port( cin: in std_logic; p, g: in std_logic_vector(1 downto 0); cout: out std_logic_vector(2 downto 0)); end component SCLL_3; component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; signal p, g: std_logic_vector(1 downto 0); signal c: std_logic_vector(2 downto 0); begin U0: SCLL_3 port map(cin=>cin, p=>p, g=>g, cout=>c); U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0)); U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1)); L1: s(2) <= c(2); end architecture STRUCTURAL; -- -- scll_3.vhd (Carry Lookahead Logic for Bit position 13 & 14) library IEEE; use IEEE.std_logic_1164.all; entity SCLL_3 is port( cin: in std_logic; p, g: in std_logic_vector(1 downto 0); cout: out std_logic_vector(2 downto 0)); end entity SCLL_3;

125

architecture BEHAVIORAL of SCLL_3 is begin L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); end architecture BEHAVIORAL; -- cla_14.vhd (Carry Lookahead Adder ~ 14 Bits) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end entity CLA_14; architecture STRUCTURAL of CLA_14 is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLA_2 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(1 downto 0)); end component CLA_2; component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L; signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_2 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(13 downto 12)); U5: CLL_2L port map(p=>p, g=>g, cout=>c); end architecture STRUCTURAL; -- cla_3s.vhd (Carry Lookahead Adder ~ 3 Bits (For CLA_19)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_3S is

126

port(a, b: in std_logic_vector(2 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0); ovf: out std_logic); end entity CLA_3S; architecture STRUCTURAL of CLA_3S is component SCLL_4 is port( cin: in std_logic; p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 0)); end component SCLL_4; component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 0); begin U0: SCLL_4 port map(cin=>cin, p=>p, g=>g, cout=>c); U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0)); U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1)); U3: PFA port map(a=>a(2), b=>b(2), c=>c(2), s=>s(2), g=>g(2), p=>p(2)); L1: ovf <= c(2) xor c(3); end architecture STRUCTURAL; -- scll_4.vhd (Carry Lookahead Logic for Bit position 17, 18, & 19) library IEEE; use IEEE.std_logic_1164.all; entity SCLL_4 is port( cin: in std_logic; p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 0)); end entity SCLL_4; architecture BEHAVIORAL of SCLL_4 is begin L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)) or (p(2) and p(1) and p(0) and cin); end architecture BEHAVIORAL; -- cla_4.vhd (Carry Lookahead Adder ~ 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLA_4; architecture STRUCTURAL of CLA_4 is component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;

127

component CLL is port( cin: in std_logic; p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLL; signal c, g, p: std_logic_vector(3 downto 0); begin L1: CLL port map(cin=>cin, p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out); LK: for k in 3 downto 0 generate PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k)); end generate LK; end architecture STRUCTURAL; -- cla_4_1.vhd (Carry Lookahead Adder ~ 4 bits / for the first CLA_4) library IEEE; use IEEE.std_logic_1164.all; entity CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLA_4_1; architecture STRUCTURAL of CLA_4_1 is component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; component CLL_1 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(2 downto 0); p_out, g_out: out std_logic); end component CLL_1; signal g, p: std_logic_vector(3 downto 0); signal c: std_logic_vector(3 downto 0); begin L0: g(0) <= a(0) and b(0); L1: p(0) <= a(0) xor b(0); L2: s(0) <= p(0); L3: CLL_1 port map(p=>p, g=>g, cout=>c(3 downto 1), p_out=>p_out, g_out=>g_out); LK: for k in 3 downto 1 generate PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k)); end generate LK; end architecture STRUCTURAL; -- -- cll_1.vhd (Carry Lookahead Logic - for first 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL_1 is port( p, g: in std_logic_vector(3 downto 0);

128

cout: out std_logic_vector(2 downto 0); p_out, g_out: out std_logic); end entity CLL_1; architecture BEHAVIORAL of CLL_1 is begin L1: cout(0) <= g(0); L2: cout(1) <= g(1) or (p(1) and g(0)); L3: cout(2) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); L4: p_out <= p(3) and p(2) and p(1) and p(0); L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- cll.vhd (Carry Lookahead Logic - for 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL is port( cin: in std_logic; p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLL; architecture BEHAVIORAL of CLL is begin L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)) or (p(2) and p(1) and p(0) and cin); L5: p_out <= p(3) and p(2) and p(1) and p(0); L6: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- cll_2l.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end entity CLL_2L; architecture BEHAVIORAL of CLL_2L is begin L1: cout(1) <= g(0); L2: cout(2) <= g(1) or (p(1) and g(0)); L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- csa.vhd (Carry Save Adder) library IEEE; use IEEE.std_logic_1164.all; entity CSA is generic(n: positive := 5); port( a, b, c: in std_logic_vector(n-1 downto 0);

129

sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end entity CSA; architecture STRUCTURAL of CSA is component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; begin L1: carry(0) <= '0'; KL: for k in n-1 downto 0 generate FAK: FA port map(a=>a(k), b=>b(k), cin=>c(k), s=>sum(k), cout=>carry(k+1)); end generate KL; end architecture STRUCTURAL; -- -- cla_2.vhd (Carry Lookahead Adder ~ 2 Bits (Last 2 Bits)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_2 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(1 downto 0)); end entity CLA_2; architecture STRUCTURAL of CLA_2 is component SCLL is port(cin, p, g: in std_logic; cout: out std_logic); end component SCLL; component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; signal c, p, g: std_logic; begin U1: PFA port map(a=>a(0), b=>b(0), c=>cin, s=>s(0), g=>g, p=>p); U2: SCLL port map(cin=>cin, p=>p, g=>g, cout=>c); L1: s(1) <= a(1) xor b(1) xor c; end architecture STRUCTURAL; -- -- scll.vhd (Carry Lookahead Logic for Bit position 13) library IEEE; use IEEE.std_logic_1164.all; entity SCLL is port(cin, p, g: in std_logic; cout: out std_logic); end entity SCLL; architecture BEHAVIORAL of SCLL is begin L1: cout <= g or (p and cin);

130

end architecture BEHAVIORAL; -- ha.vhd (Half Adder) library IEEE; use IEEE.std_logic_1164.all; entity HA is port( a, b: in std_logic; s, cout: out std_logic); end entity HA; architecture BEHAVIORAL of HA is begin L1: s <= a xor b; L2: cout <= a and b; end architecture BEHAVIORAL; -- fa.vhd (full adder) library IEEE; use IEEE.std_logic_1164.all; entity FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end entity FA; architecture BEHAVIORAL of FA is begin L1: s <= a xor b xor cin; L2: cout <= (a and b) or (a and cin) or (b and cin); end architecture BEHAVIORAL; -- pfa.vhd (Partial Full Adder) library IEEE; use IEEE.std_logic_1164.all; entity PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end entity PFA; architecture BEHAVIORAL of PFA is begin L1: s <= a xor b xor c; L2: g <= a and b; L3: p <= a xor b; end architecture BEHAVIORAL; -- reg_p.vhd (Positive edge clocked registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity REG_P is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity REG_P;

131

architecture BEHAVIORAL of REG_P is signal d_reg: signed(n-1 downto 0); begin STORE: process (clk, rst, d_in) is begin if (rst = '1') then d_reg <= conv_signed('0', n); elsif (rising_edge(clk)) then d_reg <= signed(d_in); end if; end process STORE; L1: d_out <= std_logic_vector(d_reg); end architecture BEHAVIORAL; -- reg_n.vhd (Negative edge clocked registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity REG_N is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity REG_N; architecture BEHAVIORAL of REG_N is signal d_reg: signed(n-1 downto 0); begin STORE: process (clk, rst, d_in) is begin if (rst = '1') then d_reg <= conv_signed('0', n); elsif (falling_edge(clk)) then d_reg <= signed(d_in); end if; end process STORE; L1: d_out <= std_logic_vector(d_reg); end architecture BEHAVIORAL;

132

Appendix B

VHDL codes, C++ source codes and Script file for Post-Synthesis simulation

Adders C++ source code // This program generate all possible inputs to the Adders // with the ability to increase the increment #include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ofstream out_file1, out_file2, out_file3; out_file1.open("v_a.dat"); out_file2.open("v_b.dat"); out_file3.open("v_ans.dat"); int time, delay, a, b, choice; int lo, hi, step; time = 20; delay = 0; cout << "Please enter the selection by number:" << endl; cout << "-------------------------------------" << endl; cout << "(1) CLA 14" << endl; cout << "(2) CLA 15" << endl; cout << "(3) CLA 16" << endl; cout << "(4) CLA 17" << endl; cout << "(5) CLA 19" << endl; cin >> choice; cout << endl << "Please enter the step: "; cin >> step; switch (choice) { case 1: lo = -8192; hi = 8192; break; case 2: lo = -16384; hi = 16384; break; case 3: lo = -32768; hi = 32768; break; case 4: lo = -65536; hi = 65536; break; case 5: lo = -262144; hi = 262144; break; default: lo = -8192;

133

hi = 8192; break; } out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; for (a=lo; a<hi; a+=step) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << a << "\\H +" << endl; for (b=lo; b<hi; b+=step) { out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << b << "\\H +" << endl; out_file3 << setiosflags(ios::uppercase) << "@" << dec << (time + delay) << "ns=" << hex << (a + b) << "\\H +" << endl; time = time + 20; } } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << (time + delay) << "ns=" << 0 << "\\H" << endl; out_file1.close(); out_file2.close(); out_file3.close(); return 0; } VHDL file for 14-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA14 is port( a, b: in std_logic_vector(13 downto 0); ans: in std_logic_vector(13 downto 0); t: inout std_logic_vector(13 downto 0); err: out std_logic ); end entity TB_CLA14; architecture BEHAV of TB_CLA14 is component CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end component CLA_14; begin U1: CLA_14 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

134

VHDL file for 15-bit CLA Testbench library IEEE;

s: out std_logic_vector(15 downto 0));

use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA15 is port( a, b: in std_logic_vector(14 downto 0); ans: in std_logic_vector(14 downto 0); t: inout std_logic_vector(14 downto 0); err: out std_logic ); end entity TB_CLA15; architecture BEHAV of TB_CLA15 is component CLA_15 is port( a, b: in std_logic_vector(14 downto 0); s: out std_logic_vector(14 downto 0)); end component CLA_15; begin U1: CLA_15 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

VHDL file for 16-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA16 is port( a, b: in std_logic_vector(15 downto 0); ans: in std_logic_vector(15 downto 0); t: inout std_logic_vector(15 downto 0); err: out std_logic ); end entity TB_CLA16; architecture BEHAV of TB_CLA16 is component CLA_16 is port( a, b: in std_logic_vector(15 downto 0);

end component CLA_16; begin U1: CLA_16 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP;

end architecture BEHAV;

135

VHDL file for 17-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA17 is port( a, b: in std_logic_vector(16 downto 0); ans: in std_logic_vector(16 downto 0); t: inout std_logic_vector(16 downto 0); err: out std_logic ); end entity TB_CLA17; architecture BEHAV of TB_CLA17 is component CLA_17 is port( a, b: in std_logic_vector(16 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17; begin U1: CLA_17 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if;

t: inout std_logic_vector(18 downto 0);

end process COMP; end architecture BEHAV; VHDL file for 19-bit CLA Testbench -- tb_cla_19.vhd library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity TB_CLA19 is port( a, b: in std_logic_vector(18 downto 0); ans: in std_logic_vector(18 downto 0); ovf: out std_logic;

err: out std_logic ); end entity TB_CLA19; architecture BEHAV of TB_CLA19 is begin U1: CLA_19 port map(a=>a, b=>b, s=>t, ovf=>ovf); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

136

Multiplication Unit C++ source code // This program generate all possible inputs to the Multiplication // unit and correct output results corresponding to all the inputs. #include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ofstream out_file1, out_file2, out_file3, out_file4, out_file5; out_file1.open("coef.dat"); out_file2.open("mag.dat"); out_file3.open("x_ans.dat"); int delay, time, m, n;

out_file1 << "@" << time << "ns=" << 0 << "\\H +" << endl;

time = 0; delay = 40;

out_file2 << "@" << time << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << time << "ns=" << 0 << "\\H +" << endl;

}

for (m=-32; m<32; m++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << m << "\\H +" << endl; for (n=0; n<256; n++) { out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << n << "\\H +" << endl; out_file3 << setiosflags(ios::uppercase) << "@" << dec << time + delay << "ns=" << hex << (m*n) << "\\H +" << endl; time = time + 20; }

out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << time + delay << "ns=" << 0 << "\\H" << endl; out_file1.close(); out_file2.close(); out_file3.close(); return 0; } VHDL code for Multiplication Unit testbench -- testbench for multiplier (tb_mult.vhd) library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity TB_MULT is port( clk, rst: in std_logic; coef: in std_logic_vector(5 downto 0); mag: in std_logic_vector(7 downto 0); pro: in std_logic_vector(13 downto 0); p: out std_logic_vector(13 downto 0); err: out std_logic ); end entity TB_MULT; architecture STRUCT of TB_MULT is signal product: std_logic_vector(13 downto 0);

137

begin M1: MULT port map(clk=>clk, rst=>rst, a=>mag, b=>coef, p=>product); L1: p <= product; COMP: process(clk, rst) is begin if (rst = '1') then err <= '0'; elsif (clk'event and clk = '1') then if (product = pro) then err <= '0'; else err <= '1'; end if; end if;

|

break err 1-0 do (print err.out)

end process COMP; end architecture STRUCT; Script File | The file has been automatically generated by | the Script Editor File Wizard version 2.0.1.89 | | Copyright © 1998 Aldec, Inc. | Initial settings delete_signals set_mode functional restart stepsize 10 ns | Vector Definitions | | Add your vector definition commands here vector coef coef5 coef4 coef3 coef2 coef1 coef0 radix hex coef vector mag mag7 mag6 mag5 mag4 mag3 mag2 mag1 mag0 radix hex mag vector product p[13:0] radix hex product vector t_ans pro[13:0] radix hex t_ans | Watched Signals and Vectors

| Define your signal and vector watch list here watch coef mag product err | Stimulators Assignment | | Select and/or define your own stimulators | and assign them to the selected signals wfm coef < coef.dat wfm mag < mag.dat wfm t_ans < x_ans.dat | Set Breakpoint Conditions | | Define breakpoint conditions and | breakpoint actions for selected signals here

| Perform Simulation

138

| | Run simulation for a selected number of | clock cycles or a time range sim

139

Appendix C

C++ Source Codes for Programs Used During Post-Implementation Simulation Program 1 (Input Image Plane and Output Image Planes generator) #include <iostream.h> #include <stdlib.h> #include <iomanip.h> #include <fstream.h> int main() { ifstream in_file1; ofstream out_file1, out_file2; in_file1.open("coef.txt"); out_file1.open("v_input.dat"); out_file2.open("input_mag.txt");

int i_mag[5][60];

int row, col, nfc; int a, b, k, m, n, mag;

int fc[3][5][5]; unsigned int seed; cout << "Seed number: "; cin >> seed; in_file1 >> nfc; row = 5; col = 60; k = 0; // Reading in the FC planes in coef.txt file while (k < nfc) { for (a = 0; a < 5; a++) { for (b = 0; b < 5; b++) { in_file1 >> mag; fc[k][a][b] = mag; } } k++; } in_file1.close(); // Generate Randomized Input Image plane with rand function srand(seed); out_file2 << "Input Image Plane (" << row << "x" << col << ")" << endl; out_file2 << "------------------------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 256; while (mag > 255) { mag = rand(); } i_mag[a][b] = mag; out_file2 << setiosflags(ios::uppercase) << setw(3) << hex << mag << " "; }

140

out_file2 << endl; }

}

// Generate Expected Output

{

<< "--------------------" << endl;

{

// This segment of the code generate the input image plane for simulation for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { out_file1 << setiosflags(ios::uppercase) << "assign input " << hex << i_mag[a][b] << "\\h" << endl; out_file1 << "cycle 1" << endl; } if (a < 2) { out_file1 << "cycle 1" << endl; } else { out_file1 << "cycle 2" << endl;

}

for (k = 0; k < nfc; k++)

out_file2 << endl << endl << "Output Image Plane " << k+1 << endl

for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 0; for (m = 0; m < 5; m++)

for (n = 0; n < 5; n++) { if (!(((a-m+2) < 0) | ((a-m+2) >= row)| ((b-n+2)<0) | ((b-n+2) >= col))) mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]); } } mag = mag & 0x7FFFF; out_file2 << setw(5) << hex << setiosflags(ios::uppercase) << mag << " "; } out_file2 << endl; } } out_file1.close(); out_file2.close(); return 0; } Program 2 (To generate test vectors that will program the FCs into each MAUs) #include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ifstream in_file1; ofstream out_file1, out_file2, out_file3; in_file1.open("coef.txt"); //Filter Coefficient file out_file1.open("v_coef.dat"); out_file2.open("v_c_reg.dat"); out_file3.open("v_au_sel.dat");

141

int array[5][5]; int time, count, temp, a, b, k, nfc; time = 40; count = 1; k = 1; in_file1 >> nfc; out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; while (k-1 < nfc) { out_file3 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << k << "\\H +" << endl; for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file1 >> temp; cout << temp << endl; array[b][a] = temp; } } for (a=0; a<5; a++) { for (b=0; b<5; b++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << array[a][b] << "\\H +" << endl; out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << count << "\\H +" << endl; time += 80; count++; } } count = 1; k++; } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; in_file1.close(); out_file1.close(); out_file2.close(); out_file3.close(); return 0; }

142

Appendix D

C++ Source Codes for Programs Used During Hardware Prototype Implementation Program 1 (This is the program that responsible for sending the FC values) //This is the driver for system without FIFO #include <iostream.h> #include <stdlib.h> #include <conio.h> #include <iomanip.h> #include <fstream.h> #define DATA 0x0378 #define STATUS DATA+1 #define CONTROL DATA+2 void delay(int); void sentdata(int &); main() { int reg_coef[3][25]; int reg_cfg[3][25]; int fc[3][5][5]; int k, nfc, a, b, count, wait, d, o_sel, o_cfg;

in_file >> nfc >> o_sel; // Read in the number of FC plane

/* Reading the Filter Coefficient from the coef.txt */ ifstream in_file; in_file.open("coef.txt"); // Open the coef.txt file

o_cfg = 0x80; /* Make Sure Parallel Port is in forward mode and set strobe */ _outp(CONTROL, _inp(CONTROL) & 0xDE); /* Make Sure the write enable (ppc(3)) is at low */ _outp(CONTROL, _inp(CONTROL) | 0x08); k = 1; count = 1; while (k-1 < nfc) // Repeat for the number of plane indicated { for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file >> fc[k-1][b][a]; // Reading in the filter coefficients } // in the arrangement of FC in the AU } for (a=0; a<5; a++) { for (b=0; b<5; b++) { reg_coef[k-1][count-1] = fc[k-1][a][b] & 0x3F; reg_cfg[k-1][count-1] = ((k & 0x03) << 5); reg_cfg[k-1][count-1] = reg_cfg[k-1][count-1] | (count & 0x1F); cout << hex << reg_cfg[k-1][count-1] << " " << reg_coef[k-1][count-1] << endl; count++; } } count = 1; k++; } k = 1; while (k-1 < nfc) { a = 1;

143

while(a<26) { while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { _outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3) while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5) sentdata(reg_coef[k-1][a-1]); while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6) sentdata(reg_cfg[k-1][a-1]); a++; while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6) } } k++; } //Configuration of the MAUs are done _outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3) k = 0; while (k<1) { //Program output selection according to o_sel read in while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { _outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3) while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5) sentdata(o_sel); while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6) sentdata(o_cfg); while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6) k++; } } _outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3) cout << "Configuration done!" << endl; in_file.close(); //Close coef.txt // Programming of the Filter Coefficient is done // exit(1); // This section starts sending input datas to the system ifstream in_file1; in_file1.open("input.txt"); //Open the input data file wait = 1; while (wait==1) { if ((_inp(STATUS) & 0x08) == 0) //Detect low on pps(3) { //Run the following segment of code if pps(3)==0 while(!((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { //Run the following segment of code if pps(4)==1 if ((_inp(STATUS) & 0x10) == 16) //Detect high on pps(4) { in_file1 >> d; //Read in the data from file sentdata(d); while((_inp(STATUS) & 0x08) == 0) //Detect high on pps(3) {} } } } } return 0; }; //end of main void sentdata(int &c) { cout << c << " sent..." << endl; _outp(DATA, c^0x03); /* sending the data with the two LSB toggled */ _outp(CONTROL, _inp(CONTROL) & 0xFB); /* set strobe ~ one to zero */ delay(1000); _outp(CONTROL, _inp(CONTROL) | 0x04); /* reset strobe ~ zero to one */

144

delay(1000); }; /* A function to create delay */ void delay(int k) { int i; for (i=0; i<=k; i++){} }; Program 2 (This is the program that generate VHDL file for internal RAM for input image pixels) #include <iostream.h> #include <fstream.h> #include <iomanip.h> #include <stdlib.h>

row = 5;

main() { int a, b, mag, row, col, i, j, k; int i_mag[5][62]; int temp[32]; unsigned int seed; ofstream outfile("input_ram.vhd", ios::out);

col = 62;

mag = rand();

cout << "Seed number: "; cin >> seed; srand(seed); for (a = 0; a < row; a++) { for (b = 0; b < (col-2); b++) { mag = 256; while (mag > 255) {

} i_mag[a][b] = mag; } i_mag[a][col-1] = 0; i_mag[a][col-2] = 0; }

outfile << "architecture STRUCT of IN_RAM is\n\n"

<< " RST: in std_logic;\n"

outfile << "library IEEE;\n" << "use IEEE.std_logic_1164.all;\n" << "use IEEE.numeric_std.all;\n" << "\nentity IN_RAM is\n" << " port( clk: in std_logic;\n" << " rst: in std_logic;\n" << " req: in std_logic;\n" << " dout: out std_logic_vector(7 downto 0) );\n" << "end entity IN_RAM;\n\n";

<< " component RAMB4_S8 is\n" << " port( DI: in std_logic_vector(7 downto 0);\n" << " EN: in std_logic;\n" << " WE: in std_logic;\n"

<< " CLK: in std_logic;\n" << " ADDR: in std_logic_vector(8 downto 0);\n" << " DO: out std_logic_vector(7 downto 0) );\n"

145

<< " end component RAMB4_S8;\n\n";

while (i<32)

if (a < 5)

<< " attribute INIT_0" << setw(1) << hex << j << " of IRM: label is \"";

a = b = j = k = 0; while (k<(row*col)) { i = 0;

{

{ temp[31-i] = i_mag[a][b]; } else { temp[31-i] = 0; } if (b == (col-1)) { b = 0; a++; k++; i++; } else { k++; b++; i++; } } outfile << " attribute INIT_0" << setw(1) << hex << j << ": string;\n"

for (i=0; i<32; i++) { if (temp[i] < 16) { outfile << "0" << setw(1) << hex << temp[i]; } else { outfile << setw(2) << hex << temp[i]; } } outfile << "\";\n"; j++; } outfile << "\n signal din : std_logic_vector(7 downto 0);\n" << " signal addr: unsigned(8 downto 0);\n" << " signal adr : std_logic_vector(8 downto 0);\n" << " signal en : std_logic;\n" << " signal we : std_logic;\n"; << " L1: din <= (others=>'0');\n"

outfile << "\n begin\n\n"

<< " L2: en <= '1';\n" << " L3: we <= '0';\n" << " L4: adr <= std_logic_vector(addr);\n\n"; outfile << " P1: process(clk, rst, req) is\n" << " begin\n" << " if (rst = '1') then\n" << " addr <= (others=>'0');\n" << " elsif (clk'event and clk = '1') then\n" << " if (req = '1') then\n" << " addr <= addr + 1;\n" << " end if;\n" << " end if;\n" << " end process P1;\n\n"; outfile << " IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, \n"

146

<< " ADDR=>adr, DO=>dout);\n";

outfile << "\nend architecture STRUCT;\n"; outfile.close(); return 0; }

Program 3 (Test Program: this program responsible for comparing the uploaded output witht the theoretical correct outputs) #include <iostream.h> #include <stdlib.h> #include <iomanip.h> #include <fstream.h> void hex2file(ofstream &, int); int main()

out_file2.open("exp_res.txt");

{ ifstream in_file1; ofstream out_file1, out_file2; in_file1.open("coef.txt"); out_file1.open("input.txt");

int row, col, nfc; int a, b, k, m, n, mag, o_sel; int i_mag[5][60]; int fc[3][5][5]; unsigned int seed; cout << "Seed number: "; cin >> seed; in_file1 >> nfc >> o_sel; row = 5; col = 60; k = 0; // Reading in the FC planes in coef.txt file while (k < nfc) { for (a = 0; a < 5; a++) { for (b = 0; b < 5; b++) { in_file1 >> mag; fc[k][a][b] = mag; } } k++; } in_file1.close(); // Generate Randomized Input Image plane with rand function srand(seed); out_file2 << "Input Image Plane (" << row << "x" << col << ") Seed: " << seed << endl; out_file2 << "---------------------------------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 256; while (mag > 255) {

147

mag = rand(); } i_mag[a][b] = mag; out_file1 << dec << setw(3) << mag << " "; out_file2 << setiosflags(ios::uppercase) << setw(3) << hex << mag << " "; } out_file1 << endl; out_file2 << endl; } // Generate Expected Output for (k = 0; k < nfc; k++) { out_file2 << endl << endl << "Output Image Plane " << k+1 << endl << "--------------------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 0; for (m = 0; m < 5; m++) { for (n = 0; n < 5; n++) { if (!(((a-m+2) < 0) | ((a-m+2) >= row) | ((b-n+2) < 0) | ((b-n+2) >= col))) mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]); } } mag = mag & 0x7FFFF; hex2file(out_file2, mag); } out_file2 << endl; } } out_file1.close(); out_file2.close(); return 0; } void hex2file(ofstream &outfile, int mag) { int i, temp; for (i=0; i<5; i++) { temp = mag & 0xF0000; temp = temp >> 16; outfile << hex << setiosflags(ios::uppercase) << temp; mag = mag << 4; } outfile << " "; }

148

Appendix E

VHDL Files for Modules External to the Convolution Architecture

1. Top Level Description of the whole system (the convolution architecture is included) -- par2brd.vhd library IEEE, BRDMOD; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BRDMOD.brd_util.all; entity PAR2BRD is port( -- inputs from board sw1: in std_logic; -- start signal sw2: in std_logic; -- global reset sw3: in std_logic; -- mux select for internal clock clk: in std_logic; -- external clock -- from parallel port ppd: in std_logic_vector(7 downto 0); ppc: in std_logic_vector(3 downto 2); pps: out std_logic_vector(6 downto 3); -- output to external SRAM (right bank) cen_r : out std_logic; wen_r : out std_logic; oen_r : out std_logic; addr_r: out std_logic_vector(18 downto 0); data_r: out std_logic_vector(15 downto 0); -- output to external SRAM (left bank) cen_l : out std_logic; wen_l : out std_logic; oen_l : out std_logic; addr_l: out std_logic_vector(18 downto 0); data_l: out std_logic_vector(15 downto 0); -- output from the interface clk_led: out std_logic; -- sl: out std_logic_vector(6 downto 0); -- sr: out std_logic_vector(6 downto 0) ); done: out std_logic ); end entity PAR2BRD; architecture STRUCT of PAR2BRD is component SYS is port( clk, rst, str: in std_logic; d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF) coef: in std_logic_vector(5 downto 0); --(FCs from parallel port) ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp) au_sel: in std_logic_vector(1 downto 0); --(AU select from pp) o_sel: in std_logic; --(Output config from pp) req: out std_logic; --(Controller -> FIFO) sram_w: out std_logic; --(SYS -> SRAM) d_out: out std_logic_vector(18 downto 0) ); end component SYS; component IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end component IN_RAM; component IBUF is port( i: in std_logic;

149

o: out std_logic ); end component IBUF; component BUFG is port( i: in std_logic; o: out std_logic ); end component BUFG; -- These signals are from parallel port interface signal d_clk, strobe, strobe_b: std_logic; signal nsw1, nsw2, nsw3 : std_logic; -- Internal connection signals signal req: std_logic; --(SYS -> IN_RAM) signal v_d : std_logic_vector(7 downto 0); --(PINTFC -> REG_A) signal v_t : std_logic_vector(18 downto 0); --(SYS -> SRAM) signal v_in : std_logic_vector(7 downto 0); --(IN_RAM -> SYS) -- signal v_led : std_logic_vector(7 downto 0); --(MUX -> SVNSEG)

signal oen : std_logic;

SRM: OUT_RAM port map(clk=>d_clk, rst=>nsw2, w=>sram_w, d_in=>v_t,

L3 : wen_l <= wen;

signal c_out : std_logic_vector(5 downto 0); --(coefficient output from FC_MOD) signal cf_out: std_logic_vector(7 downto 0); --(MAUs configuration output from FC_MOD) signal sram_w: std_logic; --(SYS -> OUT_RAM) signal cen : std_logic; signal wen : std_logic;

signal data : std_logic_vector(18 downto 0); signal addr : std_logic_vector(18 downto 0); -- Clock selection signal signal c_sel : std_logic; signal p_clk : std_logic; -- filter coefficients programming clk begin -- External strobe buffering and padding B1: IBUF port map(i=>ppc(2), o=>strobe_b); B2: BUFG port map(i=>strobe_b, o=>strobe); -- Inverting the logic level of the push buttons. S1: nsw1 <= not sw1; S2: nsw2 <= not sw2; S3: nsw3 <= not sw3; -- Clock counter to reduce the clock frequency of the external clock C1: C_CNTR generic map(n=>12500) port map(clk=>clk, rst=>nsw2, co=>p_clk); -- First In First Out queue after the parallel port -- F1: FIFO port map(rst=>nsw2, r_clk=>d_clk, r_en=>req, w_clk=>strobe, w_en=>ppc(3), -- din=>ppd, dout=>v_d, empty=>pps(3)); -- This parallel port interface is aimed to replace the FIFO queue P1: PINTFC port map(clk=>strobe, rst=>nsw2, ppd=>ppd, d_out=>v_d); -- Drivers to the two seven segments LEDs -- SV1: SVNSG port map(ldg=>v_led(7 downto 4), rdg=>v_led(3 downto 0), sl=>sl, sr=>sr); -- Filter Coefficient Programming Module FC1: FC_MOD port map(clk=>d_clk, rst=>nsw2, ppc=>ppc(3), ppd=>v_d, pps1=>pps(5), pps2=>pps(6), coef_out=>c_out, cfg_out=>cf_out); -- SRAM Interface module (Responsible for writing output pixels to the external SRAM

cen=>cen, wen=>wen, oen=>oen, addr=>addr, data=>data); L1 : cen_l <= cen; L2 : cen_r <= cen;

L4 : wen_r <= wen; L5 : oen_l <= oen; L6 : oen_r <= oen; L7 : addr_l <= addr; L8 : addr_r <= addr; L9 : data_l <= "0000000000000" & data(18 downto 16); L10: data_r <= data(15 downto 0);

150

-- Clock LED on the bar(6) LED L11: pps(3) <= d_clk; L12: done <= sram_w; -- MUX Select for SVNSEG LEDs display -- MUX: v_led <= v_t(7 downto 0) when nsw3 = '0' else v_t(15 downto 8); -- Convolution System (req is replaced by pps(4) to the parallel port pin) -- (o_sel is the output config pin to select which output plane to output to svnseg) U0: SYS port map(clk=>d_clk, rst=>nsw2, str=>nsw1, coef=>c_out, ld_reg=>cf_out(4 downto 0), au_sel=>cf_out(6 downto 5), o_sel=>cf_out(7), req=>req, d_in=>v_in, sram_w=>sram_w, d_out=>v_t); -- Input RAM to provide input image pixels to the convolution system U1: IN_RAM port map(clk=>d_clk, rst=>nsw2, req=>req, dout=>v_in); -- MUX select for the internal operation clock by SW3 -- SELP: process (nsw3, nsw2) is -- begin -- if (nsw2 = '1') then -- c_sel <= '0'; -- elsif (nsw3 = '1') then -- c_sel <= '1'; -- end if; -- end process SELP; -- assign the clock as indicated by the c_sel signal L13: d_clk <= p_clk when nsw3 = '0' else clk; L14: clk_led <= c_sel; end architecture STRUCT; 2. Block RAM module (initialized with input image plane) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end entity IN_RAM; architecture STRUCT of IN_RAM is component RAMB4_S8 is port( DI: in std_logic_vector(7 downto 0); EN: in std_logic; WE: in std_logic; RST: in std_logic; CLK: in std_logic; ADDR: in std_logic_vector(8 downto 0); DO: out std_logic_vector(7 downto 0) ); end component RAMB4_S8;

attribute INIT_03: string;

attribute INIT_00: string; attribute INIT_00 of IRM: label is "b127037aa60fa2fcd6d2ecfba0b29b472f795754f8c707915e72910a387ba32d"; attribute INIT_01: string; attribute INIT_01 of IRM: label is "54c7000034647cf42036263aea806ef629ca8017f15167ae072fa406aa3c5f8e"; attribute INIT_02: string; attribute INIT_02 of IRM: label is "29e3340c8d55d63104709180640dde6598abfa87834f3f8b862521afb27b05da";

attribute INIT_03 of IRM: label is "8c059a5e0000d10f29ef58e2343094fdf00c4875d7132b775a9d90eb0a2af23a"; attribute INIT_04: string; attribute INIT_04 of IRM: label is "f4ffbba026cb9a95e76073e78cf8bb71ec7adb544571aba14108426ce0b65b48"; attribute INIT_05: string; attribute INIT_05 of IRM: label is "562f3c0012a50000c449b579657f3430b517d329250b06e193764decbb8d2fae";

151

attribute INIT_06: string; attribute INIT_06 of IRM: label is "0e1445292bf6efc7ca1a168a0b28ac6c95e4c5e1eafa97d4a739e68847e36db2"; attribute INIT_07: string; attribute INIT_07 of IRM: label is "766d9cd4a6f5a327000025adece722128cec7c0d9ec1ecf1cb50f064f402ac29"; attribute INIT_08: string; attribute INIT_08 of IRM: label is "48acd73b6335198a049f8709c6e40466c6c633d89e44e024272e09b37771f914"; attribute INIT_09: string; attribute INIT_09 of IRM: label is "0000000000000000000000003f1b75a632e7d585d0d36448ec00ce97f55e0ad8"; signal din : std_logic_vector(7 downto 0); signal addr: unsigned(8 downto 0); signal adr : std_logic_vector(8 downto 0); signal en : std_logic; signal we : std_logic; begin L1: din <= (others=>'0'); L2: en <= '1'; L3: we <= '0'; L4: adr <= std_logic_vector(addr); P1: process(clk, rst, req) is begin if (rst = '1') then addr <= (others=>'0'); elsif (clk'event and clk = '1') then if (req = '1') then addr <= addr + 1; end if; end if; end process P1; IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, ADDR=>adr, DO=>dout); end architecture STRUCT; 3. FC Programming module -- fc_pg_mod.vhd (Filter Coefficient Programming Module) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_MOD is port( clk, rst: in std_logic; ppc: in std_logic; -- Control Pin from parallel port ppd: in std_logic_vector(7 downto 0); -- Input data from parallel port pps1: out std_logic; -- Status pin for request data pps2: out std_logic; -- Status pin for request cfg coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end entity FC_MOD; architecture STRUCTURAL of FC_MOD is component FC_REG is port( clk, rst: in std_logic; rec_0: in std_logic; rec_1: in std_logic; prog: in std_logic; d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end component FC_REG; component FC_FSM is port( clk, rst: in std_logic;

152

ctr_pin: in std_logic; rec_0: out std_logic; -- receive_data state enable pin rec_1: out std_logic; -- receive_config state enable pin prog: out std_logic ); -- program state enable pin end component FC_FSM; signal rec_0, rec_1, prog: std_logic; begin FSM: FC_FSM port map(clk=>clk, rst=>rst, ctr_pin=>ppc, rec_0=>rec_0, rec_1=>rec_1, prog=>prog); FCG: FC_REG port map(clk=>clk, rst=>rst, rec_0=>rec_0, rec_1=>rec_1, prog=>prog, d_in=>ppd, coef_out=>coef_out, cfg_out=>cfg_out); -- Status pins to the parallel port O1: pps1 <= rec_0; O2: pps2 <= rec_1; end architecture STRUCTURAL; -- fc_pg_fsm.vhd (Filter Coefficient Programming Finite State Machine) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_FSM is port( clk, rst: in std_logic; ctr_pin: in std_logic; rec_0: out std_logic; -- receive_data state enable pin rec_1: out std_logic; -- receive_config state enable pin prog: out std_logic ); -- program state enable pin end entity FC_FSM; architecture BEHAVIORAL of FC_FSM is type states is (idle, receive_data, receive_config, program); signal c_state: states; -- Current State signal n_state: states; -- Next State begin NST_PROC: process(c_state, ctr_pin) is begin case c_state is when idle => if (ctr_pin = '0') then n_state <= idle; else n_state <= receive_data; end if; when receive_data => n_state <= receive_config; when receive_config => n_state <= program; when program => if (ctr_pin = '0') then n_state <= idle; else n_state <= receive_data; end if; end case; end process NST_PROC; CST_PROC: process(clk, rst, n_state) is begin if(rst='1') then c_state <= idle;

153

elsif (clk'event and clk='0') then c_state <= n_state; end if; end process CST_PROC; OUT_PROC: process(c_state) is begin case c_state is when idle => rec_0 <= '0'; rec_1 <= '0'; prog <= '0'; when receive_data => rec_0 <= '1'; rec_1 <= '0'; prog <= '0'; when receive_config => rec_0 <= '0'; rec_1 <= '1'; prog <= '0'; when program => rec_0 <= '0'; rec_1 <= '0'; prog <= '1'; end case; end process OUT_PROC; end architecture BEHAVIORAL; -- fc_pg_reg.vhd (Filter COefficient Programming Registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_REG is port( clk, rst: in std_logic; rec_0: in std_logic; rec_1: in std_logic; prog: in std_logic; d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end entity FC_REG; architecture BEHAVIORAL of FC_REG is signal d_reg: std_logic_vector(5 downto 0); signal c_reg: std_logic_vector(7 downto 0); begin -- Registers for storing Filter Coefficients REC_D: process(clk, rst, rec_0) is begin if (rst = '1') then d_reg <= (others=>'0'); elsif (clk'event and clk='1') then if (rec_0='1') then d_reg <= d_in(5 downto 0); end if; end if; end process REC_D; -- MAUs configuration signals REC_C: process(clk, rst, rec_1) is begin if (rst = '1') then

154

c_reg <= (others=>'0'); elsif (clk'event and clk='1') then if (rec_1='1') then c_reg <= d_in(7 downto 0); end if; end if; end process REC_C; -- Enable Output from the registers O1: coef_out <= d_reg when prog = '1' else (others=>'0'); O2: cfg_out <= c_reg when prog = '1' else (others=>'0'); end architecture BEHAVIORAL; 4. SRAM Driver -- out_ram.vhd (Output Ram for storing output pixels from the architecture) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity OUT_RAM is port( clk, rst: in std_logic; w: in std_logic; -- read or write to SRAM d_in: in std_logic_vector(18 downto 0); -- Input data from architecture cen, wen: out std_logic; -- cen=chip enable, wen=write enable (both active low) oen: out std_logic; -- oen=out enable (active low) addr: out std_logic_vector(18 downto 0); -- SRAM address bus data: out std_logic_vector(18 downto 0) ); -- SRAM Data bus end entity OUT_RAM; architecture BEHAV of OUT_RAM is signal w_address: unsigned(18 downto 0); -- signal i_data : unsigned(7 downto 0); begin -- Asynchronous Reset and positive edge trigger events P1: process (clk, rst, w) is begin if (rst = '1') then wen <= '1'; oen <= '1'; addr <= (others => '0'); -- initial address during reset elsif (clk'event and clk = '1') then if (w = '1') then wen <= '0'; oen <= '1'; addr <= std_logic_vector(w_address); end if; end if; end process P1; -- Address counter P2: process(clk, rst, w) is begin if (rst = '1') then w_address <= (others => '0'); elsif (clk'event and clk = '1') then if (w = '1') then w_address <= w_address + 1; end if; end if;

155

end process P2; -- Chip enable signal L1: cen <= clk; -- Data Bus L2: data <= d_in; --std_logic_vector(i_data); -- L3: i_data <= unsigned(w_address(7 downto 0)) + 4; end architecture BEHAV; 5. Parallel Port Interface Module library IEEE; use IEEE.std_logic_1164.all; entity PINTFC is port( clk, rst: in std_logic; ppd: in std_logic_vector(7 downto 0); d_out: out std_logic_vector(7 downto 0) ); end entity PINTFC; architecture BEHAVIORAL of PINTFC is signal data: std_logic_vector(7 downto 0); begin REC: process(clk, rst) is begin if (rst = '1') then data <= "00000000"; elsif (clk'event and clk = '1') then data <= ppd; end if; end process REC; L1: d_out <= data; end architecture BEHAVIORAL;

156

References

[1] Bernard Bosi and Guy Bois, “Reconfigurable Pipelined 2-D Convolvers for fast Digital Signal Processing”, IEEE Transaction on Very Large Scale (VLSI) Systems, Vol. 7, No.3, p 299-308, Sep. 1999.

[2] Cheng-The Hsieh and Seung P. Kim, “A Highly-Modular Pipelined VLSI Architecture for 2-D FIR Digital Filter,” Proceedings of the 1996 IEEE 39th Midwest Symposium on Circuits and Systems, Part 1, p 137-140, Aug. 1996.

[3] D. D. Haule and A. S. Malowany, “High-speed 2-D Hardware Convolution

Architecture Based on VLSI Systolic Arrays”, IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, p 52-55, Jun 1989.

[4] D. Patterson and J. Hennessy, Computer Organization & Design: The Hardware

/ Software Interface, Morgan Kaufmann, 1994. [5] GSI Technology, Product Datasheet, http://www.gsitechnology.com. [6] H. T. Kung, “Why Systolic Architectures”, IEEE Computer, Vol. 15, p 37-46,

Jan. 1982. [7] Hyun Man Chang and Myung H. Sunwoo, “An Efficient Programmable 2-D

Convolver Chip”, Proceedings of the 1998 IEEE International Symposium on Circuits and Systems, ISCAS, Part 2, p 429-432, May 1998.

[8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach,

Second Edition, Morgan Kaufmann, 1996.

[9] K. Hsu, L. J. D’Luna, H. Yeh, W. A. Cook and G. W. Brown, “A Pipelined ASIC for Color Matrixing and Convolution”, Proceedings of the 3rd Annual IEEE ASIC Seminar and Exhibit, Sep. 1990.

[10] Kai Hwang, Computer Arithmetic: Principles, Architecture, and Design, John

Wiley & Sons, 1979. [11] M. Morris Mano and Charles R. Kime, Logic and Computer Design

Fundamentals, Prentice Hall, 1997. [12] Michael J. Flynn and Stuart F. Oberman, Advanced Computer Arithmetic Design,

Wiley-Interscience, 2001. [13] O. L. MacSorley, “High-Speed Arithmetic in Binary Computers”, Proceedings of

the IRE, vol. 49, pp. 67-91, Jan. 1961.

157

[14] V. Hecht, K. Rönner and P. Pirsch, “An Advanced Programmable 2D-Convolution Chip for Real Time Image Processing”, Proceedings of IEEE International Symposium on Circuits and Systems, p 1897-1900, 1991.

[15] Vijay K. Madisetti and Douglas B. Williams, The Digital Signal Processing

Handbook, CRC Press and IEEE Press, 1998. [16] Wayne Niblack, An Introduction to Digital Image Processing, Prentice/Hall

International, 1986. [17] Xess Co., XSV Board User Manual, http://www.xess.com/manuals/xsv-manual-

v1_1.pdf [18] Xilinx Co., Foundation 4.1i Software Manual,

http://toolbox.xilinx.com/docsan/xilinx4/pdf/manuals.pdf

158

Vita

Albert Wong was born on January 1st 1975 in Sibu, Sarawak, Malaysia. He

attended SMB Methodist secondary school in Sibu and graduated in 1993. He obtained

his Bachelor of Science in Electrical Engineering Degree in May of 1998 from the

University of Kentucky, Lexington, Kentucky. He enrolled in the University of

Kentucky’s Graduate school in the fall semester of 1999.

159

a new scalable systolic array processor …heath/albert_wong_master_thesis.pdfa new scalable...

Documents