a new scalable systolic array processor …heath/albert_wong_master_thesis.pdfa new scalable...
TRANSCRIPT
ABSTRACT OF THESIS
A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION
Two-dimensional discrete convolution is an essential operation in digital imageprocessing. An ability to simultaneously convolute an (i×j) pixel input image plane withmore than one Filter Coefficient Plane (FC) in a scalable manner is a targetedperformance goal. Assuming k FCs, each of size (n×n), an additional goal is that thesystem have the ability to output k convoluted pixels each clock cycle. To achieve theseperformance goals, an architecture that utilizes a new systolic array arrangement isdeveloped and the final architecture design is captured using the VHDL hardwaredescriptive language. The architecture is shown to be scalable when convoluting multipleFCs with the same input image plane. The architecture design is functionally andperformance validated through VHDL post-synthesis and post-implementation (functionaland performance) simulation testing. In addition, the design was implemented to a FieldProgrammable Gate Array (FPGA) experimental hardware prototype for furtherfunctional and performance testing and evaluation. KEYWORDS: Systolic Array Processor, Discrete Convolution, Hardware Prototyping,
Scalable Architecture, Parallel Architecture.
________________________
________________________
A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION
By
Albert Tung-Hoe Wong
____________________________ Director of Thesis
____________________________ Director of Graduate Studies
____________________________
RULES FOR THE USE OF THESIS
Unpublished theses submitted for the Master’s degree and deposited in the University ofKentucky Library are as a rule open for inspection, but are to be used only with dueregard to the rights of the authors. Bibliographical references may be noted, butquotations or summaries of parts may be published only with permission of the author,and with the usual scholarly acknowledgements. Extensive copying or publication of the thesis in whole or in part also requires theconsent of the Dean of the Graduate School of the University of Kentucky. A library that borrows this thesis for use by its patrons is expected to secure the signatureof each user. Name Date ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________
THESIS
2 3
U
T
Albert Tung-Hoe Wong
00
niversity of Kentucky
he Graduate School
A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION
THESIS
A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical
Engineering in the College of Engineering at the University of Kentucky
By
Albert Tung-Hoe Wong
Lexington, Kentucky
Director: Dr. J. Robert Heath, Associate Professor of Electrical and Computer Engineering
Lexington, Kentucky
2003
MASTER’S THESIS RELEASE
I authorize the University of Kentucky Libraries to reproduce this thesis in
whole or in part for purposes of research.
Signed: _____________________________________ Date: _______________________________________
ACKNOWLEDGEMENTS The following thesis, while an individual work, benefited from the insights and
direction of several people. First, my Thesis Chair, Dr. J. Robert Heath, exemplifies the
high quality scholarship to which I aspire. In addition, Dr. J. Robert Heath provided
timely and instructive comments and evaluation at every stage of the thesis process,
allowing me to complete this project. Next, I wish to thank the complete Thesis
Committee: Dr. J. Robert Heath, Dr. Hank Dietz, and Dr. William R. Dieter. Each
individual provided insights that guided and challenged my thinking, substantially
improving the finished product. I would also like to thank Dr. Michael Lhamon from
Lexmark Inc. for his technical insights, guidance and comments.
In addition to the technical and instrumental assistance above, I received equally
important assistance from family and friends. My wife, Sze Ying Ng, provided on-going
support through out the thesis process for me to finish the thesis.
iii
CONTENTS
Acknowledgements............................................................................................................ iii
List of Tables ..................................................................................................................... vi
List of Figures ................................................................................................................... vii
Chapter 1 Introduction .........................................................................................................1
Chapter 2 Background and Convolution Architecture Requirements .................................2
Chapter 3 Version 1 Convolution Architecture ...................................................................7
3.1. Arithmetic Unit (AU) .......................................................................................... 7
3.2. Coefficient Shifters (CSs) ................................................................................. 10
3.3. Input Data Shifters (IDSs)................................................................................. 11
3.3.1. Register Bank (RB) ............................................................................... 12
3.3.2. Pattern Generator Pointers (PGPs) ....................................................... 12
3.3.3. Delay Units (DU).................................................................................. 15
3.4. Systolic Flow of Version 1 Convolution Architecture. .................................... 15
3.5. Data Memory Interface (DM I/F) ..................................................................... 17
3.6. Output Correction Unit ..................................................................................... 19
3.7. Controller .......................................................................................................... 19
Chapter 4 Revised Architectural Requirements and Resulting Version 2 Convolution
Architecture...............................................................................................................21
4.1. Version 2 Convolution Architecture for (k = 1) ............................................... 21
4.2. Arithmetic Unit (AU) ........................................................................................ 22
4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU) ...... 26
4.2.2. Delay Units (DU).................................................................................. 31
4.3. Input Data Shifters (IDS) .................................................................................. 31
4.4. Data Memory Interface (DM I/F) ..................................................................... 32
4.5. Memory Pointers Unit (MPU) .......................................................................... 32
4.6. Systolic Flow of Version 2 Convolution Architecture ..................................... 34
4.7. Controller .......................................................................................................... 35
4.8. Multiple Filter Coefficient Sets when (k > 1) ................................................... 43
Chapter 5 VHDL Description of Version 2 Convolution Architecture..............................45
iv
Chapter 6 Version 2 Convolution Architecture Validation via Virtual Prototyping
(Post-Synthesis and Post-Implementation Simulation Experimentation).................47
6.1. Post-Synthesis Simulation ................................................................................ 48
6.1.1. Adders ................................................................................................... 48
6.1.2. Multiplication Unit................................................................................ 51
6.1.3. Version 2 Convolution Architecture (with k = 1) ................................. 52
6.2. Post-Implementation Simulation ...................................................................... 61
6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture
(with k = 1)...................................................................................................... 61
6.2.2. Version 2 Convolution Architecture (with k = 1) ................................. 62
6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k
= 3) .................................................................................................................. 65
6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)........... 66
Chapter 7 Hardware Prototype Development and Testing ................................................72
7.1. Board Utilization Modules and Prototype Setup .............................................. 73
7.2. Hardware Prototyping Flow.............................................................................. 76
7.3. Test Cases ......................................................................................................... 80
Chapter 8 Conclusion.........................................................................................................84
Appendix A VHDL Code for Version 2 Discrete Convolution Architecture ....................86
Appendix B VHDL codes, C++ source codes and Script file for Post-Synthesis
Simulation ...............................................................................................................133
Appendix C C++ Source Codes for Programs Used During Post-Implementation
Simulation ...............................................................................................................140
Appendix D C++ Source Codes for Programs Used During Hardware Prototype
Implementation .......................................................................................................143
Appendix E VHDL Files for Modules External to the Convolution Architecture...........149
References........................................................................................................................157
Vita .................................................................................................................................159
v
LIST OF TABLES Table 3.1. Filter coefficient array. .....................................................................................11
Table 3.2. 5×5 Filter size (with one output pointer). .........................................................13
Table 3.3. 5×5 Filter size (Convolution with two output pointers). ..................................14
Table 4.1. Gate count comparison between CSA and CLA................................................25
Table 4.2. A summary of the multiplication. .....................................................................26
Table 4.3. Partial Product Selection Table.........................................................................28
Table 4.4. Comparison between method I and method II..................................................30
Table 6.1. Details of FPGA on the XESS protoboard. .......................................................62
Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)........62
Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3)........66
vi
LIST OF FIGURES Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and
Output Image Plane (OI).............................................................................................3
Figure 2.2. Example showing how two consecutive output pixels are generated. This
example is shown with a 3×3 size FC. .......................................................................4
Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels
need to be stored. In this example, 2 previous rows (shaded rows in addition to
IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are
needed for a 3×3 filter size..........................................................................................4
Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed
to be 8 in this example). ..............................................................................................8
Figure 3.2. A MAU and included functional units. ..............................................................8
Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data
Shifters and CSs are the outputs from Coefficient Shifters. .......................................9
Figure 3.4. Functional units within CSs.............................................................................10
Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters............11
Figure 3.6. Functional units within IDSs. ..........................................................................11
Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the input
pixels)........................................................................................................................12
Figure 3.8. Additional hardware and modification for convolution of x output pixels in
parallel for (x ≤ n) (Functional units shaded in gray are additional hardware
required for processing two convolutions in parallel). .............................................14
Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the figure
denotes one flip-flop. ................................................................................................15
Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel..........16
Figure 3.11. Basic functional units within Data Memory I/F............................................17
Figure 3.12. A more detailed look at DM I/F. ...................................................................18
Figure 3.13. Time line for activities, where W denotes Write or R denotes Read (from
external memory device) registers indicated in boxes directly below......................19
Figure 3.14. Output pattern for two convolutions in parallel. ...........................................19
vii
Figure 4.1. A top level view of Version 2 of the convolution architecture for one
distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication
and Add Array and AT denotes Adder Tree). ...........................................................22
Figure 4.2. Functional units within the Multiplication and Add Array (MAA). ................23
Figure 4.3. A MAU and its functional units. ......................................................................23
Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline
stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC. ..........24
Figure 4.5. One possible arrangement of the AT when CLA is utilized within the
MAUs. .......................................................................................................................25
Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row
of the partial products denotes sign extension of that particular row of partial
product). ....................................................................................................................27
Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial products....28
Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree
Structure....................................................................................................................29
Figure 4.9. Illustration of multiplication technique based on Modified Booth’s
Algorithm..................................................................................................................29
Figure 4.10. Partial Product’s sign extension reduced for hardware saving......................30
Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline
stages within each MAU (PL denotes a pipeline stage composed of flip-flop
registers)....................................................................................................................31
Figure 4.12. Structural view of the IDS with n = 5 and d = 8............................................32
Figure 4.13. External memory devices organization for n = 5 and d = 8. .........................33
Figure 4.14. Functional units within the Memory Pointers Unit (MPU)...........................34
Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td
denotes the time delay between each MAU). ............................................................35
Figure 4.16. Top level view of the Controller Unit (CU). .................................................36
Figure 4.17. Functional Units that receive control signals from the CU. ..........................37
Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller
Unit (CU). .................................................................................................................39
Figure 4.19. Modified Version 2 system flow chart. .........................................................41
viii
Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be
any number). .............................................................................................................43
Figure 5.1. Version 2 Convolution Architecture organization. .........................................46
Figure 6.1. Testing model for lower level functional components. ...................................49
Figure 6.2. Post-Synthesis simulation for 14-bit CLA. ......................................................49
Figure 6.3. A close up view of one segment of Figure 6.2. ...............................................50
Figure 6.4. Post-Synthesis simulation for 15-bit CLA. ......................................................50
Figure 6.5. Post-Synthesis simulation for 16-bit CLA. ......................................................50
Figure 6.6. Post-Synthesis simulation for 17-bit CLA. ......................................................50
Figure 6.7. Post-Synthesis simulation for 19-bit CLA. ......................................................51
Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication Unit
(MU)..........................................................................................................................51
Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above..........52
Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven
columns of both IP and OI are shown due to report page width limit). ...................53
Figure 6.11. The source code for C++ program that generates test vectors to program
the filter coefficients into MAUs...............................................................................54
Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit. .............55
Figure 6.13. First phase of operation; programming of FCs into MAUs. ..........................56
Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown in
figure above is the beginning of the second row of the input pixels). ......................56
Figure 6.15. Second phase of operation; output pixels generated. ....................................57
Figure 6.16. Second phase of operation; output pixels of the second row of OI
(superimposed)..........................................................................................................58
Figure 6.17. Third phase of operation; output pixels of the last row of OI
(superimposed)..........................................................................................................58
Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns). ..................59
Figure 6.19. First phase of operation for test case 2. .........................................................60
Figure 6.20. Second phase of operation for test case 2; output pixels shown are the
first six of row one of OI (superimposed).................................................................60
ix
Figure 6.21. Third phase of operation for test case 2; output pixels shown are the first
six of the last row for OI (superimposed). ................................................................61
Figure 6.22. Second phase of operation for test case 1 (post-implementation
simulation); output pixels of the second row of OI (superimposed). .......................63
Figure 6.23. Third phase of operation for test case 1 (post-implementation simulation);
output pixels of the last row of OI (superimposed). .................................................64
Figure 6.24. Second phase of operation for test case 2 (post-implementation
simulation); output pixels shown are the first six of row one of OI
(superimposed)..........................................................................................................64
Figure 6.25. Third phase of operation for test case 2 (post-implementation simulation);
output pixels shown are the first six of the last row for OI (superimposed).............65
Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes......................67
Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first row
of the OIs for test case 1. ..........................................................................................68
Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the second
row of the OIs for test case 1. ...................................................................................68
Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes. .....................69
Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third row
of the OIs for test case 2. ..........................................................................................70
Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth
row of the OIs for test case 2. ...................................................................................70
Figure 6.32. A plot of equivalent system gates versus number of FC planes....................71
Figure 7.1. Convolution Architecture hardware implementation. .....................................72
Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture
obtained from XESS Co. website, http://www.xess.com).........................................73
Figure 7.3. Top level view of the prototyping hardware. ..................................................74
Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing
input image pixels for the convolution system (seed number of 1 is provided to
the program)..............................................................................................................75
Figure 7.5. FPGA configuration and bit stream download program, gxsload from
XESS Co. ...................................................................................................................77
x
Figure 7.6. Execution of the FCs configuration program..................................................78
Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the
upper bound of the SRAM address space whereas the low address indicates the
lower bound of the SRAM address space. .................................................................79
Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are
two segments due to the fact that the program wrote the right bank of the SRAM
(16-bit) first and the left bank of the SRAM next (16 MSB bits). .............................79
Figure 7.9. SRAM contents retrieved for first OI plane for test case 1. .............................81
Figure 7.10. SRAM contents retrieved for second OI plane for test case 1........................81
Figure 7.11. SRAM contents retrieved for third OI plane for test case 1. ..........................81
Figure 7.12. SRAM contents retrieved for first OI plane for test case 2. ...........................82
Figure 7.13. SRAM contents retrieved for second OI plane for test case 2........................82
Figure 7.14. SRAM contents retrieved for third OI plane for test case 2. ..........................83
xi
Chapter 1
Introduction
Performance and cost are both important parameters and criteria in today’s
computing system components, whether the components are an entire computer or
computer accessories and peripherals such as printers. The ever increasing desire for
higher performance from consumers has driven printer manufacturers to develop and
incorporate performance enhancements into their products at a cheapest price. It is a
known fact that cost is, most of the time, directly proportional to performance, but
manufacturers are constantly pursuing higher performance for less cost.
An ability to scan and print exceedingly clear images at a maximum page-per-
minute rate at the cheapest cost are performance metrics printer manufacturers aim for. In
order to produce highly enhanced clear images, the “discrete convolutional-filtering
algorithm” must be implemented within the scanner or printer. General-purpose signal
processors from various vendors are widely used to implement the convolutional-filtering
algorithm. Many times all the functionality offered by general-purpose signal processors
are not needed or required by the manufacturers, thus the unused functionality becomes a
cost overhead. Also, many times, commercially available general-purpose processors
cannot meet desired performance/cost requirements of having the highest performance at
lowest cost. Thus, a special-purpose architecture signal/image processor is desired to
implement the discrete convolutional filtering algorithm. The subject of this thesis is the
development of an efficient high performance special purpose signal/image processor
architecture which may be used to implement the discrete convolutional-filtering
algorithm at a lowest cost.
1
Chapter 2
Background and Convolution Architecture Requirements
Convolution is one of the essential operations in digital image processing required
for image enhancements [15,16]. It is used in linear filtering operations such as
smoothing, denoising, edge detection and so on [15,16]. In general, image processing is
carried out in a two dimensional space/array [16]. A digital image can be represented
with an array of numbers in a two dimensional space. Each number (or pixel) has an
associated row and column to indicate its coordination (position) in the two dimensional
space and the number’s value represents gray levels for that coordinate [15]. The gray
levels are usually represented with a byte or 8-bit unsigned binary number, ranging from
0 to 255 in decimal. Equation 1 shows the two dimensional discrete convolution
algorithm, where IP is the Input Image Plane, FC is the Filter Coefficient Plane, and OI is
the Output Image Plane [16].
∑∑−
=
−
=
−+−
−+−==
1
0
1
0)]
21(),
21([],[],[*],[],[
n
I
n
J
nJynIxIPJIFCyxIPyxFCyxOI (1)
Figure 2.1 below shows the basic definitions for the Input Image Plane (IP), Filter
Coefficient Plane (FC), and Output Image Plane (OI). Assuming that the IP has a size of
i×j pixels and FC has a size of n×n pixels, then, OI would have a size of i×j pixels. In
most cases, n<i and n<j.
2
FC(0,0)
FC(n-1,n-1)
IP(0,0)
IP(i-1,j-1)
x IP(x,y) i
j
n
n
OI(0,0)
x OI(x,y) i
j
OI(i-1,j-1)
c
Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and Output Image Plane (OI).
Digital convolution can be thought of as a moving window of operations [16]. As
shown in Equation 1, one output pixel of the OI[x,y] can be obtained by rotating the FC
180 degrees around the center point (denoted c in Figure 2.1 within FC) and place it over
the IP with the center point on top of IP[x,y]. All the overlapping IP pixels would be
multiplied by the corresponding filter coefficients on FC and then all the products are
summed to generate the one pixel OI[x,y]. The next output pixel can be obtained by
sliding the FC plane by one pixel to the right and then repeat all the processes mentioned
above. Figure 2.2 illustrates the idea of the moving window of operations. FC is first
centered at IP[3,4] to compute OI[3,4] and then moves to IP[3,5] for OI[3,5].
From Figure 2.2 below, one can deduce that when an output pixel is computed,
access to entire previous rows or portions of previously input rows of input pixels are
needed. Hence, previous input image pixels must be stored for this purpose. However,
not all of the previous rows of input image pixels are necessarily needed. Instead, only
(n-1) rows plus n input image pixels are required. This can be shown by example in
Figure 2.3 below. Another important observation that can be made from Figure 2.2 is that
for consecutive convolution, only n input image pixels are obsolete and require update.
This is an important observation that influences the design of the convolution
architecture. Figure 2.3 shows an example for a 3×3 filter size where the shaded areas of
3
the IP and the area under the FC plane are the input image pixels that need to be stored.
Hence, these pixels can be stored in a memory device whether it is on chip or off chip.
Figure 2.2. Example showing how two consecutive output pixels are generated. This
example is shown with a 3×3 size FC.
• IP23
• •
• IP24
•
• IP25
•
•
•
•
• IP43
• IP33
• IP34
• IP44
• IP35
• IP45
•
•
•
•
• • • • •
•
• •
• IP24
•
• IP25
•
•
•
• IP26
•
• • IP34
• IP44
• IP35
• IP45
•
•
• IP36
• IP46
• • • • •
FC00 FC01 FC02
FC10 FC11 FC12
FC20 FC21 FC22
Filter Coefficients
FC centered at IP[3,4]
FC centered at IP[3,5]
Current output, OI[3,4] = FC00IP45 + FC01IP44 + FC02IP43 + FC10IP35 + FC11IP34 + FC12IP33 + FC20IP25 + FC21IP24 + FC22IP23
next output, OI[3,5] = FC00IP46 + FC01IP45 + FC02IP44 + FC10IP36 + FC11IP35 + FC12IP34 + FC20IP26 + FC21IP25 + FC22IP24
• IP35
• IP23
• •
• IP24
•
• IP25
•
•
•
•
• IP43
• IP33
• IP34
• IP44
• IP45
•
•
•
End of the rowsBeginning of the
rows
Current input image pixel. The input image pixels coming in row by row.
Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels
need to be stored. In this example, 2 previous rows (shaded rows in addition to IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are needed
for a 3×3 filter size.
Convolution is a vital part of image processing and it can be done through both
software and hardware [1]. Much effort has been directed towards speeding up the
convolution process through hardware implementation [1,2,3,9,14]. This is because
convolution is a computation intensive algorithm as shown in Equation 1. For example,
with a 5×5 filter size, each output pixel requires 25 Multiplication and Addition
4
Operations (MAOPS). Thus, as the total number of pixels for an image increases,
MAOPS will increase substantially. Bosi and Bois in [1] propose the use of FPGAs
programmed with a 2D convolver as a coprocessor to an existing digital signal processor
(DSP) to speedup the convolution process. In [2,3,9,14], special purpose convolution
architectures are designed to meet real-time image processing requirements. Hsieh and
Kim in [2] proposed a highly pipelined VLSI convolution architecture. Parallel one
dimensional convolutions and a circular processing module were the approaches used in
the architecture for performance gain and the architecture required n×n processing
elements, each being a multiplier and adder. Both [3] and [14] propose convolution
architectures based on systolic arrays which operate on real time images with a size of
512×512 pixels. Also, both of the architectures performed bit-serial arithmetic. The
architecture of [14] requires on chip memory to store the necessary input pixels.
The focus of this thesis is the development of a high performance real time special
purpose convolution architecture which is desired for scanning and printing applications.
The requirement for the final “production” version (implemented with ASIC technology)
of this architecture is the capability to perform convolution with a 5×5 FC size on input
images of size of 8½”×11” at a rate of 60 Pages Per Minute (PPM) at 600dpi (dot per
inch). A total of 33.66M pixels are generated when a standard paper size of 8½”×11” is
scanned at a resolution of 600dpi. Add to that the requirement to process 60 scanned
PPM results in 1.69G MAOPS per second. The multiplication operands are each 8 bits in
width generating 16 bit products which must be summed. The method proposed in [1] is
not feasible from a cost standpoint and also some of the functionality of the DSP may not
be required. The architectures presented in [2] and [9] are special purpose architectures
for convolution, and both architectures require n×n processing elements which could
potentially occupy a large chip area. The architectures of [3] and [14] are both systolic
array architectures employing bit-serial arithmetic operations and hence may not be able
to meet performance requirements mentioned above. In [7] the authors point out the well
known fact that for most applications bit-parallel arithmetic has a performance edge over
bit-serial arithmetic. However, the processing architecture of [7] is based on bit-serial
arithmetic since it is sufficient for their requirements and has less gate count.
5
Two hardware architectures, version 1 and version 2, are proposed for the
implementation of the two dimensional discrete convolution algorithm shown as
Equation 1. Version 1 of the architecture proposed in the next section, is based on a linear
systolic array structure. Version 2 of the architecture, based on an extension of version 1,
will be shown in a later section. Version 2 of the architecture was developed to meet
some functional and performance requirements different from those of Version 1 of the
architecture.
Version 1 is a special purpose convolution architecture. Unlike [2] and [9], the
architecture will not use n×n processing elements and it will be scalable in order to meet
variable performance requirements. A scalable architecture when implemented into
programmable Field Programmable Gate Array (FPGA) technology allows users to
implement the architecture to meet their specific performance needs. Parallel arithmetic
operations will be utilized in the version 1 architecture for performance gain.
6
Chapter 3
Version 1 Convolution Architecture
Specially designed hardware to implement convolution can offer a performance
gain over a general-purpose Digital Signal Processor (DSP). Since the convolution
algorithm requires a large number of multiplications and additions, a multiprocessor
architecture is desired. Multiprocessor architectures offer the benefit of processing
multiple operations at a given instance.
A specific type of multiprocessor architecture will be utilized for implementation
of convolution in hardware. It is referred to as a systolic array structure [6]. The
advantages of this type of structure include its modularity and regularity in structure and
the ease of pipelining. In other words, the systolic structure has the ability to fully and
simultaneously utilize all computational units within the architecture. A major challenge
in developing systolic array architectures can be in providing data simultaneously to the
multiple computational units in correct order. Figure 3.1 shows the top-level view of
version 1 of the developed hardware convolution architecture with the basic functional
units indicated. An external memory device (external to an FPGA or ASIC chip) will be
utilized to hold a portion of the scanned input image plane during the convolution
process. A more detailed description of the interface between the main system and the
external memory will be presented later. The functionality of all basic functional units of
the architecture of Figure 3.1 will now be described.
3.1. Arithmetic Unit (AU)
The Arithmetic Unit (AU) is the core of version 1 of the convolution architecture.
As shown in Figure 3.1 above, the AU consists of an Accumulator plus Multiplication
and Add Units (MAUs). As the name implies, the basic building blocks within MAUs are
7
multiplication units and adders. Each MAU consists of one multiplication unit and one
adder as shown in Figure 3.2 below.
Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed to be 8 in this example).
AA cc cc
uu mmuu ll
aa ttoo rr
MMuullttiipplliiccaattiioonn aanndd AAdddd UUnniittss ((MMAAUUss))
CCooeeffffiicciieenntt SShhiifftteerrss ((CCSSss))
IInnppuutt DDaattaa SShhiifftteerrss ((IIDDSSss)) DDaattaa MMeemmoorryy IInntteerrffaaccee ((DDMM II//FF))Input Image Plane, IP
Outputs, OI
Filter Coefficients, FC
CS0 CS1 CSn-1
IDS0 IDS1 IDSn-1
IDS Input
CCoonnttrroolllleerr
OOuuttppuutt CCoorrrreeccttiioonn
UUnniitt
Arithmetic Unit (AU)
External Memory Device
MAUn-1 MAU1 MAU0
40
8
8 8
8 8 8
8
8
8
192121
Multiplication Unit
Adder
Output from previous MAU
IDS output
CS output
To the next MAU
8
8
16
1617
Figure 3.2. A MAU and included functional units.
8
As depicted in Figure 3.2 the multiplication unit will multiply two 8-bit binary
numbers, and then the adder will add the product to input from the previous MAU. Output
from the adder will be used as the input to the next MAU. In order to achieve high
performance it is important to utilize high-speed adders and multiplication units within
the architecture. It is also of interest to adopt a multiplication technique that is suitable for
pipelining for performance enhancement. For example, the Wallace Tree multiplication
or array multiplication techniques can be easily pipelined into multiple stages. The
implementation platform influences performance as well. For instance, it is now common
to find high performance built in core adders and multiplication units within FPGA
technology chips.
MMAAUU00 MMAAUU11 MMAAUUnn--11 PPaarrttiiaall rreessuulltt ttoo aaccccuummuullaattoorr
RReeggiisstteerr00
IIDDSS00 IIDDSS11 IIDDSSnn--11
CClloocckk,, CCllkk
RReeggiisstteerr11 RReeggiisstteerrnn--11
CCSS00 CCSS11 CCSSnn--11
MMuullttiipplliiccaattiioonn aanndd AAdddd UUnniittss ((MMAAUUss))
Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data Shifters and CSs are the outputs from Coefficient Shifters.
The MAUs are arranged in such a way that they create the systolic array structure.
Figure 3.3 shows the systolic array structure of the MAUs. The total number of MAUs
used is determined by the size of the coefficient filter. For an n×n filter size, n MAUs
would be utilized.
Use of the systolic array structure will require the outputs from the Input Data
Shifter (IDS) and Coefficient Shifter (CS) functional units to be skewed. This is to ensure
that the correct sequence of multiplications is added correctly. An accumulator is needed
at the end of the structure to add the necessary partial results generated by the MAUs to
form an output pixel. For example, for an n×n filter size, the accumulator must
9
accumulate n partial results generated by the MAUs in the generation of one output pixel.
This requires n clock cycles if each MAU takes one clock cycle to complete its operation.
The registers to the left of each MAU serve as pipeline stage registers. If necessary,
additional pipeline stages can be implemented within each MAU to increase performance
[8].
3.2. Coefficient Shifters (CSs)
Coefficient shifters are a group of parallel register shifters that can be
programmed to retain the values of the filter coefficients. CSs are responsible for
generating a skewed output of filter coefficients to MAU inputs. Figure 3.4 shows a
structural view of the CSs. The number of shifters within the CSs is also dependant on the
coefficient filter size. For an n×n filter size, there will be n coefficient shifters. Each CS
stores n coefficients as seen in Figure 3.4. Once programmed with the filter coefficients,
the CSs will retain the filter coefficients through out the convolution process. In order to
provide the MAUs with the skewed input from the CS, as the convolution process starts,
CS0 will shift after the first clock cycle, the next clock cycle CS1 and CS0 will shift, the
following clock cycle CS2, CS1, and CS0 will shift. The process continues until all the CSs
are shifting every clock cycle. This will ensure that the MAUs will receive required
skewed input. Figure 3.5 shows the arrangement of the filter coefficients within the CSs
for the convolution algorithm corresponding to the filter coefficients shown in Table 3.1.
CCSS00 CCSS11 CCSSnn--11
CCooeeffffiicciieennttss SShhiifftteerrss ((CCSSss))
nn--11
11
00
nn--11
11
00
nn--11
11
00
FFCC IInnppuutt
CCSS00 CCSS11 CCSSnn--11
CClloocckk,, ccllkk
88
88 88 88
88
88
88 88
88
88 88
88
88
Figure 3.4. Functional units within CSs.
10
Table 3.1. Filter coefficient array. FC(0, 0) FC(0, 1) FC(0, n-2) FC(0, n-1) FC(1, 0) FC(1, n-1)
FC(n-1, 0) FC(n-1, 1) FC(n-1, n-2) FC(n-1, n-1)
Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters.
3.3. Input Data Shifters (IDSs)
The main function of the Input Data Shifters (IDSs) is to generate a proper
sequence of input image pixels for the MAUs. Figure 3.6 shows the basic functional units
within the IDSs of Figure 3.1.
Register Bank (RB) (Size of n×n Registers)
Delay Units Outputs to MAU Inputs
Pattern Generator Pointer(s) (PGPs)
Input from Data Memory Interface
CCooeeffffiicciieenntt SShhiifftteerrss ((CCSSss))
FC(n-1,0)
FC(1,0)
FC(0,0)
FC(n-1,1)
FC(0,1)
FC(n-1,n-2)
FC(0,n-2)
FC(n-1,n-1)
FC(1,n-1)
FC(0,n-1)
CSn-1 CSn-2 CS1 CS0 88 88 88 88
Figure 3.6. Functional units within IDSs.
11
3.3.1. Register Bank (RB)
Due to the structure of the convolution algorithm, for each successive output
pixel, it requires access to the previous ((n×(n-1))+n-1) input image pixels. Hence, the RB
of Figure 3.6 is used to provide the correct input image pixels for successive convolution.
Figure 3.7 shows the detail of the RB. The RB consists of n registers and each register has
a length of n input image pixels or (n×d) bits assuming each input image pixel is d bits in
length. Thus, RB has the capacity to hold n2 input image pixels that are needed for each
convolution. This functional unit and its structure also improve the scalability of the
architecture.
n2log
Dem
ux
Mux
n×d n×d
n2log
0
n-1 n-1
0 0 d d(n-1)-1 (d×n)-1
RReeggiisstteerrss BBaannkk
To delay units
Input from Data
Memory
Demux’s select lines
Mux’s select lines
Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the
input pixels).
3.3.2. Pattern Generator Pointers (PGPs)
In order to provide the MAUs with the correct sequence of input image pixels for
each convolution, the Pattern Generator Pointers (PGPs) of Figure 3.6 are utilized. As the
RB fully fills up for each convolution, only one register needs to be updated with new
input image pixels. Thus, the update sequence for the RB (Input image pixels coming
12
from the Data Memory Interface) will repeatedly go from top to bottom (repeating from
zero to (n-1)). As for the output sequence from RB, each convolution requires all
registers’ contents being fetched to the Delay Units. Hence, all except the Data Memory
Interface (DM I/F) run at frequency n times faster (for an n×n filter size). The DM I/F
operates at the same frequency as the input image pixel rate. Table 3.2 below shows the
output sequence for one output pointer. The output sequence is 0, 1, 2, 3, 4 for n = 5. The
example in Table 3.2 is based on a 5×5 filter size. It is found that the output pattern will
repeat itself every five convolutions.
Table 3.2. 5×5 Filter size (with one output pointer). Convolutions
1st 2nd 3rd 4th 5th 0 1 2 3 4 1 2 3 4 0 2 3 4 0 1 3 4 0 1 2
Rea
ding
ord
er
from
the
RB
4 0 1 2 3
The architecture can be scaled up to process up to x convolutions in parallel where
(x ≤ n). This is made possible by adding (x-1) additional output pointer(s), Delay Unit(s)
and AU(s) to the existing architecture. As in the example above, the output sequence for
each pointer can be predetermined and they repeat after every five convolutions. Table
3.3 below shows an example for a 5×5 filter size with two output pointers. Figure 3.8
shows the additional hardware required if two convolutions are to be done in parallel and
the figure infers the additional hardware required to convolute x = n points in parallel. In
addition, an Output Correction Unit (OCU) will be needed for convolution of two or
more output pixels in parallel (See OCU in Figure 3.1). The function of the OCU will be
explained in a following section.
As the architecture is scaled up to process more than one convolution in parallel,
all functional units within the architecture except the DM I/F (which runs at the same
frequency as the input image pixels’ rate) can operate at a lower frequency. If the
architecture is scaled up to process n convolutions in parallel, then the whole architecture
operates at the same clock rate as the input image pixels rate. Thus, on average one
convolution can be achieved in every clock cycle. However, the RB needs modification in
order to process n convolutions in parallel. To process n convolutions in parallel, on
every clock cycle all n registers within the RB will be read at once. Thus, it is necessary
13
for the current input from the DM I/F for updating one of the registers within RB being
fetched to the MAU at the same instance. For this case, n pointers are utilized and each
pointer will only have one sequence instead of five as shown in Table 3.2 and Table 3.3.
Table 3.3. 5×5 Filter size (Convolution with two output pointers). Convolutions
1st 2nd 3rd 4th 5th 0 2 4 1 3 1 3 0 2 4 2 4 1 3 0 3 0 2 4 1
Rea
ding
ord
er
from
the
RB
(poi
nter
0)
4 1 3 0 2
Register Bank
Pointer
IDS Inputs
01234
Pointern-1
Convolutions
1st 2nd 3rd 4th 5th 1 3 0 2 4 2 4 1 3 0 3 0 2 4 1 4 1 3 0 2
Rea
ding
ord
er
from
the
RB
(poi
nter
1)
0 2 4 1 3
DUn-1
AUn-1
DU1DU0
AU1AU0
Output Correction Unit
CSs outputs
1Pointer0 Figure 3.8. Additional hardware and modification for convolution of x output pixels
in parallel for (x ≤ n) (Functional units shaded in gray are additional hardware required for processing two convolutions in parallel).
14
The PGPs can be synthesized by using a finite state machine model. Another
modeling possibility is by storing the predetermined sequences into RAM and reading
them out sequentially as needed.
3.3.3. Delay Units (DU)
Output from the Register Bank (RB of Figure 3.7 and Figure 3.8) will go through
the Delay Units (DU) of Figure 3.8 before being fetched into MAUs within the Aus of
Figure 3.8. The delay units consist of a series of flip-flops placed in a manner that will
generate a skewed input to the MAUs. This is necessary for the AU to generate the correct
outputs. Figure 3.9 below shows the internal structure of a DU.
R R R R
R R R
R R
R
8 8 8 8 8
40
MAU4 MAU3 MAU2 MAU1 MAU0
Output from the IDS
IDS(39 – 32) IDS(31 – 24) IDS(23 – 16) IDS(15 – 8)
IDS(7 – 0)
Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the
figure denotes one flip-flop.
3.4. Systolic Flow of Version 1 Convolution Architecture.
In order to further demonstrate how the data flow within the MAUs occurs, Figure
3.10 may be used. As the input image pixels pass through the DU from the RB, skewed
input image pixels are generated and fed to the MAUs of the AU. At the same time,
skewed filter coefficients are also input into MAUs by the CSs. Figure 3.10 below shows
an example of how an output pixel is obtained as it flows through the MAUs with a 5×5
filter coefficient size.
15
• IP23
• IP13
• IP14
• IP24
• IP15
• IP25
• IP12
• IP22
• IP16
• IP26
• IP43
• IP33
• IP34
• IP44
• IP35
• IP45
• IP32
• IP42
• IP36
• IP46
• IP53
• IP54
• IP55
• IP52
• IP56
FC00 FC01 FC02
FC10 FC11 FC12
FC20 FC21 FC22
Filter Coefficients FC centered at IP[3,4]
output, OI[3,4] = FC00IP56 + FC01IP55 + FC02IP54 + FC03IP53 + FC04IP52 + FC10IP46 + FC11IP45 + FC12IP44 + FC13IP43 + FC14IP42 + FC20IP36 + FC21IP35 + FC22IP34 + FC23IP33 + FC24IP32 + FC30IP26 + FC31IP25 + FC32IP24 + FC33IP23 + FC34IP22 + FC40IP16 + FC41IP15 + FC42IP14 + FC43IP13 + FC44IP12
FC03
FC13
FC23
FC04
FC14
FC24
FC30 FC31 FC32 FC33 FC34
FC40 FC41 FC42 FC43 FC44
Time
t0 c.c. - - - - FC44IP12 (t0 + 1)
c.c. - - - FC34IP22 + FC44IP12 FC43IP13
(t0 + 2) c.c. - -
FC24IP32 + FC34IP22 + FC44IP12
FC33IP23 + FC43IP13 FC42IP14
(t0 + 3) c.c. - FC14IP42 + FC24IP32 +
FC34IP22 + FC44IP12 FC23IP33 + FC33IP23 + FC43IP13
FC32IP24 + FC42IP14 FC41IP15
(t0 + 4) c.c.
FC04IP52 + FC14IP42 + FC24IP32 + FC34IP22 + FC44IP12
FC13IP43 + FC23IP33 + FC33IP23 + FC43IP13
FC22IP34 + FC32IP24 + FC42IP14
FC31IP25 + FC41IP15 FC40IP16
(t0 + 5) c.c.
FC03IP53 + FC13IP43 + FC23IP33 + FC33IP23 + FC43IP13
FC12IP44 + FC22IP34 + FC32IP24 + FC42IP14
FC21IP35 + FC31IP25 + FC41IP15
FC30IP26 + FC40IP16 -
(t0 + 6) c.c.
FC02IP54 + FC12IP44 + FC22IP34 + FC32IP24 + FC42IP14
FC11IP45 + FC21IP35 + FC31IP25 + FC41IP15
FC20IP36 + FC30IP26 + FC40IP16
- -
(t0 + 7) c.c.
FC01IP55 + FC11IP45 + FC21IP35 + FC31IP25 + FC41IP15
FC10IP46 + FC20IP36 + FC30IP26 + FC40IP16 - - -
(t0 + 8) c.c.
FC00IP56 + FC10IP46 + FC20IP36 + FC30IP26 + FC40IP16 - - - -
MAU4 MAU3 MAU2 MAU1 MAU0
Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel.
As shown in Figure 3.10 above, the convolution starts at t0 clock cycle (cc) when
the first input image pixel (IP12) is multiplied by filter coefficient FC44 in MAU0. During
the next clock cycle, (t0 + 1), the previous product from MAU0 is added with the product
of the IP22 and FC34 multiplication in MAU1 while a new product is generated in MAU0
(FC43 and IP13). The sum of the two products in MAU1 will be propagated into MAU2 the
next clock cycle (t0 + 2) and it is then summed with the product generated within MAU2.
16
The process continues as shown in Figure 3.10 above. An output pixel is generated
during (t0 + 9) cc when partial results from MAU4 at (t0 + 4), (t0 + 5), (t0 + 6), (t0 + 7), and
(t0 + 8) cc are summed by the accumulator. Once the first output pixel is generated on the
9th cc, from then on a new output pixel will be generated every five cc’s.
3.5. Data Memory Interface (DM I/F)
It is anticipated that external memory devices will be utilized for IP pixel storage
since the cost of having an on chip memory within the single-chip convolution
architecture of Figure 3.1 is high for any implementation platform. (The following
assessment is made on the assumption that a 5×5 filter size is desired) The bus width for
data transfer between external memory device and DM I/F will be 40 bits wide. This will
ensure that each access to the external memory device can yield five input image pixels.
However, since memory devices such as the SRAM devices on the market only come in
sizes of 8-bit, 16-bit, and 32-bit, two memory devices will be used; one 8-bit and one 32-
bit. Due to the fact that for consecutive convolution only five input image pixels need to
be updated, only one access to an external memory device will be required. Figure 3.11
below shows the basic functional units within the DM I/F.
Cache Unit
Zero Padding Hardware
Input From Scanning Device
External Memory Device
Output To IDSs
40 40
8
8
Figure 3.11. Basic functional units within Data Memory I/F.
A cache unit is utilized to reduce the penalty of accessing external memory
devices. Figure 3.12 shows a more detailed DM I/F. Register File A and B each consists
of four shift registers, namely Registers b, c, d, and e. Each register is 40-bits in size and
holds five input image pixels. In order to prevent data starvation, Register File A and B
are used alternatively. As Register File A is providing input image pixels to IDSs,
17
Register File B is being filled with input image data from external memory devices and
vise versa. As either one of the Register Files are outputting input image pixels to IDSs,
each internal register will shift an input image pixel out by shifting right.
Register File A
Register File B
Register a
‘0’ Scanning
Device
To External Memory Device
From External Memory Device
Output to IDSs
40
8 8
40
8
32
32
32
Figure 3.12. A more detailed look at DM I/F.
Observing Equation 1, there are instances where references are made outside the
range of the input image. For these accesses zero pixel values will be used. To address
these boundary conditions, zero padding hardware is incorporated into the DM I/F.
Whenever the end of a row (input image pixels) is reached 2
)1( −n pixels are attached.
Thus, a column counter is needed within the main controller of the architecture.
Register a of Figure 3.12 is a register that can hold up to five input image pixels.
As Register a is filled, the contents will be stored into External Memory Devices. There
are also five address pointers needed for addressing the External Memory Devices for
storage of input image pixels. Each addressable location of the two External Memory
Devices can hold up to five input image pixels (40 bits). Figure 3.13 shows the time line
for activities within the DM I/F. For every four reads from External Memory Devices
(read from each pointer once) and one write to store input image pixels stored in Register
a, five output pixels (OI) of Figure 3.1 will be produced.
18
Figure 3.13. Time line for activities, where W denotes Write or R denotes Read
(from external memory device) registers indicated in boxes directly below.
3.6. Output Correction Unit
This unit is responsible for correcting the output sequence when the architecture is
scaled to process two or more convolutions in parallel. Figure 3.14 below shows an
example of the output pixel sequence when two convolutions are processed in parallel.
Instead of one output on each output clock cycle, two output pixels are generated on a
single clock cycle every two output clock cycles. Thus, the Output Correction Unit may
be needed to correct the output sequence back to one output pixel per one output clock
cycle. Whether the OCU is needed can be addressed at a later time.
Figure 3.14. Output pattern for two convolutions in parallel. 3.7. Controller
This is the functional unit of Figure 3.1 that will coordinate all the other
functional units within the architecture. The main controller of the architecture will be
19
implemented in finite state machine form. Within the main controller, there will be a row
and a column counter to keep track of the row and column counts so that it knows when
the end of a row is encountered. There can be two separate controlling units within the
main controller, one controller responsible for the DM I/F (controller DM) and the other
responsible for the rest of the architecture (controller R). The DM I/F will be running at
the same rate as the input image pixels while the rest of the architecture will be running at
least n times faster assuming convolution of a single point. However, as the architecture
is scaled up to handle x convolutions at the same time (see Figure 3.8), then the controller
R can be run at a correspondingly lower frequency rate.
Basically the controller can be divided into three main stages. The first stage is
mainly devoted to storing the first few rows of input image pixels and waiting until there
are enough input image pixels for convolution to start. The second stage is responsible
for filling up the pipeline and making sure that the convolution starts in the correct
manner. The last stage deals with shutting down of the system.
20
Chapter 4
Revised Architectural Requirements and Resulting Version 2 Convolution Architecture
The convolution architecture proposed in the previous section is scalable and
suitable for applications that require scalable performance and hardware. In this section a
more stringent performance requirement is addressed for which a convoluted OI pixel is
expected on each clock cycle of 7.3 ns (for a final “production” model based on ASIC
technology). In addition, k distinct n×n FCs are required to be simultaneously convoluted
with each Input Image Plane (IP) resulting in a performance requirement of k OI pixels
on each 7.3 ns clock cycle. The performance requirement of k convoluted OI pixels on
each 7.3 ns clock cycles can only be expected from final high speed production
technologies. In Version 1 of the architecture, filter coefficients (FCs) were assumed to
be 8 bits in length. Filter coefficients will now be 6 bits in length. Even though the
convolution architecture proposed in the previous section can be scaled up and pipeline
stages within the MAUs can also be increased to meet all the above requirements, a
specially tailored architecture can save hardware and reduce the architecture’s controller
complexity. For example, as shown in Figure 3.1, within each AU there is an accumulator
in front of the MAUs. As the architecture is scaled up to process n convolutions in
parallel, n accumulators within the architecture will be required which can be costly from
a hardware standpoint. Furthermore, a simplified controller for the IDSs can also
contribute to a hardware saving. Hence, in this section a modified and specially tailored
convolution architecture will be presented and it is referred to as Version 2 of the
convolution architecture.
4.1. Version 2 Convolution Architecture for (k = 1)
Since the desired output rate is the convolution of one OI pixel per clock cycle,
for a n×n FC size, a total of n2 MAUs are needed for one distinct filter coefficient set.
Figure 4.1 below shows a top level view of Version 2 of the architecture where n and d
21
(width of input image pixels) are assumed to be 5 and 8 respectively. Buses shown in
Figure 4.1 with width of 40 bits are resultant of (n×d). Each functional unit in this
architecture will implement required functionality as will be addressed below.
DDaattaa MMeemmoorryy IInntteerrffaaccee ((DDMM II//FF))
Input Image Plane, IP
Outputs,OI
Filter Coefficients, FC
CCoonnttrroolllleerr UUnniitt ((CCUU))
External Memory Device
(Data Memory)
Input Data
Shifters (IDS)
Arithmetic Unit (AU)
MAA0
MAA1
MAAn-1
AT
IDSn-1
IDS1
IDS0
n×d
d
6
19
n×d
n×d
n×d
n×d
Figure 4.1. A top level view of Version 2 of the convolution architecture for one distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication and
Add Array and AT denotes Adder Tree). 4.2. Arithmetic Unit (AU)
As shown in Figure 4.1 above, the Arithmetic Unit (AU) consists of n
Multiplication and Add Arrays (MAAs) plus an Adder Tree Structure (AT) at the end of
the MAAs. Within each MAA there are n Multiplication and Add Units (MAUs) arranged
in a systolic array structure. Figure 4.2 shows the arrangement of the n MAUs within each
MAA.
The basic functional units within each MAU remain the same as in the previous
section. In Version 2 of the architecture, filter coefficients will be held within the MAUs,
therefore, an additional register is needed to hold the filter coefficient value assigned to a
specific MAU. Since Version 2 of the modified convolution architecture will feature n2
MAUs, the Coefficient Shifters (CSs) shown in Version 1 of the architecture (see Figure
3.1, Figure 3.4 and Figure 3.5) of the previous section can be eliminated. Hence, all the n2
22
filter coefficients will be assigned to a specific MAU. Figure 4.3 shows the functional
units within each MAU.
Coefficients
MMAAUU00 MMAAUU11 MMAAUUnn--11 PPaarrttiiaall
rreessuulltt ttoo AAddddeerr TTrreeee
RReeggiisstteerr00
Clock, Clk
RReeggiisstteerr11 RReeggiisstteerrnn--11
MMuullttiipplliiccaattiioonn aanndd AAdddd AArrrraayy ((MMAAAA))
DU0 DU1 DUn-1
Delay Units (DU) IDS output
Filter
Figure 4.2. Functional units within the Multiplication and Add Array (MAA).
Multiplication Unit
Adder Output from
previous MAU
DU output
Filter Coefficient
To the next MAU
6
8
14
14
15
Register
Figure 4.3. A MAU and its functional units.
In order to achieve the desired performance, it will be necessary to pipeline all
MAUs beyond the minimum pipeline stages shown in Figure 4.2 (the register to the right
of each MAU represents a pipeline stage). Thus it is important to employ multiply
techniques that can easily be pipelined into multiple stages. It is possible to combine the
multiplication unit and the adder shown in Figure 4.3 into one unit. For the most part, a
multiplication unit usually consists of an adder tree that adds all the generated partial
23
products. As shown in Figure 4.3 an adder is required to sum the previous MAU output
with the product generated by the multiplier. It is possible to use a Carry Save Adder and
generate the output as two separate outputs (a sum output and a carry output). This will
eliminate the need for another high speed adder at the end of each MAU.
The Adder Tree (AT) within the AU is responsible for adding all the n partial
results from the MAAs to form the output image pixel. The AT can be constructed with
Carry Save Adders (CSAs) and a Carry Look Ahead Adder (CLA). In addition, the AT
will be pipelined into multiple stages as well for performance. Figure 4.4 shows a
possible arrangement of CSAs and CLA within the AT. This example is based on a 5×5
FC size.
CSA
R
R
Sum from MAA0
Carry from MAA0
Sum from MAA1
Output Image Pixel
AT
Carry from MAA1
Sum from MAA2
Carry from MAA2
Sum from MAA3
Carry from MAA3
Sum from MAA4
CSA
CSA
Carry from MAA4
CSA
CSA
CSA
R
R
CSA
CSA
R
CLA
16
16
16
16
16
16
16
16
16
16
19
Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC.
Another basic functional unit within the AU is the Delay Units (DU) which are
responsible for generating the skewed input image pixels for the MAUs. However, the
DUs will need to be pipelined as well with the same pipeline stages that the MAA has.
Upon further investigation, even though the replacement of a high speed adder
with a CSA within a MAU can save a small amount of hardware, the replacement is not as
beneficial when the architecture is reviewed at the highest level. Table 4.1 shows a direct
24
comparison of number of gates required for both a 14-bit CSA and a 14-bit CLA (EX-OR
gate is counted as five gates). The amount of hardware saved is not as significant as the
increase in hardware for the AT. Figure 4.5 shows a possible arrangement of the AT if
CLA is utilized within the MAUs.
Table 4.1. Gate count comparison between CSA and CLA. CSA CLA
Gate Count 182 210
Output from MAA3
CSA
R
CSA
CSA
CLA
R
R
Output from MAA0
Output from MAA1
Output from MAA2
Output from MAA4
Output Image Pixel
AT
17
17
17
17
17
17
18
17
17
18
19
17
18
1919
Figure 4.5. One possible arrangement of the AT when CLA is utilized within the MAUs.
First and foremost, comparison between Figure 4.4 and Figure 4.5 shows a
number of CSAs being saved and also a number of pipeline stages being saved as well.
This results in a large amount of hardware savings. If CSA is used within each MAU, the
number of bits (or bus lines) running from one MAU to another is doubled. Hence, when
implemented, CSA will require more real estate within the chip (especially when
implemented as an ASIC) than CLA, thus reinforcing the need to reduce the number of
CSA units. Another important hardware reduction is the reduction in the number of flip-
flops required for the pipeline into half since only one bus (one output from each MAA) is
required.
25
In conclusion, the adder within each MAU of Figure 4.3 will be a CLA type and
the Adder Tree (AT) of Figure 4.1 will be implemented as shown in Figure 4.5 for the
case of n = 5.
4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU)
The Multiplication Unit (MU) of Figure 4.3 is one of the most important
arithmetic components within the proposed convolution architecture. Thus, it is important
that a high speed and area efficient multiplication technique be derived and implemented
since the architecture requires 25 MUs for one convolution set. For each MU, an 8-bit
unsigned binary number (IP) is to be multiplied by a 6-bit signed binary number (FC)
and a 14-bit signed binary output (OI) is generated. Table 4.2 below shows a summary of
all the elements involved in the multiplication. All signed binary numbers will be
represented as 2’s complement numbers.
Table 4.2. A summary of the multiplication. Description Representation Range (Decimal)
Multiplicand 8-bit unsigned binary number 0 to 255
Multiplier 6-bit signed binary number (2’s complement) -32 to 31
Product 14-bit signed binary number (2’s complement) -8192 to 8191
Multiplication in binary can be done using the same technique as with the
commonly used paper and pencil method. Partial products are generated based on each
bit of the multiplier and then all the partial products are summed to generate the product.
The number of partial products required is dependent on the number of bits of the
multiplier. Hence, as shown in Table 4.2, a 6-bit signed binary number is used as the
multiplier instead of the 8-bit unsigned binary number; this is due to the fact that using a
reduced number of bits for the multiplier results in fewer partial products. However, since
the multiplier in this case is a signed binary number, for the regular paper and pencil
method to work when the multiplier is in negative range, both the multiplicand and
multiplier need to be complemented before the multiplication. This is due to the fact that
all the partial products are positive and hence the result generated will be positive as well,
which is not correct since a negative result should be obtained as the multiplier is of
negative value. Hence, by complementing both the multiplicand and the multiplier, the
26
signs are switched between the two, but the result should be the same, a negative value.
Besides, all the partial products need to be sign extended for the multiplication to be
correct. Figure 4.6 illustrates the multiplication concept mentioned above. A copy of the
multiplicand will be placed into the partial product with sign extension(s) if the
respective multiplier bit is one, otherwise all zeros will be placed.
B0 B1 B2 B3 B4 B5 B6 B7
A0 A1 A2 A3 A4 A5 × x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x
x
x x x x x x x x x
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
+ P13
s s s
s s s s
s s s s s s s
s
partial product based on A0
partial product based on A1
partial product based on A2
partial product based on A3
partial product based on A4
partial product based on A5
multiplicand
multiplier
Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row of the partial products denotes sign extension of that particular row of partial
product).
From Figure 4.6 above, it is shown that the most expensive operation is to sum all
the partial products into the final result. It is difficult to design an adder that can add six
operands at the same time and also it may not be speed efficient as well. However, a
method such as the Multilevel Carry Save Adder (CSA) Tree [10] can be employed to add
all the partial products into a final result. The Multilevel CSA Adder Tree uses multiple
stages of the CSA to reduce the operands into two operands and a final adder stage is used
to sum both operands to generate the final result. Depending on the speed requirement,
the Multilevel CSA Tree can be easily pipelined into multiple stages to increase
throughput. Besides, the final stage adder can be replaced with a fast adder such as Carry
Lookahead Adder (CLA) to reduce the latency. Figure 4.7 below shows a possible
arrangement of the Multilevel CSA Tree for adding six operands. Also, a five stage
pipeline can be implemented with this configuration.
It is possible to reduce the hardware count if the number of partial products can be
reduced. This can be done thorough use of the Modified Booth’s Algorithm (MBA) [13].
The MBA inspects three multiplier bits at a time and generates respective partial product
27
selections. Compared to the MBA, the original Booth’s Algorithm (BA) [4] inspects two
bits of the multiplier at an instance and hence the number of partial products generated
still remains proportional to the number of multiplier bits. The MBA can reduce the
number of partial products required to ( 12+
x ), assuming x is the number of bits for the
multiplier. Thus, for a 6-bit multiplier, the partial products generated will be reduced
from six to four. However, the Partial Products Generator’s (PPG) complexity is
increased due to the different possible outputs for each partial product. Table 4.3 below
gives a summary of the possible outputs for a partial product based on the three multiplier
bits examined.
CSA CSA
CSA
CSA
CLA
Partial Products Generator PP5PP4PP3PP2PP1PP0
Product
Multiplicand Multiplier
1st Stage
2nd Stage
3rd Stage
Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial
products.
Table 4.3. Partial Product Selection Table.
Multiplier Bits Selection 000 0 001 + Multiplicand 010 + Multiplicand 011 + 2×Multiplicand100 - 2×Multiplicand 101 - Multiplicand 110 - Multiplicand 111 0
28
As shown in Table 4.3, each partial product generated can have a different output
and thus the hardware complexity for the PPG is increased. Also, the output for the
partial product can be easily obtained by either shifting the multiplicand left one position
for 2× and complement plus one for all the negative values required. Figure 4.8 below
illustrates the changes to the Wallace Tree when MBA is employed. Compared to Figure
4.7, two CSAs can be saved.
Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree Structure.
Figure 4.9 below depicts the detailed multiplication technique when MBA is
employed to reduce the number of partial products required. Some hardware saving can
be achieved when a Full Adder (FA) can be replaced with a Half Adder (HA) within the
MU. Figure 4.10 shows the reduced sign extension within the partial products which in
turn can contribute to hardware savings [12].
B0 B1 B2 B3 B4 B5 B6 B7
A0 A1 A2 A3 A4 A5 × x x x x x x x x
x x x x x x x x x x x x x s3
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
+ P13
s1 s2
x x x x
s1 x s1 s1
s2 s2
s1
x
partial product based on 0, A0, A1 partial product based on A1, A2, A3
partial product based on A3, A4, A5
Sign corrections for all partial products
multiplicand
multiplier
s3
s2
s1
CSA
CSA
CLA
Partial Products GeneratorPP3PP2PP1PP0
Product
Multiplicand Multiplier
1st Stage
2nd Stage
3rd Stage 14
14 14
Figure 4.9. Illustration of multiplication technique based on Modified Booth’s
Algorithm.
29
B0 B1 B2 B3 B4 B5 B6 B7
A0 A1 A2 A3 A4 A5 × x x x x x x x x
x x x x x x x x x x x x x s3
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
+ P13
s1 s2
x x x x
s1 x s1 s1
1 s2
x
partial product based on 0, A0, A1 partial product based on A1, A2, A3
partial product based on A3, A4, A5
Sign corrections for all partial products
multiplicand
multiplier
s3
Figure 4.10. Partial Product’s sign extension reduced for hardware saving.
Table 4.4 below shows the hardware comparison between multiplication
techniques shown in Figure 4.7 and Figure 4.8. For simplicity, the multiplication
technique shown in Figure 4.7 is denoted as method I and method II denotes the
multiplication technique shown in Figure 4.8. Also, Exclusive-OR gates (EX-OR) within
both methods are counted as five gates. FF in Table 4.4 denotes Flip-Flops.
Table 4.4. Comparison between method I and method II. # FA # HA # FF # CLA Gate Count (excluding FF)
Method I 37 5 78 1 691
Method II 9 15 72 1 641
From Table 4.4, the total number of Gate Count, excluding FFs, is almost
identical but in order to achieve the speed requirement, multiple pipeline stages are
included. The maximum gate delay for both methods is identical, which is due to the CLA
(10 gate delays) [11]. Even though the hardware saving between the two methods is
modest at a glance, the number of replications of the unit can be a factor. Another
important note, in order for method I to be able to handle two’s complement, both the
multiplicand and the multiplier need to be complemented (complement each bit and add
one to the result) if the multiplier has a negative value. Hence extra overhead is required
for method I to function correctly. There is a workaround to avoid adding hardware for
complementing both the multiplicand and multiplier for method I, and that is by reversing
the multiplicand and the multiplier. This will ensure that the multiplier will always have
positive value and hence avoid the two’s complement hardware, but in return more partial
products will need to be generated since the multiplier consists of an 8-bit unsigned
number.
30
The multiply technique as shown in Figure 4.8 will be employed when the version
2 architecture is implemented. This is because this multiply technique requires less
hardware compared to the multiply technique shown in Figure 4.7.
4.2.2. Delay Units (DU)
The Delay Units (DU) are responsible for generating skewed input image pixels
for the MAA. The number of stages within DU is directly proportional to n and the
number of pipeline stages employed within the MAA. Figure 4.11 below shows the
organization of the flip-flops within the DU. The number of stages shown in Figure 4.11
below assumes n = 5 and that there are two pipeline stages within each MAU (one
pipeline stage after the Multiplication Unit and another one after the Adder; exclude the
first MAU since it only has a Multiplication Unit and hence one pipeline stage) as shown
in Figure 4.1.
PL1
PL2
PL3
PL4
PL5
PL6
PL7
32 24 24 16 16 8 8
8 8 8 8 8
DU0 (7-0)
DU1 (15-8)
DU2 (23-16)
DU3 (31-24)
DU4 (39-32)
IDS Output
Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline stages within each MAU (PL denotes a pipeline stage composed of flip-flop
registers).
4.3. Input Data Shifters (IDS)
This functional unit (see Figure 4.1) is responsible for providing the AU with the
correct input image pixel sequence. Figure 4.12 below shows the structural view of the
Input Data Shifters (IDS). There are n shift registers (S Registers) within the IDS with
each register capable of holding n input image pixels and with parallel load capability.
Each input image pixel is d bits wide. All shift registers also need to be able to shift all n
input image pixels in parallel.
31
S Register0
S Register1
S Registern-1
Output from DM I/F
Outputs to AU
40
40
40
40
40
40
40
IDS0
IDS1
IDSn-1
Figure 4.12. Structural view of the IDS with n = 5 and d = 8.
As shown in Figure 4.12 above, all data within the structure are shifted in parallel
(in this example, it is 40 bits) from one shift register to another shift register. For
example, S Register0 is loaded with 40 bits in parallel from the output of DM I/F and
shifts 40 bits in parallel into S Register1.
4.4. Data Memory Interface (DM I/F)
The Data Memory Interface (DM I/F) of Figure 4.1 will remain unchanged from
section 3.5. See Figure 3.11 and Figure 3.12.
4.5. Memory Pointers Unit (MPU)
The external memory devices (see Figure 4.1) that are required by the architecture
are read and written through several memory pointers within the Memory Pointer Unit
(MPU). In order to achieve a minimum number of writes to the external memory devices,
MPU receives and stores n (five) input image pixels (for a 5×5 filter size) before it writes
all five input image pixels to the memory location pointed to by one of the memory
pointers. Thus, the bus width for the interconnection between the memory devices and
the architecture is 40 bits (n×d). If the memory accesses cannot keep up with the main
system clock, then the memory bandwidth can be increased to reduce the number of
accesses to the external memory devices.
32
ptr_a (ptr_0)
ptr_b (ptr_1)
ptr_c
ptr_d
ptr_e (ptr_n-1)
0
1024
2048
3072
4096
0 31 39
external memory device a
external memory device b
Figure 4.13. External memory devices organization for n = 5 and d = 8.
Figure 4.13 above shows the memory organization of the external memory
devices for the case of n = 5 and d =8 with different segments of the memory designated
for each memory pointer while Figure 4.14 below shows the functional units within the
MPU, again, for the case of n = 5 and d = 8. Each of n memory segments of the memory
is capable of storing one row of input image pixels. For example, for a n×d (40) bit
memory bus width and 5100 pixels of paper size width, each segment of the memory
should have at least 1020 locations. From Figure 4.13 above, 1024 locations are allocated
to each memory pointer. By allocating 1024 locations to each memory segment, the three
most significant bits of each memory pointer can be used to differentiate each memory
segment. Also, for every five output pixels generated, every memory pointer will have a
common ten least significant bits, except for the memory pointer that is used to write
current pixels into the external memory devices. This is because the other four memory
pointers need to pre-fetch all necessary input image pixels for the next convolution
iteration into the cache memory. Thus, two 10 bit counters (col_cntr #1 and col_cntr #2)
are needed to generate memory addresses as shown in Figure 4.14.
33
3
reg_sel
ptr_b
col_cntr #2
col_cntr #1
10
1
2
3
4
5
1
2
3
4
5
13add_out
ptr_c
ptr_d
ptr_e
ptr_a
reg_sel
Figure 4.14. Functional units within the Memory Pointers Unit (MPU).
As shown in Figure 4.14 above, the three most significant bits of each memory
pointer is stored in registers. One important note is that, as the architecture is
reinitialized, all registers are initialized differently. For example, Memory Pointer b
(ptr_b) is initialized with 001, Memory Pointer c (ptr_c) with 010, Memory Pointer d
(ptr_d) with 011, Memory Pointer e (ptr_e) with 100, and Memory Pointer a (ptr_a) with
000. This is to ensure that each memory pointer is designated to a specific memory
segment. In addition, the memory pointers are shifted as indicated in Figure 4.14 above
whenever a row of the input image pixels is completed. This is because the least recent
row of input image pixels stored within the external memory devices are no longer
needed and can be overwritten to store the current input image pixels.
4.6. Systolic Flow of Version 2 Convolution Architecture
This section shows how the input image data flows through the AU of Figure 4.1
for the case of (n = 5) and how each output pixel (OI) is generated in Version 2 of the
Convolution Architecture. Figure 4.15 below shows the systolic flow of the data for
Version 2 of the Convolution Architecture. In order to simplify the figure, all pipeline
stages within the MAAs are ignored and the figure corresponds to Figure 3.10 with the
same convolution point and input image pixels. As can be seen in Figure 4.15 below, at
time t0 every MAA will multiply the input image pixel received with the filter coefficient
34
that is stored within the first MAU (within the MAA). The next time instant, t0 + 1td where
td denotes the pipeline delay between two MAUs within each MAA, the previous product
from MAU0 (in each MAA) is summed with the product of MAU1. This process continues
until time instant t0 + 4td when all input image pixels (for one convolution point) have
flowed through MAU4 within each MAA and they will be summed by the AT the next
cycle to generate one output pixel.
4.7. Controller
The controller for version 2 of the architecture, shown in Figure 4.1, is only
responsible for controlling the DM I/F of the architecture and the described Memory
Pointers Unit (MPU). This is due to the fact that the AU and the IDS need no controller to
regulate their activities. AU and IDS will be clocked with the main clock and all the
necessary input image pixels will propagate through the pipeline stages as required,
hence no controller is necessary for both units.
Time t0 c.c. (t0 + 1td) c.c. (t0 + 2 td) c.c. (t0 + 3 td) c.c. (t0 + 4 td) c.c.
FC40IP16 FC30IP26 + FC40IP16
FC20IP36 + FC30IP26 + FC40IP16
FC10IP46 + FC20IP36 + FC30IP26 + FC40IP16
FC00IP56 + FC10IP46 + FC20IP36 + FC30IP26 +
FC40IP16 MAA0
FC41IP15 FC31IP25 + FC41IP15
FC21IP35 + FC31IP25 + FC41IP15
FC11IP45 + FC21IP35 + FC31IP25 + FC41IP15
FC01IP55 + FC11IP45 + FC21IP35 + FC31IP25 +
FC41IP15 MAA1
FC42IP14 FC32IP24 + FC42IP14
FC22IP34 + FC32IP24 + FC42IP14
FC12IP44 + FC22IP34 + FC32IP24 + FC42IP14
FC02IP54 + FC12IP44 + FC22IP34 + FC32IP24 +
FC42IP14 MAA2
FC43IP13 FC33IP23 + FC43IP13
FC23IP33 + FC33IP23 + FC43IP13
FC13IP43 + FC23IP33 + FC33IP23 + FC43IP13
FC03IP53 + FC13IP43 + FC23IP33 + FC33IP23 +
FC43IP13 MAA3
FC44IP12 FC34IP22 + FC44IP12
FC24IP32 + FC34IP22 + FC44IP12
FC14IP42 + FC24IP32 + FC34IP22 + FC44IP12
FC04IP52 + FC14IP42 + FC24IP32 + FC34IP22 +
FC44IP12 MAA4
Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td
denotes the time delay between each MAU).
Figure 4.16 below shows the top level view of the Controller Unit (CU) with the
input and output control signals shown. The CU is responsible for generating control
signals to functional units within the DM I/F and the MPU.
35
beginning of a row, bor
Controller Unit (CU)
clock, clk
end of column, eoc shut down signal, sds
row greater than, rgt reset, rst
f_sel, cache banks select line z_pad, zero padding reg_sel, registers select 3
en_w, write enable for cache 2
en_sf, shift enable for cache 2
z_input, zero input c_inc, column counter increment 2
rot, rotate memory pointers r_inc, row counter increment r_w, memory read/write line sd_inc, shut down counter increment
Figure 4.16. Top level view of the Controller Unit (CU).
Figure 4.17 below shows the functional units for which the CU generates control
signals for the case of n = 5 and d = 8. The functional units labeled C_BANK1 and
REG_A are functional units contained within the DM I/F, whereas the functional unit
labeled as MEMPTRS is the MPU referred to above. The MPU is the functional unit
responsible for generating memory addresses for the external memory devices which
store all the necessary input image pixels (IP) for each convolution. C_BANK1 is the
functional unit that supplies input image pixels to the Input Data Shifters (IDS) and pre-
fetch the necessary input image pixels from the external memory devices for the next
iteration of convolutions. In other words, the functional unit serves as a cache memory
for the convolution system. As for the functional unit labeled as REG_A, it is a unit that is
responsible for storing the most recent input image pixels received from the external
scanning device and later write to the external memory device when its register is full. In
addition, the functional unit also supplies the most current input image pixels to the IDS.
36
Figure 4.17. Functional Units that receive control signals from the CU. The convolution system is pipelined into multiple stages requiring synchronized
operation. Thus, the CU is modeled as a finite state machine. Figure 4.18 below shows
the system flow chart for the CU. The system flow chart describes micro-operations of
the system on a clock-cycle by clock-cycle basis and it also indicates values that must be
assigned to appropriate control signals of the architecture on each clock cycle of
operation. Operation of the system flow chart shown below can be divided into three
segments. The first segment starts from the beginning of the flow chart and runs until the
row counter (row_cntr) reaches a count of greater than one. This segment is operational
when the input image pixels of a scanned page start and the convolution process will only
be started after the first two rows of input image pixels have been received. In this first
segment, the received two rows of input image pixels will be stored in the external
memory device. The purpose of the tog signal within the flow chart is to alternate writing
to the two Regfiles within C_BANK1 to avoid data starvation from the external memory
device. There are two column counters (col_cntr #1 and col_cntr #2) within the
MEMPTRS functional unit, the reason being that the first column counter is for ptr_a
address generator to write to the external memory device while the second column
37
counter will be one count ahead of the first column counter to pre-fetch the rest of the
rows (ptr_b, ptr_c, ptr_d, and ptr_e) required from the external memory device.
After the first two rows of the input image pixels have been stored, the
convolution process can be started as the third row of input image pixels of the new row
are received. The second operational segment of the flow chart, which starts from the
decision box of row greater than one (row_cntr > 1) and ends at the connector A in the
figure, is operational as the convolution process starts. This segment of the flow chart
will continue until all input image pixels for the entire scanned page are received.
The last segment of the flow chart starts from the connector A and continues until
the end of the flow chart. This segment is mainly responsible for supplying the system
with zeros as input to the system until the last two rows of the output pixels are
completely generated. As can be seen from the flow chart, a special counter (sd_cntr) is
designated for counting to two for indication of the end of the convolution process.
Control signals such as tog, en_w and en_sf are expected to retain their latest
value as the system transitions from one micro-operation to another. Thus, some memory
elements such as latches are required. If this compromises the CU’s speed and
performance, then a modification to the system flow chart such as the one shown in
Figure 4.19 below would be desired.
The system flow chart shown in Figure 4.19 contains extra states added to
eliminate the shared states (after each decision branch) shown in Figure 4.18.This
modification is aimed to remove the memory elements for signal tog, en_w and en_sf,
which need to be toggled after each branching after the decision making states. This
reduces the control signals generation delay.
38
Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller Unit (CU).
39
Figure 4.18. (Continued) System flow chart for Version 2 convolution architecture’s Controller Unit (CU).
40
Figure 4.19. Modified Version 2 system flow chart.
41
Figure 4.19. (Continued) Modified Version 2 system flow chart.
42
4.8. Multiple Filter Coefficient Sets when (k > 1)
To address the need to simultaneously convolute k different sets of Filter
Coefficients (FC) with a single Input Image Plane (IP), such as when scanning and
printing color images, the version 2 architecture will require some hardware to be
replicated. Figure 4.20 below shows a high level view of the arrangement of the
additional required replicated hardware. For each additional FC set, one additional AU
will need to be added. However, not all the functional units within the AU need to be
replicated. A common DU (within the MAAs) can be shared among all the AUs for
additional FC sets (see Figure 4.2 for detail within a MAA). For example, all the MAA0s
within all the AUs can all share a common DU rather than each MAA0 having its own DU.
Input Image Plane, IP
AUk-1
AU1
DDaattaa MMeemmoorryy IInntteerrffaaccee ((DDMM II//FF))
OOuuttppuuttss,, OOII11
FC1
Input Data Shifters (IDS)
AU0
IDSn-1
IDS1
IDS0 MAA0
MAA1
MAAn-1
AT
OOuuttppuuttss,, OOII22
OOuuttppuuttss,, OOIIkk
FC2
FCk
Clock, CLK
CCoonnttrroolllleerr UUnniitt ((CCUU))
Data Memory (External or
Internal)
Control signals
Control signals
Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be
any number). The CU, DM I/F, and IDS functional units of Figure 4.20 are functionally and
operationally identical to the same units of Figure 4.1 for a given n and d and only one
instantiation of these units is required when k filter coefficient sets are used. This
enhances the scalability of the convolution architecture when expanded to handle
43
multiple FC planes. Likewise, the CU does not have to control any of the AUs of Figure
4.20. It only has to control the DM I/F and IDS units.
The version 2 convolution architecture of Figure 4.20, from a functional and
performance standpoint, can now simultaneously convolute a single IP with k (n×n) FCs
resulting in k convoluted OI pixels (OI1, OI2, … OIk) on each system clock cycle. This
functionality and performance of the version 2 architecture will first be validated via
HDL post-synthesis and post-implementation simulation in a later chapter of the thesis.
The functionality and performance will finally be validated in a later chapter via
development and experimental testing of a FPGA based hardware prototype.
44
Chapter 5
VHDL Description of Version 2 Convolution Architecture
This chapter describes the VHDL coding style and approach used to capture the
Version 2 convolution architecture. After the design is captured through VHDL, it is
synthesized and implemented to a targeted FPGA. Before a hardware prototype is built,
functional and performance level simulation will be done to validate its proper
functionality and determine its performance.
Modular and bottom up hierarchical design approaches were employed during the
VHDL design capture process. The modular design approach partitions the entire system
into smaller modules or functional units that can be independently designed and
described in VHDL. Besides, identical modules (with the same functionalities) can share
the same VHDL code or reuse the previously designed module. In addition, the bottom up
hierarchical design approach allows a multiple level view of the entire system for design
ease. Hence, by employing these approaches the smaller modules or functional units can
be tested and validated before they are combined together as the entire system.
For prototype purposes the Version 2 convolution architecture is captured with
three AUs instantiated (k = 3), no pipeline stages are built into the multiplication units,
and the architecture is tailored to an input image plane of size 5×60 pixels. The VHDL
described system has a total of 13 pipeline stages within each AU. In addition, the
external memory device as shown in Figure 4.1 is described in VHDL and incorporated
into the overall system and will thus, for the experimental hardware prototype, be
implemented within the FPGA chip containing the other functional units of the
convolution architecture.
Figure 5.1 below shows the organization of functional units within the convolution
architecture. For simplicity of the chart only the main functional units are shown, sub-
modules within the main functional units are omitted. Both behavioral and structural
level coding styles were used during the VHDL coding process. Behavioral level style
45
coding has the advantage in that only the behaviors of the modules are described in the
code and the CAD software must infer the internal logic blocks. However, this may
present inconsistency since different CAD software may infer different logic blocks for
the same code. For this thesis behavioral level coding style was employed for most of the
functional units, however all the various sized adders and multiplication units were coded
at the structural level. This was to validate the correctness of the multiply and addition
techniques proposed in the previous chapter.
Convolution Architecture
External Memory Device
Data Memory Interface
Arithmetic Unit Controller UnitInput Data Shifters
Memory Pointers Unit Cache Unit Multiplication and Add Array Adder Tree
Multiplication and Add Unit
Multiplication Unit Various sized Adder
Figure 5.1. Version 2 Convolution Architecture organization.
After the system is captured through VHDL, post-synthesis and post-
implementation HDL software simulation can determine if the system is functioning and
performing as it should. The next chapter presents post-synthesis and post-
implementation simulations of the convolution architecture. All VHDL code for Version
2 of the Discrete Convolution Architecture with three AUs (k = 3, see Figure 4.20) is
included in Appendix A. The code is appropriately commented such that one should be
able to identify the VHDL code describing all functional units of the convolution
architecture system.
46
Chapter 6
Version 2 Convolution Architecture Validation via Virtual Prototyping (Post-Synthesis and Post-Implementation Simulation Experimentation)
Hardware Description Language (HDL) simulation of an architecture design,
sometimes known as virtual prototyping, is an important step in the design flow for fine
tuning and detecting potential problem areas before the design is implemented or
manufactured. In this section, Post-Synthesis simulation results and Post-Implementation
simulation results of version 2 of the convolution architecture will be presented. Both
Post-Synthesis simulation and Post-Implementation simulation are utilities contained
within the Xilinx Foundation 4.1i CAD software packages utilized during this project
[18]. During the process of validating version 2 of the convolution architecture, the
computer system that was used to run the specific software has the following
configuration; Intel Pentium III 450 MHz processor, 128 MB memory capacity and
Windows 98 Second Edition operating system.
After a design has been captured either through schematic capture or via HDLs
such as VHDL or Verilog, software HDL simulation of the design is the next step in the
design flow for functional and timing validation. Software HDL simulation has the
advantage of identifying potential problem areas before a design is implemented (for
FPGA) or manufactured (for ASIC) and hence correction or modification can be made.
The usage of both Post-Synthesis simulation and Post-Implementation simulation within
the design flow for design prototyping via FPGA technology can be attributed to the fact
that Post-Synthesis simulation is utilized for functional validation of the design whereas
Post-Implementation simulation is utilized for both functional and timing (performance)
validation of the design. In order to obtain a better understanding of the characteristics of
a particular design, both utilities can be important tools for such purpose.
The testing methodology employed in this project uses the bottom up approach,
which means lower level functional components such as various types of adders and
47
multipliers were tested before these components were combined to form higher level
functional units. Using the bottom up approach in testing is desired since this will help
assure that when the lower level components are combined into higher level functional
units, one can be more assured that the lower level components will not be at fault if
errors are detected.
6.1. Post-Synthesis Simulation
In order to be assured that version 2 of the convolution architecture functionally
performs as intended in the previous sections, Post-Synthesis simulation was utilized for
functional level validation. Post-Synthesis HDL simulation of a system is simulation of
the system as synthesized to netlist (gate-level) form and zero propagation delay is
assumed through gates. To determine the correctness of the functional unit under test, all
possible input vectors are required to be applied and checked against known correct
outputs or expected outputs from the functional unit under test. Thus, testbenches are
required and need to be developed for this purpose. However, if the number of inputs for
the functional unit under test is large, fully testing all the possible inputs or stimulus for
the functional unit under test can be quiet complex. Hence, automated generation of the
testbenches is preferred. In order to achieve this objective, C++ was used to write a
program capable of generating testbenches in the required format. For ease of re-running
the simulation process, the script file editor, a feature of the Xilinx Foundation Simulator,
has been used to eliminate the process of inputting test vectors after each simulation run.
Figure 6.1 below shows the testing model that was used for verifying the functionality of
lower level functional components.
The testing model shown in Figure 6.1 below was captured through VHDL as an
entity with the functional unit under test being instantiated and its output is compared to
the expected or theoretically correct result from the testbench.
6.1.1. Adders
Different types of Carry Lookahead Adders (CLA) were employed within the
convolution system. The main difference between all of them lies in the length of the
operands that they operate on and depending on the length they are referred to as 14-bit,
48
15-bit, 16-bit, 17-bit and 19-bit CLA. To check that these lower level functional
components operate as intended, Post-Synthesis simulation was used. One of the most
utilized CLA is the 14-bit CLA, and it was duplicated within each Multiplication Unit
(MU) contained in the convolution system.
Functional Unit Under Test
Comparator
err (zero indicates both results are identical, one indicates otherwise)
stimulus / test vectors of testbenchexpected / theoretically
correct result from testbench
Figure 6.1. Testing model for lower level functional components.
The testing methodology described in Figure 6.1 was used. The VHDL file that
contains the testbench entity and C++ program source code that generates the
theoretically correct outputs (in a file format that is acceptable to the script editor for
software simulation) can be found in Appendix B. Figure 6.2 shows the Post-Synthesis
simulation output of the testbench for the 14-bit CLA. However, due to the length of the
simulation, selective test vectors were used instead of an exhaustive (all possible inputs)
set. As can be seen from Figure 6.2, the signal err remains low throughout the simulation
and indicates that the outputs from the unit under test agree with the theoretically correct
outputs generated from the C++ program.
Figure 6.2. Post-Synthesis simulation for 14-bit CLA.
Figure 6.3 below shows a close up view for one segment of the testbench
simulation shown in Figure 6.2 above. Buses vec_a and vec_b are the two input operands,
while ans_ut is the output from the unit under test (14-bit CLA) and ans is the
49
theoretically correct output. For instance, at the left-most of bus vec_a one sees a
hexadecimal value of 0008 (8 in decimal) and vec_b shows a hexadecimal value of 2000
(-8192 in decimal), thus the sum should be 2008 (-8184 in decimal) which is the same
value shown on both buses ans and ans_ut.
Figure 6.3. A close up view of one segment of Figure 6.2.
The procedure for testing all the other CLAs with different operand lengths is the
same as shown above. Figure 6.4, Figure 6.5, Figure 6.6, and Figure 6.7 show Post-
Synthesis simulation results of testbenchs for each CLA. As can be seem from the figures,
the err signal stays low throughout, thus indicating that outputs from the unit under test
agrees with the predicted correct results generated by the C++ program.
Figure 6.4. Post-Synthesis simulation for 15-bit CLA.
Figure 6.5. Post-Synthesis simulation for 16-bit CLA.
Figure 6.6. Post-Synthesis simulation for 17-bit CLA.
50
Figure 6.7. Post-Synthesis simulation for 19-bit CLA.
6.1.2. Multiplication Unit
The Multiplication Unit (MU) is a lower level component that is replicated 25
times in version 2 of the convolution architecture within each AU for the case of n = 5.
Hence, it is important to determine that MU is functioning correctly. In order to test MU
with all possible inputs, a C++ program was written to generate the required testbench;
the program can be found in Appendix B. In addition, an entity was created in a VHDL
file with MU being instantiated and a comparator was also instantiated to compare the
output generated by MU with the theoretically correct output from the program (as an
input to the entity). This VHDL file can also be found in Appendix B.
Figure 6.8 below shows the complete run of the testbench. Coef is a 6-bit wide
signed filter coefficients bus, bus mag is an unsigned 8-bit magnitude input, bus product
is the output generated by the unit under test (MU in this case) and bus t_ans is the
theoretically correct output. Signal err is the output from the comparator, and it will be
one if both the outputs, t_ans and product are not matched. As shown in Figure 6.8
below, all the buses are packed closely and cannot be distinguished due to the length of
the simulation, however, the err signal remains low for the entire simulation. Thus, both
the buses t_ans and product are identical throughout the simulation.
Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication
Unit (MU).
51
Figure 6.9 below shows a close up view of one segment of the simulation. For
instance, in the first part of Figure 6.9, bus coef has a value of 21 (-31 in decimal) and bus
mag has a value of FB (251 in decimal), thus the product of the multiplication should
have a value of 219B (14-bit signed value) in hexadecimal (-7781 in decimal). Both
buses product and t_ans have the same value and thus the err signal has a value of zero
indicating that both the values agree with one another.
Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above.
6.1.3. Version 2 Convolution Architecture (with k = 1)
In the process of testing Version 2 of the convolution architecture as a whole unit
for the case of k = 1 (see Figure 4.20), a few minor modifications were made to the
system such that the Post-Synthesis simulation can be completed within a reasonable time
frame. However, these modifications do not affect the system’s intended characteristics.
For instance, the intended Input Image Plane (IP) has a size of 5100×6600 pixels; for
simulation purpose the IP size was reduced to 5×60 pixels. This reduction will in no way
hamper the system’s functional characteristics. The Filter Coefficient Plane (FC) will
remain the same as in the previous sections; a 5×5 size. Figure 6.10 below shows a test
case used to verify the functional correctness of Version 2 of the convolution
architecture. The IP has a size of 5×60, but due to the page width limit (this thesis) only
the first seven columns of the IP can be clearly shown. The same can be said about the
Output Image Plane (OI) in Figure 6.10.
A C++ program was written to generate the test vectors required to program all
MAUs with the correct filter coefficients. This C++ program can read a text file with
filter coefficients indicated within and then generate waveform vectors for the script
editor to use. Figure 6.11 below shows the source file for the program.
52
Input Image Plane (Decimal)
0 1 2 3 4 5 6 60 61 62 63 64 65 66
120 121 122 123 124 125 126 180 181 182 183 184 185 186 240 241 241 241 241 241 241
Filter Coefficient Plane (Decimal)
-1 2 3 0 1 1 -1 2 1 0 0 0 1 1 1 1 1 0 1 1 2 -2 1 0 1
Output Image Plane (Hexadecimal)
00259 0029C 0031D 00328 00333 0033E 00349 00400 004BD 005B9 005C8 005D7 005E6 005F5 0061F 00790 0093D 0094A 00956 00962 0096E 003C5 005E7 00756 0075F 00768 00771 0077A 002D5 0047D 0069E 006A5 006AB 006B1 006B7
Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven columns of both IP and OI are shown due to report page width limit).
The operation of the convolution architecture can be divided into three phases; the
first phase of the operation starts as the system commences operation until the first two
rows of the IP have been stored into the external memory devices and no OI is generated.
The second phase of the operation starts when the system has enough IP to commence
generation of OI; this phase of operation ends when IP is received completely. Finally,
the last phase of the operation starts with the system being provided with zeros as input
and continues operation until all the OIs are generated.
53
#include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ifstream in_file1; ofstream out_file1, out_file2; in_file1.open("coef.txt"); out_file1.open("v_coef.dat"); out_file2.open("v_c_reg.dat"); int array[5][5]; int time, count, temp, a, b; time = 40; count = 1; for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file1 >> temp; cout << temp << endl; array[b][a] = temp; } } out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; for (a=0; a<5; a++) { for (b=0; b<5; b++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << array[a][b] << "\\H +" << endl; out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << count << "\\H +" << endl; time += 20; count++; } } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; in_file1.close(); out_file1.close(); out_file2.close(); return 0; }
Figure 6.11. The source code for C++ program that generates test vectors to program the filter coefficients into MAUs.
Figure 6.12 below shows the arrangement of the Filter Coefficients (FCs) within
each MAU contained in the Arithmetic Unit (AU). The arrangement of the FCs only
showing a 90 degree (clockwise) rotation is because the input image pixels have been
rotated by 90 degrees (counter clockwise) before flowing through the AU.
54
MAUs Filter Coefficients
1 FC40 2 FC30 3 FC20 4 FC10 5 FC00 6 FC41 7 FC31 8 FC21 9 FC11 10 FC01 11 FC42 12 FC32 13 FC22 14 FC12 15 FC02 16 FC43 17 FC33 18 FC23 19 FC13 20 FC03 21 FC44 22 FC34 23 FC24 24 FC14 25 FC04
FC00 FC01 FC02
FC10 FC11 FC12
FC20 FC21 FC22
Filter Coefficients
FC03
FC13
FC23
FC04
FC14
FC24
FC30 FC31 FC32 FC33 FC34
FC40 FC41 FC42 FC43 FC44
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Arithmetic Unit
Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit.
Figure 6.13 and Figure 6.14 below show the Post-Synthesis simulation output of
the first phase of operation for the version 2 convolution architecture based on test case 1
shown in Figure 6.10. Figure 6.13 shows the programming of the FCs into respective
MAUs. As shown in the figure, coef_regs bus acts as a write enable for each MAU within
55
the Arithmetic Unit (AU) and coef bus is the FC value which is given as input to each
MAU.
Figure 6.13. First phase of operation; programming of FCs into MAUs.
Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown
in figure above is the beginning of the second row of the input pixels).
56
In Figure 6.14, the input image pixel values are shown in hexadecimal instead of
the decimal values in test case 1 shown in Figure 6.10. As can be seen in Figure 6.14, in
this phase of operation, there are no output pixels generated since the system has to wait
until the first two rows of IP are received. Also shown in Figure 6.14, the system will
receive five input image pixels and then write to one memory location.
The second phase of the system operation is shown in the following figures.
Figure 6.15 below shows that the system is generating the first three output pixels for the
second row of the OI as compared to the test case 1 shown in Figure 6.10. Under normal
operation, the output pixels will be generated after multiple stages of pipeline delays
contained within the AU as shown in Figure 6.15. Figure 6.16 shows superimposed
(timing delay not included) output pixels with their corresponding input pixels for ease of
comparison. The output pixels shown in Figure 6.16 are the first six output pixels from
the second row of the OI. Buses o1, o2, o3, o4 and o5 are the output buses from
functional unit IDS to the AU (all the buses’ value are shown in hexadecimal) and each
bus is 40 bits wide (five input image pixels). In Figure 6.16 below, the 25 input image
pixels that correspond to each output pixel are highlighted. As can be seen from Figure
6.16, all the output pixels are as predicted in Figure 6.10.
Figure 6.15. Second phase of operation; output pixels generated.
57
Figure 6.16. Second phase of operation; output pixels of the second row of OI
(superimposed).
The third phase of system operation is shown in Figure 6.17 below. Figure 6.17
below shows that the system generates the first six output pixels for the last row of OI. At
this phase of operation, the input image pixels are completely received and zeros are
inserted into the system. The six output pixels shown in Figure 6.17 are compared and
validate with the output pixels predicted in test case 1shown in Figure 6.10.
Figure 6.17. Third phase of operation; output pixels of the last row of OI
(superimposed).
58
A second test (test case 2) was also done to further investigate the correctness of
operation of the system. Figure 6.18 shows the FCs, IP (the first seven columns) and the
OI (the first seven columns) for test case 2. As in the previous test case (Figure 6.10), due
to the page width limit, only the first seven columns of the IP and OI are being displayed
from the 60 columns.
Input Image Plane (Decimal)
0 1 2 3 4 5 6 60 61 62 63 64 65 66
120 121 122 123 124 125 126 180 181 182 183 184 185 186 240 241 241 241 241 241 241
Filter Coefficient Plane (Decimal)
-1 2 1 0 1 1 -2 1 1 0 0 0 1 -1 2 2 1 0 1 -2 2 -2 1 0 1
Output Image Plane (Hexadecimal)
000F0 0012F 001AA 001B0 001B6 001BC 001C2 001A9 001EB 0031E 00326 0032E 00336 0033E 00314 00392 00500 00508 0050F 00516 0051D 0025E 00318 003D2 003D8 003DE 003E4 003EA 0038B 00354 00448 0044E 00452 00456 0045A
Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns).
Figure 6.19, Figure 6.20 and Figure 6.21 show the post-synthesis simulation
results of all three phases of operation for Version 2 of the convolution architecture with
IP and FCs as intended in Figure 6.18. All the results from Figure 6.19 and Figure 6.20
agree with the expected results shown in Figure 6.18 above.
59
Figure 6.19. First phase of operation for test case 2.
Figure 6.20. Second phase of operation for test case 2; output pixels shown are the
first six of row one of OI (superimposed).
60
Figure 6.21. Third phase of operation for test case 2; output pixels shown are the
first six of the last row for OI (superimposed).
6.2. Post-Implementation Simulation
After version 2 of the convolution architecture, for the case of (k = 1), was
functionally validated via post-synthesis HDL simulation, its functional and timing
characteristics were studied and validated through post-implementation simulation. The
following sections will describe and depict synthesis and implementation of the system to
a particular Field Programmable Gate Array (FPGA) chip and the post-implementation
simulations that have been done to validate the version 2 convolution architecture.
6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture (with k
= 1)
In general, when a system described in a HDL is synthesized to a specific FPGA
chip, the CAD packages (Xilinx Foundation Series in this case) invoke a process that
translates the system described in HDL to a specific gate level netlist. The gate level
netlist may consist of any gate level elements or functional units that are specific to a
certain family of FPGA, hence a targeted (specific) FPGA is to be specified before the
process begins. Following the synthesis process is the implementation process of the
desired system which targeted a specific FPGA chip. This process includes map, place
and route of the netlist within the specific FPGA chip. Within each FPGA chip there are a
61
certain number of Configurable Logic Blocks (CLBs), and within each of these CLBs
there are a number of Lookup Tables (LUTs) and memory elements such as Flip-Flops
(FFs). The mapping process implements the gate level netlist to the FPGA chip using all
the available resources. Then, the place and route process determines the best placement
and routing of all the resources used for the mapped system such that all the components
(resources) are connected according to the netlist.
For this project, a prototyping board (XSV800) manufactured by XESS Co. is
used. This protoboard featured the Virtex family FPGA chip (XCV800) from Xilinx. Table
6.1 below shows a summary of the resources available within the FPGA chip on the
protoboard. There are 4704 CLBs in this specific FPGA chip and within each CLB there
are four 4-Input LUTs and four FFs. Table 6.2 below shows the resource utilization on
the XCV800 chip as version 2 of the convolution architecture (with k = 1) is implemented.
Table 6.1. Details of FPGA on the XESS protoboard. FPGA XCV800 (Virtex FPGA family)
System Gates 888,439 CLB Array 56×84
FF 18,816 4-Input LUT 18,816
Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)
CLBs 1,878 FF 2,620
4-Input LUT 5,955 Equivalent System Gates 96,210
6.2.2. Version 2 Convolution Architecture (with k = 1)
The post-implementation simulations of version 2 of the convolution architecture
with k = 1 were conducted with the same test cases run in the post-synthesis simulations
in the previous section. The script file and C++ programs used in the post-synthesis
simulations were reused in the post-implementation simulation testing and validation
processes described here.
Figure 6.22 and Figure 6.23 below show the results of the second phase and third
phase of operation for post-implementation simulation of test case 1 (see Figure 6.10). As
can be seen from both of the figures, the highlighted output image pixels were as
62
predicted in Figure 6.10. Figure 6.24 and Figure 6.25 show the second and third phase of
operation of post-implementation simulation for test case 2 (see Figure 6.18). All the
output image pixels highlighted within these figures were in agreement with the predicted
output image pixels as shown in Figure 6.18.
In both Figure 6.22 and Figure 6.24, the second phase of operation is shown; after
the first two rows of the IP has been stored and the convolution architecture starts the
convolution process. Meanwhile, in Figure 6.23 and Figure 6.25 the third phase of
operation is shown; with all the IP received and zeros are inserted as input for the
convolution system to process the last two rows of the OI.
A clock frequency (clk in all figures) of 12.5 MHz has been used in all the post-
implementation simulations (Figure 6.22, Figure 6.23, Figure 6.24 and Figure 6.25)
conducted thus far. The main objective of the simulation testing described in this section
was to validate the system functionality and performance with respect to being able to
generate one OI pixel on each system clock cycle with a 5×5 FC. For the case of k = 1,
the convolution architecture met our just stated functional and perfomance goals.
Figure 6.22. Second phase of operation for test case 1 (post-implementation
simulation); output pixels of the second row of OI (superimposed).
63
Figure 6.23. Third phase of operation for test case 1 (post-implementation
simulation); output pixels of the last row of OI (superimposed).
Figure 6.24. Second phase of operation for test case 2 (post-implementation
simulation); output pixels shown are the first six of row one of OI (superimposed).
64
Figure 6.25. Third phase of operation for test case 2 (post-implementation
simulation); output pixels shown are the first six of the last row for OI (superimposed).
6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k = 3)
As shown in Figure 4.20, the architecture can be scaled up to perform k
convolutions in parallel. To validate the scalability of version 2 convolution architecture,
the version 2 convolution architecture with three AUs instantiated is synthesized and
implemented to the XCV800 FPGA chip. Table 6.3 below shows the XCV800 chip
resource utilization as the convolution architecture is implemented. However, as the
system is scaled up to process three convolutions in parallel, the total number of system
gates did not increase proportionally. As can be seen from Table 6.3, the equivalent
system gates for k = 3 is 173,170 gates compared to the 96,210 gates for k = 1 (Table
6.2), an increase of 80 percent, which is less than the factor of three. This is due to the
fact that when the system is scaled up, only the AU needs to be replicated and it does not
need to be totally replicated as earlier discussed. Comparison of the total number of CLBs
utilized between the two implementations will not yield a good measurement since not all
the elements within each CLB are utilized.
65
Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3) CLBs 4,613 FF 5,226
4-Input LUT 15,307 Equivalent System Gates 173,170
6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)
In order to validate that version 2 of the convolution architecture can be scaled up
to include more than one AU (k > 1 in Figure 4.20) and continue to operate correctly from
a functional and performance standpoint, this section presents post-implementation
simulation results of version 2 convolution architecture operating with three instantiated
AUs. All VHDL code for version 2 of the convolution architecture with three AUs
instantiated can be found in Appendix A.
To validate the output image planes (OI) generated by version 2 convolution
architecture for k = 3, a C++ program with the ability to generate different sets of input
image planes of size 5×60 pixels, depending on the seed number given, has been written
and used. The program uses the rand function to generate random numbers based on the
given seed and the generated numbers were limited in the range of 0 to 255. In addition,
the program also generates the three expected output image planes based on the three
filter coefficient planes that it reads in. The source code of the program mentioned above
can be found in Appendix C. Another program that generates all the test vectors
necessary to program each individual MAUs with the filter coefficients was written and
used. This program reads in three FC planes contained in a text file and then generates
the test vectors according to the script editor format (source code for this program can
also be found in Appendix C).
Two test cases were post-implementation simulated and each of the test cases was
run with a single and different IP (generated by giving different seed number) and three
distinct FC sets. This was done to further validate correct operation and performance of
the version 2 convolution architecture with k = 3. Figure 6.26 below shows test case
number 1 with the inputs and expected outputs (generated by the C++ program
mentioned in the above paragraph). However, again due to page width limitation the
figure only shows part of the IP and predicted OIs.
66
Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes.
Figure 6.27 and Figure 6.28 below show the results of the post-implementation
HDL simulation with the inputs of test case 1 (see Figure 6.26). Figure 6.27 shows the
output image pixels for the first row of the three OIs starting from the third output image
pixel (signals out_pxl1, out_pxl2, and out_pxl3 were output image pixels for the first OI,
second OI and third OI respectively). Figure 6.28 shows the second row of output image
pixels (start from the third pixel) for all three OIs. All output pixels generated by post-
implementation simulation of the version 2 convolution architecture system agreed with
the expected results shown in Figure 6.26.
As can be seen from Figure 6.27 and Figure 6.28 all the input image pixels
highlighted within each rectangle correspond to the 25 input image pixels required for all
three convolutions (one output image pixel per FC plane). Again, the clock frequency
that has been utilized in the post-implementation simulation run of test case 1 in Figure
6.27 and Figure 6.28 is 12.5 MHz.
67
Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first
row of the OIs for test case 1.
Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the
second row of the OIs for test case 1.
For test case number 2, the IP generator program was given a seed number of 2
and hence a different IP plane was produced as shown in Figure 6.29. Figure 6.30 (OIs
result for third row) and Figure 6.31 (OIs result for fourth row) show the post-
68
implementation simulation result for test case 2. The output results from both of the
figures agreed with the predicted results shown in Figure 6.29. The clock frequency of
test case 2 is the same as in test case 1. All the highlighted input image pixels within each
of the rectangles corresponding to the 25 input image pixels required for each output
pixel generated. Each individual OI pixel is generated within a single system clock cycle.
Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes.
Validation of version 2 of the convolution architecture has been accomplished
through post-synthesis and post-implementation HDL simulation utilizing the Xilinx
Foundation CAD software packages. All the simulations are done with the system
implemented to a Xilinx Virtex FPGA (XCV800). As the system is scaled up to process k
convolutions in parallel, the hardware increment is directly proportional to k since only
AUs are replicated. A graph showing the equivalent system gates count compared to the
number of FC planes is plotted and shown in Figure 6.32 below. Since all the simulation
69
results are as desired and correct, the version 2 convolution architecture is functionally
and performance validated in that it can correctly generate three OI pixels (OI1, OI2, and
OI3) within one system clock cycle (with k = 3).
Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third
row of the OIs for test case 2.
Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth
row of the OIs for test case 2.
70
Equivalent System Gates versus Number of FC planes
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
0 0.5 1 1.5 2 2.5 3 3.5
Figure 6.32. A plot of equivalent system gates versus number of FC planes.
71
Chapter 7
Hardware Prototype Development and Testing
Hardware prototype development and testing were done to experimentally
validate the functionality and performance of the convolution architecture. Ideally, the
convolution architecture will be implemented in ASIC technology with external SRAM
(Data Memory) as shown in Figure 7.1 below. In the figure below, b and l denote the bus
width for the address bus of the external SRAM and output image pixel respectively. CE,
OE, and WE are chip enable, output enable and write enable control signals for the
external SRAM. For example, to implement the convolution architecture with three 5×5
FC planes, a total of 113 IO (Input Output) pins are needed on the FPGA or ASIC.
SRAMs
FPGA or ASIC (Implementation of
Convolution architecture)
n × d datab
address
CEOEWE
Output Image planes
OI1
OI2
OIk
l
l
l
Figure 7.1. Convolution Architecture hardware implementation. To further validate the convolution architecture functionality and performance
correctness, hardware emulation of version 2 of the convolution architecture is done
through the development and testing of a FPGA based prototype. A hardware prototyping
board manufactured by the XESS Corporation [17] which features Xilinx Virtex FPGA
(XCV800) technology was available and utilized. Figure 7.2 below shows a picture of the
XSV-800 prototype board. Even though the XCV800 FPGA has enough IO pins to handle
the convolution architecture configuration shown in Figure 7.1 above, the SRAM on the
72
prototyping board has lower data bandwidth than desired. Because of this, the Data
Memory was inferred or emulated within the FPGA.
Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture obtained from XESS Co. website, http://www.xess.com).
The hardware emulation is carried out by programming the FPGA with the
convolution architecture through the parallel port of a computer. The Xilinx Foundation
series CAD software package [18] was utilized to generate the bit stream file (FPGA
configuration bit stream) necessary to program the FPGA with the desired convolution
architecture hardware description. A software utilization package, XSTOOLs is provided
by XESS cooperation for use with the prototyping board. The software package includes
programs such as bit stream download (for FPGA programming), a clock frequency
setting program and an on-board SRAM content retrieval or initialization program.
7.1. Board Utilization Modules and Prototype Setup
As shown in Figure 7.2, the prototyping board contains many auxiliary parts such
as push buttons, LEDs, SRAMs, parallel port and so on. However, to utilize these parts,
the FPGA needs to be programmed with the appropriate driver or module. These drivers
or modules are implemented within the FPGA since all these parts are connected to the
FPGA. Thus, for the purpose of hardware emulation of the convolution architecture
system, a means of supplying the input image plane (IP) and storing of the output image
73
plane (OI) is necessary and must be developed. An internal Block RAM within the FPGA
is implemented to provide the convolution system with the input image pixels. The
internal RAM is initialized with input image pixels when the system is synthesized and
implemented. Figure 7.3 below shows a pictorial view of the prototyping hardware. All
functional blocks or modules within the FPGA were implemented with VHDL.
FPGA SRAM
Convolution System
SRAM driver
19
21
Output pixels, data lines
address
6 Control signals clock
Block RAM
8
Input pixels
External clock
FC Programming
Filter Coefficients
6
Parallel port
Data Memory
Figure 7.3. Top level view of the prototyping hardware. In Figure 7.3, there is a FC Programming module. This module, as the name
implies is responsible for initialization of all the filter coefficients within the convolution
system. The filter coefficients are supplied from the computer through the use of the
parallel port. A C++ program (this program can be found in Appendix D) was written to
read in the filter coefficients for each FC plane contained in a file and send them through
the parallel port to the module. The FC Programming module receives two bytes of data
from the parallel port to program one filter coefficient register. In addition, the FC
Programming module also sends the output image plane to the external SRAM for storage
and analysis later on. Due to the limit of the data bus for the external SRAM, it is only
possible to retain one OI from each run.
Also shown in the figure is the Block RAM module within the FPGA chip. This
module is mainly responsible for providing the convolution system with the input image
pixels and it is implemented through VHDL code. This module makes use of the internal
RAM within the FPGA to store the input image pixels. In order to be able to generate the
VHDL file for this module easily, a C++ program was written to generate this file
depending on the seed of the random number provided. By using the same seed number
74
as those used in the previous chapter, the same input image pixels can be generated. An
example of the generated VHDL file is shown in Figure 7.4 below. The C++ program can
be found in Appendix D. library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end entity IN_RAM; architecture STRUCT of IN_RAM is component RAMB4_S8 is port( DI: in std_logic_vector(7 downto 0); EN: in std_logic; WE: in std_logic; RST: in std_logic; CLK: in std_logic; ADDR: in std_logic_vector(8 downto 0); DO: out std_logic_vector(7 downto 0) ); end component RAMB4_S8; attribute INIT_00: string; attribute INIT_00 of IRM: label is "9fb6add75e5a290b8713f55f888c72ce37e06d31362b091dd779a254c5b24a30"; attribute INIT_01: string; attribute INIT_01 of IRM: label is "80b20000c7212be922035da80f3273676826daf8fd91c35f0b99e093f209c61c"; attribute INIT_02: string; attribute INIT_02 of IRM: label is "2f2d2f198cec76303797682ed5553d18eb5345050260bccdc4eed36fc92e4910"; attribute INIT_03: string; attribute INIT_03 of IRM: label is "e4392f650000e90330e84c421110aa1db0d9d34e544884c1b5a7ce5aeaff060d"; attribute INIT_04: string; attribute INIT_04 of IRM: label is "29af02af2cf91bfc241bd2ada5d75262f228d437b5d0fc6e8f18bb82b5216f9e"; attribute INIT_05: string; attribute INIT_05 of IRM: label is "8f552a661c42000095af0ea4bb1cfdb88c34cdf122f8ae8904447e84657c6a00"; attribute INIT_06: string; attribute INIT_06 of IRM: label is "e1ab005cf16e41b0d174d275a95fa85c177eb6d6ec1b2ef67cd891e33f88cfb7"; attribute INIT_07: string; attribute INIT_07 of IRM: label is "679780f0f91cba3e000007b707bc25aba634015c6ab3c55053130fd44d7f9ee8"; attribute INIT_08: string; attribute INIT_08 of IRM: label is "53bb6d2e0a67fd7071d852926a66ce6a617485308dc35ad9177c391f8f32a8b3"; attribute INIT_09: string; attribute INIT_09 of IRM: label is "00000000000000000000000066896a4dc61c8c7b1434242dcc20e505cb7aa7a9"; signal din : std_logic_vector(7 downto 0); signal addr: unsigned(8 downto 0);
Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is provided to the
program).
75
signal adr : std_logic_vector(8 downto 0); signal en : std_logic; signal we : std_logic; begin L1: din <= (others=>'0'); L2: en <= '1'; L3: we <= '0';
L4: adr <= std_logic_vector(addr);
P1: process(clk, rst, req) is begin if (rst = '1') then addr <= (others=>'0'); elsif (clk'event and clk = '1') then if (req = '1') then addr <= addr + 1; end if; end if; end process P1; IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, ADDR=>adr, DO=>dout); end architecture STRUCT;
Figure 7.4(Continued) Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is
provided to the program). In addition to the two modules mentioned above, there is another module that is
responsible for controlling the external SRAM (parts that are external to the FPGA). This
module generates progressive addresses to store the output image pixels in an ascending
order in addition to the other signals such as cen (chip enable), wen (write enable) and so
on to ensure the proper functioning of the SRAM.
All modules mentioned in this section were developed and implemented via
VHDL description. The VHDL files for all these modules can be found in Appendix E.
7.2. Hardware Prototyping Flow
After a design is synthesized and implemented through use of CAD packages, a
bit stream file (FPGA configuration bit file) for a specific FPGA chip is generated. In this
case, the bit stream file contains the configuration bits for the convolution architecture as
well as the auxiliary modules generated for a Xilinx XCV800 FPGA chip. Next, the bit
stream file is programmed into the FPGA through the parallel port of a computer. For this
particular prototyping board by the XESS Co., a FPGA configuration or download
76
program, gxsload, is provided. Figure 7.5 shows the graphic interface of the gxsload
program once it is executed.
Figure 7.5. FPGA configuration and bit stream download program, gxsload from
XESS Co.
After the FPGA chip has been configured with the convolution architecture, it is
ready for hardware experimental testing and validation of the convolution architecture.
Since the input image pixels are stored within the Block RAM module within the FPGA,
the only time that the system requires external input (external of the FPGA) is for filter
coefficient programming. This is done through hardware (FC Programming module) and
software. The software program (can be found in Appendix D) was developed and
written with the C++ language to read in the filter coefficients from a text file, coef.txt
(the same file as shown in Figure 6.26), and send all the data through the parallel port to
the convolution architecture. However, within this file it also specifies which OI plane
the external SRAM stores during each experimental run. Figure 7.6 below shows a
segment of the verbose output for the execution of the FCs configuration program. The
program enters the filter coefficient in the order shown in Figure 6.12. For each filter
coefficient two bytes of data are sent through the parallel port, the first byte indicates the
position of the filter coefficient within AU and the following byte is the filter coefficient.
77
Figure 7.6. Execution of the FCs configuration program.
Next, the convolution process is commenced when the start push button on the
prototyping board is pressed. One of the push buttons on the prototyping board is mapped
as the start signal for the convolution system. Since the execution of the convolution
architecture is transparent to the user, a LED on the prototyping board is mapped to the
invert of the SRAM’s write enable signal (invert of the wen signal in the highest level of
VHDL description). Consequently, once the convolution architecture finishes its
execution, the SRAM’s write enable line will be pulled low, hence the LED is lighted.
Then, the output image pixels stored in the external SRAM are retrieved by using
the gxsload program, which is the same program used to download the FPGA
configuration file. Figure 7.7 shows the graphical interface of the gxsload program when
used to upload SRAM contents to a file. The uploaded SRAM content is stored in an Intel
hex file format. Figure 7.8 below shows the uploaded SRAM contents in a file. There are
two banks of SRAM on the prototyping board, the left bank and the right bank. Each bank
of the SRAM has a 16-bit data bus and 19-bit address bus. Since the output image pixels
are 19-bits wide, both sides of the SRAM are utilized.
As evident from Figure 7.8 below, it is tedious to trace and compare the uploaded
SRAM contents to the expected output results. As mentioned in the previous section, a
program was written to generate the theoretically correct output image pixels (shown in
Figure 6.26). In order to compare the uploaded results with the known correct results
efficiently, a C++ program was written to parse the Intel hex file and check against the
78
theoretically correct output for similarity. The source code for this program can be found
in Appendix D.
SRAM Upper limit
SRAM lower limit
Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the upper bound of the SRAM address space whereas the low address indicates the
lower bound of the SRAM address space.
Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are two segments due to the fact that the program wrote the right bank of the SRAM
(16-bit) first and the left bank of the SRAM next (16 MSB bits).
79
7.3. Test Cases
To validate correct functional and performance operation of the FPGA based
hardware prototype of the convolution architecture, two test cases were run. The
convolution architecture was run at 2 kHz clock frequency for these test cases. Maximum
clock rate hardware prototype performance was not a goal for these two tests. The
performance metric of interest is whether the prototype can simultaneously convolute one
IP with k (n×n) FCs and generate k OI pixels on each system clock cycle. These two test
cases are as shown in Figure 6.26 and Figure 6.29 of the previous section. Since for each
test case there are three different Filter Coefficient (FC) planes, three experimental runs
must be carried out. The reason that three experimental runs are needed even though all
three OI planes are generated in each experimental run is because of the SRAM data bus
bandwidth limitation. Each OI requires 19-bits and the SRAM data bus is only 32-bits
wide.
Figure 7.9 (first OI plane), Figure 7.10 (second OI plane) and Figure 7.11 (third
FC plane) show the results obtained from the SRAM after each experimental run. The
grayed areas of the figures are Intel hex file header and checksum, and the highlighted
boxes with arrows projected to the bottom of the figure are the first OI pixel for the
respective OI plane. To obtain the second OI pixel, slide the windows to the next column
as marked. Comparison of all three obtained OI planes with the result shown in Figure
6.26 reveals they match. The comparing program executed on each of the experimental
runs show that the obtained results are identical to the expected results.
80
Figure 7.9. SRAM contents retrieved for first OI plane for test case 1.
Figure 7.10. SRAM contents retrieved for second OI plane for test case 1.
Figure 7.11. SRAM contents retrieved for third OI plane for test case 1.
81
Figure 7.12 (first OI plane), Figure 7.13 (second OI plane) and Figure 7.14 (third
OI plane) below show the experimental runs for test case 2. Again, the OI planes received
from each experimental run are compared to the expected results shown in Figure 6.29 of
the previous section. After each experimental run, the results obtained were compared
with the expected results using the comparison program. All three of the OI planes
matched with the expected results and thus, again, validated the correctness of the
convolution architecture.
From the results obtained from testing of the hardware prototype, the functionality
and performance correctness of version 2 of the convolution architecture is thus further
validated.
Figure 7.12. SRAM contents retrieved for first OI plane for test case 2.
Figure 7.13. SRAM contents retrieved for second OI plane for test case 2.
82
Figure 7.14. SRAM contents retrieved for third OI plane for test case 2.
83
Chapter 8
Conclusions In summary, the main objective of this thesis research project was to develop the
architecture for and design, validate, and build a hardware prototype of a convolution
architecture capable of processing an input image plane such that an output image pixel is
obtained every clock cycle assuming convolution with one FC plane. In addition, the
convolution architecture needed to be scalable in both the filter coefficient plane size
(kernel size) and the number of filter coefficient planes which could be simultaneously
processed. The motivations behind the scalability were such that, first, the convolution
architecture can be tailored to any size of kernel and still produce one output image pixel
per clock cycle. Secondly, a motivation for scalability was to allow k kernels of any size
within the architecture and the architecture will have the functional and performance
capability to output k output image pixels on each system clock cycle.
The developed convolution architecture was captured through the use of the
VHDL hardware description language. Xilinx Foundation series CAD software packages
were used to synthesize and implement the architecture to a FPGA chip. Before the
architecture was prototyped into the prototyping board for experimental testing, the
architecture was functionally and performance validated through HDL software
simulation via post-synthesis and post-implementation software simulations.
Experimental testing of the architecture was done on a prototyping board that featured a
Virtex family FPGA.
Post-synthesis and Post-implementation HDL software simulation and
experimental hardware testing of the hardware prototype showed that the implemented
prototype of the convolution architecture was indeed functionally correct as intended. It
is felt that if the convolution architecture were implemented in high speed ASIC
“production” technology with a high speed external SRAM, the intent that a single IP
pixel can be convoluted with k (n×n) FCs and k OI pixels generated within a clock cycle
of 7.3 ns could be achieved. If required, and as earlier indicated, pipelining of the
84
multiply unit within MAUs of the AUs would greatly increase overall system performance
if needed.
As a side note, a convolution program with three 5×5 FC planes and 5100×6600
IP plane was run on a general purpose processor (AMD Athlon 650 MHz) for a “loose”
comparison to the performance of the new convolution architecture. The processor used
on average 0.4 second of system time to convolute the one IP plane with three FC planes
which indicates that when the processor runs at around 260 MHz it would be able to meet
the “production” requirements for the new convolution architecture system. However, the
cost/performance ratio for the general purpose processor will be higher than the version 2
convolution architecture implemented in ASIC technology considering the die size of
both architectures (the convolution architecture has roughly less than ten percent of the
general purpose processor’s transistor count).
In conclusion, the best cost/performance ratio can be obtained from implementing
the new convolution architecture in “production” ASIC technology which should allow
the system clock of the convolution architecture to have a desired cycle time of 7.3 ns or
less. Thus, the primary factor that determines performance of the new convolution
architecture is the speed of the implementation technology, optimization of the
layout/placement of the implementation to reduce longest path delays, and the degree of
pipelining one chooses to design into the system.
85
Appendix A
VHDL Code for Version 2 Discrete Convolution Architecture (Figure 4.20 for k = 3)
1. Version 2 Convolution Architecture -- sys.vhd (Top Level System of Version 2 Convolution Architecture) library IEEE; use IEEE.std_logic_1164.all; entity SYS is port( clk, rst, str: in std_logic; d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF) coef: in std_logic_vector(5 downto 0); --(FCs from parallel port) ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp) au_sel: in std_logic_vector(1 downto 0); --(AU select from pp) o_sel: in std_logic; --(Output config pin from pp) req: out std_logic; --(Controller -> FIFO) sram_w: out std_logic; --(SYS -> SRAM) d_out: out std_logic_vector(18 downto 0) ); end entity SYS; architecture STRUCT of SYS is component CTR is port( clk, rst, str, bor, eoc, sds, rgt: in std_logic; f_sel: out std_logic; z_pad: out std_logic; reg_sel: out std_logic_vector(2 downto 0); en_w: out std_logic_vector(1 downto 0); en_sf: out std_logic_vector(1 downto 0); z_input: out std_logic; c_inc: out std_logic_vector(1 downto 0); rot: out std_logic; r_inc: out std_logic; req: out std_logic; r_w: out std_logic; s_inc: out std_logic; sd_inc: out std_logic; ans: out std_logic ); end component CTR; component RCNT is port( clk, rst, r_inc, sd_inc: in std_logic; eoc, sds, rgt: out std_logic ); end component RCNT; component REG_A is generic( n: integer := 8; -- denotes the data width d: integer := 5 );-- denotes the number of registers port( clk, rst, z_pad, z_input: in std_logic; reg_sel: in std_logic_vector(2 downto 0); d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector((n*d)-1 downto 0); ids_out: out std_logic_vector(n-1 downto 0) ); end component REG_A; component C_BANK1 is port( clk, rst, f_sel, z_pad: in std_logic; en_sf, en_w: in std_logic_vector(1 downto 0); ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end component C_BANK1; component RAM is port( wclk, r_w: in std_logic; d_in: in std_logic_vector(39 downto 0); addr: in std_logic_vector(6 downto 0);
86
d_out: out std_logic_vector(39 downto 0) ); end component RAM; component MEMPTR is port( clk, rst, rot, s_inc: in std_logic; inc: in std_logic_vector(1 downto 0); reg_sel: in std_logic_vector(2 downto 0); bor: out std_logic; -- begining of a new row bseq: out std_logic_vector(3 downto 0); add_out: out std_logic_vector(6 downto 0) ); end component MEMPTR; component IDS is port( clk, rst, ans: in std_logic; ids_in: in std_logic_vector(39 downto 0); o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0)); end component IDS; component DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end component DU; component AU is port( clk, rst: in std_logic; du0, du1, du2, du3, du4: in std_logic_vector(39 downto 0); p_en: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); out_pxl: out std_logic_vector(18 downto 0); ovf: out std_logic); end component AU; -- Internal signals to connect components signal r_inc, sd_inc, eoc, sds, rgt: std_logic; --(Row counter <-> Controller) -- signal req : std_logic; signal z_pad, z_input : std_logic; --(Controller -> REG_A ) signal rot, s_inc, bor : std_logic; --(Controller -> MEMPTR ) signal r_w : std_logic; --(Controller -> RAM ) signal f_sel : std_logic; --(Controller -> BANK_1 ) signal ans : std_logic; --(Controller -> IDS ) signal a1, a2, a3 : std_logic; --(Controller->a1->a2->a3->ans) signal ovf1, ovf2, ovf3 : std_logic; --(Overflow from AUs) --(Controller -> MEMPTR) signal c_inc : std_logic_vector(1 downto 0); --(Controller -> BANK_1) signal en_w, en_sf : std_logic_vector(1 downto 0); --(Controller -> REG_A, MEMPTR) signal reg_sel : std_logic_vector(2 downto 0); --(REG_A -> RAM) signal rega_ram : std_logic_vector(39 downto 0); --(REG_A -> IDS) signal rega_ids : std_logic_vector(7 downto 0); --(MEMPTR -> C_BANK1) signal bseq : std_logic_Vector(3 downto 0); --(C_BANK1 -> IDS) signal cbank_ids : std_logic_vector(31 downto 0); --(RAM -> C_BANK1) signal ram_cbank : std_logic_vector(39 downto 0); --(MEMPTR -> RAM) Ram address signal memptr_ram : std_logic_vector(6 downto 0); --(IDS -> DUs) signal o1, o2, o3, o4, o5 : std_logic_vector(39 downto 0); --(Combined output from REG_A and C_BANK1 into ids_in) signal ids_in : std_logic_vector(39 downto 0); --(DUs -> AUs) signal du_au1, du_au2, du_au3 : std_logic_vector(39 downto 0); signal du_au4, du_au5 : std_logic_vector(39 downto 0); --(AUs -> Output Pixels)
87
signal out_pxl1 : std_logic_vector(18 downto 0); signal out_pxl2 : std_logic_vector(18 downto 0); signal out_pxl3 : std_logic_vector(18 downto 0); --(AU's select line for programming) signal a_sel : std_logic_vector(2 downto 0); --(Output select register for holding output selection from parallel port) signal op_sel_reg : std_logic_vector(1 downto 0); --(ans delays signals) signal ds : std_logic_vector(13 downto 0); begin -- Main Controller of Version 2 Convolution Architecture U0: CTR port map(clk=>clk, rst=>rst, str=>str, bor=>bor, eoc=>eoc, sds=>sds, rgt=>rgt, f_sel=>f_sel, z_pad=>z_pad, reg_sel=>reg_sel, en_w=>en_w, en_sf=>en_sf, z_input=>z_input, c_inc=>c_inc, rot=>rot, r_inc=>r_inc, req=>req, r_w=>r_w, s_inc=>s_inc, sd_inc=>sd_inc, ans=>a1); -- Row counter for the main controller U1: RCNT port map(clk=>clk, rst=>rst, r_inc=>r_inc, sd_inc=>sd_inc, eoc=>eoc, sds=>sds, rgt=>rgt); -- Register A of the DM_IF U2: REG_A port map(clk=>clk, rst=>rst, z_pad=>z_pad, z_input=>z_input, reg_sel=>reg_sel, d_in=>d_in, d_out=>rega_ram, ids_out=>rega_ids); -- C_BANK1 of the DM_IF U3: C_BANK1 port map(clk=>clk, rst=>rst, f_sel=>f_sel, z_pad=>z_pad, en_sf=>en_sf, en_w=>en_w, ld_reg=>reg_sel, bseq=>bseq, d_in=>ram_cbank, d_out=>cbank_ids); -- RAM U4: RAM port map(wclk=>clk, r_w=>r_w, d_in=>rega_ram, addr=>memptr_ram, d_out=>ram_cbank); -- MEMPTR (memory pointer) U5: MEMPTR port map(clk=>clk, rst=>rst, rot=>rot, s_inc=>s_inc, inc=>c_inc, reg_sel=>reg_sel, bor=>bor, bseq=>bseq, add_out=>memptr_ram); -- IDS L1: ids_in <= rega_ids & cbank_ids; -- Combine signals output from REG_A and CBANK u6: IDS port map(clk=>clk, rst=>rst, ans=>ans, ids_in=>ids_in, o1=>o1, o2=>o2, o3=>o3, o4=>o4, o5=>o5); -- This process is to create delays such that the output from IDS will outputs at -- the same time. Also take cares of the boundary outputs from line to line. D2: process (clk, rst, a1, a2) is begin if (rst = '1') then a2 <= '0'; a3 <= '0'; ans <= '0'; elsif (clk'event and clk = '1') then a2 <= a1; a3 <= a2; ans <= a3; end if; end process D2; -- This process is to propagate ans signal through out all the pipeline stages to -- the SRAM writer interface so it could strat "recording". D3: process (clk, rst, ans, ds) is begin if (rst = '1') then ds <= (others => '0'); elsif (clk'event and clk = '1') then ds(0) <= ans; ds(1) <= ds(0);
88
ds(2) <= ds(1); ds(3) <= ds(2); ds(4) <= ds(3); ds(5) <= ds(4); ds(6) <= ds(5); ds(7) <= ds(6); ds(8) <= ds(7); ds(9) <= ds(8); ds(10) <= ds(9); ds(11) <= ds(10); ds(12) <= ds(11); ds(13) <= ds(12); end if; end process D3; L2: sram_w <= ds(13) or ds(12) or ds(11) or ds(10); -- DUs u7: DU port map(clk=>clk, rst=>rst, ids_in=>o1, du_out=>du_au1); u8: DU port map(clk=>clk, rst=>rst, ids_in=>o2, du_out=>du_au2); u9: DU port map(clk=>clk, rst=>rst, ids_in=>o3, du_out=>du_au3); u10: DU port map(clk=>clk, rst=>rst, ids_in=>o4, du_out=>du_au4); u11: DU port map(clk=>clk, rst=>rst, ids_in=>o5, du_out=>du_au5); -- AUs u12: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(0), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl1, ovf=>ovf1); u13: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(1), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl2, ovf=>ovf2); u14: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(2), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl3, ovf=>ovf3); AUP_SEL: process (au_sel) is begin case (au_sel) is when "01" => a_sel <= "001"; when "10" => a_sel <= "010"; when "11" => a_sel <= "100"; when others => a_sel <= "000"; end case; end process AUP_SEL; -- Testing Purposes -- Output selection logic OP_SEL: process (clk, rst, o_sel) is begin if (rst = '1') then op_sel_reg <= (others => '0'); elsif (clk'event and clk = '1') then if (o_sel = '1') then op_sel_reg <= coef(1 downto 0); else op_sel_reg <= op_sel_reg; end if; end if; end process OP_SEL; -- Output selection mux OP_D: process (op_sel_reg, out_pxl1, out_pxl2, out_pxl3) is begin case (op_sel_reg) is when "00" => d_out <= out_pxl1; when "01" => d_out <= out_pxl2;
89
when "10" => d_out <= out_pxl3; when others => d_out <= out_pxl1; end case; end process OP_D; end architecture STRUCT; 2. Controller Unit (CU) -- ctr.vhd (Controller) library IEEE; use IEEE.std_logic_1164.all; entity CTR is port( clk, rst, str, bor, eoc, sds, rgt: in std_logic; f_sel: out std_logic; z_pad: out std_logic; reg_sel: out std_logic_vector(2 downto 0); en_w: out std_logic_vector(1 downto 0); en_sf: out std_logic_vector(1 downto 0); z_input: out std_logic; c_inc: out std_logic_vector(1 downto 0); rot: out std_logic; r_inc: out std_logic; req: out std_logic; r_w: out std_logic; s_inc: out std_logic; sd_inc: out std_logic; ans: out std_logic ); end entity CTR; architecture BEHAVIORAL of CTR is type statetype is (st0, st1, st2, st3, st4, st5, st6, st7, st8, st9, st10, st11, st12, st13, st14, st15, st16, st17, st18, st19, st20, st21, st22, st23, st24, st25, st26, st27, st28, st29, st30, st31, st32, st33, st34, st35, st36, st37, st38, st39, st40, st41, st42, st43, st44); signal c_st, n_st: statetype; signal tog : std_logic; begin NXTSTPROC: process (c_st, str, bor, eoc, rgt, tog, sds) is begin case c_st is when st0 => if (str = '1') then n_st <= st1; else n_st <= st0; end if; when st1 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when st2 => n_st <= st3; when st3 => n_st <= st4; when st4 => n_st <= st5; when st5 => n_st <= st6;
90
when st6 => if (bor = '1') then n_st <= st7; else if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st7 => n_st <= st8; when st8 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when st9 => n_st <= st10; when st10 => n_st <= st11; when st11 => n_st <= st12; when st12 => n_st <= st13; when st13 => if (bor = '1') then n_st <= st14; else if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st14 => n_st <= st15; when st15 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when ST16 => n_st <= st17; when st17 => n_st <= st18; when st18 => n_st <= st19; when st19 => n_st <= st20; when st20 => if (bor = '1') then
91
n_st <= st21; else if (tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st21 => n_st <= st22; when st22 => if (eoc = '1') then if (tog = '0') then n_st <= st37; else n_st <= st30; end if; else if (tog = '0') then n_st <= St23; else n_st <= st16; end if; end if; when ST23 => n_st <= st24; when st24 => n_st <= st25; when st25 => n_st <= st26; when st26 => n_st <= st27; when st27 => if (bor = '1') then n_st <= st28; else if (tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st28 => n_st <= st29; when st29 => if (eoc = '1') then if (tog = '0') then n_st <= st37; else n_st <= st30; end if; else if (tog = '0') then n_st <= St23; else n_st <= st16; end if; end if; when st30 => n_st <= st31; when st31 => n_st <= st32; when st32 => n_st <= st33; when st33 => n_st <= st34; when st34 => if (bor = '1') then n_st <= st35; else
92
n_st <= st37; end if; when st35 => n_st <= st36; when st36 => if (sds = '1') then n_st <= st44; else n_st <= st37; end if; when st37 => n_st <= st38; when st38 => n_st <= st39; when st39 => n_st <= st40; when st40 => n_st <= st41; when st41 => if (bor = '1') then n_st <= st42; else n_st <= st30; end if; when st42 => n_st <= st43; when st43 => if (sds = '1') then n_st <= st44; else n_st <= st30; end if; when st44 => n_st <= st44; when others => null; end case; end process NXTSTPROC; CURSTPROC: process (clk, rst) is begin if (rst = '1') then c_st <= st0; elsif (clk'event and clk = '0') then c_st <= n_st; end if; end process CURSTPROC; OUTCONPROC: process (c_st) is begin case c_st is when st0 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '0';
93
ans <= '0'; when st1 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '0'; ans <= '0'; when st2 => reg_sel <= "001"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st3 => reg_sel <= "010"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st4 => reg_sel <= "011"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st5 => reg_sel <= "100"; f_sel <= '0'; en_w <= "01"; en_sf <= "00";
94
z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st6 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st7 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st8 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st9 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1';
95
tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st10 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st11 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st12 => reg_sel <= "100"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st13 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '0';
96
when st14 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st15 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st16 => reg_sel <= "001"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st17 => reg_sel <= "010"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st18 => reg_sel <= "011"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0';
97
c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st19 => reg_sel <= "100"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st20 => reg_sel <= "101"; f_sel <= '1'; en_w <= "00"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st21 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st22 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0';
98
sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st23 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st24 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st25 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st26 => reg_sel <= "100"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1';
99
when st27 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st28 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st29 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st30 => reg_sel <= "001"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st31 => reg_sel <= "010"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1';
100
c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st32 => reg_sel <= "011"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st33 => reg_sel <= "100"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st34 => reg_sel <= "101"; f_sel <= '1'; en_w <= "00"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st35 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '1'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0';
101
sd_inc <= '1'; s_inc <= '1'; ans <= '0'; when st36 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st37 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st38 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st39 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st40 => reg_sel <= "100"; f_sel <= '0';
102
en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st41 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st42 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '1'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '1'; s_inc <= '1'; ans <= '0'; when st43 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st44 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0';
103
r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when others => null; end case; end process OUTCONPROC; end architecture BEHAVIORAL; 3. Memory Pointers Unit (MPU) -- mem_ptr.vhd (Memory Pointers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity MEMPTR is port( clk, rst, rot: in std_logic; inc: in std_logic_vector(1 downto 0); reg_sel: in std_logic_vector(2 downto 0); bor: out std_logic; -- begining of a new row bseq: out std_logic_vector(3 downto 0); add_out: out std_logic_vector(12 downto 0) ); end entity MEMPTR; architecture BEHAVIORAL of MEMPTR is signal count1, count2 : unsigned(9 downto 0); signal ptr_a, ptr_b, ptr_c, ptr_d, ptr_e: unsigned(2 downto 0); signal b_seq : unsigned(3 downto 0); signal eor : std_logic; begin CNTR1: process (clk, rst, inc) is begin if (rst = '1') then count1 <= to_unsigned(0, 10); elsif (clk'event and clk = '1') then if (inc(0) = '1') then if (count1 = to_unsigned(1020,10)) then count1 <= to_unsigned(0, 10); else count1 <= count1 + 1; end if; end if; end if; end process CNTR1; CNTR2: process (clk, rst, inc) is begin if (rst = '1') then count2 <= to_unsigned(1, 10); elsif (clk'event and clk = '1') then if (inc(1) = '1') then if (count2 = to_unsigned(1020,10)) then count2 <= to_unsigned(0, 10); else count2 <= count2 + 1; end if;
104
end if; end if; end process CNTR2; CODOUT: process (count1, count2) is begin if (count2 = to_unsigned(0, 10)) then eor <= '1'; else eor <= '0'; end if; if (count1 = to_unsigned(0, 10)) then bor <= '1'; else bor <= '0'; end if; end process CODOUT; BLK: process (clk, rst, rot) is begin if (rst = '1') then b_seq <= to_unsigned(0, 4); elsif (clk'event and clk = '1') then if (rot = '1') then b_seq(0) <= '1'; b_seq(1) <= b_seq(0); b_seq(2) <= b_seq(1); b_seq(3) <= b_seq(2); end if; end if; end process BLK; L1: bseq <= std_logic_vector(b_seq); PTRS: process (clk, rst, rot) is begin if (rst = '1') then ptr_a <= to_unsigned(0, 3); ptr_b <= to_unsigned(1, 3); ptr_c <= to_unsigned(2, 3); ptr_d <= to_unsigned(3, 3); ptr_e <= to_unsigned(4, 3); elsif (clk'event and clk = '1') then if (rot = '1') then ptr_b <= ptr_a; ptr_c <= ptr_b; ptr_d <= ptr_c; ptr_e <= ptr_d; ptr_a <= ptr_e; end if; end if; end process PTRS; MUX: process (reg_sel, count1, count2, ptr_a, ptr_b, ptr_c, ptr_d, ptr_e, eor) is begin case reg_sel is when "001" => if (eor = '1') then add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_e) & std_logic_vector(count2); end if;
105
when "010" => if (eor = '1') then add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2); end if; when "011" => if (eor = '1') then add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2); end if; when "100" => if (eor = '1') then add_out <= std_logic_vector(ptr_a) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2); end if; when "101" => add_out <= std_logic_vector(ptr_a) & std_logic_vector(count1); when others => add_out <= std_logic_vector(to_unsigned(0, 13)); end case; end process MUX; end architecture BEHAVIORAL;
4. Data Memory Interface (DM I/F) -- dm_if.vhd (Data Memory Interface) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity C_BANK1 is port( clk, rst, f_sel, z_pad: in std_logic; en_sf, en_w: in std_logic_vector(1 downto 0); ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end entity C_BANK1; architecture STRUCTURAL of C_BANK1 is component REGFILE is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_sf, en_w: in std_logic; ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end component REGFILE; signal f_a, f_b, f_mux: std_logic_vector(31 downto 0); begin RF1: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(0), en_w=>en_w(0), bseq=>bseq, ld_reg=>ld_reg, d_in=>d_in, d_out=>f_a); RF2: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(1), en_w=>en_w(1), bseq=>bseq, ld_reg=>ld_reg, d_in=>d_in, d_out=>f_b); MUX1: f_mux <= f_a when f_sel = '0' else f_b; Z_P: d_out <= f_mux when z_pad = '0' else std_logic_vector(to_unsigned(0, 32)); end architecture STRUCTURAL;
106
-- -- regfile.vhd library IEEE; use IEEE.std_logic_1164.all; entity REGFILE is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_sf, en_w: in std_logic; ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end entity REGFILE; architecture STRUCTURAL of REGFILE is component PLS_REG is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_ld, en_sf: in std_logic; d_in: in std_logic_vector((n*d)-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component PLS_REG; signal reg_sel: std_logic_vector(3 downto 0); begin LF: for f in 1 to 4 generate PR_F: PLS_REG generic map(n=>n, d=>d) port map(clk=>clk, rst=>rst, en_ld=>reg_sel(f-1), en_sf=>en_sf, d_in=>d_in, d_out=>d_out((f*n)-1 downto ((f-1)*n))); end generate LF; SEL: process (ld_reg, en_w, bseq) is begin if (en_w = '0') then reg_sel <= "0000"; else case ld_reg is when "001" => if (bseq(0) = '1') then reg_sel <= "0001"; else reg_sel <="0000"; end if; when "010" => if (bseq(1) = '1') then reg_sel <= "0010"; else reg_sel <= "0000"; end if; when "011" => if (bseq(2) = '1') then reg_sel <= "0100"; else reg_sel <= "0000"; end if; when "100" => if (bseq(3) = '1') then reg_sel <= "1000"; else reg_sel <= "0000"; end if; when others => reg_sel <= "0000"; end case; end if;
107
end process SEL; end architecture STRUCTURAL; -- pls_reg.vhd library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity PLS_REG is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_ld, en_sf: in std_logic; d_in: in std_logic_vector((n*d)-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity PLS_REG; architecture STRUCTURAL of PLS_REG is signal s: std_logic_vector((n*(d+1))-1 downto 0); begin L1: s(n-1 downto 0) <= std_logic_vector(to_unsigned(0, n)); LK: for k in 1 to d generate REGK: S_REG generic map(n=>n) port map(clk=>clk, rst=>rst, en_ld=>en_ld, en_sf=>en_sf, p_in=>d_in((n*k)-1 downto n*(k-1)), d_in=>s((n*k)-1 downto n*(k-1)), d_out=>s((n*(k+1))-1 downto (n*k))); end generate LK; L2: d_out <= s((n*(d+1))-1 downto n*d); end architecture STRUCTURAL; -- reg_a.vhd library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity REG_A is generic( n: integer := 8; -- denotes the data width d: integer := 5 );-- denotes the number of registers port( clk, rst, z_pad: in std_logic; reg_sel: in std_logic_vector(2 downto 0); d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector((n*d)-1 downto 0); ids_out: out std_logic_vector(n-1 downto 0) ); end entity REG_A; architecture BEHAVIORAL of REG_A is signal reg1, reg2, reg3, reg4, reg5, regt: unsigned(n-1 downto 0); begin -- Register write with conditions REGSEL: process (clk, rst, reg_sel) is begin if (rst = '1') then reg1 <= to_unsigned(0, n); reg2 <= to_unsigned(0, n); reg3 <= to_unsigned(0, n); reg4 <= to_unsigned(0, n); reg5 <= to_unsigned(0, n); regt <= to_unsigned(0, n); elsif (clk'event and clk = '1') then
108
case reg_sel is when "001" => reg1 <= unsigned(d_in); regt <= unsigned(d_in); when "010" => reg2 <= unsigned(d_in); regt <= unsigned(d_in); when "011" => reg3 <= unsigned(d_in); regt <= unsigned(d_in); when "100" => reg4 <= unsigned(d_in); regt <= unsigned(d_in); when "101" => reg5 <= unsigned(d_in); regt <= unsigned(d_in); when others => null; end case; end if; end process REGSEL; -- Output Logic L1: d_out <= std_logic_vector(reg5) & std_logic_vector(reg4) & std_logic_vector(reg3) & std_logic_vector(reg2) & std_logic_vector(reg1); L2: ids_out <= std_logic_vector(regt) when z_pad = '0' else std_logic_vector(to_unsigned(0, n)); end architecture BEHAVIORAL; 5. Input Data Shifters (IDS) -- ids.vhd (Input Data Shifters) -- This is the functional unit that responsible for providing inputs to the five MAAs. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity IDS is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0)); end entity IDS; architecture STRUCTURAL of IDS is signal s1, s2, s3, s4: std_logic_vector(39 downto 0); begin R1: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>ids_in, d_out=>s1); L1: o1 <= s1; R2: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s1, d_out=>s2); L2: o2 <= s2; R3: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s2, d_out=>s3); L3: o3 <= s3; R4: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s3, d_out=>s4); L4: o4 <= s4; R5: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s4, d_out=>o5); end architecture STRUCTURAL; 6. Arithmetic Unit (AU) -- au.vhd (Arithmetic Unit) -- This is the combination of all the arithmetic units, which including all the
109
-- MAUs (25 of them). library IEEE; use IEEE.std_logic_1164.all; entity AU is port( clk, rst: in std_logic; ids0, ids1, ids2, ids3, ids4: in std_logic_vector(39 downto 0); ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); out_pxl: out std_logic_vector(18 downto 0); ovf: out std_logic); end entity AU; architecture STRUCTURAL of AU is component MAA is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end component MAA; component AT is port( clk, rst: in std_logic; maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0); ovf: out std_logic; out_pxl: out std_logic_vector(18 downto 0)); end component AT; signal maa0, maa1, maa2, maa3, maa4: std_logic_vector(16 downto 0); signal ld_coef : std_logic_vector(24 downto 0); begin MAA_0: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(4 downto 0), coef=>coef, img_pxl=>ids0, p_rst=>maa0); MAA_1: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(9 downto 5), coef=>coef, img_pxl=>ids1, p_rst=>maa1); MAA_2: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(14 downto 10), coef=>coef, img_pxl=>ids2, p_rst=>maa2); MAA_3: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(19 downto 15), coef=>coef, img_pxl=>ids3, p_rst=>maa3); MAA_4: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(24 downto 20), coef=>coef, img_pxl=>ids4, p_rst=>maa4); U1: AT port map(clk=>clk, rst=>rst, maa0=>maa0, maa1=>maa1, maa2=>maa2, maa3=>maa3, maa4=>maa4, ovf=>ovf, out_pxl=>out_pxl); MUX: process (ld_reg, coef) is begin case (ld_reg) is when "00001" => ld_coef <= "0000000000000000000000001"; -- 1 when "00010" => ld_coef <= "0000000000000000000000010"; -- 2 when "00011" => ld_coef <= "0000000000000000000000100"; -- 3 when "00100" => ld_coef <= "0000000000000000000001000"; -- 4 when "00101" => ld_coef <= "0000000000000000000010000"; -- 5 when "00110" => ld_coef <= "0000000000000000000100000"; -- 6 when "00111" => ld_coef <= "0000000000000000001000000"; -- 7 when "01000" => ld_coef <= "0000000000000000010000000"; -- 8 when "01001" => ld_coef <= "0000000000000000100000000"; -- 9 when "01010" => ld_coef <= "0000000000000001000000000"; -- 10 when "01011" => ld_coef <= "0000000000000010000000000"; -- 11 when "01100" => ld_coef <= "0000000000000100000000000"; -- 12 when "01101" => ld_coef <= "0000000000001000000000000"; -- 13 when "01110" => ld_coef <= "0000000000010000000000000"; -- 14
110
when "01111" => ld_coef <= "0000000000100000000000000"; -- 15 when "10000" => ld_coef <= "0000000001000000000000000"; -- 16 when "10001" => ld_coef <= "0000000010000000000000000"; -- 17 when "10010" => ld_coef <= "0000000100000000000000000"; -- 18 when "10011" => ld_coef <= "0000001000000000000000000"; -- 19 when "10100" => ld_coef <= "0000010000000000000000000"; -- 20 when "10101" => ld_coef <= "0000100000000000000000000"; -- 21 when "10110" => ld_coef <= "0001000000000000000000000"; -- 22 when "10111" => ld_coef <= "0010000000000000000000000"; -- 23 when "11000" => ld_coef <= "0100000000000000000000000"; -- 24 when "11001" => ld_coef <= "1000000000000000000000000"; -- 25 when others => ld_coef <= "0000000000000000000000000"; end case; end process MUX; end architecture STRUCTURAL; -- at.vhd (Adding Tree) -- This is the Adding Tree that is responsible for adding five 17-bit word from -- five different MAAs. The structure will include four level of pipeline stages. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity AT is port( clk, rst: in std_logic; maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0); ovf: out std_logic; out_pxl: out std_logic_vector(18 downto 0)); end entity AT; architecture STRUCTURAL of AT is signal low, ovf_r : std_logic; signal sum1 : std_logic_vector(16 downto 0); signal sum2, sum3, carry1, carry2: std_logic_vector(17 downto 0); signal carry3 : std_logic_vector(18 downto 0); signal sum4 : std_logic_vector(19 downto 0); signal pl1_r1 : std_logic_vector(17 downto 0); signal pl1_r2, pl1_r3, pl1_r4 : std_logic_vector(16 downto 0); signal pl2_r5, pl2_r6 : std_logic_vector(17 downto 0); signal pl2_r7 : std_logic_vector(16 downto 0); signal pl3_r8 : std_logic_vector(18 downto 0); signal pl3_r9 : std_logic_vector(17 downto 0); signal pl4_r10 : std_logic_vector(19 downto 0); begin L1: low <= '0'; U1: CSA generic map(n=>17) port map(a=>maa0, b=>maa1, c=>maa2, sum=>sum1, carry=>carry1); R1: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry1, d_out=>pl1_r1); R2: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum1, d_out=>pl1_r2); R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa3, d_out=>pl1_r3); R4: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa4, d_out=>pl1_r4); U2: CSA generic map(n=>17) port map(a=>pl1_r1(16 downto 0), b=>pl1_r2, c=>pl1_r3, sum=>sum2(16 downto 0), carry=>carry2); L2: sum2(17) <= pl1_r1(17); -- This is the most significant bit from carry1 above R5: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry2, d_out=>pl2_r5); R6: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum2, d_out=>pl2_r6); R7: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>pl1_r4, d_out=>pl2_r7); U3: CSA generic map(n=>17) port map(a=>pl2_r5(16 downto 0), b=>pl2_r6(16 downto 0), c=>pl2_r7, sum=>sum3(16 downto 0), carry=>carry3(17 downto 0)); L3: HA port map(a=>pl2_r5(17), b=>pl2_r6(17), s=>sum3(17), cout=>carry3(18)); R8: REG_P generic map(n=>19) port map(clk=>clk, rst=>rst, d_in=>carry3, d_out=>pl3_r8); R9: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum3, d_out=>pl3_r9);
111
U4: CLA_19 port map(a(17 downto 0)=>pl3_r9, a(18)=>low, b=>pl3_r8, s=>sum4(18 downto 0), ovf=>sum4(19)); R10: REG_P generic map(n=>20) port map(clk=>clk, rst=>rst, d_in=>sum4, d_out=>pl4_r10); L4: out_pxl <= pl4_r10(18 downto 0); L5: ovf <= pl4_r10(19); end architecture STRUCTURAL; -- maa.vhd (This is ths systolic array of five MAUs with a DU) library IEEE; use IEEE.std_logic_1164.all; entity MAA is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end entity MAA; architecture STRUCTURAL of MAA is component DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end component DU; component MAUS is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end component MAUS; signal s: std_logic_vector(39 downto 0); begin U1: DU port map(clk=>clk, rst=>rst, ids_in=>img_pxl, du_out=>s); U2: MAUS port map(clk=>clk, rst=>rst, ld_reg=>ld_reg, coef=>coef, img_pxl=>s, p_rst=>p_rst); end architecture STRUCTURAL; -- du.vhd -- This is the Delay Unit for the propagation of the image data library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end entity DU; architecture STRUCTURAL of DU is signal p1: std_logic_vector(31 downto 0); signal p2, p3: std_logic_vector(23 downto 0); signal p4, p5: std_logic_vector(15 downto 0); signal p6, p7: std_logic_vector(7 downto 0); begin L1: du_out(7 downto 0) <= ids_in(7 downto 0); PL1: REG_P generic map(n=>32) port map(clk=>clk, rst=>rst, d_in=>ids_in(39 downto 8), d_out=>p1);
112
L2: du_out(15 downto 8) <= p1(7 downto 0); PL2: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p1(31 downto 8), d_out=>p2); PL3: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p2, d_out=>p3); L3: du_out(23 downto 16) <= p3(7 downto 0); PL4: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p3(23 downto 8), d_out=>p4); PL5: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p4, d_out=>p5); L4: du_out(31 downto 24) <= p5(7 downto 0); PL6: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p5(15 downto 8), d_out=>p6); PL7: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p6, d_out=>p7); L5: du_out(39 downto 32) <= p7; end architecture STRUCTURAL; -- maus.vhd library IEEE; use IEEE.std_logic_1164.all; entity MAUS is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end entity MAUS; architecture STRUCTURAL of MAUS is component MAU_0 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(7 downto 0); p_res: out std_logic_vector(13 downto 0)); end component MAU_0; component MAU_1 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(13 downto 0); -- previous MAU output p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU end component MAU_1; component MAU_2 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(14 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end component MAU_2; component MAU_3 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end component MAU_3; component MAU_4 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU end component MAU_4;
113
signal p_res1: std_logic_vector(13 downto 0); signal p_res2: std_logic_vector(14 downto 0); signal p_res3, p_res4: std_logic_vector(15 downto 0); begin U0: MAU_0 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(0), coef=>coef, img_pxl=>img_pxl(7 downto 0), p_res=>p_res1); U1: MAU_1 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(1), coef=>coef, img_pxl=>img_pxl(15 downto 8), p_mau=>p_res1, p_res=>p_res2); U2: MAU_2 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(2), coef=>coef, img_pxl=>img_pxl(23 downto 16), p_mau=>p_res2, p_res=>p_res3); U3: MAU_3 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(3), coef=>coef, img_pxl=>img_pxl(31 downto 24), p_mau=>p_res3, p_res=>p_res4); U4: MAU_4 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(4), coef=>coef, img_pxl=>img_pxl(39 downto 32), p_mau=>p_res4, p_res=>p_rst); end architecture STRUCTURAL; -- mau_0.vhd -- This is the first MAU of the MAUs (for one systolic array). -- This MAU only contains Multiplication unit and no Adder since there is no previous -- MAU output that needs to be accumulated. The range of multiplication is within -- -8160 to 7905 (decimal), hence the output (p_res) is 14-bit word. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_0 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- Filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- Image pixels p_res: out std_logic_vector(13 downto 0)); -- Partial result to end entity MAU_0; architecture BEHAVIORAL of MAU_0 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); begin U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL;
114
-- mau_1.vhd -- This is the second MAU of the MAUs. The range of this MAU is between -- -16320 and 15810 (decimal), hence the 15-bit word is used for the partial result. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_1 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(13 downto 0); -- previous MAU output p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU end entity MAU_1; architecture BEHAVIORAL of MAU_1 is signal coef_reg: std_logic_vector(5 downto 0); signal pl1, pl2, product: std_logic_vector(13 downto 0); signal sum: std_logic_vector(14 downto 0); begin R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2); U2: CLA_15 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL; -- mau_2.vhd -- This is the third MAU of the MAUs systolic array. -- The range of the MAU is between -24480 and 23715 (decimal), thus -- a 16-bit word is used for the partial result bus. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_2 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(14 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end entity MAU_2; architecture BEHAVIORAL of MAU_2 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2, sum: std_logic_vector(15 downto 0); begin
115
L1: pl2(15 downto 14) <= "00"; L2: pl1(15) <= '0'; R1: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1(14 downto 0)); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL; -- mau_3.vhd -- This is the third MAU with the MAUs. The range of the MAU is between -- -32640 and 31620 (decimal), thus a 16-bit word bus is used for the partial -- result coming out from this MAU. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_3 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end entity MAU_3; architecture BEHAVIORAL of MAU_3 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2, sum: std_logic_vector(15 downto 0); begin L1: pl2(15 downto 14) <= "00"; R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg;
116
end if; end if; end process STORE; end architecture BEHAVIORAL; -- mau_4.vhd -- This is the last MAU within the MAUs systolic array. The range for this MAU -- is between -40800 and 39525 (decimal), thus a 17-bit word bus is used for the -- partial result coming out of this MAU. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_4 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU end entity MAU_4; architecture BEHAVIORAL of MAU_4 is signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2: std_logic_vector(15 downto 0); signal sum: std_logic_vector(16 downto 0); begin L1: pl2(15 downto 14) <= "00"; R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_17 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res); STORE: process (clk, rst, ld_reg) is begin if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if; end process STORE; end architecture BEHAVIORAL;
7. Multiplication and Adder Units (These functional units have been defined as a library package) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; package bfu_pckg is
117
component CLA_15 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(14 downto 0)); end component CLA_15; component CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end component CLA_16; component CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17; component CLA_19 is port(a, b: in std_logic_vector(18 downto 0); s: out std_logic_vector(18 downto 0); ovf: out std_logic); end component CLA_19; component MULT is port(a: in std_logic_vector(7 downto 0); b: in std_logic_vector(5 downto 0); p: out std_logic_vector(13 downto 0)); end component MULT; component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA; component CSA is generic(n: positive := 5); port( a, b, c: in std_logic_vector(n-1 downto 0); sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end component CSA; component REG_P is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component REG_P; component REG_N is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component REG_N; end package bfu_pckg; -- mult.vhd (Multiplication Unit) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity MULT is
118
port(a: in std_logic_vector(7 downto 0); b: in std_logic_vector(5 downto 0); p: out std_logic_vector(13 downto 0)); end entity MULT; architecture STRUCT of MULT is component PPG is generic(n: integer := 8); port( a: in std_logic_vector(n-1 downto 0); mult: in std_logic_vector(2 downto 0); pp: out std_logic_vector(n downto 0); spp: out std_logic); end component PPG; component R3_2C is generic(n: integer := 14); port(a, b, c: in std_logic_vector(n-1 downto 0); sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end component R3_2C; component S3_2C is port(pp1, pp2, pp3: in std_logic_vector(8 downto 0); sp1, sp2, sp3: in std_logic; sum, carry: out std_logic_vector(13 downto 0)); end component S3_2C; component CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end component CLA_14; signal sp1, sp2, sp3: std_logic; signal ls: std_logic_vector(2 downto 0); signal pp1, pp2, pp3: std_logic_vector(8 downto 0); signal pp4, sum1, sum2, carry1: std_logic_vector(13 downto 0); signal carry2: std_logic_vector(14 downto 0); begin L1: ls <= b(1 downto 0) & '0'; U1: PPG port map(a=>a, mult=>ls, pp=>pp1, spp=>sp1); U2: PPG port map(a=>a, mult=>b(3 downto 1), pp=>pp2, spp=>sp2); U3: PPG port map(a=>a, mult=>b(5 downto 3), pp=>pp3, spp=>sp3); U4: S3_2C port map(pp1=>pp1, pp2=>pp2, pp3=>pp3, sp1=>sp1, sp2=>sp2, sp3=>sp3, sum=>sum1, carry=>carry1); L2: pp4 <= "000000000" & sp3 & '0' & sp2 & '0' & sp1; U5: R3_2C port map(a=>sum1, b=>carry1, c=>pp4, sum=>sum2, carry=>carry2); U6: CLA_14 port map(a=>sum2, b=>carry2(13 downto 0), s=>p); end architecture STRUCT; -- ppg.vhd (Partial Product Generator) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity PPG is generic(n: integer := 8); port( a: in std_logic_vector(n-1 downto 0); mult: in std_logic_vector(2 downto 0); pp: out std_logic_vector(n downto 0); spp: out std_logic); end entity PPG;
119
architecture BEHAVIORAL of PPG is begin PP_PROC: process(mult, a) begin case mult is when "000" => for k in n downto 0 loop pp(k) <= '0'; end loop; spp <= '0'; when "001" => pp <= '0' & a; spp <= '0'; when "010" => pp <= '0' & a; spp <= '0'; when "011" => pp <= a & '0'; spp <= '0'; when "100" => pp <= not(a & '0'); spp <= '1'; when "101" => pp <= not('0' & a); spp <= '1'; when "110" => pp <= not('0' & a); spp <= '1'; when "111" => for l in n downto 0 loop pp(l) <= '0'; end loop; spp <= '0'; when others => null; end case; end process PP_PROC; end architecture BEHAVIORAL; -- r3_2c.vhd (Second level 3 to 2 Counter for Multiplier) library IEEE; use IEEE.std_logic_1164.all; entity R3_2C is generic(n: integer := 14); port(a, b, c: in std_logic_vector(n-1 downto 0); -- a->sum, b->carry c->pp4 sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end entity R3_2C; architecture STRUCTURAL of R3_2C is component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA; component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; begin L1: carry(0) <= '0'; LK: for k in 2 downto 0 generate HAK: HA port map(a=>a(k), b=>c(k), s=>sum(k), cout=>carry(k+1)); end generate LK; L2: HA port map(a=>a(3), b=>b(3), s=>sum(3), cout=>carry(4)); L3: FA port map(a=>a(4), b=>b(4), cin=>c(4), s=>sum(4), cout=>carry(5)); LF: for f in n-1 downto 5 generate
120
HAF: HA port map(a=>a(f), b=>b(f), s=>sum(f), cout=>carry(f+1)); end generate LF; end architecture STRUCTURAL; -- s3_2c.vhd (Special 3 to 2 Counter) library IEEE; use IEEE.std_logic_1164.all; entity S3_2C is port(pp1, pp2, pp3: in std_logic_vector(8 downto 0); sp1, sp2, sp3: in std_logic; sum, carry: out std_logic_vector(13 downto 0)); end entity S3_2C; architecture STRUCTURAL of S3_2C is component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA; component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; signal high, n_sp1, n_sp2, n_sp3: std_logic; begin G0: high <= '1'; G1: n_sp1 <= not sp1; G2: n_sp2 <= not sp2; G3: n_sp3 <= not sp3; L1: sum(1 downto 0) <= pp1(1 downto 0); L2: carry(2 downto 0) <= "000"; L3: sum(13) <= n_sp3; LK: for k in 1 downto 0 generate HAA: HA port map(a=>pp1(k+2), b=>pp2(k), s=>sum(k+2), cout=>carry(k+3)); end generate LK; LG: for g in 4 downto 0 generate FAG: FA port map(a=>pp1(g+4), b=>pp2(g+2), cin=>pp3(g), s=>sum(g+4), cout=>carry(g+5)); end generate LG; LF: for f in 6 downto 5 generate FAF: FA port map(a=>sp1, b=>pp2(f+2), cin=>pp3(f), s=>sum(f+4), cout=>carry(f+5)); end generate LF; FA7: FA port map(a=>n_sp1, b=>n_sp2, cin=>pp3(7), s=>sum(11), cout=>carry(12)); HA2: HA port map(a=>high, b=>pp3(8), s=>sum(12), cout=>carry(13)); end architecture STRUCTURAL; -- cla_16.vhd (Carry Lookahead Adder ~ 16 Bits) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end entity CLA_16; architecture STRUCTURAL of CLA_16 is
121
component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L; signal p, g: std_logic_vector(3 downto 0); signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12), p_out=>p(3), g_out=>g(3)); U5: CLL_2L port map(p=>p(2 downto 0), g=>g(2 downto 0), cout=>c); end architecture STRUCTURAL; -- cla_19.vhd (19-bit Carry Lookahead adder) library IEEE; use IEEE.std_logic_1164.all; entity CLA_19 is port(a, b: in std_logic_vector(18 downto 0); s: out std_logic_vector(18 downto 0); ovf: out std_logic); end entity CLA_19; architecture STRUCTURAL of CLA_19 is component CLA_3S is port(a, b: in std_logic_vector(2 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0); ovf: out std_logic); end component CLA_3S; component CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17; signal cout, high, low, ovf0, ovf1: std_logic; signal s1, s2: std_logic_vector(2 downto 0); begin L1: high <= '1'; L2: low <= '0'; U1: CLA_17 port map(a=>a(15 downto 0), b=>b(15 downto 0), s(15 downto 0)=>s(15 downto 0), s(16)=>cout); U2: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>high, s=>s1, ovf=>ovf1);
122
U3: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>low, s=>s2, ovf=>ovf0); L3: s(18 downto 16) <= s1 when cout = '1' else s2; L4: ovf <= ovf1 when cout='1' else ovf0; end architecture STRUCTURAL; -- cla_17.vhd (Carry Lookahead Adder ~ 17 Bits) -- This Carry Lookahead adder adds two 16-bit number and generate a 17-bit sum. library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end entity CLA_17; architecture STRUCTURAL of CLA_17 is component CLA_16_P is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0); p_out, g_out: out std_logic); end component CLA_16_P; signal p_4, g_4: std_logic; begin U1: CLA_16_P port map(a=>a, b=>b, s=>s(15 downto 0), p_out=>p_4, g_out=>g_4); L3: s(16) <= g_4; end architecture STRUCTURAL; -- cla_16_p.vhd (Carry Lookahead Adder ~ 16 Bits) with p and g library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_16_P is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0); p_out, g_out: out std_logic); end entity CLA_16_P; architecture STRUCTURAL of CLA_16_P is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLL_2 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 1); p_out, g_out: out std_logic); end component CLL_2; signal p, g: std_logic_vector(3 downto 0);
123
signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12), p_out=>p(3), g_out=>g(3)); U5: CLL_2 port map(p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out); end architecture STRUCTURAL; -- -- cll_2.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) with P, G output library IEEE; use IEEE.std_logic_1164.all; entity CLL_2 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 1); p_out, g_out: out std_logic); end entity CLL_2; architecture BEHAVIORAL of CLL_2 is begin L1: cout(1) <= g(0); L2: cout(2) <= g(1) or (p(1) and g(0)); L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); L4: p_out <= p(3) and p(2) and p(1) and p(0); L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- cla_15.vhd (Carry Lookahead Adder ~ 15 Bits) -- This is a adder that add two 14-bit number and the sum is 15-bit word library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_15 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(14 downto 0)); end entity CLA_15; architecture STRUCTURAL of CLA_15 is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLA_3 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic;
124
s: out std_logic_vector(2 downto 0)); end component CLA_3; component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L; signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_3 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(14 downto 12)); U5: CLL_2L port map(p=>p, g=>g, cout=>c); end architecture STRUCTURAL; -- -- cla_3.vhd (Carry Lookahead Adder ~ 3 Bits (Last 3 Bits)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_3 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0)); end entity CLA_3; architecture STRUCTURAL of CLA_3 is component SCLL_3 is port( cin: in std_logic; p, g: in std_logic_vector(1 downto 0); cout: out std_logic_vector(2 downto 0)); end component SCLL_3; component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; signal p, g: std_logic_vector(1 downto 0); signal c: std_logic_vector(2 downto 0); begin U0: SCLL_3 port map(cin=>cin, p=>p, g=>g, cout=>c); U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0)); U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1)); L1: s(2) <= c(2); end architecture STRUCTURAL; -- -- scll_3.vhd (Carry Lookahead Logic for Bit position 13 & 14) library IEEE; use IEEE.std_logic_1164.all; entity SCLL_3 is port( cin: in std_logic; p, g: in std_logic_vector(1 downto 0); cout: out std_logic_vector(2 downto 0)); end entity SCLL_3;
125
architecture BEHAVIORAL of SCLL_3 is begin L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); end architecture BEHAVIORAL; -- cla_14.vhd (Carry Lookahead Adder ~ 14 Bits) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end entity CLA_14; architecture STRUCTURAL of CLA_14 is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1; component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4; component CLA_2 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(1 downto 0)); end component CLA_2; component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L; signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 1); begin U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_2 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(13 downto 12)); U5: CLL_2L port map(p=>p, g=>g, cout=>c); end architecture STRUCTURAL; -- cla_3s.vhd (Carry Lookahead Adder ~ 3 Bits (For CLA_19)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_3S is
126
port(a, b: in std_logic_vector(2 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0); ovf: out std_logic); end entity CLA_3S; architecture STRUCTURAL of CLA_3S is component SCLL_4 is port( cin: in std_logic; p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 0)); end component SCLL_4; component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 0); begin U0: SCLL_4 port map(cin=>cin, p=>p, g=>g, cout=>c); U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0)); U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1)); U3: PFA port map(a=>a(2), b=>b(2), c=>c(2), s=>s(2), g=>g(2), p=>p(2)); L1: ovf <= c(2) xor c(3); end architecture STRUCTURAL; -- scll_4.vhd (Carry Lookahead Logic for Bit position 17, 18, & 19) library IEEE; use IEEE.std_logic_1164.all; entity SCLL_4 is port( cin: in std_logic; p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 0)); end entity SCLL_4; architecture BEHAVIORAL of SCLL_4 is begin L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)) or (p(2) and p(1) and p(0) and cin); end architecture BEHAVIORAL; -- cla_4.vhd (Carry Lookahead Adder ~ 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLA_4; architecture STRUCTURAL of CLA_4 is component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;
127
component CLL is port( cin: in std_logic; p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLL; signal c, g, p: std_logic_vector(3 downto 0); begin L1: CLL port map(cin=>cin, p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out); LK: for k in 3 downto 0 generate PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k)); end generate LK; end architecture STRUCTURAL; -- cla_4_1.vhd (Carry Lookahead Adder ~ 4 bits / for the first CLA_4) library IEEE; use IEEE.std_logic_1164.all; entity CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLA_4_1; architecture STRUCTURAL of CLA_4_1 is component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; component CLL_1 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(2 downto 0); p_out, g_out: out std_logic); end component CLL_1; signal g, p: std_logic_vector(3 downto 0); signal c: std_logic_vector(3 downto 0); begin L0: g(0) <= a(0) and b(0); L1: p(0) <= a(0) xor b(0); L2: s(0) <= p(0); L3: CLL_1 port map(p=>p, g=>g, cout=>c(3 downto 1), p_out=>p_out, g_out=>g_out); LK: for k in 3 downto 1 generate PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k)); end generate LK; end architecture STRUCTURAL; -- -- cll_1.vhd (Carry Lookahead Logic - for first 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL_1 is port( p, g: in std_logic_vector(3 downto 0);
128
cout: out std_logic_vector(2 downto 0); p_out, g_out: out std_logic); end entity CLL_1; architecture BEHAVIORAL of CLL_1 is begin L1: cout(0) <= g(0); L2: cout(1) <= g(1) or (p(1) and g(0)); L3: cout(2) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); L4: p_out <= p(3) and p(2) and p(1) and p(0); L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- cll.vhd (Carry Lookahead Logic - for 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL is port( cin: in std_logic; p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLL; architecture BEHAVIORAL of CLL is begin L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)) or (p(2) and p(1) and p(0) and cin); L5: p_out <= p(3) and p(2) and p(1) and p(0); L6: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- cll_2l.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end entity CLL_2L; architecture BEHAVIORAL of CLL_2L is begin L1: cout(1) <= g(0); L2: cout(2) <= g(1) or (p(1) and g(0)); L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); end architecture BEHAVIORAL; -- csa.vhd (Carry Save Adder) library IEEE; use IEEE.std_logic_1164.all; entity CSA is generic(n: positive := 5); port( a, b, c: in std_logic_vector(n-1 downto 0);
129
sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end entity CSA; architecture STRUCTURAL of CSA is component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA; begin L1: carry(0) <= '0'; KL: for k in n-1 downto 0 generate FAK: FA port map(a=>a(k), b=>b(k), cin=>c(k), s=>sum(k), cout=>carry(k+1)); end generate KL; end architecture STRUCTURAL; -- -- cla_2.vhd (Carry Lookahead Adder ~ 2 Bits (Last 2 Bits)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_2 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(1 downto 0)); end entity CLA_2; architecture STRUCTURAL of CLA_2 is component SCLL is port(cin, p, g: in std_logic; cout: out std_logic); end component SCLL; component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA; signal c, p, g: std_logic; begin U1: PFA port map(a=>a(0), b=>b(0), c=>cin, s=>s(0), g=>g, p=>p); U2: SCLL port map(cin=>cin, p=>p, g=>g, cout=>c); L1: s(1) <= a(1) xor b(1) xor c; end architecture STRUCTURAL; -- -- scll.vhd (Carry Lookahead Logic for Bit position 13) library IEEE; use IEEE.std_logic_1164.all; entity SCLL is port(cin, p, g: in std_logic; cout: out std_logic); end entity SCLL; architecture BEHAVIORAL of SCLL is begin L1: cout <= g or (p and cin);
130
end architecture BEHAVIORAL; -- ha.vhd (Half Adder) library IEEE; use IEEE.std_logic_1164.all; entity HA is port( a, b: in std_logic; s, cout: out std_logic); end entity HA; architecture BEHAVIORAL of HA is begin L1: s <= a xor b; L2: cout <= a and b; end architecture BEHAVIORAL; -- fa.vhd (full adder) library IEEE; use IEEE.std_logic_1164.all; entity FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end entity FA; architecture BEHAVIORAL of FA is begin L1: s <= a xor b xor cin; L2: cout <= (a and b) or (a and cin) or (b and cin); end architecture BEHAVIORAL; -- pfa.vhd (Partial Full Adder) library IEEE; use IEEE.std_logic_1164.all; entity PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end entity PFA; architecture BEHAVIORAL of PFA is begin L1: s <= a xor b xor c; L2: g <= a and b; L3: p <= a xor b; end architecture BEHAVIORAL; -- reg_p.vhd (Positive edge clocked registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity REG_P is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity REG_P;
131
architecture BEHAVIORAL of REG_P is signal d_reg: signed(n-1 downto 0); begin STORE: process (clk, rst, d_in) is begin if (rst = '1') then d_reg <= conv_signed('0', n); elsif (rising_edge(clk)) then d_reg <= signed(d_in); end if; end process STORE; L1: d_out <= std_logic_vector(d_reg); end architecture BEHAVIORAL; -- reg_n.vhd (Negative edge clocked registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity REG_N is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity REG_N; architecture BEHAVIORAL of REG_N is signal d_reg: signed(n-1 downto 0); begin STORE: process (clk, rst, d_in) is begin if (rst = '1') then d_reg <= conv_signed('0', n); elsif (falling_edge(clk)) then d_reg <= signed(d_in); end if; end process STORE; L1: d_out <= std_logic_vector(d_reg); end architecture BEHAVIORAL;
132
Appendix B
VHDL codes, C++ source codes and Script file for Post-Synthesis simulation
Adders C++ source code // This program generate all possible inputs to the Adders // with the ability to increase the increment #include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ofstream out_file1, out_file2, out_file3; out_file1.open("v_a.dat"); out_file2.open("v_b.dat"); out_file3.open("v_ans.dat"); int time, delay, a, b, choice; int lo, hi, step; time = 20; delay = 0; cout << "Please enter the selection by number:" << endl; cout << "-------------------------------------" << endl; cout << "(1) CLA 14" << endl; cout << "(2) CLA 15" << endl; cout << "(3) CLA 16" << endl; cout << "(4) CLA 17" << endl; cout << "(5) CLA 19" << endl; cin >> choice; cout << endl << "Please enter the step: "; cin >> step; switch (choice) { case 1: lo = -8192; hi = 8192; break; case 2: lo = -16384; hi = 16384; break; case 3: lo = -32768; hi = 32768; break; case 4: lo = -65536; hi = 65536; break; case 5: lo = -262144; hi = 262144; break; default: lo = -8192;
133
hi = 8192; break; } out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; for (a=lo; a<hi; a+=step) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << a << "\\H +" << endl; for (b=lo; b<hi; b+=step) { out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << b << "\\H +" << endl; out_file3 << setiosflags(ios::uppercase) << "@" << dec << (time + delay) << "ns=" << hex << (a + b) << "\\H +" << endl; time = time + 20; } } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << (time + delay) << "ns=" << 0 << "\\H" << endl; out_file1.close(); out_file2.close(); out_file3.close(); return 0; } VHDL file for 14-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA14 is port( a, b: in std_logic_vector(13 downto 0); ans: in std_logic_vector(13 downto 0); t: inout std_logic_vector(13 downto 0); err: out std_logic ); end entity TB_CLA14; architecture BEHAV of TB_CLA14 is component CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end component CLA_14; begin U1: CLA_14 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;
134
VHDL file for 15-bit CLA Testbench library IEEE;
s: out std_logic_vector(15 downto 0));
use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA15 is port( a, b: in std_logic_vector(14 downto 0); ans: in std_logic_vector(14 downto 0); t: inout std_logic_vector(14 downto 0); err: out std_logic ); end entity TB_CLA15; architecture BEHAV of TB_CLA15 is component CLA_15 is port( a, b: in std_logic_vector(14 downto 0); s: out std_logic_vector(14 downto 0)); end component CLA_15; begin U1: CLA_15 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;
VHDL file for 16-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA16 is port( a, b: in std_logic_vector(15 downto 0); ans: in std_logic_vector(15 downto 0); t: inout std_logic_vector(15 downto 0); err: out std_logic ); end entity TB_CLA16; architecture BEHAV of TB_CLA16 is component CLA_16 is port( a, b: in std_logic_vector(15 downto 0);
end component CLA_16; begin U1: CLA_16 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP;
end architecture BEHAV;
135
VHDL file for 17-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA17 is port( a, b: in std_logic_vector(16 downto 0); ans: in std_logic_vector(16 downto 0); t: inout std_logic_vector(16 downto 0); err: out std_logic ); end entity TB_CLA17; architecture BEHAV of TB_CLA17 is component CLA_17 is port( a, b: in std_logic_vector(16 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17; begin U1: CLA_17 port map(a=>a, b=>b, s=>t); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if;
t: inout std_logic_vector(18 downto 0);
end process COMP; end architecture BEHAV; VHDL file for 19-bit CLA Testbench -- tb_cla_19.vhd library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity TB_CLA19 is port( a, b: in std_logic_vector(18 downto 0); ans: in std_logic_vector(18 downto 0); ovf: out std_logic;
err: out std_logic ); end entity TB_CLA19; architecture BEHAV of TB_CLA19 is begin U1: CLA_19 port map(a=>a, b=>b, s=>t, ovf=>ovf); COMP: process(t, ans) is begin if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;
136
Multiplication Unit C++ source code // This program generate all possible inputs to the Multiplication // unit and correct output results corresponding to all the inputs. #include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ofstream out_file1, out_file2, out_file3, out_file4, out_file5; out_file1.open("coef.dat"); out_file2.open("mag.dat"); out_file3.open("x_ans.dat"); int delay, time, m, n;
out_file1 << "@" << time << "ns=" << 0 << "\\H +" << endl;
time = 0; delay = 40;
out_file2 << "@" << time << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << time << "ns=" << 0 << "\\H +" << endl;
}
for (m=-32; m<32; m++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << m << "\\H +" << endl; for (n=0; n<256; n++) { out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << n << "\\H +" << endl; out_file3 << setiosflags(ios::uppercase) << "@" << dec << time + delay << "ns=" << hex << (m*n) << "\\H +" << endl; time = time + 20; }
out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << time + delay << "ns=" << 0 << "\\H" << endl; out_file1.close(); out_file2.close(); out_file3.close(); return 0; } VHDL code for Multiplication Unit testbench -- testbench for multiplier (tb_mult.vhd) library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity TB_MULT is port( clk, rst: in std_logic; coef: in std_logic_vector(5 downto 0); mag: in std_logic_vector(7 downto 0); pro: in std_logic_vector(13 downto 0); p: out std_logic_vector(13 downto 0); err: out std_logic ); end entity TB_MULT; architecture STRUCT of TB_MULT is signal product: std_logic_vector(13 downto 0);
137
begin M1: MULT port map(clk=>clk, rst=>rst, a=>mag, b=>coef, p=>product); L1: p <= product; COMP: process(clk, rst) is begin if (rst = '1') then err <= '0'; elsif (clk'event and clk = '1') then if (product = pro) then err <= '0'; else err <= '1'; end if; end if;
|
break err 1-0 do (print err.out)
end process COMP; end architecture STRUCT; Script File | The file has been automatically generated by | the Script Editor File Wizard version 2.0.1.89 | | Copyright © 1998 Aldec, Inc. | Initial settings delete_signals set_mode functional restart stepsize 10 ns | Vector Definitions | | Add your vector definition commands here vector coef coef5 coef4 coef3 coef2 coef1 coef0 radix hex coef vector mag mag7 mag6 mag5 mag4 mag3 mag2 mag1 mag0 radix hex mag vector product p[13:0] radix hex product vector t_ans pro[13:0] radix hex t_ans | Watched Signals and Vectors
| Define your signal and vector watch list here watch coef mag product err | Stimulators Assignment | | Select and/or define your own stimulators | and assign them to the selected signals wfm coef < coef.dat wfm mag < mag.dat wfm t_ans < x_ans.dat | Set Breakpoint Conditions | | Define breakpoint conditions and | breakpoint actions for selected signals here
| Perform Simulation
138
| | Run simulation for a selected number of | clock cycles or a time range sim
139
Appendix C
C++ Source Codes for Programs Used During Post-Implementation Simulation Program 1 (Input Image Plane and Output Image Planes generator) #include <iostream.h> #include <stdlib.h> #include <iomanip.h> #include <fstream.h> int main() { ifstream in_file1; ofstream out_file1, out_file2; in_file1.open("coef.txt"); out_file1.open("v_input.dat"); out_file2.open("input_mag.txt");
int i_mag[5][60];
int row, col, nfc; int a, b, k, m, n, mag;
int fc[3][5][5]; unsigned int seed; cout << "Seed number: "; cin >> seed; in_file1 >> nfc; row = 5; col = 60; k = 0; // Reading in the FC planes in coef.txt file while (k < nfc) { for (a = 0; a < 5; a++) { for (b = 0; b < 5; b++) { in_file1 >> mag; fc[k][a][b] = mag; } } k++; } in_file1.close(); // Generate Randomized Input Image plane with rand function srand(seed); out_file2 << "Input Image Plane (" << row << "x" << col << ")" << endl; out_file2 << "------------------------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 256; while (mag > 255) { mag = rand(); } i_mag[a][b] = mag; out_file2 << setiosflags(ios::uppercase) << setw(3) << hex << mag << " "; }
140
out_file2 << endl; }
}
// Generate Expected Output
{
<< "--------------------" << endl;
{
// This segment of the code generate the input image plane for simulation for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { out_file1 << setiosflags(ios::uppercase) << "assign input " << hex << i_mag[a][b] << "\\h" << endl; out_file1 << "cycle 1" << endl; } if (a < 2) { out_file1 << "cycle 1" << endl; } else { out_file1 << "cycle 2" << endl;
}
for (k = 0; k < nfc; k++)
out_file2 << endl << endl << "Output Image Plane " << k+1 << endl
for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 0; for (m = 0; m < 5; m++)
for (n = 0; n < 5; n++) { if (!(((a-m+2) < 0) | ((a-m+2) >= row)| ((b-n+2)<0) | ((b-n+2) >= col))) mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]); } } mag = mag & 0x7FFFF; out_file2 << setw(5) << hex << setiosflags(ios::uppercase) << mag << " "; } out_file2 << endl; } } out_file1.close(); out_file2.close(); return 0; } Program 2 (To generate test vectors that will program the FCs into each MAUs) #include <iostream.h> #include <iomanip.h> #include <fstream.h> int main() { ifstream in_file1; ofstream out_file1, out_file2, out_file3; in_file1.open("coef.txt"); //Filter Coefficient file out_file1.open("v_coef.dat"); out_file2.open("v_c_reg.dat"); out_file3.open("v_au_sel.dat");
141
int array[5][5]; int time, count, temp, a, b, k, nfc; time = 40; count = 1; k = 1; in_file1 >> nfc; out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; while (k-1 < nfc) { out_file3 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << k << "\\H +" << endl; for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file1 >> temp; cout << temp << endl; array[b][a] = temp; } } for (a=0; a<5; a++) { for (b=0; b<5; b++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << array[a][b] << "\\H +" << endl; out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << count << "\\H +" << endl; time += 80; count++; } } count = 1; k++; } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; in_file1.close(); out_file1.close(); out_file2.close(); out_file3.close(); return 0; }
142
Appendix D
C++ Source Codes for Programs Used During Hardware Prototype Implementation Program 1 (This is the program that responsible for sending the FC values) //This is the driver for system without FIFO #include <iostream.h> #include <stdlib.h> #include <conio.h> #include <iomanip.h> #include <fstream.h> #define DATA 0x0378 #define STATUS DATA+1 #define CONTROL DATA+2 void delay(int); void sentdata(int &); main() { int reg_coef[3][25]; int reg_cfg[3][25]; int fc[3][5][5]; int k, nfc, a, b, count, wait, d, o_sel, o_cfg;
in_file >> nfc >> o_sel; // Read in the number of FC plane
/* Reading the Filter Coefficient from the coef.txt */ ifstream in_file; in_file.open("coef.txt"); // Open the coef.txt file
o_cfg = 0x80; /* Make Sure Parallel Port is in forward mode and set strobe */ _outp(CONTROL, _inp(CONTROL) & 0xDE); /* Make Sure the write enable (ppc(3)) is at low */ _outp(CONTROL, _inp(CONTROL) | 0x08); k = 1; count = 1; while (k-1 < nfc) // Repeat for the number of plane indicated { for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file >> fc[k-1][b][a]; // Reading in the filter coefficients } // in the arrangement of FC in the AU } for (a=0; a<5; a++) { for (b=0; b<5; b++) { reg_coef[k-1][count-1] = fc[k-1][a][b] & 0x3F; reg_cfg[k-1][count-1] = ((k & 0x03) << 5); reg_cfg[k-1][count-1] = reg_cfg[k-1][count-1] | (count & 0x1F); cout << hex << reg_cfg[k-1][count-1] << " " << reg_coef[k-1][count-1] << endl; count++; } } count = 1; k++; } k = 1; while (k-1 < nfc) { a = 1;
143
while(a<26) { while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { _outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3) while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5) sentdata(reg_coef[k-1][a-1]); while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6) sentdata(reg_cfg[k-1][a-1]); a++; while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6) } } k++; } //Configuration of the MAUs are done _outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3) k = 0; while (k<1) { //Program output selection according to o_sel read in while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { _outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3) while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5) sentdata(o_sel); while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6) sentdata(o_cfg); while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6) k++; } } _outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3) cout << "Configuration done!" << endl; in_file.close(); //Close coef.txt // Programming of the Filter Coefficient is done // exit(1); // This section starts sending input datas to the system ifstream in_file1; in_file1.open("input.txt"); //Open the input data file wait = 1; while (wait==1) { if ((_inp(STATUS) & 0x08) == 0) //Detect low on pps(3) { //Run the following segment of code if pps(3)==0 while(!((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { //Run the following segment of code if pps(4)==1 if ((_inp(STATUS) & 0x10) == 16) //Detect high on pps(4) { in_file1 >> d; //Read in the data from file sentdata(d); while((_inp(STATUS) & 0x08) == 0) //Detect high on pps(3) {} } } } } return 0; }; //end of main void sentdata(int &c) { cout << c << " sent..." << endl; _outp(DATA, c^0x03); /* sending the data with the two LSB toggled */ _outp(CONTROL, _inp(CONTROL) & 0xFB); /* set strobe ~ one to zero */ delay(1000); _outp(CONTROL, _inp(CONTROL) | 0x04); /* reset strobe ~ zero to one */
144
delay(1000); }; /* A function to create delay */ void delay(int k) { int i; for (i=0; i<=k; i++){} }; Program 2 (This is the program that generate VHDL file for internal RAM for input image pixels) #include <iostream.h> #include <fstream.h> #include <iomanip.h> #include <stdlib.h>
row = 5;
main() { int a, b, mag, row, col, i, j, k; int i_mag[5][62]; int temp[32]; unsigned int seed; ofstream outfile("input_ram.vhd", ios::out);
col = 62;
mag = rand();
cout << "Seed number: "; cin >> seed; srand(seed); for (a = 0; a < row; a++) { for (b = 0; b < (col-2); b++) { mag = 256; while (mag > 255) {
} i_mag[a][b] = mag; } i_mag[a][col-1] = 0; i_mag[a][col-2] = 0; }
outfile << "architecture STRUCT of IN_RAM is\n\n"
<< " RST: in std_logic;\n"
outfile << "library IEEE;\n" << "use IEEE.std_logic_1164.all;\n" << "use IEEE.numeric_std.all;\n" << "\nentity IN_RAM is\n" << " port( clk: in std_logic;\n" << " rst: in std_logic;\n" << " req: in std_logic;\n" << " dout: out std_logic_vector(7 downto 0) );\n" << "end entity IN_RAM;\n\n";
<< " component RAMB4_S8 is\n" << " port( DI: in std_logic_vector(7 downto 0);\n" << " EN: in std_logic;\n" << " WE: in std_logic;\n"
<< " CLK: in std_logic;\n" << " ADDR: in std_logic_vector(8 downto 0);\n" << " DO: out std_logic_vector(7 downto 0) );\n"
145
<< " end component RAMB4_S8;\n\n";
while (i<32)
if (a < 5)
<< " attribute INIT_0" << setw(1) << hex << j << " of IRM: label is \"";
a = b = j = k = 0; while (k<(row*col)) { i = 0;
{
{ temp[31-i] = i_mag[a][b]; } else { temp[31-i] = 0; } if (b == (col-1)) { b = 0; a++; k++; i++; } else { k++; b++; i++; } } outfile << " attribute INIT_0" << setw(1) << hex << j << ": string;\n"
for (i=0; i<32; i++) { if (temp[i] < 16) { outfile << "0" << setw(1) << hex << temp[i]; } else { outfile << setw(2) << hex << temp[i]; } } outfile << "\";\n"; j++; } outfile << "\n signal din : std_logic_vector(7 downto 0);\n" << " signal addr: unsigned(8 downto 0);\n" << " signal adr : std_logic_vector(8 downto 0);\n" << " signal en : std_logic;\n" << " signal we : std_logic;\n"; << " L1: din <= (others=>'0');\n"
outfile << "\n begin\n\n"
<< " L2: en <= '1';\n" << " L3: we <= '0';\n" << " L4: adr <= std_logic_vector(addr);\n\n"; outfile << " P1: process(clk, rst, req) is\n" << " begin\n" << " if (rst = '1') then\n" << " addr <= (others=>'0');\n" << " elsif (clk'event and clk = '1') then\n" << " if (req = '1') then\n" << " addr <= addr + 1;\n" << " end if;\n" << " end if;\n" << " end process P1;\n\n"; outfile << " IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, \n"
146
<< " ADDR=>adr, DO=>dout);\n";
outfile << "\nend architecture STRUCT;\n"; outfile.close(); return 0; }
Program 3 (Test Program: this program responsible for comparing the uploaded output witht the theoretical correct outputs) #include <iostream.h> #include <stdlib.h> #include <iomanip.h> #include <fstream.h> void hex2file(ofstream &, int); int main()
out_file2.open("exp_res.txt");
{ ifstream in_file1; ofstream out_file1, out_file2; in_file1.open("coef.txt"); out_file1.open("input.txt");
int row, col, nfc; int a, b, k, m, n, mag, o_sel; int i_mag[5][60]; int fc[3][5][5]; unsigned int seed; cout << "Seed number: "; cin >> seed; in_file1 >> nfc >> o_sel; row = 5; col = 60; k = 0; // Reading in the FC planes in coef.txt file while (k < nfc) { for (a = 0; a < 5; a++) { for (b = 0; b < 5; b++) { in_file1 >> mag; fc[k][a][b] = mag; } } k++; } in_file1.close(); // Generate Randomized Input Image plane with rand function srand(seed); out_file2 << "Input Image Plane (" << row << "x" << col << ") Seed: " << seed << endl; out_file2 << "---------------------------------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 256; while (mag > 255) {
147
mag = rand(); } i_mag[a][b] = mag; out_file1 << dec << setw(3) << mag << " "; out_file2 << setiosflags(ios::uppercase) << setw(3) << hex << mag << " "; } out_file1 << endl; out_file2 << endl; } // Generate Expected Output for (k = 0; k < nfc; k++) { out_file2 << endl << endl << "Output Image Plane " << k+1 << endl << "--------------------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 0; for (m = 0; m < 5; m++) { for (n = 0; n < 5; n++) { if (!(((a-m+2) < 0) | ((a-m+2) >= row) | ((b-n+2) < 0) | ((b-n+2) >= col))) mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]); } } mag = mag & 0x7FFFF; hex2file(out_file2, mag); } out_file2 << endl; } } out_file1.close(); out_file2.close(); return 0; } void hex2file(ofstream &outfile, int mag) { int i, temp; for (i=0; i<5; i++) { temp = mag & 0xF0000; temp = temp >> 16; outfile << hex << setiosflags(ios::uppercase) << temp; mag = mag << 4; } outfile << " "; }
148
Appendix E
VHDL Files for Modules External to the Convolution Architecture
1. Top Level Description of the whole system (the convolution architecture is included) -- par2brd.vhd library IEEE, BRDMOD; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BRDMOD.brd_util.all; entity PAR2BRD is port( -- inputs from board sw1: in std_logic; -- start signal sw2: in std_logic; -- global reset sw3: in std_logic; -- mux select for internal clock clk: in std_logic; -- external clock -- from parallel port ppd: in std_logic_vector(7 downto 0); ppc: in std_logic_vector(3 downto 2); pps: out std_logic_vector(6 downto 3); -- output to external SRAM (right bank) cen_r : out std_logic; wen_r : out std_logic; oen_r : out std_logic; addr_r: out std_logic_vector(18 downto 0); data_r: out std_logic_vector(15 downto 0); -- output to external SRAM (left bank) cen_l : out std_logic; wen_l : out std_logic; oen_l : out std_logic; addr_l: out std_logic_vector(18 downto 0); data_l: out std_logic_vector(15 downto 0); -- output from the interface clk_led: out std_logic; -- sl: out std_logic_vector(6 downto 0); -- sr: out std_logic_vector(6 downto 0) ); done: out std_logic ); end entity PAR2BRD; architecture STRUCT of PAR2BRD is component SYS is port( clk, rst, str: in std_logic; d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF) coef: in std_logic_vector(5 downto 0); --(FCs from parallel port) ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp) au_sel: in std_logic_vector(1 downto 0); --(AU select from pp) o_sel: in std_logic; --(Output config from pp) req: out std_logic; --(Controller -> FIFO) sram_w: out std_logic; --(SYS -> SRAM) d_out: out std_logic_vector(18 downto 0) ); end component SYS; component IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end component IN_RAM; component IBUF is port( i: in std_logic;
149
o: out std_logic ); end component IBUF; component BUFG is port( i: in std_logic; o: out std_logic ); end component BUFG; -- These signals are from parallel port interface signal d_clk, strobe, strobe_b: std_logic; signal nsw1, nsw2, nsw3 : std_logic; -- Internal connection signals signal req: std_logic; --(SYS -> IN_RAM) signal v_d : std_logic_vector(7 downto 0); --(PINTFC -> REG_A) signal v_t : std_logic_vector(18 downto 0); --(SYS -> SRAM) signal v_in : std_logic_vector(7 downto 0); --(IN_RAM -> SYS) -- signal v_led : std_logic_vector(7 downto 0); --(MUX -> SVNSEG)
signal oen : std_logic;
SRM: OUT_RAM port map(clk=>d_clk, rst=>nsw2, w=>sram_w, d_in=>v_t,
L3 : wen_l <= wen;
signal c_out : std_logic_vector(5 downto 0); --(coefficient output from FC_MOD) signal cf_out: std_logic_vector(7 downto 0); --(MAUs configuration output from FC_MOD) signal sram_w: std_logic; --(SYS -> OUT_RAM) signal cen : std_logic; signal wen : std_logic;
signal data : std_logic_vector(18 downto 0); signal addr : std_logic_vector(18 downto 0); -- Clock selection signal signal c_sel : std_logic; signal p_clk : std_logic; -- filter coefficients programming clk begin -- External strobe buffering and padding B1: IBUF port map(i=>ppc(2), o=>strobe_b); B2: BUFG port map(i=>strobe_b, o=>strobe); -- Inverting the logic level of the push buttons. S1: nsw1 <= not sw1; S2: nsw2 <= not sw2; S3: nsw3 <= not sw3; -- Clock counter to reduce the clock frequency of the external clock C1: C_CNTR generic map(n=>12500) port map(clk=>clk, rst=>nsw2, co=>p_clk); -- First In First Out queue after the parallel port -- F1: FIFO port map(rst=>nsw2, r_clk=>d_clk, r_en=>req, w_clk=>strobe, w_en=>ppc(3), -- din=>ppd, dout=>v_d, empty=>pps(3)); -- This parallel port interface is aimed to replace the FIFO queue P1: PINTFC port map(clk=>strobe, rst=>nsw2, ppd=>ppd, d_out=>v_d); -- Drivers to the two seven segments LEDs -- SV1: SVNSG port map(ldg=>v_led(7 downto 4), rdg=>v_led(3 downto 0), sl=>sl, sr=>sr); -- Filter Coefficient Programming Module FC1: FC_MOD port map(clk=>d_clk, rst=>nsw2, ppc=>ppc(3), ppd=>v_d, pps1=>pps(5), pps2=>pps(6), coef_out=>c_out, cfg_out=>cf_out); -- SRAM Interface module (Responsible for writing output pixels to the external SRAM
cen=>cen, wen=>wen, oen=>oen, addr=>addr, data=>data); L1 : cen_l <= cen; L2 : cen_r <= cen;
L4 : wen_r <= wen; L5 : oen_l <= oen; L6 : oen_r <= oen; L7 : addr_l <= addr; L8 : addr_r <= addr; L9 : data_l <= "0000000000000" & data(18 downto 16); L10: data_r <= data(15 downto 0);
150
-- Clock LED on the bar(6) LED L11: pps(3) <= d_clk; L12: done <= sram_w; -- MUX Select for SVNSEG LEDs display -- MUX: v_led <= v_t(7 downto 0) when nsw3 = '0' else v_t(15 downto 8); -- Convolution System (req is replaced by pps(4) to the parallel port pin) -- (o_sel is the output config pin to select which output plane to output to svnseg) U0: SYS port map(clk=>d_clk, rst=>nsw2, str=>nsw1, coef=>c_out, ld_reg=>cf_out(4 downto 0), au_sel=>cf_out(6 downto 5), o_sel=>cf_out(7), req=>req, d_in=>v_in, sram_w=>sram_w, d_out=>v_t); -- Input RAM to provide input image pixels to the convolution system U1: IN_RAM port map(clk=>d_clk, rst=>nsw2, req=>req, dout=>v_in); -- MUX select for the internal operation clock by SW3 -- SELP: process (nsw3, nsw2) is -- begin -- if (nsw2 = '1') then -- c_sel <= '0'; -- elsif (nsw3 = '1') then -- c_sel <= '1'; -- end if; -- end process SELP; -- assign the clock as indicated by the c_sel signal L13: d_clk <= p_clk when nsw3 = '0' else clk; L14: clk_led <= c_sel; end architecture STRUCT; 2. Block RAM module (initialized with input image plane) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end entity IN_RAM; architecture STRUCT of IN_RAM is component RAMB4_S8 is port( DI: in std_logic_vector(7 downto 0); EN: in std_logic; WE: in std_logic; RST: in std_logic; CLK: in std_logic; ADDR: in std_logic_vector(8 downto 0); DO: out std_logic_vector(7 downto 0) ); end component RAMB4_S8;
attribute INIT_03: string;
attribute INIT_00: string; attribute INIT_00 of IRM: label is "b127037aa60fa2fcd6d2ecfba0b29b472f795754f8c707915e72910a387ba32d"; attribute INIT_01: string; attribute INIT_01 of IRM: label is "54c7000034647cf42036263aea806ef629ca8017f15167ae072fa406aa3c5f8e"; attribute INIT_02: string; attribute INIT_02 of IRM: label is "29e3340c8d55d63104709180640dde6598abfa87834f3f8b862521afb27b05da";
attribute INIT_03 of IRM: label is "8c059a5e0000d10f29ef58e2343094fdf00c4875d7132b775a9d90eb0a2af23a"; attribute INIT_04: string; attribute INIT_04 of IRM: label is "f4ffbba026cb9a95e76073e78cf8bb71ec7adb544571aba14108426ce0b65b48"; attribute INIT_05: string; attribute INIT_05 of IRM: label is "562f3c0012a50000c449b579657f3430b517d329250b06e193764decbb8d2fae";
151
attribute INIT_06: string; attribute INIT_06 of IRM: label is "0e1445292bf6efc7ca1a168a0b28ac6c95e4c5e1eafa97d4a739e68847e36db2"; attribute INIT_07: string; attribute INIT_07 of IRM: label is "766d9cd4a6f5a327000025adece722128cec7c0d9ec1ecf1cb50f064f402ac29"; attribute INIT_08: string; attribute INIT_08 of IRM: label is "48acd73b6335198a049f8709c6e40466c6c633d89e44e024272e09b37771f914"; attribute INIT_09: string; attribute INIT_09 of IRM: label is "0000000000000000000000003f1b75a632e7d585d0d36448ec00ce97f55e0ad8"; signal din : std_logic_vector(7 downto 0); signal addr: unsigned(8 downto 0); signal adr : std_logic_vector(8 downto 0); signal en : std_logic; signal we : std_logic; begin L1: din <= (others=>'0'); L2: en <= '1'; L3: we <= '0'; L4: adr <= std_logic_vector(addr); P1: process(clk, rst, req) is begin if (rst = '1') then addr <= (others=>'0'); elsif (clk'event and clk = '1') then if (req = '1') then addr <= addr + 1; end if; end if; end process P1; IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, ADDR=>adr, DO=>dout); end architecture STRUCT; 3. FC Programming module -- fc_pg_mod.vhd (Filter Coefficient Programming Module) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_MOD is port( clk, rst: in std_logic; ppc: in std_logic; -- Control Pin from parallel port ppd: in std_logic_vector(7 downto 0); -- Input data from parallel port pps1: out std_logic; -- Status pin for request data pps2: out std_logic; -- Status pin for request cfg coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end entity FC_MOD; architecture STRUCTURAL of FC_MOD is component FC_REG is port( clk, rst: in std_logic; rec_0: in std_logic; rec_1: in std_logic; prog: in std_logic; d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end component FC_REG; component FC_FSM is port( clk, rst: in std_logic;
152
ctr_pin: in std_logic; rec_0: out std_logic; -- receive_data state enable pin rec_1: out std_logic; -- receive_config state enable pin prog: out std_logic ); -- program state enable pin end component FC_FSM; signal rec_0, rec_1, prog: std_logic; begin FSM: FC_FSM port map(clk=>clk, rst=>rst, ctr_pin=>ppc, rec_0=>rec_0, rec_1=>rec_1, prog=>prog); FCG: FC_REG port map(clk=>clk, rst=>rst, rec_0=>rec_0, rec_1=>rec_1, prog=>prog, d_in=>ppd, coef_out=>coef_out, cfg_out=>cfg_out); -- Status pins to the parallel port O1: pps1 <= rec_0; O2: pps2 <= rec_1; end architecture STRUCTURAL; -- fc_pg_fsm.vhd (Filter Coefficient Programming Finite State Machine) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_FSM is port( clk, rst: in std_logic; ctr_pin: in std_logic; rec_0: out std_logic; -- receive_data state enable pin rec_1: out std_logic; -- receive_config state enable pin prog: out std_logic ); -- program state enable pin end entity FC_FSM; architecture BEHAVIORAL of FC_FSM is type states is (idle, receive_data, receive_config, program); signal c_state: states; -- Current State signal n_state: states; -- Next State begin NST_PROC: process(c_state, ctr_pin) is begin case c_state is when idle => if (ctr_pin = '0') then n_state <= idle; else n_state <= receive_data; end if; when receive_data => n_state <= receive_config; when receive_config => n_state <= program; when program => if (ctr_pin = '0') then n_state <= idle; else n_state <= receive_data; end if; end case; end process NST_PROC; CST_PROC: process(clk, rst, n_state) is begin if(rst='1') then c_state <= idle;
153
elsif (clk'event and clk='0') then c_state <= n_state; end if; end process CST_PROC; OUT_PROC: process(c_state) is begin case c_state is when idle => rec_0 <= '0'; rec_1 <= '0'; prog <= '0'; when receive_data => rec_0 <= '1'; rec_1 <= '0'; prog <= '0'; when receive_config => rec_0 <= '0'; rec_1 <= '1'; prog <= '0'; when program => rec_0 <= '0'; rec_1 <= '0'; prog <= '1'; end case; end process OUT_PROC; end architecture BEHAVIORAL; -- fc_pg_reg.vhd (Filter COefficient Programming Registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_REG is port( clk, rst: in std_logic; rec_0: in std_logic; rec_1: in std_logic; prog: in std_logic; d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end entity FC_REG; architecture BEHAVIORAL of FC_REG is signal d_reg: std_logic_vector(5 downto 0); signal c_reg: std_logic_vector(7 downto 0); begin -- Registers for storing Filter Coefficients REC_D: process(clk, rst, rec_0) is begin if (rst = '1') then d_reg <= (others=>'0'); elsif (clk'event and clk='1') then if (rec_0='1') then d_reg <= d_in(5 downto 0); end if; end if; end process REC_D; -- MAUs configuration signals REC_C: process(clk, rst, rec_1) is begin if (rst = '1') then
154
c_reg <= (others=>'0'); elsif (clk'event and clk='1') then if (rec_1='1') then c_reg <= d_in(7 downto 0); end if; end if; end process REC_C; -- Enable Output from the registers O1: coef_out <= d_reg when prog = '1' else (others=>'0'); O2: cfg_out <= c_reg when prog = '1' else (others=>'0'); end architecture BEHAVIORAL; 4. SRAM Driver -- out_ram.vhd (Output Ram for storing output pixels from the architecture) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity OUT_RAM is port( clk, rst: in std_logic; w: in std_logic; -- read or write to SRAM d_in: in std_logic_vector(18 downto 0); -- Input data from architecture cen, wen: out std_logic; -- cen=chip enable, wen=write enable (both active low) oen: out std_logic; -- oen=out enable (active low) addr: out std_logic_vector(18 downto 0); -- SRAM address bus data: out std_logic_vector(18 downto 0) ); -- SRAM Data bus end entity OUT_RAM; architecture BEHAV of OUT_RAM is signal w_address: unsigned(18 downto 0); -- signal i_data : unsigned(7 downto 0); begin -- Asynchronous Reset and positive edge trigger events P1: process (clk, rst, w) is begin if (rst = '1') then wen <= '1'; oen <= '1'; addr <= (others => '0'); -- initial address during reset elsif (clk'event and clk = '1') then if (w = '1') then wen <= '0'; oen <= '1'; addr <= std_logic_vector(w_address); end if; end if; end process P1; -- Address counter P2: process(clk, rst, w) is begin if (rst = '1') then w_address <= (others => '0'); elsif (clk'event and clk = '1') then if (w = '1') then w_address <= w_address + 1; end if; end if;
155
end process P2; -- Chip enable signal L1: cen <= clk; -- Data Bus L2: data <= d_in; --std_logic_vector(i_data); -- L3: i_data <= unsigned(w_address(7 downto 0)) + 4; end architecture BEHAV; 5. Parallel Port Interface Module library IEEE; use IEEE.std_logic_1164.all; entity PINTFC is port( clk, rst: in std_logic; ppd: in std_logic_vector(7 downto 0); d_out: out std_logic_vector(7 downto 0) ); end entity PINTFC; architecture BEHAVIORAL of PINTFC is signal data: std_logic_vector(7 downto 0); begin REC: process(clk, rst) is begin if (rst = '1') then data <= "00000000"; elsif (clk'event and clk = '1') then data <= ppd; end if; end process REC; L1: d_out <= data; end architecture BEHAVIORAL;
156
References
[1] Bernard Bosi and Guy Bois, “Reconfigurable Pipelined 2-D Convolvers for fast Digital Signal Processing”, IEEE Transaction on Very Large Scale (VLSI) Systems, Vol. 7, No.3, p 299-308, Sep. 1999.
[2] Cheng-The Hsieh and Seung P. Kim, “A Highly-Modular Pipelined VLSI Architecture for 2-D FIR Digital Filter,” Proceedings of the 1996 IEEE 39th Midwest Symposium on Circuits and Systems, Part 1, p 137-140, Aug. 1996.
[3] D. D. Haule and A. S. Malowany, “High-speed 2-D Hardware Convolution
Architecture Based on VLSI Systolic Arrays”, IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, p 52-55, Jun 1989.
[4] D. Patterson and J. Hennessy, Computer Organization & Design: The Hardware
/ Software Interface, Morgan Kaufmann, 1994. [5] GSI Technology, Product Datasheet, http://www.gsitechnology.com. [6] H. T. Kung, “Why Systolic Architectures”, IEEE Computer, Vol. 15, p 37-46,
Jan. 1982. [7] Hyun Man Chang and Myung H. Sunwoo, “An Efficient Programmable 2-D
Convolver Chip”, Proceedings of the 1998 IEEE International Symposium on Circuits and Systems, ISCAS, Part 2, p 429-432, May 1998.
[8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach,
Second Edition, Morgan Kaufmann, 1996.
[9] K. Hsu, L. J. D’Luna, H. Yeh, W. A. Cook and G. W. Brown, “A Pipelined ASIC for Color Matrixing and Convolution”, Proceedings of the 3rd Annual IEEE ASIC Seminar and Exhibit, Sep. 1990.
[10] Kai Hwang, Computer Arithmetic: Principles, Architecture, and Design, John
Wiley & Sons, 1979. [11] M. Morris Mano and Charles R. Kime, Logic and Computer Design
Fundamentals, Prentice Hall, 1997. [12] Michael J. Flynn and Stuart F. Oberman, Advanced Computer Arithmetic Design,
Wiley-Interscience, 2001. [13] O. L. MacSorley, “High-Speed Arithmetic in Binary Computers”, Proceedings of
the IRE, vol. 49, pp. 67-91, Jan. 1961.
157
[14] V. Hecht, K. Rönner and P. Pirsch, “An Advanced Programmable 2D-Convolution Chip for Real Time Image Processing”, Proceedings of IEEE International Symposium on Circuits and Systems, p 1897-1900, 1991.
[15] Vijay K. Madisetti and Douglas B. Williams, The Digital Signal Processing
Handbook, CRC Press and IEEE Press, 1998. [16] Wayne Niblack, An Introduction to Digital Image Processing, Prentice/Hall
International, 1986. [17] Xess Co., XSV Board User Manual, http://www.xess.com/manuals/xsv-manual-
v1_1.pdf [18] Xilinx Co., Foundation 4.1i Software Manual,
http://toolbox.xilinx.com/docsan/xilinx4/pdf/manuals.pdf
158
Vita
Albert Wong was born on January 1st 1975 in Sibu, Sarawak, Malaysia. He
attended SMB Methodist secondary school in Sibu and graduated in 1993. He obtained
his Bachelor of Science in Electrical Engineering Degree in May of 1998 from the
University of Kentucky, Lexington, Kentucky. He enrolled in the University of
Kentucky’s Graduate school in the fall semester of 1999.
159