memory optimization technique
Post on 01-Feb-2016
241 Views
Preview:
DESCRIPTION
TRANSCRIPT
Memory Optimization Techniques
Dept. of ECE,ASIET Page 1
CHAPTER 1
INTRODUCTION
A long with the progressive device scaling, semiconductor memory has become cheaper,
faster, and more power-efficient. Moreover, embedded memories will have dominating
presence in the system on-chips (SoCs), which may exceed 90% of the total SoC content. It has
also been found that the transistor packing density of memory components is not only higher,
but also increasing much faster than those of logic components. There were several
optimization techniques adopted for this change.[1]
A lookup table is an array containing precalculated values that can be retrieved from
memory whenever needed. The project optimizes the memory space needed to store data in a
look up table by combining the methods of Antisymmetric Product Code (APC) and Odd
Multiple Storage (OMS). The existing methodology reduces the LUT size by half, but the
combined effect of APC and OMS reduces the LUT size by one fourth. This causes selective
sign reversal, lesser area and simpler multiplication on the memory. It saves the area delay to
30% -50%.LUT optimization is a memory based multiplication technique, where odd multiples
of fixed coefficients are required to be stored, which is known as Odd Multiple Storage
(OMS).Suppose if there are 2L words, on conducting OMS optimization, instead of 2
L words
only, 2(L/2)
words are stored. Then, even multiples are derived by left shift operation by means
of a barrel shifter. Further product codes are recorded as antisymmetric pairs, (APC), which
reduces the LUT size by the factor of two.
The other optimization technique proposed here is the FPGA implementation of cache
memory. Cache memory design is suitable for use in FPGA-based cache controller and
processor. Cache systems are on-chip memory element used to store data. Cache serves as a
buffer between a CPU and its main memory. Cache memory is used to synchronize the data
transfer rate between CPU and main memory. As cache memory closer to the microprocessor, it
is faster than the RAM and main memory. The advantage of storing data on cache, as compared
to RAM, is that it has faster retrieval times, but it has disadvantage of on-chip energy
consumption. FPGA implementation on cache memory eliminates the need for data memory
and thus optimizes memory.[2]
Memory Optimization Techniques
Dept. of ECE,ASIET Page 2
CHAPTER 2
LITERATURE REVIEW
2.1 DISTRIBUTED ARITHMETIC
Distributed Arithmetic, DA; is an efficient technique for calculation of inner products or
multiply and Accumulate (MAC). The MAC operation is common in digital signal processing
algorithm. The direct method involves using dedicated multipliers. Multipliers are fast but they
consume considerable hardware. Distributed Arithmetic technique is bit serial in nature. It can
therefore appear to be slow in nature. It turns out that when the number of elements in a vector
is nearly the same as the word size. DA is quite slow. It replaces the explicit multiplication by
ROM lookups, which is an efficient technique to implement on field programmable gate
arrays.[4]
DA based computation is widely used in filters, where filter outputs are computed as the
inner product of the input sample vectors. Hence DA based computation consumes more area.
In DA approach an LUT(Look Up is used to store all possible values of inner products of a
fixed N point vector, in this method the memory size of LUT increases exponentially with the
word length of inner products.
DA approaches makes use of the pipelined structure (adder, shifter). On each cycle,
eight bit sample is fed to the word serial bit parallel converter, out of which a pair of
consecutive bits are transferred to each of its own DA structures. But Distributed Arithmetic
approach is less efficient than LUT approach in terms of the area complexity and latency.[4]
2.1.1 Parallel Implementation Of DA [4]
The inner products containing many terms, can be portioned into a number of small
inner products which can be computed and summed by using either distributed arithmetic or an
adder tree.
2.1.2 Reducing the Memory Size
One of the several possible ways to reduce the overall memory requirement is to
partition the memory into smaller pieces that are added before the shift accumulator.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 3
The second approach is based on a special coding of the ROM content. Memory size can
be halved by using the ingenious scheme. In particular it can divide the N address bits of the
ROM into (N/K) groups of K bits. So a ROM of size 2N can be divided into N/K ROMs of size
2K.
2.1.3 Comparison of Initial Years of DA
The traditional comparisons o DA are with multiplier based solutions for problems
pertaining to filters. When DA was devised, the comparisons given were in terms of number of
TTL ICs required for mechanization of a certain type of filter.
Multiplication is strongest operation because it is repeated addition. It requires large
portion of chip area. Power consumption is also more. Memory based structures are more
regular compared to the Multiply and Accumulate structures and has many advantages, like
greater potential for higher throughput and reduced latency implementation and are expected to
have less dynamic power consumption due to less switching activities for memory based
operations. Memory base structures are well suited for many digital signal processing
algorithms, which involve multiplication with a fixed set of coefficients.[4]
2.2 LOOK UP TABLE
A lookup table is an array that replaces runtime computation with a simpler array
indexing operation. The savings in terms of processing time can be significant, since retrieving
a value from memory is often faster than undergoing an input output operation. The tables may
be precalculated and stored in static program storage or even stored in hardware in application-
specific platforms. Lookup tables are also used extensively to validate input values by matching
against a list of valid items in an array.
In data analysis applications, such as image processing, a lookup table (LUT) is used to
transform the input data into a more desirable output format. For example, a grayscale picture of
the planet Saturn will be transformed into a colour image to emphasize the differences in its
rings.
A classic example of reducing run-time computations using lookup tables is to obtain
the result of a trigonometry calculation, such as the sine of a value. Calculating trigonometric
Memory Optimization Techniques
Dept. of ECE,ASIET Page 4
functions can substantially slow a computing application. The same application can finish much
sooner when it first precalculates the sine of a number of values, for example for each whole
number of degrees. When the program requires the sine of a value, it can use the lookup table to
retrieve the closest sine value from a memory address, and may also take the step of
interpolating to the sine of the desired value, instead of calculating by mathematical formula.
Lookup tables are thus used by mathematics co processors in computer systems.
There are intermediate solutions that use tables in combination with a small amount of
computation, often using interpolation. Pre-calculation combined with interpolation can produce
higher accuracy for values that fall between two precomputed values. This technique requires
slightly more time to be performed but can greatly enhance accuracy in applications that require
the higher accuracy. Depending on the values being precomputed, pre-computation with
interpolation can also be used to shrink the lookup table size while maintaining accuracy.
There is hardware LUT’s also. In digital logic, an n-bit lookup table can be implemented
with a multiplexer whose select lines are the inputs of the LUT and whose inputs are constants.
An n-bit LUT can encode any n-input Boolean function by modeling such functions as truth
tables. This is an efficient way of encoding Boolean logic functions, and LUTs with 4-6 bits of
input are in fact the key component of modern Field Programmable Gate Arrays (FPGA).
Multiplication which is used in many processors, for performing arithmetic operations
has always been a good subject of research, architectures which are smaller and faster.
Multiplication, which is the repetitive addition, forms the basis of any processor architecture.
The speed, performance, reliability and efficiency of the adder architecture can dominate the
performance. Nowadays the prime aim of designing is to minimize the core area.
A conventional lookup-table (LUT)-based multiplier is shown inFigure 2.1, where A is a
fixed coefficient, and X is an input word to be multiplied with A. Assuming X to be a positive
binary number of word length L, there can be 2L possible values of X, and accordingly, there
can be 2Lpossible values of product C = A ・X.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 5
Figure 2.1 : conventional LUT-based multiplier
Therefore, for memory-based multiplication, an LUT of 2Lwords, consisting of pre
computed product values corresponding to all possible values of X, is conventionally used. The
product word A ・Xi is stored at the location Xi for 0 ≤ Xi ≤ 2L− 1, such that if an L-bit binary
value of Xi is used as the address for the LUT, then the corresponding product value A ・Xi is
available as its output.
2.3 APC AND OMS
The multiplication is major arithmetic operation in signal processing and in ALU's. The
multiplier uses look-up-table as memory for their computations. However, we do not find any
significant work on LUT optimization for memory-based multiplication. APC is an approach,
in which the LUT size is reduced to half and provides a reduction. Anti-symmetric product
coding is the technique used to process the multiplication based on LUT multiplication which
reduces the size of conventional LUT by 50 %. The APC is based on the anti-symmetric coding
i.e. the 2‘s complement phenomenon which is used to reduce the LUT size by half.
For the multiplication of any binary word X of size L, with a fixed coefficient A, instead
of storing all the 2L possible values of C = A · X, only (2L /2) words corresponding to the odd
multiples of A may be stored in the LUT, while all the even multiples of A could be derived by
left-shift operations of one of those odd multiples. This approach is known odd multiple
storage. A barrel shifter for producing left shifts could be used to derive all the even multiples
of A.
The implementation of the proposed APC-OMS combined LUT for memory based
multiplier uses two techniques, APC and OMS method. This method is supposed to reduce the
Memory Optimization Techniques
Dept. of ECE,ASIET Page 6
area to one fourth. This multiplier uses four blocks. The address generation block converts our
input to address d0, d1, d2, d3, which is produced by combining both the APC and OMS
method. The 4-to-9 address line decoder converts the address d0, d1, d2, d3 to LUT address.
The memory array is an LUT and barrel shifter converts the LUT output to the desired output.
When APC approach is combined with the OMS technique, the two's complement
operations could be simplified since the input address and LUT output could always be
transformed into odd integers, and thus reduces the LUT size to one fourth of the conventional
LUT. It is found that the proposed LUT-based multiplier involves comparable area and time
complexity for a word size of 5-bits.
2.4 FIELD PROGRAMMABLE GATE ARRAY
Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based
around a matrix of configurable logic blocks (CLBs) connected via programmable
interconnects. FPGAs can be reprogrammed to desired application or functionality requirements
after manufacturing. This feature distinguishes FPGAs from Application Specific Integrated
Circuits (ASICs), which are custom manufactured for specific design tasks.
The need of programmable devices was realized in early 70s itself with the design of
PLD by Ron Cline from Signetics. The digital ICs like TTL or CMOS have fixed functionality
and the user has no option to change or modify their functionality. That is, they work according
to the design given by the manufacturer. So, to change this people started thinking of a
methodology by which the functionality of an IC can be modified or changed. Then the concept
of using Fuses in ICs entered and gained momentum.
This method of changing or modifying the functionality of an IC using the Fuses was
appreciated and this method of blowing a Fuse between two contacts or keeping the Fuse intact
was done by using software and hence these devices were called Programmable Logic
Devices(PLDs). Many digital chips were considered under the category of PLDs .But the most
fundamental and primitive was the Memories like ROM or PROM etc. The realization of
Digital circuits by PLDs can be classified as shown in the diagram below. PLAs were
introduced in the early 1970s, by Philips, but their main drawbacks were that they were
Memory Optimization Techniques
Dept. of ECE,ASIET Page 7
expensive to manufacture and offered somewhat poor speed-performance. Both disadvantages
were due to the two levels of configurable logic, because programmable logic planes were
difficult to manufacture and introduced significant propagation delays.
To overcome these problems, Programmable Array Logic (PAL) devices were
developed. Memory: Memory is used to store, provide access to, and allow modification of data
and program code for use within a processor-based electronic circuit or system. The two basic
types of memory are ROM (read-only memory) and RAM (random access memory).
ROM is used for holding program code that must be retained when the memory power is
removed. It is considered to provide non-volatile storage. The code can either be fixed when the
memory is fabricated (mask programmable ROM) or electrically programmed once (PROM,
Programmable ROM) or multiple times. Multiple programming capacity requires the ability to
erase prior programming, which is available with EPROM (electrically programmable ROM,
erased using ultraviolet [UV] light), EEPROM or EEPROM (electrically erasable PROM), or
flash (also electrically erased). PROM is sometimes considered to be in the same category of
circuit as programmable logic, although in this text, PROM is considered in the memory
category only.
RAM is used for holding data and program code that require fast access and the ability
to modify the contents during normal operation. RAM differs from read-only memory (ROM)
in that it can be both read from and written to in the normal circuit application. However, flash
memory can also be referred to as non-volatile RAM (NVRAM). RAM is considered to provide
a volatile storage, because unlike ROM, the contents of RAM will be lost when the power is
removed. There are two main types of RAM: static RAM (SRAM) and dynamic RAM
(DRAM).
ROM- READ ONLY MEMORY: A ROM is essentially a memory device for storage
purpose in which a fixed set of binary information is stored. The user must first specify the
binary information to be stored and then it is embedded in the unit to form the required
interconnection pattern. ROM contains special internal links that can be fused or broken.
Certain links are to be broken or blown out to realize the desired interconnections for a
particular application and to form the required circuit path. Once a pattern is established for a
Memory Optimization Techniques
Dept. of ECE,ASIET Page 8
ROM, it remained fixed even if the power supply to the circuit is switched off and then
switched on again. The ROM consists of n input lines and m-output lines. Each bit combination
of input variables is called an address and each bit combination that is formed at output lines is
called a word. Thus, an address is essentially binary number that denotes one of the min-terms
of n variables and the number of bits per word is equal to the number of output lines m. It is
possible to generate p = 2n number of distinct addresses from n number of input variables.
Since there are 2n distinct addresses in a ROM, there are 2n distinct words which are said to be
stored in the device and an output word can be selected by a unique address. The address value
applied to the input lines specifies the word at output lines at any.
Programmable Logic Device-(PLD): The logic devices other than TTL,CMOS families
whose logical operation is specified by the user through a process called programming are
called Programmable Logic Devices. So, the programmable logic device is the IC that contain
digital logic cells and programmable interconnect. The idea of PLD was first conceived by Ron
Cline from Signetics in 1975 with programmable AND and OR planes. The basic idea with
these devices is to enable the designer to configure the logic cells and interconnect to form a
digital electronic circuit within a single IC package. Here, the hardware resources will be
configured to implement a required functionality. By changing the hardware configuration, the
PLD will operate a different function.
There are three types of PLD available. The simple programmable logic device (SPLD),
the Complex programmable logic device (CPLD), and the Field programmable gate array
(FPGA). The PLD with simple architectural features can be called as SPLD or Simple
programmable Logic Device. The SPLD was introduced prior to the CPLD and FPGA. Based
on the architecture the SPLDs are classified into three types. Programmable logic array (PLA),
Programmable array of logic (PAL), and Generic Array of Logic (GAL). PLA-Programmable
Logic Array PLA, Programmable Logic Array is a type of LSI device and conceptually similar
to a ROM. However, a PLA does not contain all AND gates to form the decoder or does not
generate all the minterms like ROM.
In the PLA, the decoder is replaced by a group of AND gates with buffers/inverters,
each of which can be programmed to generate some product terms of input variable
Memory Optimization Techniques
Dept. of ECE,ASIET Page 9
combinations that are essential to realize the output functions. The AND and OR gates inside
the PLA are initially fabricated with the fusible links among them. The required Boolean
functions are implemented in sum of the products form by opening the appropriate links and
retaining the desired connections. So, the PLA consists of two programmable planes AND and
OR planes.
The AND plane consists of programmable interconnect along with AND gates. The OR
plane consists of programmable interconnect along with OR gates. In this view, there are four
inputs to the PLA and four outputs from the PLA. Each of the inputs can be connected to an
AND gate with any of the other inputs by connecting the crossover point of the vertical and
horizontal interconnect lines in the AND gate programmable interconnect. Initially, the
crossover points are not electrically connected, but configuring the PLA will connect particular
cross over points together. In this view, the AND gate is seen with a single line to the input.
This view is by convention, but this also means that any of the inputs (vertical lines) can be
connected. Hence, for four PLA inputs, the AND gate also has four inputs. The single output
from each of the AND gates is applied to an OR gate programmable inter connect.
Again, the crossover points are initially not electrically connected, but configuring the
PLA will connect particular crossover points together. In this view, the OR gate is seen with a
single line to the input. This view is by convention, but this also means that any of AND gate
outputs can be connected to the OR gate inputs. Hence, for four AND gates, the OR gate also
has four inputs Therefore, the function is implemented in either AND-OR form when the output
link across INVERTER is in place, or in AND-OR-INVERT form when the link is blown off.
PROGRAMMABLE ARRAY LOGIC (PAL): The first programmable device was the
programmable array logic (PAL) developed by Monolithic Memories Inc(MMI). The
Programmable Array Logic or PAL is similar to PLA, but in a PAL device only AND gates are
programmable. The OR array is fixed by the manufacturer. This makes PAL devices easier to
program and less expensive than PLA. On the other hand, since the OR array is fixed, it is less
flexible than a PLA device.
FIELD PROGRAMMABLE GATE ARRAYS: The FPGA concept emerged in 1985
with the XC2064TM FPGA family from Xilinx. The “FPGA is an integrated circuit that
contains many (64 to over 10,000) identical logic cells that can be viewed as standard
components.” The individual cells are interconnected by a matrix of wires and programmable
Memory Optimization Techniques
Dept. of ECE,ASIET Page 10
switches. A user's design is implemented by specifying the simple logic function for each cell
and selectively closing the switches in the interconnect matrix. The array of logic cells and
interconnects form a fabric of basic building blocks for logic circuits. Complex designs are
created by combining these basic blocks to create the desired circuit. Unlike CPLDs (Complex
Programmable Logic Devices) FPGAs contain neither AND nor OR planes.
The FPGA architecture consists of configurable logic blocks, configurable I/O blocks,
and programmable interconnect. Also, there will be clock circuitry for driving the clock signals
to each logic block, and additional logic resources such as ALUs, memory, and decoders may
be available. The two basic types of programmable elements for an FPGA are Static RAM and
antifuses. Each logic block in an FPGA has a small number of inputs and one output. A look up
table (LUT) is the most commonly used type of logic block used within FPGAs. There are two
types of FPGAs.(i) SRAM based FPGAs and (ii) Antifse technology based(OTP) Every FPGA
consists of the following elements Configurable logic blocks(CLBs) Configurable input output
blocks(IOBs) Two layer metal network of vertical and horizontal lines for interconnecting the
CLBS.
2.4.1 Difference between an ASIC and an FPGA
ASIC and FPGAs have different value propositions, and they must be carefully
evaluated before choosing any one over the other. Information abounds that compares the two
technologies. While FPGAs used to be selected for lower speed/complexity/volume designs in
the past, today’s FPGAs easily push the 500 MHz performance barrier. With unprecedented
logic density increases and a host of other features, such as embedded processors, DSP blocks,
clocking, and high-speed serial at ever lower price points, FPGAs are a compelling proposition
for almost any type of design.
2.4.2 FPGA Applications
Due to their programmable nature, FPGAs are an ideal fit for many different markets. As
the industry leader, Xilinx provides comprehensive solutions consisting of FPGA devices,
advanced software, and configurable, ready-to-use IP cores for markets and applications such
as:
Memory Optimization Techniques
Dept. of ECE,ASIET Page 11
Aerospace & Defense - Radiation-tolerant FPGAs along with intellectual property for
image processing, waveform generation, and partial reconfiguration for SDRs.
ASIC Prototyping - ASIC prototyping with FPGAs enables fast and accurate SoC
system modeling and verification of embedded software
Audio - Xilinx FPGAs and targeted design platforms enable higher degrees of
flexibility, faster time-to-market, and lower overall non-recurring engineering costs
(NRE) for a wide range of audio, communications, and multimedia applications.
Automotive - Automotive silicon and IP solutions for gateway and driver assistance
systems, comfort, convenience, and in-vehicle infotainment.
Broadcast - Adapt to changing requirements faster and lengthen product life cycles with
Broadcast Targeted Design Platforms and solutions for high-end professional broadcast
systems.
Consumer Electronics - Cost-effective solutions enabling next generation, full-featured
consumer applications, such as converged handsets, digital flat panel displays,
information appliances, home networking, and residential set top boxes.
Data Center - Designed for high-bandwidth, low-latency servers, networking, and
storage applications to bring higher value into cloud deployments.
High Performance Computing and Data Storage - Solutions for Network Attached
Storage (NAS), Storage Area Network (SAN), servers, and storage appliances.
Industrial - Xilinx FPGAs and targeted design platforms for Industrial, Scientific and
Medical (ISM) enable higher degrees of flexibility, faster time-to-market, and lower
overall non-recurring engineering costs (NRE) for a wide range of applications such as
industrial imaging and surveillance, industrial automation, and medical imaging
equipment.
Medical - For diagnostic, monitoring, and therapy applications, the Virtex FPGA and
Spartan FPGA families can be used to meet a range of processing, display, and I/O
interface requirements.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 12
CHAPTER 3
PROJECT DESCRIPTION
The Antisymmetric Product Coding is an approach in which the LUT size can be
reduced to half, where the product words are recoded as antisymmetric pairs. The APC
approach, although providing a reduction in LUT size by a factor of two, incorporates
substantial overhead of area and time to perform the two’s complement operation of LUT
output for sign modification and that of the input operand for input mapping.
For simplicity of presentation, we assume both X and A to be positive integers. The
product words for different values of X for L= 5 are shown in Table 3.1 1 It may be observed in
this table that the input word X on the first column of each row is the two’s complement of that
on the third column of the same row. In addition, the sum of product values corresponding to
these two input values on the same row is 32A.[1]
Table 3.1: APC words for different input values for L = 5
Let the product values on the second and fourth columns of a row be u and v,
respectively. Since one can write u = [(u + v)/2 − (v − u)/2] and v = [(u + v)/2 + (v − u)/2], for
(u + v) = 32A, we can have
u= 16A– [(v-u)/2] (3.1)
Memory Optimization Techniques
Dept. of ECE,ASIET Page 13
v= 16A+ [(v-u)/2] (3.2)
The product values on the second and fourth columns of Table 3.1 therefore have a negative
mirror symmetry. This behaviour of the product words can be used to reduce the LUT size,
where, instead of storing u and v, only [(v − u)/2] is stored for a pair of input on a given row.
The 4-bit LUT addresses and corresponding coded words are listed on the fifth and sixth
columns of the table, respectively. Since the representation of the product is derived from the
antisymmetric behaviour of the products, we can name it as antisymmetric product code. The
4-bit address X’ = (x3’x2’x1’x0’)of the APC word is given by
X’ = XL if x4=1
XL’ if x4=0
Where, XL = (x3 x2 x1 x0) is the four less significant bits of X’ and X’L is the two’s complement
of XL. The desired product could be obtained by adding or subtracting the stored value (v − u) to
or from the fixed value 16Awhen x4 is 1 or 0, respectively, i.e,
Product word = 16A+ (sign value) × (APC word)
Where, sign value = 1 for x4 = 1 and sign value = −1 for x4 = 0. The product value for X =
(10000) corresponds to APC value “zero,” which could be derived by resetting the LUT output,
instead of storing that in the LUT.
Figure 3.1:LUT-based multiplier for L = 5 using the APC technique
(3.3)
Memory Optimization Techniques
Dept. of ECE,ASIET Page 14
3.2 COMBINED APC-OMS TECHNIQUE
For the multiplication of any binary word X of size L, with a fixed coefficient A, instead
of storing all the 2Lpossible values of C = A ・X, only (2
L/2) words corresponding to the odd
multiples of A may be stored in the LUT, while all the even multiples of A could be derived by
left-shift operations of one of those odd multiples. Based on the above assumptions, the LUT
for the multiplication of an L-bit input with a W-bit coefficient could be designed by the
following strategy.
A memory unit of [(2L/2) + 1] words of (W + L)-bit width is used to store the product
values, where the first (2L/2) words are odd multiples of A, and the last word is zero.
A barrel shifter for producing a maximum of (L − 1) left shifts is used to derive all the
even multiples of A.
The L-bit input word is mapped to the (L − 1)-bit address of the LUT by an address
encoder, and control bits for the barrel shifter are derived by a control circuit.
Consider Table 3.1, having eight memory locations, the eight odd multiples, A × (2i + 1)
are stored as Pi, for i= 0, 1, 2, . . . .7. The even multiples 2A, 4A, and 8A are derived by left-shift
operations of A. Similarly, 6Aand 12Aare derived by left shifting 3A, while 10A and 14A are
derived by left shifting 5Aand 7A, respectively. A barrel shifter for producing a maximum of
three left shifts could be used to derive all the even multiples of A.
If the word to be stored for X = (00000) is not 0 but 16A, which we can obtain from A by
four left shifts using a barrel shifter. However, if 16A is not derived from A,only a maximum of
three left shifts is then required to obtain all other even multiples of A.
A maximum of three bit shifts can be implemented by a two-stage logarithmic barrel
shifter, but the implementation of four shifts requires a three-stage barrel shifter. Therefore, it
would be a more efficient strategy to store 2Afor input X= (00000), so that the product 16Acan
be derived by three arithmetic left shifts.[1]
Memory Optimization Techniques
Dept. of ECE,ASIET Page 15
The product values and encoded words for input words X = (00000) and (10000) are
separately shown in Figure 3.2. For X = (00000), the desired encoded word 16A is derived by 3-
bit left shifts of 2A. For X= (10000), the APC word “0” is derived by resetting the LUT output,
by an active-high RESET signal given by RESET = (x0 + x1 + x2 + x3) ・x4.
It may be seen from Table 3.1 and 3.2 that the 5-bit input word X can be mapped into a
4-bit LUT address (d3d2d1d0), by a simple set of mapping relations
di = x’’ i+1 for i=0,1,2 and d3 = x0’’ .
Where X’’ =(x3’’x2’’x1’’x0’’) is generated by shifting-out all the leading zeros of X’’by an
arithmetic right shift followed by address mapping, i.e.
Table 3.3: Products and encoded words for X = (00000) and (10000)
Table 3.2: OMS-based design of the LUT of APC words for L = 5
Memory Optimization Techniques
Dept. of ECE,ASIET Page 16
X’’ = YL if x4=1
YL’ if x4=0
Where YL and YL’ are derived by circularly shifting-out all the leading zeros of XL and XL’,
respectively.
3.2.1 Implementation of the LUT Multiplier Using APC for L = 5
The structure and function of the LUT-based multiplier for L = 5 using the APC
technique is shown in Figure 3.1. It consists of a four-input LUT of 16 words to store the APC
values of product words as given in the sixth column of Table 3.1, except on the last row, where
2A is stored for input X = (00000) instead of storing a “0” for input X = (10000). Besides, it
consists of an address-mapping circuit and an add/subtract circuit.
The address-mapping circuit generates the desired address (x3’x2’x1’x0’).A
straightforward implementation of address mapping can be done by multiplexing XL and
X’Lusingx4 as the control bit. The address-mapping circuit, however, can be optimized by using
three XOR gates, three AND gates, two OR gates, and a NOT gate, as in Figure 3.1. The
RESET can be generated by a control circuit. The output of the LUT is added with or subtracted
from 16A, for x4 = 1 or 0, respectively by the add/subtract cell. Hence, x4 is used as the control
for the add/subtract cell.
3.2.2 Implementation of the Optimized LUT Using Modified OMS
The APC–OMS combined design of the LUT for L = 5 and for any coefficient width W
is given in Figure 3.2. It consists of an LUT of nine words of (W + 4)-bit width, a four-to-nine-
line address decoder, a barrel shifter, an address generation circuit, and a control circuit for
generating the RESET signal and control word (s1s0) for the barrel shifter. The precomputed
values of A × (2i + 1) are stored as Pi, for i= 0, 1, 2, . . . , 7, at the eight consecutive locations of
the memory array, as specified in Table 3.2, while 2A is stored for input X = (00000) at LUT
address “1000,” as specified in Table 3.3.
(3.4)
Memory Optimization Techniques
Dept. of ECE,ASIET Page 17
Figure 3.2: Proposed APC–OMS combined LUT design for the multiplication
Of W-bit fixed coefficient A with 5-bit input X.
The decoder takes the 4-bit address from the address generator and generates nine word-
select signals, i.e., {wi, for 0 ≤ i≤ 8}, to select the referenced word from the LUT. The 4-to-9-
line decoder is a simple modification of 3-to- 8-line decoder, as shown in Figure 3.3
Figure 3.3: Four-to-nine-line address-decoder
The control bits s0 and s1to be used by the barrel shifter to produce the desired number
of shifts of the LUT output are generated by the control circuit, according to the relations
s0 =x0 + (x1 + x2) (3.5)
s1 =(x0 + x1) (3.6)
The (s1s0) is a 2-bit binary equivalent of the required number of shifts specified in Table
3.2 and 3.3.The control circuit to generate the control word and RESET is shown in Figure 3.4.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 18
The address-generator circuit receives the 5-bit input operand X and maps that onto the 4-bit
address word (d3d2d1d0).
Figure 3.4: Control circuit
3.2.3 Optimized LUT Design for Signed and Unsigned Operands
The APC–OMS combined optimization of the LUT can also be performed for signed
values of A and X. When both operands are in sign-magnitude form, the multiples of magnitude
of the fixed coefficient are to be stored in the LUT, and sign of the product could be obtained by
the XOR operation of sign bits of both multiplicands. When both operands are in two’s
complement forms, a two’s complement operation of the output of the LUT is required to be
performed for x4 = 1.There is no need to add the fixed value 16A in this case, because the
product values are naturally in antisymmetric form. The add/subtract circuit is not required in
Figure 3.1, instead of that a circuit is required to perform the two’s complement operation of the
LUT output. For the multiplication of unsigned input X with signed, as well as unsigned,
coefficient A, the products could be stored in two’s complement representation, and the
add/subtract circuit in Figure. 3.1 could be modified as in Figure 3.5.
Figure 3.5:Modification of the add/subtract cell for the two’s complement
representation of product words.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 19
Except the last word, all other words in the LUT are odd multiples of A. The fixed
coefficient could be even or odd, but assuming A to be an odd number, all the stored product
words (except the last one) would be odd. If the stored value P is an odd number, it can be
expressed as
P = PD−1 P D−2・・・P1 1 (3.7)
and it’s two’s complement is given by
P’ = P’D−1 P D−2・・・P’1 1 (3.8)
Where P’I is the one’s complement of Pi for 1 ≤ i≤ D − 1, and D= W + L − 1 is the width of the
stored words. The storing of the two’s complement of all the product values and changing the
sign of the LUT output for x4 = 1, makes the last LUT unchanged. Therefore it forms a simple
sign-modification circuit as in Figure 3.6.
Figure 3.6: Optimized implementation of the sign
modification of the odd LUT output
However, the fixed coefficient A could be even as well. When A is a nonzero even
integer, we can express it as A’ × 2l, where 1 ≤ l ≤ D− 1 is an integer, and A’ is an odd integer.
Instead of storing multiples of A, we can store multiples of A’ in the LUT, and the LUT output
can be left shifted by l bits by a hardwired shifter. Similarly, an address-generation circuit can
be formed as in Figure 3.7 since all the shifted-address YL(except the last one) is an odd integer.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 20
Figure 3.7: Address-generation circuit.
3.3 MEMORY-BASED MULTIPLICATION BY INPUT OPERAND
DECOMPOSITION
3.3.1 Wallace Tree
Multipliers play an important role in today’s digital signal processing and various other
applications. With advances in technology, many researchers have tried and are trying to design
multipliers which offer either of the following design targets – high speed, low power
consumption, regularity of layout and hence less area or even combination of them in one
multiplier thus making them suitable for various high speed, low power and compact VLSI
implementation. The common multiplication method is “add and shift” algorithm. In parallel
multipliers number of partial products to be added is the main parameter that determines the
performance of the multiplier. To achieve speed improvements Wallace Tree algorithm can be
used to reduce the number of sequential adding stages.
If the multiplicand is N-bits and the Multiplier is M-bits then there is N* M partial
product. The way that the partial products are generated or summed up is the difference between
the different architectures of various multipliers. Multiplication of binary numbers can be
decomposed into additions. Consider the multiplication of two 8-bit numbers A and B to
generate the 16 bit product P.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 21
A Wallace multiplier is a parallel multiplier which uses carry save addition algorithm to
reduce latency. The main aim of Wallace tree is to achieve higher speed and low power
consumption.
Steps included are
1. Multiply each bit of one of the arguments, by each bit of the other, yielding n2 results.
2. Reduce the number of partial products to two by layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.
The second phase works as follows. As long as there are three or more wires with the same
weight add a following layer:
Take any three wires with the same weights and input them into a full adder. The result will
be an output wire of the same weight and an output wire with a higher weight for each three
input wires.
If there are two wires of the same weight left, input them into a half adder.
If there is just one wire left, connect it to the next layer
In the conventional 8 bit Wallace tree multiplier design, more number of addition operations is
required. Using the carry save adder, three partial product terms can be added at a time to form
the carry and sum. The sum signal is used by the full adder of next level. The carry signal is
used by the adder involved in the generation of the next output bit, with a resulting overall
delay proportional to log3/2n, for n number of rows. In the first and second stages of the
Wallace structure, the partial products do not depend upon any other values other than the
inputs obtained from the AND array. However, for the immediate higher stages, the final value
(PP3) depends on the carry out value of previous stage. This operation is repeated for the
consecutive stages. Wallace tree is multiplier considered as faster than a simple array multiplier
and is efficient implementation of a digital circuit which multiplies two integers. Wallace tree
multiplier as shown in Figure 3.8.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 22
3.3.2 Adder Tree
Although the memory core of the LUT multiplier is reduced to nearly one-fourth by the
proposed optimization technique, it is not efficient for operands of small widths, since it
requires an adder to add the offset value. However, it could be used for multiplication with
input of large word size by an input decomposition scheme. When the width of the input
multiplicand X is large, direct implementation of LUT multiplier involves a very large LUT.
Therefore, the input word could be decomposed into a certain number of segments or subwords,
and the partial products pertaining to different subwords could be shift added to obtain the
desired product. Let the input operand X be decomposed into T subwords, i.e., {X1,X2, . . . ,
XT }. Product word C = A ・X can be written as the sum of partial products as
C = ∑ 2s(i − 1). Ci
T
i=1
(3. 9)
Figure 3.8: Wallace tree multiplier
Memory Optimization Techniques
Dept. of ECE,ASIET Page 23
Ci = A . 𝑋i , for 1 ≤ 𝑖 ≤ T
(3.10 )
where each Xi = {x(i−1)Sx(i−1)S+1, . . . , xiS−1}is an S-bit subword, for 1 ≤ i≤ T − 1, and
XT ={x(T−1)S, x(T−1)S+1, . . . , x(T−1)S+S’−1 }is the last subwordof S-bit, where S≤ S, and xi is the
(i+ 1)th bit of X.
A generalized structure for parallel implementation of LUT multiplication for input size
L = 5× (T − 1) + S_ is shown in Figure 3.9. , whereS’<5. The input multiplicand X is
decomposed into (T − 1) more significant subwordsX1,X2, . . . , XT−1 and less significant S’-bit
subword XT. The partial products Ci= A.Xi, for 1 ≤ i≤ (T − 1), are obtained from (T − 1) LUT
multipliers optimized by APC and OMS. The Tth LUT multiplier is a conventional LUT, which
is not optimized by APC and OMS, since it is required to store the sum of offset values V = 16A
× 2S’[(25(T−1) − 1)/(25 − 1)] pertaining to all the (T − 1) optimized LUTs for partial product
generation. The Tth LUT therefore stores the values (A.XT + V ). The sign of all the optimized
LUT outputs are modified according to the value of the most significant bit of corresponding
subwordXi, for 1 ≤ i≤ T − 1, and all the LUT outputs are added together by an adder tree, as
shown in Figure 3.9. By implementing this adder tree large input operand can be solved very
quickly without any time delay. It is implemented by dividing the input into different blocks
and fed each input to each LUT.
Figure 3.9: Proposed LUT-based multiplier for L = 5
Memory Optimization Techniques
Dept. of ECE,ASIET Page 24
3.4 FPGA BASED CACHE MEMORY
Cache systems are on-chip memory element used to store data. A cache controller is
used for tracking induced miss rate in cache memory. Any data requested by microprocessor
present in cache memory then the term is called “cache hit‟. The advantage of storing data on
cache, as compared to RAM, is that it has faster retrieval times, but it has disadvantage of on-
chip energy consumption. This project deals with the design of efficient cache memory for
detecting miss rate in cache memory and less power consumption.[2] Cache controller that
communicates between microprocessor and cache memory to carry out memory related
operations. The functionality of the design is explained below.
Cache controller receive address that microprocessor wants to access Cache controller
looks for the address in L1 cache. If address present in L1 cache the data from location is
provided to microprocessor via data bus. If the address not found in L1 cache then cache miss
will occur. Cache controller then looks same address in cache L2. If address present in L2 cache
the data from location is provided to microprocessor. The same data will replace in cache L1. If
the address not found in L2 cache then cache miss will occur.[6]
Cache memory consist Cache tag memory, cache data memory, cache tag comparator
and counter. The project deals with detection of cache misses. The proposed idea designs cache
tag memory, cache tag comparator and 10-bit counter and neglects cache data memory for
detecting cache misses, as address is stored in tag memory and data is stored data memory.
Figure 3.10: Functionality design
Cache
Controller
Level 2 Cache
Memory
Level 1 Cache
Memory
Microprocessor
System Bus
Memory Optimization Techniques
Dept. of ECE,ASIET Page 25
Address Requested by microprocessor is compared with address of data in main memory those
are stored in tag memory. The cache tag memory shown in Figure 3.11 stores address of data in
main memory.[6]
The 10-bit counter, shown in figure is used to generate 10 bit address for writing and
reading operation of cache tag memory. A Ten bit counter shown in Figure 3.12.
For each clock pulse counter is incremented by one. 10- bit counter provides 10-bit
addresses to cache tag memory during writing and reading operation shown in Figure 3.12. The
22-bit cache tag comparator shown in Figure 3.13 is used to compare address requested by
microprocessor and address stored in tag memory.22-bit cache tag comparator is used to
compare 22-bit address requested by microprocessor and 22-bit address stored in tag memory.
Figure 3.11: Tag memory
Figure 3.12: Ten bit counter
Memory Optimization Techniques
Dept. of ECE,ASIET Page 26
The implementation of cache memory requires cache tag memory, 10-bit counter and
cache tag comparator as shown in Figure 3.14,
Figure 3.13: Tag comparator
Figure 3.14: Implementation of cache memory
Memory Optimization Techniques
Dept. of ECE,ASIET Page 27
CHAPTER 4
PROJECT IMPLEMENTATION
4.1 FLOW CHART
4.1.1 Antisymmetric Product Coding
No
Yes
No
Yes
In APC flow chart first step is to check whether start is zero or one. If it is zero directly
obtain the product value that is zero. Else go to next step of computation of address. After
computing the address obtains the value from LUT and check the MSB bit.
Start
Read input X
Is start =0
Compute address
APC word from LUT
MSB =1
Product = 16A + APC
word
Output product
Stop
Product = 16A -
APC word
Memory Optimization Techniques
Dept. of ECE,ASIET Page 28
4.1.2 Combined APC and OMS
Yes
No
Start
Read inputs clock, start
and X
If start=0
Compute s1s0
From s1s0
compute value of y
Assign each value
to memory array
Output
Stop
In combined APC and OMS we have to calculate the s0 and s1 value. The value
obtained is used to find the number of shifts. Shifting is done by using a shifter. Product value
is obtained from LUT after address computation .Even products are obtained after shifting.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 29
4.1.3 Adder Tree
Start
Read input X
Divide input X to 5 bit each
Call 5 bit input to each LUT
Output product
Stop
Sum the partial products
Adder tree is used for easy computation of product. Large input bit is divided in to small
input bits. Each partitioned bit is fed in to different LUT. At last the partial products are added
together and product value is obtained.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 30
4.1.4 FPGA Based Cache Memory
In FPGA based cache memory three tags are used to check the availability of data in
cache. The comparator compares the requested address with that of address available in tag. If
compared data’s are similar there occurs a cache hit else there occur cache hit. Cache hit means
the data available in the memory, if data nit available in memory then it is a cache miss. In this
scheme data memory is not used.
Yes
No
stop
Start
Get request address
from µp
Is address
available in
tag
hit
miss
Memory Optimization Techniques
Dept. of ECE,ASIET Page 31
4.2 SOFTWARE IMPLEMENTATION
4.2.1 ModelSim DE 6.5e
Verilog HDL is a hardware description language used to design and document electronic
systems. A hardware description language looks much like a programming language such as C;
it is a textual description consisting of expressions, statements and control structures. ModelSim
is a powerful simulator that can be used to simulate the behavior and performance of logic
circuits. It supports behavioral, register transfer level, and gate-level modeling. The simulator
allows the user to apply inputs to the designed circuit, usually referred to as test vectors, and to
observe the outputs generated in response. It shows how the simulator can be used to perform
functional simulation of a circuit specified in Verilog HDL. The version of ModelSim used in
this project is ModelSim DE6.5e.
4.2.2 Xilinx ISE
The Xilinx ISE tools allow the design to enter several ways including graphical
schematics, satemachine,VHDLandVerilog. Xilinx ISE (Integrated Synthesis Environment) is a
software tool produced by Xilinx for synthesis and analysis of HDL designs, enabling the
developer to compile their designs, perform timing analysis, examine RTL diagrams, simulate
a design's reaction to different stimuli, and configure the target device with the programmer.
The Xilinx ISE is a design environment for FPGA products from Xilinx. The primary user
interface of the ISE is the Project Navigator, which includes the design hierarchy (Sources),
a source code editor (Workplace), an output console (Transcript), and a processes tree
(Processes). A comparative study is made by analysing the result of Xilinx. Xilinx gives
detailed description of area used for the system.
The Design hierarchy consists of design files (modules), whose dependencies are
interpreted by the ISE and displayed as a tree structure. For single-chip designs there may be
one main module, with other modules included by the main module, similar to the
main() subroutine in C++ programs. Design constraints are specified in modules, which include
pin configuration and mapping.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 32
CHAPTER 5
FUTURE SCOPE
The calculations in the project are performed by reducing the word length.
Further the memory is reduced by storing only the odd multiples in the Look Up Table. The
Antisymmetric Product Coding with the Odd Multiple Storage hence modifies the Look Up
Table. This approach provides a reduction in LUT size. This proposed system can be used for
efficient implementation of high precision multiplication by input operand decomposition. It is
found that the LUT based bring out significantly less area and less multiplication time.
This project brings out the possibility of using LUT based multipliers to implement the
constant multiplication for DSP based applications. The higher order of which can be
implemented for DSP based filters. This idea of LUT optimization can be extended to obtain
maximum advantage of area and speed, by designing the LUT from NAND or NOR read only
memories and the arithmetic shifts can be implemented by an array of barrel shifter using metal
oxide semiconductor transistors. Further work could be still done to derive OMS –APC based
LUTs for higher input sizes with different forms of decomposition and parallel and pipelined
addition schemes for suitable area delay tradeoffs.
The idea of FPGA implementation of cache memory would be of great utility to many
modern embedded applications, for which both high-performance and low-power are of great
importance. This cache memory may used in future work to design FPGA based set associative
cache memory and cache controller for tracking the induced miss rate in cache memory.
Further, this cache memory can be implemented to higher level or multiple level of caches to
track the miss rate in huge applications.[4],[2]
Memory Optimization Techniques
Dept. of ECE,ASIET Page 33
CHAPTER 6
SIMULATION RESULTS
6.1 APC
The number of slice Flip Flop utilized is 25 and number of input LUT utilized is 92.
6.2 COMBINED APC –OMS
In combined method the number of slice Flip Flop is 16 and number of input LUT is
18. This is much reduced than APC alone.
Memory Optimization Techniques
Dept. of ECE,ASIET Page 34
6.3 ADDER TREE
6.4 FPGA BASED CACHE MEMORY
Memory Optimization Techniques
Dept. of ECE,ASIET Page 35
CHAPTER 7
CONCLUSION
The two different techniques for memory optimization were implemented. The first
memory optimization is LUT optimization using APC-OMS technique and second one is FPGA
implementation of cache memory.
The LUT optimization using the combined APC and OMS techniques reduces on chip
area when compared to using APC alone ie., the number of slices reduced from 25 to 16 ,and
thus the memory optimization was achieved. This shows possibility of using LUT based
multipliers to implement the constant multiplication for DSP applications. The full advantages
of this LUT based design however, could be derived if the LUTs are implemented as NAND or
NOR read-only memories and the arithmetic shifts are implemented by an array barrel shifter
using metal–oxide–semiconductor transistors.
FPGA implementation of cache memory was designed for detecting the miss rate in
cache memory. The use of data memory was eliminated and memory was optimized. This also
helps in reducing the design complexity and power consumption. This approach would be of
great utility to many modern embedded applications, for which both high-performance and low-
power are of great importance.
top related