memory optimization technique

Memory Optimization Techniques

Dept. of ECE,ASIET Page 1

CHAPTER 1

INTRODUCTION

A long with the progressive device scaling, semiconductor memory has become cheaper,

faster, and more power-efficient. Moreover, embedded memories will have dominating

presence in the system on-chips (SoCs), which may exceed 90% of the total SoC content. It has

also been found that the transistor packing density of memory components is not only higher,

but also increasing much faster than those of logic components. There were several

optimization techniques adopted for this change.[1]

A lookup table is an array containing precalculated values that can be retrieved from

memory whenever needed. The project optimizes the memory space needed to store data in a

look up table by combining the methods of Antisymmetric Product Code (APC) and Odd

Multiple Storage (OMS). The existing methodology reduces the LUT size by half, but the

combined effect of APC and OMS reduces the LUT size by one fourth. This causes selective

sign reversal, lesser area and simpler multiplication on the memory. It saves the area delay to

30% -50%.LUT optimization is a memory based multiplication technique, where odd multiples

of fixed coefficients are required to be stored, which is known as Odd Multiple Storage

(OMS).Suppose if there are 2L words, on conducting OMS optimization, instead of 2

L words

only, 2(L/2)

words are stored. Then, even multiples are derived by left shift operation by means

of a barrel shifter. Further product codes are recorded as antisymmetric pairs, (APC), which

reduces the LUT size by the factor of two.

The other optimization technique proposed here is the FPGA implementation of cache

memory. Cache memory design is suitable for use in FPGA-based cache controller and

processor. Cache systems are on-chip memory element used to store data. Cache serves as a

buffer between a CPU and its main memory. Cache memory is used to synchronize the data

transfer rate between CPU and main memory. As cache memory closer to the microprocessor, it

is faster than the RAM and main memory. The advantage of storing data on cache, as compared

to RAM, is that it has faster retrieval times, but it has disadvantage of on-chip energy

consumption. FPGA implementation on cache memory eliminates the need for data memory

and thus optimizes memory.[2]

CHAPTER 2

LITERATURE REVIEW

2.1 DISTRIBUTED ARITHMETIC

Distributed Arithmetic, DA; is an efficient technique for calculation of inner products or

multiply and Accumulate (MAC). The MAC operation is common in digital signal processing

algorithm. The direct method involves using dedicated multipliers. Multipliers are fast but they

consume considerable hardware. Distributed Arithmetic technique is bit serial in nature. It can

therefore appear to be slow in nature. It turns out that when the number of elements in a vector

is nearly the same as the word size. DA is quite slow. It replaces the explicit multiplication by

ROM lookups, which is an efficient technique to implement on field programmable gate

arrays.[4]

DA based computation is widely used in filters, where filter outputs are computed as the

inner product of the input sample vectors. Hence DA based computation consumes more area.

In DA approach an LUT(Look Up is used to store all possible values of inner products of a

fixed N point vector, in this method the memory size of LUT increases exponentially with the

word length of inner products.

DA approaches makes use of the pipelined structure (adder, shifter). On each cycle,

eight bit sample is fed to the word serial bit parallel converter, out of which a pair of

consecutive bits are transferred to each of its own DA structures. But Distributed Arithmetic

approach is less efficient than LUT approach in terms of the area complexity and latency.[4]

2.1.1 Parallel Implementation Of DA [4]

The inner products containing many terms, can be portioned into a number of small

inner products which can be computed and summed by using either distributed arithmetic or an

adder tree.

2.1.2 Reducing the Memory Size

One of the several possible ways to reduce the overall memory requirement is to

partition the memory into smaller pieces that are added before the shift accumulator.

The second approach is based on a special coding of the ROM content. Memory size can

be halved by using the ingenious scheme. In particular it can divide the N address bits of the

ROM into (N/K) groups of K bits. So a ROM of size 2N can be divided into N/K ROMs of size

2.1.3 Comparison of Initial Years of DA

The traditional comparisons o DA are with multiplier based solutions for problems

pertaining to filters. When DA was devised, the comparisons given were in terms of number of

TTL ICs required for mechanization of a certain type of filter.

Multiplication is strongest operation because it is repeated addition. It requires large

portion of chip area. Power consumption is also more. Memory based structures are more

regular compared to the Multiply and Accumulate structures and has many advantages, like

greater potential for higher throughput and reduced latency implementation and are expected to

have less dynamic power consumption due to less switching activities for memory based

operations. Memory base structures are well suited for many digital signal processing

algorithms, which involve multiplication with a fixed set of coefficients.[4]

2.2 LOOK UP TABLE

A lookup table is an array that replaces runtime computation with a simpler array

indexing operation. The savings in terms of processing time can be significant, since retrieving

a value from memory is often faster than undergoing an input output operation. The tables may

be precalculated and stored in static program storage or even stored in hardware in application-

specific platforms. Lookup tables are also used extensively to validate input values by matching

against a list of valid items in an array.

In data analysis applications, such as image processing, a lookup table (LUT) is used to

transform the input data into a more desirable output format. For example, a grayscale picture of

the planet Saturn will be transformed into a colour image to emphasize the differences in its

rings.

A classic example of reducing run-time computations using lookup tables is to obtain

the result of a trigonometry calculation, such as the sine of a value. Calculating trigonometric

functions can substantially slow a computing application. The same application can finish much

sooner when it first precalculates the sine of a number of values, for example for each whole

number of degrees. When the program requires the sine of a value, it can use the lookup table to

retrieve the closest sine value from a memory address, and may also take the step of

interpolating to the sine of the desired value, instead of calculating by mathematical formula.

Lookup tables are thus used by mathematics co processors in computer systems.

There are intermediate solutions that use tables in combination with a small amount of

computation, often using interpolation. Pre-calculation combined with interpolation can produce

higher accuracy for values that fall between two precomputed values. This technique requires

slightly more time to be performed but can greatly enhance accuracy in applications that require

the higher accuracy. Depending on the values being precomputed, pre-computation with

interpolation can also be used to shrink the lookup table size while maintaining accuracy.

There is hardware LUT’s also. In digital logic, an n-bit lookup table can be implemented

with a multiplexer whose select lines are the inputs of the LUT and whose inputs are constants.

An n-bit LUT can encode any n-input Boolean function by modeling such functions as truth

tables. This is an efficient way of encoding Boolean logic functions, and LUTs with 4-6 bits of

input are in fact the key component of modern Field Programmable Gate Arrays (FPGA).

Multiplication which is used in many processors, for performing arithmetic operations

has always been a good subject of research, architectures which are smaller and faster.

Multiplication, which is the repetitive addition, forms the basis of any processor architecture.

The speed, performance, reliability and efficiency of the adder architecture can dominate the

performance. Nowadays the prime aim of designing is to minimize the core area.

A conventional lookup-table (LUT)-based multiplier is shown inFigure 2.1, where A is a

fixed coefficient, and X is an input word to be multiplied with A. Assuming X to be a positive

binary number of word length L, there can be 2L possible values of X, and accordingly, there

can be 2Lpossible values of product C = A ・X.

Figure 2.1 : conventional LUT-based multiplier

Therefore, for memory-based multiplication, an LUT of 2Lwords, consisting of pre

computed product values corresponding to all possible values of X, is conventionally used. The

product word A ・Xi is stored at the location Xi for 0 ≤ Xi ≤ 2L− 1, such that if an L-bit binary

value of Xi is used as the address for the LUT, then the corresponding product value A ・Xi is

available as its output.

2.3 APC AND OMS

The multiplication is major arithmetic operation in signal processing and in ALU's. The

multiplier uses look-up-table as memory for their computations. However, we do not find any

significant work on LUT optimization for memory-based multiplication. APC is an approach,

in which the LUT size is reduced to half and provides a reduction. Anti-symmetric product

coding is the technique used to process the multiplication based on LUT multiplication which

reduces the size of conventional LUT by 50 %. The APC is based on the anti-symmetric coding

i.e. the 2‘s complement phenomenon which is used to reduce the LUT size by half.

For the multiplication of any binary word X of size L, with a fixed coefficient A, instead

of storing all the 2L possible values of C = A · X, only (2L /2) words corresponding to the odd

multiples of A may be stored in the LUT, while all the even multiples of A could be derived by

left-shift operations of one of those odd multiples. This approach is known odd multiple

storage. A barrel shifter for producing left shifts could be used to derive all the even multiples

The implementation of the proposed APC-OMS combined LUT for memory based

multiplier uses two techniques, APC and OMS method. This method is supposed to reduce the

area to one fourth. This multiplier uses four blocks. The address generation block converts our

input to address d0, d1, d2, d3, which is produced by combining both the APC and OMS

method. The 4-to-9 address line decoder converts the address d0, d1, d2, d3 to LUT address.

The memory array is an LUT and barrel shifter converts the LUT output to the desired output.

When APC approach is combined with the OMS technique, the two's complement

operations could be simplified since the input address and LUT output could always be

transformed into odd integers, and thus reduces the LUT size to one fourth of the conventional

LUT. It is found that the proposed LUT-based multiplier involves comparable area and time

complexity for a word size of 5-bits.

2.4 FIELD PROGRAMMABLE GATE ARRAY

Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based

around a matrix of configurable logic blocks (CLBs) connected via programmable

interconnects. FPGAs can be reprogrammed to desired application or functionality requirements

after manufacturing. This feature distinguishes FPGAs from Application Specific Integrated

Circuits (ASICs), which are custom manufactured for specific design tasks.

The need of programmable devices was realized in early 70s itself with the design of

PLD by Ron Cline from Signetics. The digital ICs like TTL or CMOS have fixed functionality

and the user has no option to change or modify their functionality. That is, they work according

to the design given by the manufacturer. So, to change this people started thinking of a

methodology by which the functionality of an IC can be modified or changed. Then the concept

of using Fuses in ICs entered and gained momentum.

This method of changing or modifying the functionality of an IC using the Fuses was

appreciated and this method of blowing a Fuse between two contacts or keeping the Fuse intact

was done by using software and hence these devices were called Programmable Logic

Devices(PLDs). Many digital chips were considered under the category of PLDs .But the most

fundamental and primitive was the Memories like ROM or PROM etc. The realization of

Digital circuits by PLDs can be classified as shown in the diagram below. PLAs were

introduced in the early 1970s, by Philips, but their main drawbacks were that they were

expensive to manufacture and offered somewhat poor speed-performance. Both disadvantages

were due to the two levels of configurable logic, because programmable logic planes were

difficult to manufacture and introduced significant propagation delays.

To overcome these problems, Programmable Array Logic (PAL) devices were

developed. Memory: Memory is used to store, provide access to, and allow modification of data

and program code for use within a processor-based electronic circuit or system. The two basic

types of memory are ROM (read-only memory) and RAM (random access memory).

ROM is used for holding program code that must be retained when the memory power is

removed. It is considered to provide non-volatile storage. The code can either be fixed when the

memory is fabricated (mask programmable ROM) or electrically programmed once (PROM,

Programmable ROM) or multiple times. Multiple programming capacity requires the ability to

erase prior programming, which is available with EPROM (electrically programmable ROM,

erased using ultraviolet [UV] light), EEPROM or EEPROM (electrically erasable PROM), or

flash (also electrically erased). PROM is sometimes considered to be in the same category of

circuit as programmable logic, although in this text, PROM is considered in the memory

category only.

RAM is used for holding data and program code that require fast access and the ability

to modify the contents during normal operation. RAM differs from read-only memory (ROM)

in that it can be both read from and written to in the normal circuit application. However, flash

memory can also be referred to as non-volatile RAM (NVRAM). RAM is considered to provide

a volatile storage, because unlike ROM, the contents of RAM will be lost when the power is

removed. There are two main types of RAM: static RAM (SRAM) and dynamic RAM

(DRAM).

ROM- READ ONLY MEMORY: A ROM is essentially a memory device for storage

purpose in which a fixed set of binary information is stored. The user must first specify the

binary information to be stored and then it is embedded in the unit to form the required

interconnection pattern. ROM contains special internal links that can be fused or broken.

Certain links are to be broken or blown out to realize the desired interconnections for a

particular application and to form the required circuit path. Once a pattern is established for a

ROM, it remained fixed even if the power supply to the circuit is switched off and then

switched on again. The ROM consists of n input lines and m-output lines. Each bit combination

of input variables is called an address and each bit combination that is formed at output lines is

called a word. Thus, an address is essentially binary number that denotes one of the min-terms

of n variables and the number of bits per word is equal to the number of output lines m. It is

possible to generate p = 2n number of distinct addresses from n number of input variables.

Since there are 2n distinct addresses in a ROM, there are 2n distinct words which are said to be

stored in the device and an output word can be selected by a unique address. The address value

applied to the input lines specifies the word at output lines at any.

Programmable Logic Device-(PLD): The logic devices other than TTL,CMOS families

whose logical operation is specified by the user through a process called programming are

called Programmable Logic Devices. So, the programmable logic device is the IC that contain

digital logic cells and programmable interconnect. The idea of PLD was first conceived by Ron

Cline from Signetics in 1975 with programmable AND and OR planes. The basic idea with

these devices is to enable the designer to configure the logic cells and interconnect to form a

digital electronic circuit within a single IC package. Here, the hardware resources will be

configured to implement a required functionality. By changing the hardware configuration, the

PLD will operate a different function.

There are three types of PLD available. The simple programmable logic device (SPLD),

the Complex programmable logic device (CPLD), and the Field programmable gate array

(FPGA). The PLD with simple architectural features can be called as SPLD or Simple

programmable Logic Device. The SPLD was introduced prior to the CPLD and FPGA. Based

on the architecture the SPLDs are classified into three types. Programmable logic array (PLA),

Programmable array of logic (PAL), and Generic Array of Logic (GAL). PLA-Programmable

Logic Array PLA, Programmable Logic Array is a type of LSI device and conceptually similar

to a ROM. However, a PLA does not contain all AND gates to form the decoder or does not

generate all the minterms like ROM.

In the PLA, the decoder is replaced by a group of AND gates with buffers/inverters,

each of which can be programmed to generate some product terms of input variable

combinations that are essential to realize the output functions. The AND and OR gates inside

the PLA are initially fabricated with the fusible links among them. The required Boolean

functions are implemented in sum of the products form by opening the appropriate links and

retaining the desired connections. So, the PLA consists of two programmable planes AND and

OR planes.

The AND plane consists of programmable interconnect along with AND gates. The OR

plane consists of programmable interconnect along with OR gates. In this view, there are four

inputs to the PLA and four outputs from the PLA. Each of the inputs can be connected to an

AND gate with any of the other inputs by connecting the crossover point of the vertical and

horizontal interconnect lines in the AND gate programmable interconnect. Initially, the

crossover points are not electrically connected, but configuring the PLA will connect particular

cross over points together. In this view, the AND gate is seen with a single line to the input.

This view is by convention, but this also means that any of the inputs (vertical lines) can be

connected. Hence, for four PLA inputs, the AND gate also has four inputs. The single output

from each of the AND gates is applied to an OR gate programmable inter connect.

Again, the crossover points are initially not electrically connected, but configuring the

PLA will connect particular crossover points together. In this view, the OR gate is seen with a

single line to the input. This view is by convention, but this also means that any of AND gate

outputs can be connected to the OR gate inputs. Hence, for four AND gates, the OR gate also

has four inputs Therefore, the function is implemented in either AND-OR form when the output

link across INVERTER is in place, or in AND-OR-INVERT form when the link is blown off.

PROGRAMMABLE ARRAY LOGIC (PAL): The first programmable device was the

programmable array logic (PAL) developed by Monolithic Memories Inc(MMI). The

Programmable Array Logic or PAL is similar to PLA, but in a PAL device only AND gates are

programmable. The OR array is fixed by the manufacturer. This makes PAL devices easier to

program and less expensive than PLA. On the other hand, since the OR array is fixed, it is less

flexible than a PLA device.

FIELD PROGRAMMABLE GATE ARRAYS: The FPGA concept emerged in 1985

with the XC2064TM FPGA family from Xilinx. The “FPGA is an integrated circuit that

contains many (64 to over 10,000) identical logic cells that can be viewed as standard

components.” The individual cells are interconnected by a matrix of wires and programmable

switches. A user's design is implemented by specifying the simple logic function for each cell

and selectively closing the switches in the interconnect matrix. The array of logic cells and

interconnects form a fabric of basic building blocks for logic circuits. Complex designs are

created by combining these basic blocks to create the desired circuit. Unlike CPLDs (Complex

Programmable Logic Devices) FPGAs contain neither AND nor OR planes.

The FPGA architecture consists of configurable logic blocks, configurable I/O blocks,

and programmable interconnect. Also, there will be clock circuitry for driving the clock signals

to each logic block, and additional logic resources such as ALUs, memory, and decoders may

be available. The two basic types of programmable elements for an FPGA are Static RAM and

antifuses. Each logic block in an FPGA has a small number of inputs and one output. A look up

table (LUT) is the most commonly used type of logic block used within FPGAs. There are two

types of FPGAs.(i) SRAM based FPGAs and (ii) Antifse technology based(OTP) Every FPGA

consists of the following elements Configurable logic blocks(CLBs) Configurable input output

blocks(IOBs) Two layer metal network of vertical and horizontal lines for interconnecting the

2.4.1 Difference between an ASIC and an FPGA

ASIC and FPGAs have different value propositions, and they must be carefully

evaluated before choosing any one over the other. Information abounds that compares the two

technologies. While FPGAs used to be selected for lower speed/complexity/volume designs in

the past, today’s FPGAs easily push the 500 MHz performance barrier. With unprecedented

logic density increases and a host of other features, such as embedded processors, DSP blocks,

clocking, and high-speed serial at ever lower price points, FPGAs are a compelling proposition

for almost any type of design.

2.4.2 FPGA Applications

Due to their programmable nature, FPGAs are an ideal fit for many different markets. As

the industry leader, Xilinx provides comprehensive solutions consisting of FPGA devices,

advanced software, and configurable, ready-to-use IP cores for markets and applications such

Aerospace & Defense - Radiation-tolerant FPGAs along with intellectual property for

image processing, waveform generation, and partial reconfiguration for SDRs.

ASIC Prototyping - ASIC prototyping with FPGAs enables fast and accurate SoC

system modeling and verification of embedded software

Audio - Xilinx FPGAs and targeted design platforms enable higher degrees of

flexibility, faster time-to-market, and lower overall non-recurring engineering costs

(NRE) for a wide range of audio, communications, and multimedia applications.

Automotive - Automotive silicon and IP solutions for gateway and driver assistance

systems, comfort, convenience, and in-vehicle infotainment.

Broadcast - Adapt to changing requirements faster and lengthen product life cycles with

Broadcast Targeted Design Platforms and solutions for high-end professional broadcast

systems.

Consumer Electronics - Cost-effective solutions enabling next generation, full-featured

consumer applications, such as converged handsets, digital flat panel displays,

information appliances, home networking, and residential set top boxes.

Data Center - Designed for high-bandwidth, low-latency servers, networking, and

storage applications to bring higher value into cloud deployments.

High Performance Computing and Data Storage - Solutions for Network Attached

Storage (NAS), Storage Area Network (SAN), servers, and storage appliances.

Industrial - Xilinx FPGAs and targeted design platforms for Industrial, Scientific and

Medical (ISM) enable higher degrees of flexibility, faster time-to-market, and lower

overall non-recurring engineering costs (NRE) for a wide range of applications such as

industrial imaging and surveillance, industrial automation, and medical imaging

equipment.

Medical - For diagnostic, monitoring, and therapy applications, the Virtex FPGA and

Spartan FPGA families can be used to meet a range of processing, display, and I/O

interface requirements.

CHAPTER 3

PROJECT DESCRIPTION

The Antisymmetric Product Coding is an approach in which the LUT size can be

reduced to half, where the product words are recoded as antisymmetric pairs. The APC

approach, although providing a reduction in LUT size by a factor of two, incorporates

substantial overhead of area and time to perform the two’s complement operation of LUT

output for sign modification and that of the input operand for input mapping.

For simplicity of presentation, we assume both X and A to be positive integers. The

product words for different values of X for L= 5 are shown in Table 3.1 1 It may be observed in

this table that the input word X on the first column of each row is the two’s complement of that

on the third column of the same row. In addition, the sum of product values corresponding to

these two input values on the same row is 32A.[1]

Table 3.1: APC words for different input values for L = 5

Let the product values on the second and fourth columns of a row be u and v,

respectively. Since one can write u = [(u + v)/2 − (v − u)/2] and v = [(u + v)/2 + (v − u)/2], for

(u + v) = 32A, we can have

u= 16A– [(v-u)/2] (3.1)

v= 16A+ [(v-u)/2] (3.2)

The product values on the second and fourth columns of Table 3.1 therefore have a negative

mirror symmetry. This behaviour of the product words can be used to reduce the LUT size,

where, instead of storing u and v, only [(v − u)/2] is stored for a pair of input on a given row.

The 4-bit LUT addresses and corresponding coded words are listed on the fifth and sixth

columns of the table, respectively. Since the representation of the product is derived from the

antisymmetric behaviour of the products, we can name it as antisymmetric product code. The

4-bit address X’ = (x3’x2’x1’x0’)of the APC word is given by

X’ = XL if x4=1

XL’ if x4=0

Where, XL = (x3 x2 x1 x0) is the four less significant bits of X’ and X’L is the two’s complement

of XL. The desired product could be obtained by adding or subtracting the stored value (v − u) to

or from the fixed value 16Awhen x4 is 1 or 0, respectively, i.e,

Product word = 16A+ (sign value) × (APC word)

Where, sign value = 1 for x4 = 1 and sign value = −1 for x4 = 0. The product value for X =

(10000) corresponds to APC value “zero,” which could be derived by resetting the LUT output,

instead of storing that in the LUT.

Figure 3.1:LUT-based multiplier for L = 5 using the APC technique

3.2 COMBINED APC-OMS TECHNIQUE

For the multiplication of any binary word X of size L, with a fixed coefficient A, instead

of storing all the 2Lpossible values of C = A ・X, only (2

L/2) words corresponding to the odd

multiples of A may be stored in the LUT, while all the even multiples of A could be derived by

left-shift operations of one of those odd multiples. Based on the above assumptions, the LUT

for the multiplication of an L-bit input with a W-bit coefficient could be designed by the

following strategy.

A memory unit of [(2L/2) + 1] words of (W + L)-bit width is used to store the product

values, where the first (2L/2) words are odd multiples of A, and the last word is zero.

A barrel shifter for producing a maximum of (L − 1) left shifts is used to derive all the

even multiples of A.

The L-bit input word is mapped to the (L − 1)-bit address of the LUT by an address

encoder, and control bits for the barrel shifter are derived by a control circuit.

Consider Table 3.1, having eight memory locations, the eight odd multiples, A × (2i + 1)

are stored as Pi, for i= 0, 1, 2, . . . .7. The even multiples 2A, 4A, and 8A are derived by left-shift

operations of A. Similarly, 6Aand 12Aare derived by left shifting 3A, while 10A and 14A are

derived by left shifting 5Aand 7A, respectively. A barrel shifter for producing a maximum of

three left shifts could be used to derive all the even multiples of A.

If the word to be stored for X = (00000) is not 0 but 16A, which we can obtain from A by

four left shifts using a barrel shifter. However, if 16A is not derived from A,only a maximum of

three left shifts is then required to obtain all other even multiples of A.

A maximum of three bit shifts can be implemented by a two-stage logarithmic barrel

shifter, but the implementation of four shifts requires a three-stage barrel shifter. Therefore, it

would be a more efficient strategy to store 2Afor input X= (00000), so that the product 16Acan

be derived by three arithmetic left shifts.[1]

The product values and encoded words for input words X = (00000) and (10000) are

separately shown in Figure 3.2. For X = (00000), the desired encoded word 16A is derived by 3-

bit left shifts of 2A. For X= (10000), the APC word “0” is derived by resetting the LUT output,

by an active-high RESET signal given by RESET = (x0 + x1 + x2 + x3) ・x4.

It may be seen from Table 3.1 and 3.2 that the 5-bit input word X can be mapped into a

4-bit LUT address (d3d2d1d0), by a simple set of mapping relations

di = x’’ i+1 for i=0,1,2 and d3 = x0’’ .

Where X’’ =(x3’’x2’’x1’’x0’’) is generated by shifting-out all the leading zeros of X’’by an

arithmetic right shift followed by address mapping, i.e.

Table 3.3: Products and encoded words for X = (00000) and (10000)

Table 3.2: OMS-based design of the LUT of APC words for L = 5

X’’ = YL if x4=1

YL’ if x4=0

Where YL and YL’ are derived by circularly shifting-out all the leading zeros of XL and XL’,

respectively.

3.2.1 Implementation of the LUT Multiplier Using APC for L = 5

The structure and function of the LUT-based multiplier for L = 5 using the APC

technique is shown in Figure 3.1. It consists of a four-input LUT of 16 words to store the APC

values of product words as given in the sixth column of Table 3.1, except on the last row, where

2A is stored for input X = (00000) instead of storing a “0” for input X = (10000). Besides, it

consists of an address-mapping circuit and an add/subtract circuit.

The address-mapping circuit generates the desired address (x3’x2’x1’x0’).A

straightforward implementation of address mapping can be done by multiplexing XL and

X’Lusingx4 as the control bit. The address-mapping circuit, however, can be optimized by using

three XOR gates, three AND gates, two OR gates, and a NOT gate, as in Figure 3.1. The

RESET can be generated by a control circuit. The output of the LUT is added with or subtracted

from 16A, for x4 = 1 or 0, respectively by the add/subtract cell. Hence, x4 is used as the control

for the add/subtract cell.

3.2.2 Implementation of the Optimized LUT Using Modified OMS

The APC–OMS combined design of the LUT for L = 5 and for any coefficient width W

is given in Figure 3.2. It consists of an LUT of nine words of (W + 4)-bit width, a four-to-nine-

line address decoder, a barrel shifter, an address generation circuit, and a control circuit for

generating the RESET signal and control word (s1s0) for the barrel shifter. The precomputed

values of A × (2i + 1) are stored as Pi, for i= 0, 1, 2, . . . , 7, at the eight consecutive locations of

the memory array, as specified in Table 3.2, while 2A is stored for input X = (00000) at LUT

address “1000,” as specified in Table 3.3.

Figure 3.2: Proposed APC–OMS combined LUT design for the multiplication

Of W-bit fixed coefficient A with 5-bit input X.

The decoder takes the 4-bit address from the address generator and generates nine word-

select signals, i.e., {wi, for 0 ≤ i≤ 8}, to select the referenced word from the LUT. The 4-to-9-

line decoder is a simple modification of 3-to- 8-line decoder, as shown in Figure 3.3

Figure 3.3: Four-to-nine-line address-decoder

The control bits s0 and s1to be used by the barrel shifter to produce the desired number

of shifts of the LUT output are generated by the control circuit, according to the relations

s0 =x0 + (x1 + x2) (3.5)

s1 =(x0 + x1) (3.6)

The (s1s0) is a 2-bit binary equivalent of the required number of shifts specified in Table

3.2 and 3.3.The control circuit to generate the control word and RESET is shown in Figure 3.4.

The address-generator circuit receives the 5-bit input operand X and maps that onto the 4-bit

address word (d3d2d1d0).

Figure 3.4: Control circuit

3.2.3 Optimized LUT Design for Signed and Unsigned Operands

The APC–OMS combined optimization of the LUT can also be performed for signed

values of A and X. When both operands are in sign-magnitude form, the multiples of magnitude

of the fixed coefficient are to be stored in the LUT, and sign of the product could be obtained by

the XOR operation of sign bits of both multiplicands. When both operands are in two’s

complement forms, a two’s complement operation of the output of the LUT is required to be

performed for x4 = 1.There is no need to add the fixed value 16A in this case, because the

product values are naturally in antisymmetric form. The add/subtract circuit is not required in

Figure 3.1, instead of that a circuit is required to perform the two’s complement operation of the

LUT output. For the multiplication of unsigned input X with signed, as well as unsigned,

coefficient A, the products could be stored in two’s complement representation, and the

add/subtract circuit in Figure. 3.1 could be modified as in Figure 3.5.

Figure 3.5:Modification of the add/subtract cell for the two’s complement

representation of product words.

Except the last word, all other words in the LUT are odd multiples of A. The fixed

coefficient could be even or odd, but assuming A to be an odd number, all the stored product

words (except the last one) would be odd. If the stored value P is an odd number, it can be

expressed as

P = PD−1 P D−2・・・P1 1 (3.7)

and it’s two’s complement is given by

P’ = P’D−1 P D−2・・・P’1 1 (3.8)

Where P’I is the one’s complement of Pi for 1 ≤ i≤ D − 1, and D= W + L − 1 is the width of the

stored words. The storing of the two’s complement of all the product values and changing the

sign of the LUT output for x4 = 1, makes the last LUT unchanged. Therefore it forms a simple

sign-modification circuit as in Figure 3.6.

Figure 3.6: Optimized implementation of the sign

modification of the odd LUT output

However, the fixed coefficient A could be even as well. When A is a nonzero even

integer, we can express it as A’ × 2l, where 1 ≤ l ≤ D− 1 is an integer, and A’ is an odd integer.

Instead of storing multiples of A, we can store multiples of A’ in the LUT, and the LUT output

can be left shifted by l bits by a hardwired shifter. Similarly, an address-generation circuit can

be formed as in Figure 3.7 since all the shifted-address YL(except the last one) is an odd integer.

Figure 3.7: Address-generation circuit.

3.3 MEMORY-BASED MULTIPLICATION BY INPUT OPERAND

DECOMPOSITION

3.3.1 Wallace Tree

Multipliers play an important role in today’s digital signal processing and various other

applications. With advances in technology, many researchers have tried and are trying to design

multipliers which offer either of the following design targets – high speed, low power

consumption, regularity of layout and hence less area or even combination of them in one

multiplier thus making them suitable for various high speed, low power and compact VLSI

implementation. The common multiplication method is “add and shift” algorithm. In parallel

multipliers number of partial products to be added is the main parameter that determines the

performance of the multiplier. To achieve speed improvements Wallace Tree algorithm can be

used to reduce the number of sequential adding stages.

If the multiplicand is N-bits and the Multiplier is M-bits then there is N* M partial

product. The way that the partial products are generated or summed up is the difference between

the different architectures of various multipliers. Multiplication of binary numbers can be

decomposed into additions. Consider the multiplication of two 8-bit numbers A and B to

generate the 16 bit product P.

A Wallace multiplier is a parallel multiplier which uses carry save addition algorithm to

reduce latency. The main aim of Wallace tree is to achieve higher speed and low power

consumption.

Steps included are

1. Multiply each bit of one of the arguments, by each bit of the other, yielding n2 results.

2. Reduce the number of partial products to two by layers of full and half adders.

3. Group the wires in two numbers, and add them with a conventional adder.

The second phase works as follows. As long as there are three or more wires with the same

weight add a following layer:

Take any three wires with the same weights and input them into a full adder. The result will

be an output wire of the same weight and an output wire with a higher weight for each three

input wires.

If there are two wires of the same weight left, input them into a half adder.

If there is just one wire left, connect it to the next layer

In the conventional 8 bit Wallace tree multiplier design, more number of addition operations is

required. Using the carry save adder, three partial product terms can be added at a time to form

the carry and sum. The sum signal is used by the full adder of next level. The carry signal is

used by the adder involved in the generation of the next output bit, with a resulting overall

delay proportional to log3/2n, for n number of rows. In the first and second stages of the

Wallace structure, the partial products do not depend upon any other values other than the

inputs obtained from the AND array. However, for the immediate higher stages, the final value

(PP3) depends on the carry out value of previous stage. This operation is repeated for the

consecutive stages. Wallace tree is multiplier considered as faster than a simple array multiplier

and is efficient implementation of a digital circuit which multiplies two integers. Wallace tree

multiplier as shown in Figure 3.8.

3.3.2 Adder Tree

Although the memory core of the LUT multiplier is reduced to nearly one-fourth by the

proposed optimization technique, it is not efficient for operands of small widths, since it

requires an adder to add the offset value. However, it could be used for multiplication with

input of large word size by an input decomposition scheme. When the width of the input

multiplicand X is large, direct implementation of LUT multiplier involves a very large LUT.

Therefore, the input word could be decomposed into a certain number of segments or subwords,

and the partial products pertaining to different subwords could be shift added to obtain the

desired product. Let the input operand X be decomposed into T subwords, i.e., {X1,X2, . . . ,

XT }. Product word C = A ・X can be written as the sum of partial products as

C = ∑ 2s(i − 1). Ci

(3. 9)

Figure 3.8: Wallace tree multiplier

Ci = A . 𝑋i , for 1 ≤ 𝑖 ≤ T

(3.10 )

where each Xi = {x(i−1)Sx(i−1)S+1, . . . , xiS−1}is an S-bit subword, for 1 ≤ i≤ T − 1, and

XT ={x(T−1)S, x(T−1)S+1, . . . , x(T−1)S+S’−1 }is the last subwordof S-bit, where S≤ S, and xi is the

(i+ 1)th bit of X.

A generalized structure for parallel implementation of LUT multiplication for input size

L = 5× (T − 1) + S_ is shown in Figure 3.9. , whereS’<5. The input multiplicand X is

decomposed into (T − 1) more significant subwordsX1,X2, . . . , XT−1 and less significant S’-bit

subword XT. The partial products Ci= A.Xi, for 1 ≤ i≤ (T − 1), are obtained from (T − 1) LUT

multipliers optimized by APC and OMS. The Tth LUT multiplier is a conventional LUT, which

is not optimized by APC and OMS, since it is required to store the sum of offset values V = 16A

× 2S’[(25(T−1) − 1)/(25 − 1)] pertaining to all the (T − 1) optimized LUTs for partial product

generation. The Tth LUT therefore stores the values (A.XT + V ). The sign of all the optimized

LUT outputs are modified according to the value of the most significant bit of corresponding

subwordXi, for 1 ≤ i≤ T − 1, and all the LUT outputs are added together by an adder tree, as

shown in Figure 3.9. By implementing this adder tree large input operand can be solved very

quickly without any time delay. It is implemented by dividing the input into different blocks

and fed each input to each LUT.

Figure 3.9: Proposed LUT-based multiplier for L = 5

3.4 FPGA BASED CACHE MEMORY

Cache systems are on-chip memory element used to store data. A cache controller is

used for tracking induced miss rate in cache memory. Any data requested by microprocessor

present in cache memory then the term is called “cache hit‟. The advantage of storing data on

cache, as compared to RAM, is that it has faster retrieval times, but it has disadvantage of on-

chip energy consumption. This project deals with the design of efficient cache memory for

detecting miss rate in cache memory and less power consumption.[2] Cache controller that

communicates between microprocessor and cache memory to carry out memory related

operations. The functionality of the design is explained below.

Cache controller receive address that microprocessor wants to access Cache controller

looks for the address in L1 cache. If address present in L1 cache the data from location is

provided to microprocessor via data bus. If the address not found in L1 cache then cache miss

will occur. Cache controller then looks same address in cache L2. If address present in L2 cache

the data from location is provided to microprocessor. The same data will replace in cache L1. If

the address not found in L2 cache then cache miss will occur.[6]

Cache memory consist Cache tag memory, cache data memory, cache tag comparator

and counter. The project deals with detection of cache misses. The proposed idea designs cache

tag memory, cache tag comparator and 10-bit counter and neglects cache data memory for

detecting cache misses, as address is stored in tag memory and data is stored data memory.

Figure 3.10: Functionality design

Controller

Level 2 Cache

Memory

Level 1 Cache

Memory

Microprocessor

System Bus

Address Requested by microprocessor is compared with address of data in main memory those

are stored in tag memory. The cache tag memory shown in Figure 3.11 stores address of data in

main memory.[6]

The 10-bit counter, shown in figure is used to generate 10 bit address for writing and

reading operation of cache tag memory. A Ten bit counter shown in Figure 3.12.

For each clock pulse counter is incremented by one. 10- bit counter provides 10-bit

addresses to cache tag memory during writing and reading operation shown in Figure 3.12. The

22-bit cache tag comparator shown in Figure 3.13 is used to compare address requested by

microprocessor and address stored in tag memory.22-bit cache tag comparator is used to

compare 22-bit address requested by microprocessor and 22-bit address stored in tag memory.

Figure 3.11: Tag memory

Figure 3.12: Ten bit counter

The implementation of cache memory requires cache tag memory, 10-bit counter and

cache tag comparator as shown in Figure 3.14,

Figure 3.13: Tag comparator

Figure 3.14: Implementation of cache memory

CHAPTER 4

PROJECT IMPLEMENTATION

4.1 FLOW CHART

4.1.1 Antisymmetric Product Coding

In APC flow chart first step is to check whether start is zero or one. If it is zero directly

obtain the product value that is zero. Else go to next step of computation of address. After

computing the address obtains the value from LUT and check the MSB bit.

Read input X

Is start =0

Compute address

APC word from LUT

MSB =1

Product = 16A + APC

Output product

Product = 16A -

APC word

4.1.2 Combined APC and OMS

Read inputs clock, start

If start=0

Compute s1s0

From s1s0

compute value of y

Assign each value

to memory array

Output

In combined APC and OMS we have to calculate the s0 and s1 value. The value

obtained is used to find the number of shifts. Shifting is done by using a shifter. Product value

is obtained from LUT after address computation .Even products are obtained after shifting.

4.1.3 Adder Tree

Read input X

Divide input X to 5 bit each

Call 5 bit input to each LUT

Output product

Sum the partial products

Adder tree is used for easy computation of product. Large input bit is divided in to small

input bits. Each partitioned bit is fed in to different LUT. At last the partial products are added

together and product value is obtained.

4.1.4 FPGA Based Cache Memory

In FPGA based cache memory three tags are used to check the availability of data in

cache. The comparator compares the requested address with that of address available in tag. If

compared data’s are similar there occurs a cache hit else there occur cache hit. Cache hit means

the data available in the memory, if data nit available in memory then it is a cache miss. In this

scheme data memory is not used.

Get request address

from µp

Is address

available in

4.2 SOFTWARE IMPLEMENTATION

4.2.1 ModelSim DE 6.5e

Verilog HDL is a hardware description language used to design and document electronic

systems. A hardware description language looks much like a programming language such as C;

it is a textual description consisting of expressions, statements and control structures. ModelSim

is a powerful simulator that can be used to simulate the behavior and performance of logic

circuits. It supports behavioral, register transfer level, and gate-level modeling. The simulator

allows the user to apply inputs to the designed circuit, usually referred to as test vectors, and to

observe the outputs generated in response. It shows how the simulator can be used to perform

functional simulation of a circuit specified in Verilog HDL. The version of ModelSim used in

this project is ModelSim DE6.5e.

4.2.2 Xilinx ISE

The Xilinx ISE tools allow the design to enter several ways including graphical

schematics, satemachine,VHDLandVerilog. Xilinx ISE (Integrated Synthesis Environment) is a

software tool produced by Xilinx for synthesis and analysis of HDL designs, enabling the

developer to compile their designs, perform timing analysis, examine RTL diagrams, simulate

a design's reaction to different stimuli, and configure the target device with the programmer.

The Xilinx ISE is a design environment for FPGA products from Xilinx. The primary user

interface of the ISE is the Project Navigator, which includes the design hierarchy (Sources),

a source code editor (Workplace), an output console (Transcript), and a processes tree

(Processes). A comparative study is made by analysing the result of Xilinx. Xilinx gives

detailed description of area used for the system.

The Design hierarchy consists of design files (modules), whose dependencies are

interpreted by the ISE and displayed as a tree structure. For single-chip designs there may be

one main module, with other modules included by the main module, similar to the

main() subroutine in C++ programs. Design constraints are specified in modules, which include

pin configuration and mapping.

CHAPTER 5

FUTURE SCOPE

The calculations in the project are performed by reducing the word length.

Further the memory is reduced by storing only the odd multiples in the Look Up Table. The

Antisymmetric Product Coding with the Odd Multiple Storage hence modifies the Look Up

Table. This approach provides a reduction in LUT size. This proposed system can be used for

efficient implementation of high precision multiplication by input operand decomposition. It is

found that the LUT based bring out significantly less area and less multiplication time.

This project brings out the possibility of using LUT based multipliers to implement the

constant multiplication for DSP based applications. The higher order of which can be

implemented for DSP based filters. This idea of LUT optimization can be extended to obtain

maximum advantage of area and speed, by designing the LUT from NAND or NOR read only

memories and the arithmetic shifts can be implemented by an array of barrel shifter using metal

oxide semiconductor transistors. Further work could be still done to derive OMS –APC based

LUTs for higher input sizes with different forms of decomposition and parallel and pipelined

addition schemes for suitable area delay tradeoffs.

The idea of FPGA implementation of cache memory would be of great utility to many

modern embedded applications, for which both high-performance and low-power are of great

importance. This cache memory may used in future work to design FPGA based set associative

cache memory and cache controller for tracking the induced miss rate in cache memory.

Further, this cache memory can be implemented to higher level or multiple level of caches to

track the miss rate in huge applications.[4],[2]

CHAPTER 6

SIMULATION RESULTS

6.1 APC

The number of slice Flip Flop utilized is 25 and number of input LUT utilized is 92.

6.2 COMBINED APC –OMS

In combined method the number of slice Flip Flop is 16 and number of input LUT is

18. This is much reduced than APC alone.

6.3 ADDER TREE

6.4 FPGA BASED CACHE MEMORY

CHAPTER 7

CONCLUSION

The two different techniques for memory optimization were implemented. The first

memory optimization is LUT optimization using APC-OMS technique and second one is FPGA

implementation of cache memory.

The LUT optimization using the combined APC and OMS techniques reduces on chip

area when compared to using APC alone ie., the number of slices reduced from 25 to 16 ,and

thus the memory optimization was achieved. This shows possibility of using LUT based

multipliers to implement the constant multiplication for DSP applications. The full advantages

of this LUT based design however, could be derived if the LUTs are implemented as NAND or

NOR read-only memories and the arithmetic shifts are implemented by an array barrel shifter

using metal–oxide–semiconductor transistors.

FPGA implementation of cache memory was designed for detecting the miss rate in

cache memory. The use of data memory was eliminated and memory was optimized. This also

helps in reducing the design complexity and power consumption. This approach would be of

great utility to many modern embedded applications, for which both high-performance and low-

power are of great importance.

memory optimization technique

data memory

main memory

semiconductor memory

memory space

cache memory design

cache memory closer

memory optimization

chip memory element

Documents

memory augmented policy optimization for...

a novel noise optimization technique for inductively

key generation using ant colony optimization technique

optimization technique

implementing search engine optimization technique to dynamic

multi-objective evolutionary optimization technique

advanced memory optimization techniques for

optimization technique genetic algorithm

l8: memory hierarchy optimization, bandwidth cs6963

optimization technique for genomic dna isolation …

lpp project - profit maximization through optimization...

memory/cache optimization

introduction to memory optimization

(memory access optimization)Лекция 3...

a simulation optimization technique for optimal …

differential evolution optimization technique...

optimization technique mms bamu aurangabad

virtualization technique virtualization technique system...

reconsidering custom memory...

memory optimization