low power vlsi architecture for image compression using discrete cosine transform

LOW POWER VLSI ARCHITECTURE FOR IMAGE COMPRESSION

USING DISCRETE COSINE TRANSFORM

A THESIS

Submitted by

VIJAYAPRAKASH A M

For the award of the degree

of

DOCTOR OF PHILOSOPHY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DR. M.G.R EDUCATIONAL AND RESEARCH INSTITUTE

UNIVERSITY (Declared U/S 3 of the UGC Act, 1956)

CHENNAI – 600 095

JULY 2012

ii

DECLARATION

I declare that the thesis entitled “LOW POWER VLSI ARCHITECTURE FOR

IMAGE COMPRESSION USING DISCRETE COSINE TRANSFORM”

submitted by me for the degree of Doctor of Philosophy is the record of work

carried out by me during the period from August 2004 to July 2012 under the

guidance of Dr. K. S GURUMURTHY and has not formed the basis for the

award of any degree, diploma, associate-ship, fellowship titles in this or any other

university or other similar institution of higher learning.

Signature of the candidate

iii

BONAFIDE CERTIFICATE

Certified that this thesis titled. “LOW POWER VLSI ARCHITECTURE FOR

IMAGE COMPRESSION USING DISCRETE COSINE TRANSFORM” is

the bonafide work of Mr.VIJAYAPRAKASH AM, who is carried out the

research work under my supervision. Certified further, that to the best of my

knowledge, the work reported herein does not form part of any other thesis or

dissertation on the basis of which a degree or award was conferred on an earlier

occasion on this or any other candidate.

Dr K.S.GURUMURTHY

Professor DOS in Electronics and Communication

Engineering University Visvesvaraya College of Engineering.

Bangalore -560001

iv

ABSTRACT

Image data compression refers to a process in which the amount of data is

used to represent image is reduced to meet a bit rate requirement (below or at most

equal to the maximum available bit rate), while the quality of the reconstructed

image satisfies a requirement for a certain application and the complexity of

computation involved is affordable for the application.

Data compression methods play an important role in data storage and

transmission. It is the process of converting an input data stream into another data

stream that has a smaller size. In image compression we reduce the irrelevance and

redundant image data in order to store or transmit data in an efficient form. Image

coding algorithms and techniques are developed that optimize the bit rate and

quality of the image. Image compression has got applications in many fields like

digital video, video conferencing and video over wireless networks and internet

etc,.In image compression redundant information is removed which is possible due

to the high correlation of the image data. For the increasing number of portable

wireless devices, the key design constraint is power dissipation.Limited battery life

constraints portable devices to low power dissipation; advances in battery life do

not grow as fast as the density and the operating frequency of ASICs. The ever

growing circuit densities and operating frequencies of ASICs only result in higher

power dissipation. Since early studies have focused only on high throughput DCT

with variable length coders, low-power DCT and low-power Variable Length

v

Coders have not received much attention.The target of the multimedia systems is

moving towards portable applications like laptops, mobiles and IPods etc. These

systems highly demand for low power operations, and thus require low power

functional units.

The proposed work is a realization of the low power two dimensional

Discrete Cosine Transform for image compression. This architecture uses row -

column decomposition, the number of calculations for processing an 8x8 block of

pixels is reduced. 1-D DCT operation is expressed as addition of vector-scalar

products and basic common computations are identified and shared to reduce

computational complexity. Compared to Distributed arithmetic based architecture,

the proposed DCT consumes less power. ASIC implementation of DCT and IDCT

cores for low power consumption is implemented.

The proposed work is a design and implementation of low power

architecture for Two Dimensional DCT and Variable Length coding for image

compression. The 2-D DCT calculation is performed by using the 2-D DCT

separability property, Such that the whole architecture is divided in to two 1-D

DCT calculations by using a transpose RAM. Vector processing using parallel

multipliers is a method used for implementation of DCT.The advantages in vector

processing method are regular structures, simple control and interconnect and good

balance between performance and complexity of implementation. DCT and IDCT

cores are implemented in ASIC which consumes less Power.

vi

Variable length coding that maps input source data onto code words with

variable length and it is an efficient method to minimize average code length.

Compression is achieved by assigning short code words to input symbols of high

probability and long code words to those of low probability. Variable length

coding can be successfully used to relax the bit-rate requirements and storage

spaces for many multimedia applications. For example, a variable length coder

(VLC) employed in MPEG-2 along with the Discrete Cosine Transform (DCT),

results in very good compression efficiency.

In this work researcher is going with the ASIC design for the image

compression system. Firstly, the system compresses the image using DCT and

Quantization. Next apply the Variable Length Coding for the compressed image

so that a further compression is achieved, finally use the IDCT and Variable

Length Decoding to retrieval of the image. Compression algorithms require

different amounts of processing power to encode and decode. Some high

compression algorithms require high processing power. So in this work, the present

researcher is concentrating on the low power VLSI design for the image

compression system and at the same time obtaining a good compression ratio. The

development of low power compression algorithms and architecture is not only

challenging but also intellectually stimulating.

In this research work, algorithms and architecture have been developed for

Discrete Cosine Transform, Quantization, Variable Length Encoding and

Decoding for image compression with an emphasis on low power consumption.

vii

These algorithms have been subsequently verified and the corresponding hardware

architectures are proposed so that they are suitable for ASIC implementation.

The DCT and IDCT architecture was first coded in Matlab in order to prove

the concepts and design methodology proposed for the work. After it was

successfully coded and tested, the VLSI design of the architecture was coded in

Verilog, a popular hardware description language used in industries, conforming to

RTL Coding Guidelines. The proposed hardware architecture for image

compression was synthesized using RTL compiler and it is mapped using 65nm

node standard cells. The Simulation was done using Modelsim simulator. Detailed

analysis for power and area was done using Design compiler (DC) from Synopsis

EDA tool. Power consumption of the DCT and IDCT are limited to 0.4350 mw

and 0.5519 mw with the cell area of 34983.35µm2 and 34903.79µm2 respectively.

The variable length encoder is mapped using 90nm node standard cells. The power

consumption is limited to1.5790µw with minimum cell area of 5409.922. The

physical design of the proposed hardware in this research was done using IC

compiler.

viii

ACKNOWLEDGEMENT

The joy and satisfaction that would accompany the successful completion of

any task would be incomplete without the mention of those who made it possible. I

am grateful that and now have the opportunity to thank all those people who have

helped me in different capacities to complete this thesis work successfully.

I would like to thank Dr.Thirunavakarasu Dean Research, Dr. M .G .R

Educational and Research Institute, University for his inspiration and support

during the period of this thesis work.

I express my deep sense of gratitude towards my guide Dr.K.S.Gurumurthy

Professor and Chairman Department of Electronics and Communication

Engineering, UVCE Bangalore University, Bangalore for giving me his invaluable

guidance, motivation, confidence and support for the speedy completion of this

thesis work.

I sincerely thank Dr.S.Ravi Professor and HOD Department of ECE,

Dr.M.G.R Educational and Research Institute, University who has given constant

support with motivation in completion of this thesis.

I thank whole heartedly to my wife Mrs.Geetha.S and my daughter

Jahnavi.V for their support and encouragement.

Also my sincere thanks to Industry professional friends and all my

colleagues for their constant encouragement and moral support. They have in

someway or the other responsible for the successful completion of this thesis.

VijayaPrakash A M

ix

TABLE OF CONTENTS

CHAPTER

NO TITLE

PAGE

NO

ABSTRACT iv

LIST OF TABLES xvi

LIST OF FIGURES xvii

LIST OF ABBREVATIONS xxi

1 INTRODUCTION

1.1 IMAGE DATA COMPRESSION 1

1.2 NEED FOR IMAGE COMPRESSION 6

1.3 PRINCIPLES BEHIND COMPRESSION 6

1.4 DIFFERENT TYPES OF REDUNDANCIES

IN IMAGE

7

1.4.1 Coding Redundancy 7

1.4.2 Interpixel Redundancy 7

1.4.3 Psychovisual Redundancy 8

1.5 TYPES OF JPEG COMPRESSION 8

1.5.1 Sequential DCT based 9

1.5.2 Progressive DCT based 9

1.5.3 Lossless Mode 10

1.5.4 Hierarchical Mode 10

1.6 LOSSLESS VERSUS LOSSY

COMPRESSION

11

1.6.1 Predictive Versus Transform Coding 12

x

1.7 DCT PROCESS 12

1.8 JPEG IMAGE COMPRESSION 15

1.8.1 Input Transformer 17

1.8.2 Quantization 18

1.8.3 Entropy Coding 20

1.8.4 Run-Length Encoding 20

1.8.5 Huffman Encoding 21

1.9 APLICATIONS OF DCT 22

1.10 DCT ALGORITHMS 22

1.10.1 One Dimensional DCT 22

1.10.2 Two Dimensional DCT 23

1.11 DCT ARCHITECTURES 24

1.11.1 Two-Dimensional Approaches 24

1.11.2 Row – Column Decomposition 25

1.11.3 Direct Method 25

1.11.4 Distributed Arithmetic Algorithms 26

1.12 PROPERTIES OF DCT 27

1.13 ORGANIZATION OF THE THESIS 28

SUMMARY 29

2 REVIEW OF LITERTURE

2.1 INTRODUCTION 30

3 VLSI DESIGN FLOW AND LOW POWER

VLSI DESIGN

3.1 INTRODUCTION 50

3.2 ASIC DESIGN FLOW 51

3.3 DESIGN DESCRIPTION 52

3.4 DESIGN OPTIMIZATION 53

xi

3.5 BEHAVIORAL SIMULATION 53

3.5.1 Specification for the Design 54

3.5.2 Behavioral or RTL Design 54

3.6 VERIFICATION OF THE DESIGN 54

3.7 STNTHESIS OF THE DESIGN 57

3.8 VLSI PHYSICAL DESIGN 59

3.8.1 Floor Planning 60

3.8.2 Placement 60

3.8.3 Routing 61

3.8.4 Physical Verification 63

3.9 POST LAYOUT SIMULATION 63

3.10 LOW POWER VLSI DESIGN 64

3.11 SOURCES OF POWER DISSIPATION 65

3.12 STATIC POWER CONSUMPTION 67

3.13 DYNAMIC POWER DISSIPATION 68

3.14 LOAD CAPACITANCE TRANSIENT

DISSIPATION

68

3.14.1 Internal Capacitance Transient

Dissipation

69

3.14.2 Current Spiking During Switching 70

3.15 POWER REDUCTION TECHNIQUES IN

VLSI DESIGN

71

3.15.1 Clock Gating 71

3.15.2 Asynchronous Logic 74

3.15.3 Multi vdd Techniques 74

3.15.4 Architectural Level 75

SUMMARY 75

xii

4 LOW POWER VLSI ARCHITECTURE

FOR DISCRETE COSINE TRANSFORM

4.1 INTRODUCTION 76

4.2 DCT MODULE 76

4.2.1 Mathematical Description of the DCT 77

4.3 BLOCK DIAGRAM OF DCT CORE 78

4.4 TWO DIMENSIONAL DCT CORE 79

4.4.1 Behavioral Model for Vector

Processing

80

4.4.2 Transpose Buffer 83

4.4.3 Two Dimensional DCT Architecture 84

4.5 INVERSE DISCRETE COSINE

TRANSFORM

85

4.5.1 Storage / RAM Section 87

4.5.2 IDCT Core Block diagram 87

SUMMARY 88

5 LOW POWER ARCHITECTUR FOR VARIABLELENGTH ENCODING AND DECODING

5.1 INTRODUCTION 89

5.2 VARIABLE LENGTH ENCODING 91

5.2.1 Zig- Zag Scanning 92

5.2.2 Run Length Encoder 95

5.2.3 Huffman Encoding 98

5.2.4 Interconnection of VLE blocks 99

5.3 VARIABLE LENGTH DECODER 100

5.3.1 Huffman Decoder 101

5.3.2 Block Diagram of FIFO 101

xiii

5.3.3 Run Length Decoder 102

5.3.4 Zig Zag Inverse Scanner 104

SUMMARY 105

6 SIMULATION AND SYNTHESIS RESULTS OF

DCT AND IDCT MODULES

6.1 INTRODUCTION 106

6.2 MATLAB IMPLEMENTATION OF DCT

AND IDCT MODULES

108

6.2.1 DCT Methodology 108

6.2.2 Quantization 111

6.3 MATLAB RESULTS 113

6.4 VLSI DESIGN OF THE PROPOSED

ARCHITECTURE

118

6.5 SIMULATION RESULTS USING

VERILOG

118

6.6 COMPARISON OF MATLAB AND HDL SIMULATION RESULTS

119

6.6.1 2-D DCT Simulation results using

Matlab and VERILOG

120

6.6.2 IDCT Simulation Results 121

6.7 SYNTHESIS RESULTS OF DCT AND

IDCT

122

6.8 IMAGE COMPRESSION USING

DISCRETE COSINE TRANSFORM AND

QUANTIZATION

125

6.9 RECONSTRUCTION OF IMAGE

USING IDCT

129

xiv

6.10 SIMULATION RESULT OF IMAGE

AFTER COMPRESSION

131

6.11 FPGA IMPLEMENTATION OF THE

DCTQ

131

6.11.1 Device Utilization Summary 132

6.11.2 HDL Synthesis Results 132

SUMMARY 133

7 SIMULATION AND SYNTHESIS

RESULTS OF VARIABLE LENGTH

ENCODING AND DECODING MODULE

7.1 INTRODUCTION 134

7.2 ZIGZAG SCANNING 134

7.3 RUN LENGTH ENCODING 135

7.4 HUFFMAN ENCODING 136

7.5 HUFFMAN DECODING 138

7.6 RUN LENGTH DECODING 139

7.7 ZIGZAG INVERSE SCANNING 139

7.8 POWER AND AREA REPORTS 141

7.9 PERFORMANCE COMPARISON 142

7.9.1 Power Comparison of Huffman

Decoder

142

7.9.2 Power Comparison of Run Length and

Huffman Encoder

143

7.9.3 Percentage of Power saving 145

SUMMARY 145

8 CONCLUSIONS AND SCOPE FOR

FUTURE WORK

xv

8.1 CONCLUSION 146

8.2 ORIGINAL CONTRIBUTIONS 147

8.3 SCOPE FOR FUTURE WORK 147

APPENDIX 149

REFERENCES

LIST OF PUBLICATIONS

VITAE

xvi

LIST OF TABLES

TABLE

NO TITLE

PAGE

NO

5.1 Comparison between conventional and proposed RLE 97

6.1 Signal Description of DCT core 119

6.2 Signal Description of IDCT core 121

6.3 Power Area Characteristics of DCT and IDCT using

65nm standard cells

123

7.1 Power and Area Parameters of VLC and VLD blocks 141

7.2 Power Comparison for Huffman decoders 143

7.3 Power Comparison for Run Length and Huffman

Encoders

144

7.4 Percentage of Power Savings 145

xvii

LIST OF FIGURES

FIGURE

NO TITLE

PAGE

NO

1.1 Sequential Coding and Progressive coding 10

1.2 Hierarchical multi resolution coding 11

1.3 DCT Process for compression 12

1.4 DCT Coefficients 14

1.5 Image compression model 17

1.6 Image Decompression model 17

1.7 Row – column Decomposition 25

1.8 2-D DCT model 28

3.1 Major activities in ASIC design 51

3.2 ASIC design and development flow 52

3.3 RTL Verification flow with Linting and Codecove

rage

55

3.4 Synthesis flow with Low power and UPF 57

3.5 VLSI Physical Design Hierarchy 60

3.6 VLSI Physical Design Flow 62

3.7 CMOS Circuit in subthreshold 66

3.8 CMOS Inverter mode for Static power consum 67

3.9 Simple CMOS Inverter Driving a Capacitive External

Load

68

3.10 Parasitic Internal Capacitors Associated with Two

Inverters

69

3.11 Equivalent schematic of a CMOS inverter whose 70

xviii

input is between logic levels

3.12 Different Clock gating schemes 73

4.1 Block diagram of 2-D DCT Architecture 78

4.2 Top level schematic for DCT core 79

4.3 One Dimensional- DCT Architecture 82

4.4 2-D DCT Architecture 85

4.5 1-D IDCT Architecture 86

4.6 2-D IDCT Block diagram 87

4.7 Top level schematic of IDCT Architecture 87

5.1 Block diagram of variable length encoder 91

5.2 Block diagram of Zigzag Scanner 92

5.3 Zigzag Scanning order 92

5.4 Zigzag Scanning example 93

5.5 Internal Architecture of the Zigzag Scanner 94

5.6 Block diagram of Run-Length Encoder 95

5.7 The internal architecture of run-length encoder. 96

5.8 Block diagram of Huffman Encoding 98

5.9 Internal architecture of Huffman encoder 98

5.10 Interconnection of zigzag scanning, run-length and

Huffman encoding blocks

99

5.11 Block diagram of variable length decoder 100

5.12 Block diagram of Huffman decoder 101

5.13 Block diagram of FIFO 101

5.14 Block diagram of Run-Length Decoder 102

5.15 Block diagram of Zigzag Inverse Scanner 104

6.1 Matlab Design flow 114

6.2 Matlab Simulation Results for 8x8 image 115

xix

6.3 Matlab Simulation Results for full image 116

6.4 Input image in color and Gray Scale 117

6.5 Reconstructed Image after proposed DCT and

IDCT

117

6.6 2-D DCT Matlab results 120

6.7 DCT HDL simulation results 120

6.8 2-D IDCT HDL simulation results 121

6.9 2-D IDCT MATLAB results 122

6.10 Layout of 2D- DCT 123

6.11 Zoomed Version of 2D- DCT Layout 124

6.12 Layout of 2D- IDCT 124

6.13 Zoomed Version of 2D-IDCT Layout 125

6.14 Original image for compression 126

6.15 Original image with the pixel values 126

6.16 DCT co efficients before quantization 128

6.17 Image after compression 129

6.18 Reconstructed Image using IDCT 130

6.19 Comparison of original and reconstructed image 130

6.20 Simulation results after compression 131

7.1 Waveform obtained after simulating the Zigzag

Scanning block

135

7.2 Waveform of Run Length Encoding block 136

7.3 Waveform of Huffman Encoder block 137

7.4 Layout of Variable Length Coding 137

7.5 Zoomed Version Layout of VLC 138

7.6 Waveform of Huffman Decoder block 138

7.7 Waveform of run-length decoder block 139

xx

7.8 Waveform of zigzag inverse scanner block 140

7.9 Layout of Variable Length Decoding 140

7.10 Bar Chart of Power and Area parameters 142

7.11 Bar chart showing the power comparison of Huffman

decoders

143

7.12 Barchart showing the power comparison of RL-

Huffman Encoder combination

144

xxi

LIST OF ABBREVIATIONS DCT Discrete Cosine Transform

VLSI Very Large Scale Integration

JPEG Joint Photographic Expert Group

IDCT Inverse Discrete Cosine Transform

ISO International Organization for Standardization

IEC International Electro technical Commission

MPEG Moving Picture Expert Group

HDTV High Density Television

HVS Human Visual System

KLH Karhunen-Loeve-Hotelling

CR Compression Ratio

DPCM Differential Pulse Code Modulation

DWT Discrete Wavelet Transform

RLE Run Length Encoding

DA Distributed Arithmetic

LUT Look-up table

HSM Highly Scalable Multiplier

PSNR Peak Signal to Noise Ratio

ASIC Application Specific Integrated Circuit

HDL Hardware Description Language

FPGA Field Programmable Gate Array

RTL Register Transfer Level

STA Static Timing Analysis

DRC Design Rule Check

ERC Electrical Rule Checking

SVC Static Voltage Scaling

MVS Multi-level Voltage Scaling

DVFS Dynamic Voltage and Frequency Scaling

xxii

AVS Adaptive Voltage Scaling

MAC Multiply and Accumulate

VLE Variable Length Encoding

FIFO First In First Out

RLD Run-Length Decoder

IUS Incicive Unified Simulator

CHT Condensed Huffman Table

SGHT Single-Side Growing Huffman Table

RVLC Reversible Variable-Length Coding

PE Processing Elements

1

CHAPTER 1

INTRODUCTION

The purpose of this chapter is to provide an overview of compression, in

particular an image compression using discrete cosine transform (DCT). The

different types of architecture to find the 2-Dimensional DCT and the methodology

followed for image compression using DCT.

1.1 IMAGE DATA COMPRESSION

Image data compression has been an active research area for image

processing over the last decade and has been used in a variety of applications.

Compression is the process of reducing the size of a data by encoding its

information more efficiently. By doing this, the result is a reduction in the number

of bits and bytes used to store the information. In effect, reduce the bandwidth

required for transmission and to reduce the storage requirements [97]. This

research investigates the implementation of an image data compression method

with Low power VLSI hardware that could be used in practical coding systems to

compress Image signals.

In practical situations, an image is originally defined over a large matrix of

picture elements (pixels), with each pixel represented by an 8- or 16-bit gray scale

value. This representation could be so large that it is difficult to store or transmit.

The purpose of image compression is to reduce the size of the representation and,

at the same time, to keep most of the information contained in the original

image.DCT based coding/decoding systems play a dominant role in real-time

applications. However, the DCT is computationally intensive. In addition, 2D-

DCT has been recommended by standard organizations the Joint Photographic

Expert Group (JPEG) [95] .The standards developed by these groups aid industry

manufacturers in developing real-time 2D-DCT chips for use in various image

transmission and storage systems [46].

2

DCT based coding and decoding systems play a dominant role in real-time

applications in science and engineering like audio and Images. VLSI DCT

processor chips have become indispensable in real time coding systems because of

their fast processing speed and high reliability. JPEG has defined an international

standard for coding and compression of continuous tone- still images. This

standard is commonly referred to as the JPEG standard [55]. The primary aim of

the JPEG standard is to propose an image compression algorithm that would be

generic, application independent and aid VLSI implementation of data

compression. As the DCT core becomes a critical part in an image compression

system, close studies on its performance and implementation are worthwhile and

important. Application specific requirements are the basic concern in its design. In

the last decade the advancement in data communication techniques was significant,

during the explosive growth of the Internet the demand for using multimedia has

increased. Video and Audio data streams require a huge bandwidth to be

transferred in an uncompressed form. Several ways of compressing multimedia

streams evolved, some of them use the Discrete Cosine Transform (DCT) for

transform coding and its inverse (IDCT) for transform decoding.

Image compression is a useful topic in the digital world. A digital image

bitmap can contain considerably large amounts of data causing exceptional

overhead in both computational complexity as well as data processing. Storage

media has exceptional capacity however access speeds are typically inversely

proportional to capacity. Compression is a must to manage large amounts of data

for network, internet, or storage media. Compression techniques have been studied

for years, and will continue to improve. Typically image and video compressors

and decompressors (CODECS) are performed mainly in software as signal

processors can manage these operations without incurring too much overhead in

computation. However, the complexity of these operations can be efficiently

implemented in hardware. Hardware specific CODECS can be integrated into

digital systems fairly easily. Improvements in speed occur primarily because the

hardware is tailored to the compression algorithm rather than to handle a broad

range of operations like a digital signal processor. Data compression itself is the

3

process of reducing the amount of information into a smaller data set that can be

used to represent, and reproduce the information. Types of image compression

include lossless compression, and lossy compression techniques that are used to

meet the needs of specific applications.

JPEG compression can be used as a lossless or a lossy process depending

on the requirements of the application. Both lossless and lossy compression

techniques employ reduction of redundant data. Work in standardization has been

controlled by the International Organization for Standardization (ISO) in

cooperation with the International Electro technical Commission (IEC). The Joint

Photographic Experts Group produced the well known image format JPEG, a

widely used image format. JPEG provides a solid baseline compression algorithm

that can be modified in numerous ways to any desired application. The JPEG

specification was released initially in 1991, although it does not specify a

particular implementation.

In signal processing applications, the DCT is the most widely used

transform after the discrete Fourier transform. The DCT and IDCT are important

components in many static picture compression and decompression standards [95,

55, 54], including MPEG, HDTV, and JPEG. The applications for these standards

range from still pictures on the Internet to low quality videophones to high

definition television. The DCT transforms data from the spatial domain to the

spatial frequency domain. The DCT attempts to de-correlate the image data, which

is typically highly correlated for small areas of an image. Heavily correlated data

samples provide much redundant information, whereas just a few pieces of

uncorrelated information can represent the same data much more efficiently.

To compress data, it is important to recognize redundancies in data, in the

form of coding redundancy, inter-pixel redundancy, and psycho-visual

redundancy. Data redundancies occur when unnecessary data is used to represent

source information. Compression is achieved when one or more of these types of

redundancies are reduced. Intuitively, removing unnecessary data will decrease the

4

size of the data, without losing any important information. However, this is not the

case for psycho-visual redundancy. The most obvious way to increase compression

is to reduce the coding redundancy. This is referring to the entropy of an image in

the sense that more data is used than necessary to convey the information. Lossless

redundancy removal compression techniques are classified as entropy coding.

Other compression can be obtained through inter-pixel redundancy removal. Each

adjacent pixel is highly related to its neighbors, thus can be differentially encoded

rather than sending the entire value of the pixel. Similarly adjacent blocks have the

same property, although not too the extent of pixels.

In order to produce loss less compression, it is recommended that only

coding redundancy is reduced or eliminated. This means that the source image will

be exactly the same as the decompressed image. However, inter-pixel

redundancies can also be removed, as the exact pixel value can be reconstructed

from differential coding or through run length coding. Psycho-visual redundancy

refers to the fact that the human visual system will interpret an image in a way

such that removal of this redundancy will create an image that is nearly

indistinguishable by human viewers. The main way to reduce this redundancy is

through quantization. Quantizing data will reduce it to levels defined by the

quantization value. Psycho-visual properties are taken advantage of from studies

performed on the human visual system. The Human Visual System (HVS)

describes the way that the human eye processes an image, and relays it to the brain.

By taking advantage of some properties of HVS, a lot of compression can be

achieved. In general, the human eye is more sensitive to low frequency

components, and the overall brightness, or luminance of the image.

Images contain both low frequency and high frequency components. Low

frequencies correspond to slowly varying color, whereas high frequencies

represent fine detail within the image. Intuitively, low frequencies are more

important to create a good representation of an image. Higher frequencies can

largely be ignored to a certain degree. The human eye is more sensitive to the

luminance (brightness), than the chrominance (color difference) of an image. Thus

5

during compression, chrominance values are less important and quantization can

be used to reduce the amount of psycho-visual redundancy [92, 101, 18].

Luminance data can be quantized, but more coarsely to ensure that important data

is not lost.

Several compression algorithms use transforms to change the image from

pixel values representing color to frequencies dealing with lightness and darkness

of an image. Many forms of the JPEG compressions algorithm make use of the

discrete cosine transform. Other transforms such as wavelets are employed by

other compression algorithms. These models take advantage of subjective

redundancy by exploiting the human visual system sensitivity to image

characteristics. Another form of compression technique aside from exploiting

redundancies in data is known as transform coding or block quantization.

Transform coding employs techniques such as differential pulse code modulation

as well as other predictive compression measures. Transform-coding works by

moving data from spatial components to transform space such that data is reduced,

into a fewer number of samples. Transform coders effectively create an output that

has a majority of the energy compacted into a smaller number of transform

coefficients.

The JPEG image compression algorithm makes use of a discrete cosine

transform to move pixel data representing color intensities to a frequency domain.

Most multimedia systems combine both transform coding and entropy coding into

a hybrid coding technique. The most efficient transform coding technique employs

the Karhunen-Loeve-Hotelling (KLH) transform. The KLH transform has the best

results of any studied transform regarding the best energy compaction. However,

the KLH transform has no fast algorithm or effective hardware implementation.

Thus, JPEG compression replaces the KLH transform with the discrete cosine

transform, which is closely related to the discrete Fourier transform. Transform

coders typically make use of quantizers to scale the transform coefficients to

achieve greater compression.

6

As a majority of energy from the source image is compacted into few

coefficients, vector quantizers can be used to coarsely quantize the components,

and finely quantize the more important coefficients [20]. A major benefit to

transform coding is that distortion or noise produced by quantization and rounding

gets evenly distributed over the resulting image through the inverse transform [89].

1.2 NEED FOR IMAGE COMPRESSION

The need for image compression becomes apparent when number of bits

per image is computed resulting from typical sampling rates and quantization

methods. For example, the amount of storage required for given images is (i) a

low resolution, TV quality, color video image which has 512×512 pixels/color,

8bits/pixel, and 3 colors approximately consists of 6×10⁶ bits; (ii) a 24×36 mm

negative photograph scanned at 12×10⁻⁶mm: 3000×2000 pixels/color, 8

bits/pixel, and 3 colors nearly contains 144×10⁶ bits; (iii) a 14×17 inch radiograph

scanned at 70×10⁻⁶mm: 5000×6000 pixels, 12 bits/pixel nearly contains 360×10⁶

bits. Thus storage of even a few images could cause a problem. As another

example of the need for image compression, consider the transmission of low

resolution 512×512×8 bits/pixel×3- color video image over telephone lines. Using

a 96000 bauds (bits/sec) modem, the transmission would take approximately 11

minutes for just a single image, which is unacceptable for most applications.

1.3 PRINCIPLES BEHIND COMPRESSION

Number of bits required to represent the information in an image can be

minimized by removing the redundancy present in it. There are three types of

redundancies: (i) spatial redundancy, which is due to the correlation or

dependence between neighboring pixel values; (ii) spectral redundancy, which

is due to the correlation between different color planes or spectral bands; (iii)

temporal redundancy, which is present because of correlation between

different frames in images. Image compression research aims to reduce the

number of bits required to represent an image by removing the spatial and spectral

redundancies as much as possible.

7

Data redundancy is of central issue in digital image compression. If n1 and

n2 denote the number of information carrying units in original and compressed

image respectively, then the compression ratio CR can be defined as

CR= n1/n2

And relative data redundancy RD of the original image can be defined as

RD=1-(1/CR);

Three possibilities arise here:

(1) If n1=n2, then CR=1 and hence RD=0 which implies that original image

does not contain any redundancy between the pixels.

(2) If n1>>n2, then CR→∞ and hence RD→1 which implies considerable

amount of redundancy in the original image.

(3) If n1<<n2, then CR→0 and hence RD→-∞ which indicates that the

compressed image contains more data than the original image.

1.4 DIFFERENT TYPES OF REDUNDANCIES IN IMAGE

Image compression and coding techniques explore three types of

redundancies: coding redundancy, interpixel (spatial) redundancy and

pyschovisual redundancy. Compression is achieved when one or more of these

types of redundancies are reduced.

1.4.1 Coding Redundancy

This refers to the entropy of the image in the sense more data is used than

necessary to convey the information. This can be overcome by variable length

coding. Examples of image coding schemes that explore coding redundancy are

Huffman codes and Arithmetic coding technique.

1.4.2 Interpixel Redundancy

This is also known as spatial redundancy, interframe redundancy or

geometric redundancy. It exploits the fact that an image often contains strongly

8

correlated pixels, in other words, large regions where the pixel values are the same

or almost the same. Examples of compression techniques that explore interpixel

redundancy include Constant Area Coding, Run Length Encoding and many

predictive coding algorithms like Differential Pulse Code Modulation.

1.4.3 Psychovisual Redundancy

This refers to the fact that the human visual system will interpret an image

in a way such that removal of redundancy will create an image that is nearly

indistinguishable by human viewers. The main way to reduce this type of

redundancy is through quantization. Psychovisual properties are taken advantage

from studies performed on human visual system. Most of the image coding

algorithms in use today exploit this type of redundancy such as Discrete Cosine

Transform (DCT) based algorithm at the heart of the JPEG encoding standard.

1.5 TYPES OF JPEG COMPRESSION

Still image coding is an important application of data compression. When

an analog image or picture is digitized, each pixel is represented by a fixed number

of bits, which correspond to a certain number of gray levels. In this uncompressed

format, the digitized image requires a large number of bits to be stored or

transmitted. As a result, compression becomes necessary due to the limited

communication bandwidth or storage size.

The JPEG standard allows for both lossy and lossless encoding of still

images. The algorithm for lossy coding is a discrete cosine transforms (DCT)

based coding scheme. This is the baseline of JPEG and is sufficient for many

applications. However, to meet the needs of applications that cannot tolerate loss,

e.g., compression of medical images, a lossless coding scheme is also provided and

is based on a predictive coding scheme. From the algorithmic point of view, JPEG

includes four distinct modes of operation: sequential DCT-based mode,

progressive DCT-based mode, lossless mode, and hierarchical mode.

9

1.5.1 Sequential DCT based

The sequential DCT based mode of operation comprises the baseline JPEG

algorithm. This technique can produce very good compression ratios, while

sacrificing image quality. The sequential DCT based mode achieves much of its

compression through quantization, which removes entropy from the data set.

Although this baseline algorithm is transform based, it does use some measure of

predictive coding called the differential pulse code modulation (DPCM) [8]. After

each input 8x8 block of pixels is transformed to frequency space using the DCT,

the resulting block contains a single DC component, and 63 AC components. The

DC component is predictively encoded through a difference between the current

DC value and the previous. This mode only uses Huffman coding models, not

arithmetic coding models which are used in JPEG extensions. This mode is the

most basic, but still has a wide acceptance for its high compression ratios, which

can fit many general applications very well.

1.5.2 Progressive DCT based

However, in the progressive mode, the quantized DCT coefficients are first

stored in a buffer before the encoding is performed. The DCT coefficients in the

buffer are then encoded by a multiple scanning process. In each scan, the quantized

DCT coefficients are partially encoded by either spectral selection or successive

approximation. In the method of spectral selection, the quantized DCT coefficients

are divided into multiple spectral bands according to a zigzag order. In each scan, a

specified band is encoded. In the method of successive approximation, a specified

number of most significant bits of the quantized coefficients are first encoded,

followed by the least significant bits in later scans. The difference between

sequential coding and progressive coding is shown in Figure 1.1. In the sequential

coding an image is encoded part-by-part according to the scanning order while in

the progressive coding, the image is encoded by multi scanning process and in

each scan the full image is encoded to a certain quality level.

10

(a) Sequential Coding

(b) Progressive Coding

Figure 1.1 (a) Sequential coding (b) Progressive coding

1.5.3 Lossless Mode

Lossless coding is achieved by a predictive coding scheme. In this scheme,

three neighboring pixels are used to predict the current pixel to be coded. The

prediction difference is entropy coded using either Huffman or arithmetic coding.

Because the prediction is not quantized, the coding is lossless. As it is lossless, it

also rules out the use of quantization. This method does not achieve high

compression ratios, but some applications do require extremely precise image

reproduction like in medical scanning images.

1.5.4 Hierarchical Mode

In the hierarchical mode, an image is first spatially down-sampled to a

multilayered pyramid, resulting in a sequence of frames as shown in Figure 1.2.

11

Figure 1.2 Hierarchical multi resolution coding

This sequence of frames is encoded by a predictive coding scheme. Except

for the first frame, the predictive coding process is applied to the differential

frames, i.e., the differences between the frame to be coded and the predictive

reference frame. It is important to note that the reference frame is equivalent to the

earlier frame that would be reconstructed in the decoder. The coding method for

the difference frame may either use the DCT-based coding method, the lossless

coding method, or the DCT-based processes with a final lossless process. Down -

sampling and up-sampling filters are used in the hierarchical mode. The

hierarchical coding mode provides a progressive presentation similar to

progressive DCT-based mode, but is also useful in the applications that have multi

resolution requirements. The hierarchical coding mode also provides the capability

of progressive coding to a final lossless stage.

1.6 LOSSLESS VERSUS LOSSY COMPRESSION

In lossless compression schemes, the reconstructed image, after

compression, is numerically identical to the original image. However lossless

compression can only achieve a modest amount of compression. Lossless

compression is preferred for archival purposes and often medical imaging,

technical drawings, clip art or comics. This is because lossy compression methods,

especially when used at low bit rates, introduce compression artifacts. An image

reconstructed following lossy compression contains degradation relative to the

original. Often this is because the compression scheme completely discards

12

redundant information. However, lossy schemes are capable of achieving much

higher compression. Lossy methods are especially suitable for natural images

such as photos in applications where minor (sometimes imperceptible) loss of

fidelity is acceptable to achieve a substantial reduction in bit rate. The lossy

compression that produces imperceptible differences can be called visually

lossless.

1.6.1 Predictive Versus Transform Coding

In predictive coding, information already sent or available is used to predict

future values, and the difference is coded. Since this is done in the image or spatial

domain, it is relatively simple to implement and is readily adapted to local

image characteristics. Differential Pulse Code Modulation (DPCM) is one

particular example of predictive coding. Transform coding, on the other hand, first

transforms the image from its spatial domain representation to a different type

of representation using some well-known transform and then codes the

transformed values (coefficients). This method provides greater data compression

compared to predictive methods, although at the expense of greater

computational requirements.

1.7 DCT PROCESS

Figure 1.3 DCT Process for compression

8X8 Input Image

DCT

Quantizer

Quantized DCT

coefficients

Quantization Tables

13

The DCT process is shown in Figure 1.3 The input image is divided into

non-overlapping blocks of 8 x 8 pixels, and input to the baseline encoder. The

pixel values are converted from unsigned integer format to signed integer format,

and DCT computation is performed on each block. DCT transforms the pixel data

into a block of spatial frequencies that are called the DCT coefficients. Since the

pixels in the 8 x 8 neighborhood typically have small variations in gray levels, the

output of the DCT will result in most of the block energy being stored in the lower

spatial frequencies [80, 83, 60]. On the other hand, the higher frequencies will

have values equal to or close to zero and hence, can be ignored during encoding

without significantly affecting the image quality.

The selection of frequencies based on which frequencies are most

important and which ones are less important can affect the quality of the final

image. JPEG allows for this by letting the user predefine the quantization tables

used in the quantization step that follows the DCT computation. The selection of

quantization values is critical since it affects both the compression efficiency [70],

and the reconstructed image quality.

The 2-D DCT transforms an 8x8 block of spatial data samples into an 8x8

block of spatial frequency components. The IDCT performs the inverse of DCT,

transforming spatial frequency components back into the spatial domain. Figure

1.4 shows the frequency components represented by each coefficient in the output

matrix. The low frequency coefficients occur in the top left side of the output

matrix, while the remaining

14

Figure 1.4 DCT Coefficients

higher frequency coefficients occur in the bottom right side. The DC coefficient at

position (0,0) gives an idea of the average intensity (for luminance blocks) or hue

(for chrominance blocks) of an entire block. Moving horizontally from position

(0,0) to position (0,7), the coefficients give the contributions of increasing vertical

frequency components to the overall 8x8 block. The coefficients from position

(1,0) to position (7,0) have similar meaning for horizontal frequency components.

Moving diagonally through the matrix gives the combined contribution of

horizontal and vertical frequency components. The original block is rebuilt by the

IDCT with these discrete frequency components. High frequency coefficients have

small magnitude for typical image data, which usually does not change

dramatically between neighboring pixels. Additionally, the human eye is not as

sensitive to high frequencies as to low frequencies. It is difficult for the human eye

to discern changes in intensity or colour that occurs between successive pixels. The

human eye tends to blur these rapid changes into an average hue and intensity.

However, gradual changes over the 8 pixels in a block are much more discernible

than rapid changes. When the DCT is used for compression purposes, the quantizer

15

unit attempts to force the insignificant high frequency coefficients to zero while

retaining the important low frequency coefficients.

DCT transforms the information from the time or space domains to the

frequency domain, such that other tools and transmission media can be run or used

more efficiently to reach application goals: compact representation [13], fast

transmission, memory savings, and so on. The JPEG image compression standard

was developed by Joint Photographic Expert Group. The JPEG compression

principle is the use of controllable losses to reach high compression rates. In this

context, the information is transformed to the frequency domain through DCT.

Since neighbor pixels in an image have high likelihood of showing small

variations in color, the DCT output will group the higher amplitudes in the lower

spatial frequencies. Then, the higher spatial frequencies can be discarded,

generating a high compression rate and a small perceptible loss in the image

quality. The JPEG compression is recommended for photographic images, since

drawing images are richer in high frequency areas that are distorted with the

application of the JPEG compression.

1.8 JPEG IMAGE COMPRESSION

JPEG image compression standard uses DCT (Discrete Cosine Transform).

The discrete cosine transform is a fast transform. It is a widely used and robust

method for image compression. It has excellent compaction for highly correlated

data, DCT has fixed basis images. DCT gives good compromise between

information packing ability and computational complexity.

JPEG 2000 image compression standard makes use of DWT (Discrete

Wavelet Transform). DWT can be used to reduce the image size without losing

much of the resolutions computed and values less than a pre-specified threshold

are discarded. Thus it reduces the amount of memory required to represent given

image. DWT provides lower quality than JPEG at low compression rates. DWT

requires longer compression time [12, 52].

16

The name "JPEG" stands for Joint Photographic Experts Group. JPEG is a

method of lossy compression for digitized photographic images. JPEG can achieve

a good compression with little perceptible loss in image quality. It works with

color and grayscale images and finds applications in satellite, medical etc.

JPEG Encoding consists of following stages. The Major Steps in JPEG

Coding involve:

• Image/Block Preparation

• DCT (Discrete Cosine Transformation)

• Quantization

• Entropy Coding

• Zigzag Scanning (Vectoring)

• Run Length Encoding (RLE)

The baseline JPEG compression algorithm is the most basic form of

sequential DCT based compression. By using transform coding, quantization, and

entropy coding, at an 8-bit pixel resolution, a high-level of compression can be

achieved. However, the compression ratio achieved is due to the sacrifices made in

quality. The baseline specification assumes that 8-bit pixels are the source image,

but extensions can use higher pixel resolutions. JPEG assumes that each block of

data input is 8x8 pixels, which are serially input in raster order. Similarly, each

block is sequentially input in raster order.

Baseline JPEG compression has some configurable portions, such as

quantization tables, and Huffman tables, which can individually be specified in the

JPEG file header. By studying the source images to be compressed, Huffman

codes and quantization codes can be optimized to reach a higher level of

compression without losing more quality than is acceptable. Although this mode of

JPEG is not highly configurable, it still allows a considerable amount of

compression. Furthermore compression can be achieved by sub sampling

17

chrominance portions of the input image, which is a useful technique playing on

the human visual system.

Figure 1.5 Image compression model

Figure 1.6 Image Decompression model

Image compression model shown in figure 1.5 consists of a Transformer, quantizer

and encoder.

1.8.1 Input Transformer

It transforms the input data into a format to reduce inter pixel redundancies

in the input image. Transform coding techniques use a reversible, linear

mathematical transform to map the pixel values onto a set of coefficients, which

are then quantized and encoded. The key factor behind the success of transform-

based coding schemes is that many of the resulting coefficients for most natural

images have small magnitudes and can be quantized without causing significant

distortion in the decoded image. For compression purpose, the higher the

capability of compressing information in fewer coefficients, the better the

transform; for that reason, the Discrete Cosine Transform (DCT) and Discrete

DCT

Quantization

Encoder

Compressed image

Original Image

IDCT

Dequantization

Decoder

Reconstructed image

Compressed Image

18

Wavelet Transform (DWT) have become the most widely used transform coding

techniques.

Transform coding algorithms usually start by partitioning the original

image into sub images (blocks) of small size (usually 8 × 8). For each block the

transform coefficients are calculated, effectively converting the original 8 × 8 array

of pixel values into an array of coefficients within which the coefficients closer to

the top-left corner usually contain most of the information needed to quantize and

encode (and eventually perform the reverse process at the decoder’s side) the

image with little perceptual distortion. The resulting coefficients are then quantized

and the output of the quantizer is used by symbol encoding techniques to produce

the output bit stream representing the encoded image. In image decompression

model at the decoder’s side, the reverse process takes place shown in figure 1.6

[65, 75, 16], with the obvious difference that the dequantization stage will only

generate an approximated version of the original coefficient values e.g., whatever

loss was introduced by the quantizer in the encoder stage is not reversible.

In order to make the data fit the discrete cosine transform, each pixel value

is level shifted by subtracting 128 from its value. The result of this is 8-bit pixels

that have the range of -127 to 128, making the data symmetric across 0. This is

good for DCT as any symmetry that is exposed will lead towards better entropy

compression [22, 90]. Effectively this shifts the DC coefficient to fall more in line

with the value of the AC coefficients. The AC coefficients produced by the DCT

are not affected in any way by this level shifting.

1.8.2 Quantization

The human eye responds to the DC coefficient and the lower spatial

frequency coefficients. If the magnitude of a higher frequency coefficient is below

a certain threshold, the eye will not detect it. Set the frequency coefficients in the

transformed matrix whose amplitudes are less than a defined threshold to zero

(these coefficients cannot be recovered during decoding) during quantization, the

19

sizes of the DC and AC coefficients are reduced. A division operation is performed

using the predefined threshold value as the divisor.

DCT-based image compression relies on two techniques to reduce the data

required to represent the image. Quantization is the process of reducing the number

of possible values of a quantity, thereby reducing the number of bits needed to

represent it. Entropy coding is a technique for representing the quantized data as

compactly as possible. We will develop functions to quantize images and to

calculate the level of compression provided by different degrees of quantization.

We will not implement the entropy coding required to create a compressed image

file. A simple example of quantization is the rounding of real into integers. To

represent a real number between 0 and 7 to some specified precision takes many

bits. Rounding the number to the nearest integer gives a quantity that can be

represented by just three bits. In this process, we reduce the number of possible

values of the quantity (and thus the number of bits needed to represent it) at the

cost of losing information. A “finer” quantization, that allows more values and

loses less information, can be obtained by dividing the number by a weight factor

before rounding In the JPEG image compression standard, each DCT coefficient is

quantized using a weight that depends on the frequencies for that coefficient.

The coefficients in each 8 x 8 block are divided by a corresponding entry of

an 8 x 8 quantization matrix, and the result is rounded to the nearest integer. In

general, higher spatial frequencies are less visible to the human eye than low

frequencies. Therefore, the quantization factors are usually chosen to be larger for

the higher frequencies. The quantization matrix is widely used for monochrome

images and for the luminance component of a color image. Our quantization

function blocks the image, divides each block (element-by-element) by the

quantization matrix, reassembles the blocks, and then rounds the entries to the

nearest integer. The dequantization function blocks the matrix, multiplies each

block by the quantization factors, and reassembles the each coefficient of the

matrix.

20

1.8.3 Entropy Coding

It creates a fixed or variable-length code to represent the quantizers output

and maps the output in accordance with the code. In most cases, a variable-length

code is used. An entropy encoder compresses the compressed values obtained by

the quantizer to provide more efficient compression. Most important types of

entropy encoders used in lossy image compression techniques are arithmetic

encoder, Huffman encoder and run-length encoder. Vectoring 2-D matrix of

quantized DCT coefficients are represented in the form of a single-dimensional

vector [76, 61, 50]. After quantization, most of the high frequency coefficients

(lower right corner) are zero. To exploit the number of zeros, a zigzag scan of the

matrix is used. Zigzag scan allows [73] all the DC coefficients and lower

frequency AC coefficients to be scanned first. DC coefficients are encoded using

differential encoding and AC coefficients are encoded using run-length encoding.

Huffman coding is used to encode both after that. Differential Encoding, DC

coefficient is the largest in the transformed matrix. DC coefficient varies slowly

from one block to the next. Only the difference in value of the DC coefficients is

encoded. Number of bits required to encode is reduced. The difference values are

encoded in the form (SSS, value) where SSS field indicates the number of bits

needed to encode the value and the value field indicates the binary form [41].

1.8.4 Run-Length Encoding

Run Length is the first step in entropy coding. This is a simple thought that

is accomplished by assigning a code, run length and size, to every non-zero value

in the quantized data stream. The run length is a count of zero values before the

non-zero value occurred. The size is a category given to the non-zero value which

is used to recover the value later. The DC value of the block is omitted in this

process [69, 25]. Additionally, with every non-zero value a magnitude is generated

which determines the number of bits that are necessary to reconstruct the value. It

will indicate possible values in the size category that can be correct. Run Length

coding is a basic form of lossless compression. Essentially, this process is a

21

generalization of zero suppression techniques [99, 1]. Zero suppression assumes

that one symbol or value appears often in a data stream. After quantization the goal

is that most high frequency components, which are less important to the human

visual system, are set to zero. The zigzag process organized the sequence to have

the lower frequency components which are less likely to be zero in the first part of

the data stream. This effectively has organized the data to have larger runs of

zeros, especially at the end, making the run length coding very efficient. The 63

values of the AC coefficients with long strings of zeros because of the zig-zag scan

[28]. Each AC coefficient encoded as a pair of values (skip, value), skip indicates

the number of zeros in the run and value is the next non-zero coefficient.

1.8.5 Huffman Encoding

It is a technique which will assign a variable length codeword to an input

data item. Huffman coding assigns a smaller codeword to an input that occurs

more frequently. It is very similar to Morse code, which assigned smaller pulse

combinations to letters that occurred more frequently [59]. Huffman coding is

variable length coding, where characters are not coded to a fixed number of bits.

This is the last step in the encoding process. It organizes the data stream into a

smaller number of output data packets by assigning unique code words that later

during decompression can be reconstructed without loss [78, 42]. For the JPEG

process, each combination of run length and size category, from the run length

coder is assigned a Huffman codeword. Long strings of binary digits are replaced

by shorter codeword. Prefix property of the Huffman code words enable decoding

the encoded bit stream unambiguously.

The figure 1.6 shows the Decompression model of the image in which the

reverse operation of the compression model is performed so that the original image

can be reconstructed back.

22

1.9 APPLICATIONS OF DCT

The DCT core can be utilized for a variety of multimedia applications

including

• Office automation Equipment (Multifunction printers, digital copié etc.)

• Digital cameras & camcorders

• Video production, video conference

• Surveillance systems

Like other transforms, the Discrete Cosine Transform (DCT) attempts to

decorrelate the image data. After decorrelation each transform coefficient can be

encoded independently without losing compression efficiency. The next section

describes the DCT and some of its important properties.

1.10 DCT ALGORITHMS

DCT algorithm is very effective due to its symmetry and simplicity. It is

good replacement of FFT due to consideration of real component of the image data

[75, 60]. In DCT we leave the unwanted frequency components while opting only

required frequency components of the image. Image is divided into blocks and

each block is compressed using quantization. Moreover, many simulation tools

like MATLAB ect.. are available to estimate the results prior to realization of

design in real time. Equations 1.1 and 1.2 are the one dimension DCT standard

equations for the data out.

1.10.1 One-Dimensional DCT

The most common DCT definition of a 1-D sequence of length N is

( ) = ∝ ( ) ∑ ( )cos ( ) …… (1.1)

23

for u = 0,1,2,…,N −1. Similarly, the inverse transformation is defined as

( ) = ∑ ∝ ( ) ( )cos ( )

…… (1.2)

for x = 0,1,2,…,N −1. In both equations (1.1) and (1.2) α (u) is defined as

∝ ( ) = = 0 ≠ 0 …… (1.3)

It is clear from Equation (1.1) that for

= 0, ( = 0) = 1 ( )

….. (1.4)

Thus, the first transform coefficient is the average value of the sample

sequence. In literature, this value is referred to as the DC Coefficient. All other

transform coefficients are called the AC Coefficients.

1.10.2 Two-Dimensional DCT

The 2-D DCT is a direct extension of the 1-Dimensional DCT and is given

by

( , ) = ( ) ( ) ∑ ∑ ( , )cos ( ) cos ( )

…. (1.5)

24

For u, v = 0,1,2,…,N −1 and α(u) and α(v) are defined in (1.3).

( , ) = ( ) ( ) ( , )cos (2 + 1)2 cos (2 + 1)2

…. (1.6)

The inverse transform is defined as for x, y = 0, 1, 2…, N −1. The 2-D basis

functions can be generated by multiplying the horizontally oriented 1-D basis

functions with vertically oriented set of the same functions.

1.11 DCT ARCHITECTURES

The different architectures are available to find the 2-D DCT for a image

matrix some of the architectures are discussing in the next section.

1.11.1 Two-Dimensional Approaches

The implementation of the 2-D DCT directly from the theoretical equation

results in 1024 multiplications and 896 additions. Fast algorithms exploit the

symmetry within the DCT to achieve dramatic computational savings.

25

1.11.2 Row – Column Decomposition

This algorithm computes the 2-D DCT by row-column decomposition. In

this approach, the separability property of the DCT is exploited. An 8-point, 1-D

DCT is applied to each of the 8 rows, and then again to each of the 8 columns. The

1-D algorithm that is applied to both the rows and columns is the same. Therefore,

it could be possible to use identical pieces of hardware to do the row computation

as well as the column computation. A transposition matrix would separate the row

and column is as shown in figure 1.7. The bulk of the design and computation is

in the 8 point 1-D DCT block, which can potentially be reused 16 times, 8 times

for each row, and 8 times for each column. Therefore, an algorithm for computing

the 1-D DCT is usually selected. The high regularity of this approach is very

attractive for reduced cell count and low power consumption with ASIC

implementation.

Figure 1.7 Row – Column Decomposition

1.11.3 Direct Method

This approach to computation of the 2-D DCT is by a direct method using

the results of a polynomial transform. Computational complexity is greatly

reduced, but regularity is sacrificed. Instead of the 16 1-D DCTs used in the

conventional row-column decomposition, uses all real arithmetic including eight

1-D DCTs, and stages of pre-adds and post-adds (a total of 234 additions) to

compute the 2-D DCT. Thus, the number of multiplications for most

implementations should be halved as multiplication only appears within the 1-D

DCT. Although this direct method of extension into two dimensions creates an

26

irregular relationship between inputs and outputs of the system, the savings in

computational power may be significant with the use of certain 1-D DCT

algorithms. With this direct approach, large chunks of the design cannot be reused

to the same extent as in the conventional row-column decomposition approach.

Thus, the direct approach will lead to more hardware, more complex control, and

much more intensive debugging. Although the direct approach used 278 less

additions than the row-column approach, it had much greater complexity.

Therefore, the number of computations alone could not determine which

implementation would result in the lowest power design.

1.11.4 Distributed Arithmetic Algorithms

Distributed Arithmetic (DA), is a bit level rearrangement of a multiply

accumulate to recast the multiplications as additions. The DA method is designed

for inner (dot) products of a constant vector with a variable, or input, vector. It is

the order of operations that distinguishes distributed arithmetic from conventional

arithmetic [94]. The DA technique forms partial products with one bit of data from

the input vector at a time. The partial products are shifted according to weight and

summed together. Look-up tables (LUTs) are essential to the DA method. LUTs

store all the possible sums of the elements in the constant vector. The LUT grows

exponentially in size with the dimension of the input, but are optimal on four

dimensions. DA is implemented with the least possible resources by computing it

in a fully bit-serial manner. The key elements required to implement the DA are a

16-element LUT and decoder, an adder/ subtractor, and a shifter. These elements

are grouped together in a ROM Accumulate (RAC) structure. A shift register

inputs one column of input bits per clock cycle to the LUT. It begins by inputting

the most significant column of bits and rotates to finally input the least significant

column of variable bits. The contents from the LUT get summed with the shifted

contents of the previous look up. In this fully bit-serial approach, the answer

converges in as many clock cycles as the bit length of the input elements. While

the serial inputs limit the performance of the RAC, it requires the least possible

resources. Greater performance can be achieved with an increase in hardware.

27

With an increase in resources, the result of the RAC can converge quicker. The

speed of the calculation is increased by replicating the LUT. In a fully parallel

approach, the result of the DA converges at maximum speed the clock rate. In this

case, the LUT must be replicated as many times as there are input bits.

1.12 PROPERTIES OF DCT

Some properties of the DCT which are of particular value to image

processing applications:

Decorrelation: The principle advantage of image transformation is the

removal of redundancy between neighboring pixels. This leads to uncorrelated

transform coefficients which can be encoded independently. It can be inferred that

DCT exhibits excellent decorrelation properties.

Energy Compaction: Efficiency of a transformation scheme can be

directly gauged by its ability to pack input data into as few coefficients as possible.

This allows the quantizer to discard coefficients with relatively small amplitudes

without introducing visual distortion in the reconstructed image. DCT exhibits

excellent energy compaction for highly correlated images.

Separability: The DCT transform equation can be expressed as

( , ) = 1√2 ( ) ( ) (2 + 1)2 ( , ) (2 + 1)2

… (1.7)

This property, known as separability, has the principle advantage that D (i,

j) can be computed in two steps by successive 1-D operations on rows and

columns of an image. The arguments presented can be identically applied for the

inverse DCT computation.

28

Figure 1.8 2-D DCT model

Symmetry: Row and column operations in the DCT Equation reveals that these

operations are functionally identical. Such a transformation is called a symmetric

transformation. A separable and symmetric transform can be expressed in the form

D = TMT’ … (1.8)

Where M is an N ×N symmetric transformation matrix and T is the DCT matrix.

This is an extremely useful property since it implies that the transformation

matrix can be precomputed offline and then applied to the image thereby providing

orders of magnitude improvement in computation efficiency.

1.13 ORGANIZATION OF THE THESIS

This Thesis consists of eight chapters, appendix and references in total. The

framework of the thesis is as follows. Chapter 1 describes the Introduction about

the DCT and VLC and also discusses the different DCT Architecture. In Chapter 2,

the researcher discusses about the literature review related to the present research.

Chapter 3 describes the VLSI design flow and Importance of Low Power and

different Techniques for low power VLSI design. Chapter 4 describes the proposed

Low power VLSI Architecture for DCT and IDCT. In chapter 5, the present

researcher discusses about the Low power Architecture for VLC and VLD.

Chapter 6 and Chapter 7 describe the results and discussion of the present research

to achieve the good compression with low power approach. Chapter 8 deals with

the conclusion of the present thesis and also possibilities of future works.

29

SUMMARY

This chapter describes the concepts involved in the Image compression.

The methodology adopted in transforming the image pixels from spatial domain to

the frequency domain by performing the DCT. In DCT process what are the

different architecture to find the DCT for an 8x8 pixels represented in the form of

image matrix. The output of the DCT is quantized to perform the compression

using the Quantization process. The output of the quantizer prepares the ground for

lossy compression. After the Quantization then perform the Variable Length

coding to achieve the lossless compression and finally how to reconstruct the

image using decompression process. The next chapter describes the proposed

architecture to find the Discrete Cosine Transform with the low power VLSI

approach.

30

CHAPTER 2

REVIEW OF LITERTURE

2.1 INTRODUCTION

The previous chapter describes the concepts involved in the Image

compression. The methodology adopted is transforming the image pixels from

spatial domain to the frequency domain by performing the DCT. That also

describes the basic concepts required to reconstruct the image by performing the

IDCT. This chapter describes the literature review of the previous works related to

the present researcher.

Image and Video Compression has been a very active field of research and

development for over 20 years and many different systems and algorithms for

compression and decompression have been proposed and developed. In order to

encourage interworking, competition and increased choice, it has been necessary to

define standard methods of compression, encoding and decoding to allow products

from different manufacturers to communicate effectively. This has led to

development of a number of key international standards for image and video

compression, including the JPEG, MPEG and H.26X series of standards.

The Discrete Cosine Transform (DCT) was first proposed by Ahmed et al.

(1974), and it has been more and more important in recent years. DCT has been

widely used in signal processing of image data, especially in coding for

compression, for its near-optimal performance. Because of the wide-spread use of

DCT's, research into fast algorithms for their implementation has been rather

active.

1. Ricardo Castellanos, Hari Kalva and Ravi Shankar (2009), “Low Power

DCT using Highly Scalable Multipliers”, 16th IEEE International

Conference on Image Processing ,pp.1925-1928.

31

In this paper the authors have implemented a low power DCT using highly

scalable multiplier. Scalable multipliers are used since the width of the operand bit

varies in each stage of DCT implementation. Due to the use of the variable sized

multipliers power saving is obtained. A highly scalable multiplier (HSM) allows

dynamic configuration of multiplier for each stage. The authors have calculated

PSNR and SSIM on a set of images which are JPEG encoded based on the use of

scalable multiplier. The authors conclude that the use of scalable multiplier with

variable size HSM reduces the power consumption more than compared to any

other algorithms.

2. Vimal P. Singh Thoudam, Prof. B. Bhaumik, Dr. S. Chatterjee (2010),

“Ultra Low Power Implementation of 2-D DCT for Image/Video Compression”,

International Conference on Computer Applications and Industrial Electronics

(ICCAIE), pp.532-536.

Proposed DCT architecture is based on loeffler algorithm and uses CSD

(Canonical Signal Digit). In CSD number system each bit is allowed to have {-1,

0, +1}. No two consecutive bits are nonzero in CSD numbers. So, multiplications

are replaced by shift and add operation. The proposed scheme targets reduced

power by minimizing arithmetic operations and using the concept of clock gating

technique. Considerable amount of dynamic power is consumed in the clock

distribution network. Moreover flip-flops dissipate some dynamic power even if its

state does not change. So an EN input is used for effective power reduction.

3. M. Jridi and A. Alfalou (2010), “A Low-Power, High-Speed DCT

architecture for image compression: principle and implementation” 18th IEEE/IFIP

International Conference on VLSI and System-on-Chip (VLSI-Soc 2010) pp.304-

309.

A low power and high speed DCT for image compression is implemented

on FPGA. The DCT optimization is achieved through the hardware simplification

of the multipliers used to compute the DCT coefficients. In this work the authors

have implemented the DCT with constant multiplier by making use of Canonic

32

signed digit encoding to perform constant multiplication. The canonic signed digit

representation is the signed data representation containing the fewest number of

nonzero bits. Thus for the constant multipliers, the number of addition and

subtraction will be minimum. A common sub-expression elimination technique has

also been used for further DCT optimization, thu

4. M. El Aakif, S. Belkouch, N. Chabini, M. M. Hassani (2011), “Low Power

and Fast DCT Architecture Using Multiplier-Less Method”, IEEE International

Conference on Faible Tension Faible Consommation (FTFC). pp.63-66.

In this paper a new modified flow graph algorithm (FGA) of DCT based on

Lo-effler with hardware implementation of multiplier-less operation has been

proposed. The proposed FGA uses unsigned constant coefficient multiplication.

The multiplier-less method is widely used for VLSI realization because of

improvements in speed, area overhead and power consumption. The Lo-effler fast

algorithm is used to implement 1-D DCT. Two 1D DCT steps are used to generate

2-D DCT coefficient.

5. Yongli Zhu, Zhengya Xu (2006) “Adaptive Context Based Coding for

Lossless Color Image Compresssion” IMACS Multiconference on Computational

Engineering in Systems Applications (CESA), Beijing, China, pp.1310-1314.

An adaptive low complexity lossless segment (context) based color image

compression method is proposed by the authors. Adaptive context based dividing

for an image is conducted to find optimal fixed length codes that are used for

storing relative values of every pixel in each segment of the image and the length

of each segment by modified greedy algorithm. Thus a modified Huffman coding

is applied for the result of the first compression. The compression ratios obtained

were high.

6. Sunil Bhooshan, Shipra Sharma (2009), “An Efficient and Selective Image

Compression Scheme using Huffman and Adaptive Interpolation”, 24th

33

International Conference Image and Vision Computing New Zealand (IVCNZ ),

pp.1-3.

In this paper authors have made use of lossy and lossless compression

techniques. Different blocks are compressed in one of the ways depending on the

information content in that block. The process of compression begins with passing

the image through an high pass filter and the image matrix is divided into a number

of non-overlapping sub-blocks. Each of the sub-blocks is checked for the number

of zeros by setting a threshold. If the number of zeros in a particular block is more

than the threshold, it implies that the block contains less information and that

particular block from the original image matrix is taken for lossy compression

using Adaptive Interpolation. On the other hand if the number of zeros is less than

the threshold, then it implies that the information content is more and thus the

corresponding block of the original image matrix is subjected to lossless

compression using Huffman coding. The authors conclude that the computational

complexity of this approach is less and the compression ratios obtained by this

method is also high.

7. Piyush Kumar Shukla, Pradeep Rusiya, Deepak Agrawal, Lata Chhablani,

Balwant Singh (2009.)”,Multiple Subgroup Data Compression Technique Based

On Huffman Coding”.First International Conference on Computational

Intelligence, Communication Systems and Networks (CICSYN), pp.397-402.

In this paper the authors have proposed data compression for math’s text

files based on Adaptive Huffman coding. The compression ratio obtained is more

than Adaptive Huffman coding. In this the method used by the authors the

encoding process of the system encodes the frequently occurring characters with

shorter bit codes and the infrequently occurring characters with longer bit codes.

The algorithm proceeds as follows: subgroups of 256 characters are made. These

subgroups are again grouped into three groups of alphabets, numbers and some

operators and remaining symbols. The symbols of each group are arranged in the

34

decreasing order of probability of occurrence. Codeword for each character is

provided and the data file is thus encoded.

8. Dr. Muhammad Younus Javed and Abid Nadeem (2000), “Data

Compression Through Adaptive Huffman Coding Scheme” , IEEE Proceedings on

TENCON ,Vol.2. pp. 187-190.

In this paper the authors describes the development of data compression

system that employs the Adaptive Huffman method for generating variable length

codes. Adaptive Huffman coding has been used for dynamic data compression in

order to decrease the average code length used to represent the symbols of

character sets. This encoding system is used on text files where the characters in

the text are encoded with variable length codes. Shorter codes are used to encode

the character that appears frequently and longer length codes are used to encode

characters that appear infrequently. The author concludes that this scheme is well

suited for online encoding/decoding in data networks and compression for larger

files in order to reduce storage and transmission.

9. A.P. Vinod, D. Rajan and A. Singla (2007), “ Differential pixel-based low-

power and high-speed implementation of DCT for on-board satellite image

processing”, IET international Journals on Circuits, Devices and Systems, pp. 444-

450.

In this paper the authors have presented the techniques for minimizing the

complexity of multiplication by employing Differential Pixel Image (DPI). DPI is

the matrix obtained taking the difference of intensities of the adjacent pixels in the

input image matrix. The use of DPI instead of the original image matrix results in

significant reduction in the number of operations and hence the power consumed.

The intensity of the pixel in the DPI is obtained as fd (x,y)= [f(x,y) – f(x,y-1)],

where f(x,y) is the intensity at (x,y) in the original image. Also the intensity of the

first pixel of every sub-block of the DPI will be the same as the original matrix. In

this work the DCT coefficient matrix is represented using canonic signed digits.

The authors have also used common sub-expression elimination method where

35

multiple occurrences of identical bit patterns are identified in the DCT matrix and

thereby reducing the resources necessary that which can be shared.

10. Muhammed Yusuf Khan, Ekram Khan, M.Salim Beg (2008),”

Performance Evaluation of 4x4 DCT Algorithms For Low Power Wireless

Applications”. First International Conference on Emerging Trends in Engineering

and Technology, pp.1284-1286.

In this paper the authors have compared the performance of 4x4 DCT with

8x8 DCT, since small size DCT is suitable for mobile applications using low

power devices as fast computation speed is required for real time applications. The

authors have compared the performance of 4x4 transforms with the conventional

8x8 DCT in floating point. Firstly, the authors have compared the conventional

4x4 DCT in floating point with conventional 8x8 DCT in floating point. Next, the

4x4 integer transform is compared with the conventional 8x8 DCT in floating

point. The comparison was done on computation time of the transform and inverse

transform and objective quality, based on the calculation of PSNR between input

and reconstructed image. The authors have concluded that the integer transform

approximation of the DCT will reduce the computational time considerably.

11. A.Pradini, T.M.Roffi, R.Dirza, T.Adiono (2011), “VLSI Design of a High-

Throughput Discrete Cosine Transform for Image Compression System”,

International Conference on Electrical Engineering and Informatics, Indonesia

(ICEEI), pp.1-6.

In this paper the authors have proposed a unique 2D DCT architecture

based on regular butterfly structure. The architecture employs eight 1D DCT

processors and four post-addition stages to obtain the 2D DCT coefficients. Each

1D DCT processor is designed using Algebraic Integer Encoding architecture

which requires no multipliers, therefore the entire 2D DCT architecture is

multiplier less. The multiplier less design approach has given the high throughput

to the architecture.

36

12. S.V.V.Sateesh, R.Sakthivel, K.Nirosha, Harisha M.Kittur (2011), “An

Optimized Architecture to Perform Image Compression and Encryption

Simultaneously Using Modified DCT Algorithm”, IEEE International Conference

on Signal Processing, Communication, Computing and Network Technologies,

pp.442-447.

In this paper authors have designed the architecture for DCT based on Lo-

effler scheme. The 1D DCT is calculated for only the first two terms and the rest

six terms are taken as zero. The architecture takes in 8 pixels as input for every

clock cycle and generates only 2 outputs against the 8 outputs in the traditional Lo-

effler DCT. Thus the architecture needs only 4 multipliers and 14 adders. The

adders used are carry select adders and the multipliers used are high performance

multipliers.

13. HatimAnas, SaidBelkouch, M.ElAakif, NoureddineChabini (2011), FPGA

Implementation of a Pipelined 2D DCT and Simplified Quantization for Real Time

Applications”, IEEE International Conference on Multimedia Computing and

Systems pp.1-6

This paper explores a new approach to obtain the DCT using flow graph

algorithm. All of the multiplications are merged in the quantization block. To

avoid the reduction in the operating frequency during the division at the

quantization process, all of the elements in the quantization matrix are represented

in the nearest powers of 2. Authors have shown that this method outperforms any

of the approaches in terms of operating frequency and resource requirements.

14. Chi-Chia Sun, Benjamin Heyne, Juergen, Goetze (2006), “A Low Power

and high quality Cardic based Loffler DCT”, IEEE conference.

In this paper a low power, high quality preserving DCT architecture is

presented. It is obtained by optimizing the loeffler DCT based on CORDIC

algorithm. This architecture design starts with the basic loeffler DCT which

requires 11 multiplications. This paper mainly concentrates on area and power

37

reduction. At the same time it maintains the same transformation quality as

original loeffler DCT. It is very suitable for low-power and high quality CODECS,

especially for battery based systems.

15. Hai Huang, Tze-Yun Sung, Yaw-shih Shieh (2010), “A Novel VLSI

Linear array for 2-D DCT/IDCT”, IEEE 3rd International Congress on Image AND

Signal Processing, pp.3680-3690.

This paper proposes an efficient 1-D DCT and IDCT architectures using

sub-band decomposition algorithm. The orthonormal property of DCT/IDCT

transformation matrices is fully used to simplify the hardware complexities. The

proposed architecture with computation complexity o and 0 for DCT and

IDCT, respectively and low hardware complexity 0 for both DCT and IDCT

are fully pipelined and scalable for variable length 2-D DCT/IDCT computation.

The proposed architecture requires 3 multipliers and 21 adders. In addition the

proposed architecture is highly regular, scalable, and flexible.

16. Byoung-2 Kim, Sotirios. G. Ziavras (2009 ), “Low Power Multiplierless

DCT for Image/Video Coders”, IEEE 13th International Symposium on Consumer

Electronics, pp.133-136.

This paper mainly focuses on power efficiency of coders. Power reduction

is achieved by minimising the number of arithmetic operations and their bit-width.

To minimise arithmetic operation redundancy, our DCT design focuses on chen’s

factorization approach and the constant matrix multiplication (CMM) problem.

The 8*1 DCT is decomposed using six two input butterfly networks. Each

butterfly is for 2*2 multiplications (matrix) and requires a maximum of eight

adders/ subtractors with 13-bit cosine coefficients. An extended canonic signed

digit (CSD) format is used for expressing the binary constant coefficients.

Adaptive companding scheme is used for bit-width reduction. It consists of a bit-

width compressor and bit-width expander.It provides the butterflies with reduced

bit-width while minimising image/ video quality degradation.

38

17. Tze-Yun sung, Yaw-shih Shieh and Chun-Wang Yu Hsi-Chin Hsin (2006),

“High Efficiency and Low Power Architectures for 2-D DCT and IDCT

based on cordic rotation”, Seventh International conference on Parallel,

Distributed Computing, Applications and Technologies(PDCAT), pp.191-196.

Multiplication is the key operation for both DCT and IDCT. In the

CORDIC based processor, multipliers can be replaced by simple shifters and

adders. Double rotation CORDIC algorithm has even better latency composed to

conventional CORDIC based algorithm. Hardware implementation of 8-point 2-D

DCT requires two SRAM banks (128 words), two 8 point DCT\IDCT processors,

two multiplexers and a control unit. By taking into account the symmetry

properties of the fast DCT/IDCT algorithm, high efficiency architecture with a

parallel – pipelined structure have been proposed to implement DCT and IDCT

processors. In the constituent 1-D DCT/IDCT processors, the double rotation

CORDIC algorithm with rotation mode in the circular co-ordinate system has been

utilised for the arithmetic unit for both DCT/IDCT i.e. multiplication computation.

Thus they are very much suited to VLSI implementation with design tradeoffs.

18. Gopal Lakhani (2004), “Optimal Huffman Coding of DCT Blocks”, IEEE

transactions on circuits and systems for video technology, Vol.14, issue.4. pp 522-

527.

A minor modification to the Huffman coding for the JPEG image

compression is made. During the run length coding, instead of pairing the non zero

coefficient with the preceding number of zeros, the authors have designed an

encoder that which pairs it to the subsequent number of zeros. This change in the

run length encoder has made utilized in forming the Huffman table which is

optimized for the position of the non zero coefficient denoted by the pair. The

advantage of this coding method is that no eob marker is needed to represent the

end of block.

39

19. Raymond K.W.Chan, Moon-Chuen Lee (2006), “Multiplierless

approximation of fast DCT algorithms”, IEEE International conference on

multimedia and Expo, pp.1925-1928.

This paper presents an effective method to convert any float 1-D DCT into

an approximate multiplier-less version with shift and add operations. It converts

AAN’s fast DCT algorithm to their multiplier-less versions. Experiment results

shows that ANN’s fast DCT algorithm approximated by the proposed method and

using an optimized configurations can be used to reconstruct images with high

visual quality in terms of peak signal to noise ratio (PSNR). The constant

coefficients are represented in MSD form. All the butterflies present in ANN’s

algorithm are converted to lifting structures before using the proposed method.

20. Kamrul Hasan Talukder and Koichi Harada (2007), “Discrete Wavelet

Transform for Image Compression and A Model of Parallel Image Compression

Scheme for Formal Verification”, Proceedings of the World Congress on

Engineering.

The use of discrete wavelet for image compression and a model of the

scheme of verification of parallelizing the compression have been presented in this

paper. We know that wavelet transform exploits both the spatial and frequency

correlation of data by dilations (or contractions) and translations of mother wavelet

on the input data. It supports the multi resolution analysis of data i.e. it can be

applied to different scales according to the details required, which allows

progressive transmission and zooming of the image without the need of extra

storage. Therefore the DWT characteristics is well suited for image compression

and includes the ability to take into account of Human Visual System’s (HVS)

characteristics, very good energy compaction capabilities, robustness under

transmission, high compression ratio etc. The implementation of wavelet

compression scheme is very similar to that of subband coding scheme: the signal is

decomposed using filter banks. The output of the filter banks is down-sampled,

quantized, and encoded. The decoder decodes the coded representation, up-

40

samples and recomposes the signal. A model for parallelizing the compression

technique has also been proposed here.

21. Abdullah Al Muhit, Md. Shabiul Islam and Masuri Othman (2004), “VLSI

Implementation of Discrete Wavelet Transform (DWT) for Image Compression”,

2nd International conference on Autonomous Robots and Agents, New Zealand.

December pp.391-395.

This paper presents an approach towards VLSI implementation of the

Discrete Wavelet Transform (DWT) for image compression. The design follows

the JPEG2000 standard and can be used for both lossy and lossless compression.

In order to reduce complexities of the design, linear algebra view of DWT and

IDWT has been used in this paper. The DWT algorithm consists of Forward DWT

(FDWT) and Inverse DWT (IDWT). The FDWT can be performed on a signal

using different types of filters such as db7, db4 or Haar. The Forward transform

can be done in two ways, such as matrix multiply method and linear equations.

After the FDWT stage, the resulting average and detail wavelet values can be

compressed using thresholding method. In the IDWT process, to get the

reconstructed image, the wavelet details and averages can be used in the matrix

multiply method and linear equations. This design can be used for image

compression in a robotic system.

22. En-Hui Yang, Longji Wang (2009),”Joint Optimization of Run-Length

Coding. Hoffman Coding, and Quantization Table with Complete Baseline JPEG

Decoder Compatibility”, IEEE transaction on image processing, pp. 63-74.

The authors have presented a graph-based R-D optimal algorithm for JPEG

run-length coding. It finds the optimal run size pairs in the R-D sense among all

possible candidates. Based on this algorithm, they have proposed an iterative

algorithm to optimize run-length coding, Huffman coding and quantization table

jointly. The proposed iterative joint optimization algorithm results in up to 30% bit

rate compression improvement for the test images, compared to baseline JPEG.

The algorithms are not only computationally efficient but completely compatible

41

with existing JPEG and MPEG decoders. They can be applied to the application

areas such as web image acceleration, digital camera image compression, MPEG

frame optimization and transcoding, etc.

23. D.A. Karras, S.A. Karkanisand B.G. Mertzios (1998),”Image Compression

Using the Wavelet Transform on Textural Regions of Interest”, 24th IEEE

International Euromicro conference, pp.633-639.

This paper suggests a new image compression scheme, using the discrete

wavelet transformation (DWT), which is based on attempting to preserve the

texturally important image characteristics. The main point of the proposed

methodology lies on that, the image is divided into regions of textural significance

employing textural descriptors as criteria and fuzzy clustering methodologies.

These textural descriptors include co-occurrence matrices based measures and

coherence analysis derived features. More specifically, the DWT is applied

separately to each region in which the original image is partitioned and, depending

on how it has been texturally clustered, its relative number of the wavelet

coefficients to keep is then, determined. Therefore, different compression ratios are

applied to the image regions. The reconstruction process of the original image

involves the linear combination of its corresponding reconstructed regions.

24. Muhammad Bilal Akhtar, Adil Masoud Qureshi,Qamar- Ul- Islam (2011),

“ Optimized Run Length Coding for JPEG Image Compression Used in Space

Research Program of IST,” IEEE International Conference on Computer

Networks and Information Technology (ICCNIT) , pp.81-85.

In this paper the authors have proposed a new scheme for run length coding

to minimize the error during the transmission. The optimized run length coding

uses a pair of (RUN, LEVELS) only when a pattern of consecutive zeros occur at

the input of the encoder. The non zero digits are encoded as their respective values

in LEVELS parameters. The RUN parameter is eliminated from the final encoded

message for non zero digits.

42

25. Gregory K.Wallace (1991), “ The JPEG Still Picture Compression

Standard”,IEEE transactions on consumer electronics, Vol.38,Issue.1, pp.18-38.

In this paper the author has given the brief description for JPEG image

compression standard, which uses both lossy and lossless compression methods.

For JPEG images the lossy compression is based on DCT followed by

quantization. The lossless method is based on entropy coding which is a

completely reversible process. The procedures of run length coding are also given

by the author.

26. Mustafa Safa Al-Wahaiba, Kosshiek Wong, “A Lossless Image

Compression Algorithm Using Duplication Free Run-Length Coding”, Second

International Conference on Network Applications, Protocols and Services

(NETAPPS) , pp.245-250.

In this paper authors propose a novel lossless image compression algorithm

using duplication free run-length coding. An entropy rule-based generative coding

method was proposed to produce code words that are capable of encoding both

intensity level and different flag values into a single codeword. The proposed

method gains compression by reducing a run of two pixels to only one codeword.

The algorithm has no duplication problem, and the number of pixels that can be

encoded by a single run is infinite.

27. Jason McNeely and Magdi Bayoumi (2007), “Low Power Look-Up Tables

for Huffman Decoding”, IEEE International Conference on Image Processing,

pp.465-468.

This works includes studying of different lookup tables for Huffman

decoding. It was found that PLA type architecture is common among those

Huffman lookup tables, because of its speed and simplicity advantage. In certain

situations, it was determined that the tree structure for lookup tables can be a low

power alternative to the PLA structure. These situations accounted for 56% of the

total simulation runs, and of these runs, the average power savings of the tree in

43

those situations was found to be 78%. The work also shows the effect of varying

the table size and varying the probability distributions of a table on power, area

and delay.

28. Bao Ergude,Li Weisheng, Fan Dongrui, Ma Xiaoyu, (2008), “A Study and

Implementation of the Huffman Algorithm Based on Condensed Huffman Table”,

IEEE International Conference on Computer Science and Software Engineering,

pp.42-45.

They have used the property of canonical Huffman tree to study and

implement a new Huffman algorithm based on condensed Huffman table, which

greatly reduces the expense of Huffman table and increases the compression ratio.

The binary sequence in the paper requires only a small space and under some

special circumstances, the level without leaf is marked to be 1, which can further

reduce the required size.

29. Sung-Wen Wang, Shang-Chih Chuang, Chih-Chieh Hsiao, Yi-Shin Tung

and Ja-ling Wu (2008), “An efficient Memory Construction Scheme for an

Arbitrary Side Growing Huffman table”, IEEE International conference on

multimedia and Expo.

To speed up the Huffman decoding, a memory efficient Huffman table is

constructed on the basis of arbitrary-side growing Huffman tree (AGH-tree), by

grouping the common prefix of a Huffman tree, instead of the commonly used

single-side rowing Huffman tree (SGHtree). Simulation results show that, in

Huffman decoding, an AGH-tree based Huffman table is 2.35 times faster that of

the Hashemian’s method (an SGHtree based one) and needs only one-fifth the

corresponding memory size.

30. Reza Hashemian (2003), “Direct Huffman Coding and Decoding using the

Table of Code-Lengths”IEEE International conference on information,

Technology, Coding, Computers and Communication, pp.237-241.

44

This work includes developing of a memory efficient and high-speed

search technique for encoding and decoding symbols using Huffman coding. This

technique is based on a Condensed Huffman Table (CHT) for decoding purposes,

and it is shown that a CHT is significantly smaller than the ordinary Huffman

Table. In addition, the procedure is shown to be faster in searching for a code-word

and its corresponding symbol in the symbol-list. An efficient technique is also

proposed for encoding symbols by using the code-word properties in a Single-Side

Growing Huffman Table (SGHT), where code-word values are ordered in the

ascending order exactly in contrast with the probabilities that are in descending

order.

31. Jia-Yu Lin, Ying Liu, and Ke-Chu Yi(2004), “Balance of 0,1 Bits for

Huffman and Reversible Variable-Length Coding”, IEEE Journal on

Communications, pp. 359-361.

The authors have proposed an effective algorithm to make the bit

probabilities balanced. This algorithm can be embedded in the construction of the

Huffman codebook, with little complexity. In the analysis of RVLCs based on the

Huffman codes, it was shown that the bidirectionally decodable stream had good

performance under the bit-balance criterion, and it could be combined with the

proposed algorithm to further decrease the bit-probability difference. For

symmetrical and asymmetrical RVLCs, probability differences could be decreased

by reassigning code words to source symbols after the creation of codebooks.

32. Da An, Xin Tong, Bingqiang Zhu and Yun He (2009), “A Novel Fast DCT

Coefficient Scan Architecture”, IEEE Picture Coding Symposium I, Beijing

100084, China, pp.1-4.

A novel, fast and configurable architecture for zigzag scan and optional

scans in multiple video coding standards, including H.261, MPEG-1,2,4,

H.264/AVC, and AVS is proposed. Arbitrary scan patterns could be supported by

configuring the ROM data, and the architecture can largely reduce the processing

45

cycles. The experimental results show the proposed architecture is able to reduce

up to 80% of total scanning cycles on average.

33. Pablo Montero, Javier Taibo Gulias, Samuel Rivas (2010), “Parallel

Zigzag Scanning and Huffman Coding for a GPU-Based MPEG-2 Encoder”, IEEE

International Symposium on multimedia pp.97-104.

This work describes three approaches to compute the zigzag scan, run-

level, and Huffman codes in a GPU based MPEG-2 encoder. The most efficient

method exploits the parallel configuration used for DCT computation and

quantization in the GPU using the same threads to perform the last encoding steps:

zigzag scan and Huffman coding. In the experimental results, the optimized

version averaged a 10% reduction of the compression time, including the

transference to the CPU.

34. Pei-Yin Chen, Member, Yi-Ming Lin, and Min-Yi Cho (2008),” An

Efficient Design of Variable Length Decoder for MPEG-1/2/4”, IEEE

International Transactions on multimedia, Vol.16, Issue.9, pp.1307-1315.

The authors propose an area-efficient variable length decoder (VLD) for

MPEG-1/2/4. They employed an efficient clustering-merging technique to reduce

both the size of a single LUT and the total number of LUTs required for MPEG-

1/2/4, rather than using one dedicated lookup table for carrying out variable length

coding. Synthesis results show that our VLD occupies 10666 gate counts and

operates at 125 MHz by using the standard cell from Artisan TSMC’s 0.18µm

process. The proposed design outperforms other VLDs with less hardware cost.

35. Basant K. Mohanty and Pramod K. Meher (2010), “Parallel and Pipelined

Architectures for High Throughput Computation of Multilevel 3-D DWT”.

In this paper¸ we present a throughput-scalable parallel and pipeline

architecture for high-throughput computation of multilevel 3-D DWT. The

computation of 3-D DWT for each level of decomposition is split into three

46

distinct stages¸ and all the three stages are implemented in parallel by a processing

unit consisting of an array of processing modules. The throughput rate of the

proposed structure can easily be scaled without increasing the on-chip storage and

frame-memory by using more number of processing modules¸ and it provides

greater advantage over the existing designs for higher frame-rates and higher input

bock-size. The full-parallel implementation of proposed scalable structure provides

the best of its performance.

36. Anirban Das Anindya Hazra¸and Swapna Banerjee (2010), “An Efficient

Architecture for 3-D Discrete Wavelet Transform (DWT)”.

Architecture of the 3-D DWT which is a powerful image compression

algorithm is implemented using lifting based approach. This architecture enjoys

reduced memory referencing¸ related low power consumption¸ low latency and

high throughput. A lifting based technique is carried out in 2 stages namely Spatial

Transform method and Temporal Transform method.

37. Michael Weeks and Magdy A Bayoumi (2002), “Three-Dimensional

Discrete Wavelet Transform Architectures”.

Here the two different architectures of Three-Dimensional Discrete

Wavelet Transform is implemented. They are 3DW-I and the 3DW-II. The first

architecture (3DW-I) is based on folding, whereas the 3DW-II architecture is

block-based. The 3DW-I architecture is an implementation of the 3-D DWT

similar to folded 1-D and 2-D designs. It allows even distribution of the processing

load onto 3 sets of filters, with each set performing the calculations for one

dimension. The control for this design is very simple, since the data are operated

on in a row-column-slice fashion. The 3DW-II architecture uses block inputs to

reduce the requirement of on-chip memory. It has a central control unit to select

which coefficients to pass on to the lowpass and highpass filters. Finally¸ the

3DW-I and 3DW-II architectures are compared according to memory

requirements, number of clock cycles, and processing of frames per second.

47

38. B. Das and Swapna Banerjee (2002), “Low power architecture of running

3-D wavelet transform for medical imaging application”.

In this paper a real-time 3-D DWT algorithm and its architecture realization

is proposed. Reduced buffer and low wait-time are the salient features which

makes it fit for bidirectional videoconferencing application mostly in real-time

biomedical application. The reduced hardware complexity and 100% hardware

utilization is ensured in this design. This architecture implemented on 0.25u

BiCMOS technology.

39. B.Das and Swapna Banerjee (2003), “A Memory Efficient 3-D DWT

Architecture”.

This paper proposes a memory efficient real-time 3-D DWT algorithm and

its architectural implementation. Parallelism being an added advantage for fast

processing has been used with three pipelined stages in this architecture. The

architecture proposed here is memory efficient and has a high throughput rate of 1

clock¸ with low latency period. Here we make use of Daubechies wavelet filters

for co-efficient mapping¸ correlation between low pass filter and high pass filter.

The 3-D DWT has been implemented for 8-tap Daubechies filter. However¸ this

algorithm can be extended for any number of frames at the cost of wait time. This

architecture requires a simple regular data-flow pattern. Thus¸ the control circuitry

overhead reduces making the circuit efficient for high speed¸ low power

applications. An optimization between parallelism and pipelined structure has been

used for making the circuit applicable for low power domains.

40. Basant K. Mohanty and Pramod K. Meher (2008), “Concurrent Systolic

Architecture for High-Throughput Implementation of 3- Dimensional Discrete

Wavelet Transform”.

In this paper, we present a novel systolic architecture for high-throughput

computation of 3- dimensional (3-D) discrete wavelet transform (DWT). The

entire 3-D DWT computation is decomposed into three distinct stages and

48

implemented concurrently in a linear array of fully pipelined processing elements

(PE). The proposed structure for 3-D DWT provides higher throughput than the

existing architecture; and involves nearly half or less the number of multipliers and

adders; and less on-chip memory (when normalized for unit throughput rate) than

the other. The systolic design of involves on-chip and off-chip storage of size

O(MKN) and O(N2), respectively; and computes the 3-D DWT of an input data of

size (M × N × N) in approximately (MN2)/7 cycles, where K is the row and

column size of the subband filter coefficient matrix. (N×N) is the size of each

frame and M is the frame rate of the video input.

Higher computational throughput rate is achieved in this architecture which

does not involve any off-chip memory and involves either the same or less on-chip

memory than the existing design.

41. Basant K. Mohanty and Pramod K. Meher (2011), “Memory-Efficient

Architecture for 3-D DWT Using Overlapped Grouping of Frames”.

In this paper we presented a memory efficient architecture for 3-D DWT

using overlapped grouping of frames (GOFs). It involves only a frame-buffer of

size O(MN) to compute multilevel 3-D DWT, unlike the existing folded structures

which involve frame-buffer of size O(MNR). The 3-D DWT structure consists of

two types of hardware components: (i) combinational component and (ii)

memory/storage component. The combinational component consists mainly of

arithmetic circuits and multiplexors; and the memory component consists of a

frame-memory, temporal-memory, registers and transposition-memory. Frame-

memory is usually external to the chip, while temporal-memory may either be on-

chip or external.

Due to less memory complexity, this architecture dissipates significantly

less dynamic power than the existing structures. It can compute multilevel running

3-D DWT on an infinite GOFs frames and involves much less memory and

resource than the existing designs. It could, therefore, be used for high-

performance video processing applications.

49

42. Erdal Oruklu, Sonali Maharishi and Jafar Saniie (2007), “Analysis of

Ultrasonic 3-D Image Compression Using Non-Uniform, Separable Wavelet

Transforms”.

Here Discrete Wavelet Transform (DWT) is used for compression of 3-D

ultrasound data. Different wavelet kernels are analyzed and benchmarked for

compression of experimental signals. In order to reduce computational complexity,

non-uniform DWT method is utilized where different wavelet filters are applied to

ultrasonic axial resolution and spatial resolutions. Axial resolution contains more

information than spatial resolutions; therefore simple wavelet filters such as Haar

can be used for the spatial resolutions reducing the computations significantly.

Here make use of thresholding concept in which it can be applied to transform

coefficients of the original ultrasonic signal for data compression.

50

CHAPTER 3

VLSI DESIGN FLOW AND LOW POWER VLSI DESIGN

3.1 INTRODUCTION

The previous chapter describes the literature review in view of the present

research. The present researcher went through the different approaches presented

by the different authors to implement image compression with the low power

VLSI design approach. In this chapter, the author is discussing the VLSI Design

flow with the concepts of Verification i.e. how to adapt the concept of Linting and

code coverage, Synthesis of the design with low power concepts and finally the

physical design. The author also discuss the need for Low power in VLSI design

and the different Low power techniques applied in VLSI chip design.

The complexity of Very Large Scale Integrated circuits (VLSI) being

designed and used today makes the manual approach to design impractical. Design

automation is the order of the day. With the rapid technological developments in

the last two decades, the status of VLSI technology is characterized by the

following:

A steady increase in the size and hence the functionality of the ICs.

A steady reduction in feature size and hence increase in the speed of

operation as gate or transistor density.

A steady improvement in the predictability of circuit behavior.

A steady increase in the variety and size of software tools for VLSI design.

The above developments have resulted in a proliferation of approaches to

VLSI design. An abstraction based model is the basis of the automated design.

51

3.2 ASIC DESIGN FLOW

As with any other technical activity, development of an Application

Specific Integrated Circuit (ASIC) starts with an idea and takes tangible shape

through the stages of development as shown in Figure 3.1 and Figure 3.2 shows

the details of the design flow. The first step in the process is to expand the idea in

terms of behavior of the target circuit. Through stages of programming, the same is

fully developed into a design description in terms of well defined standard

constructs and conventions.

Figure 3.1 Major activities in ASIC design

The design is tested through a simulation process; it is to check, verify, and

ensure that what is wanted is what is described. Simulation is carried out through

dedicated simulation tools. With every simulation run, the simulation results are

studied to identify errors in the design description. The errors are corrected and

another simulation run carried out. Simulation and changes to design description

together form a cyclic iterative process, repeated until an error-free design is

evolved.

Design Description

Idea

Simulation

Physical Design

Synthesis

52

Design description is an activity independent of the target technology or

manufacturer. It results in a description of the digital circuit. To translate it into a

tangible circuit, one goes through the physical design process. The same

constitutes a set of activities closely linked to the manufacturer and the target

technology

3.3 DESIGN DESCRIPTION

Figure 3.2 ASIC design and development flow

The Figure 3.2 describes details about the design that is carried out in different

stages. The process of transforming the idea into a detailed circuit description in

terms of the elementary circuit components constitutes design description. The

final circuit of such an IC can have up to a billion such components; it is arrived at

in a step-by-step manner.

The first step in evolving the design description is to describe the circuit in

terms of its behavior. The description looks like a program in a high level language

53

like C. Once the behavioral level design description is ready, it is tested

extensively with the help of a simulation tool, it checks and confirms that all the

expected functions are carried out satisfactorily. If necessary, this behavioral level

routine is edited, and modified. Finally, one has a design for the expected system

described at the behavioral level. The behavioral design forms the input to the

synthesis tools, for circuit synthesis. The behavioral constructs not supported by

the synthesis tools are replaced by data flow and gate level constructs. To surmise,

the designer has to develop synthesizable codes for his own design.

The design at the behavioral level is to be elaborated in terms of known and

acknowledged functional blocks. It forms the next detailed level of design

description. Once again the design is to be tested through simulation and iteratively

corrected for errors. The elaboration can be continued one or two steps further. It

leads to a detailed design description in terms of logic gates and transistor

switches.

3.4 DESIGN OPTIMIZATION

The circuit at the gate level in terms of the gates and flip flops can be

redundant in nature. The same can be minimized with the help of minimization

tools. The minimized logical design is converted to a circuit in terms of the switch

level cells from standard libraries provided by the foundries, In this proposed work

designer has used both 90nm and 65nm standard cells. The cell based design

generated by the tool is the last step in the logical design process, it forms the input

to the first level of physical design.

3.5 BEHAVIORAL SIMULATION

The design descriptions are tested for their functionality at every level like

behavioral, data flow, and gate. One has to check here whether all the functions are

carried out as expected and rectify them. All such activities are carried out by the

Verification tool. The tool also has an editor to carry out any corrections to the

source code. Simulation involves testing the design for all its functions, functional

54

sequences, timing constraints, and specifications. Normally testing and simulation

at all the levels; behavioral to switch level is also carried out by a single tool.

Figure 3.3 shows the RTL verification with linting.

3.5.1 Specification for the Design

During this stage for the required specifications all the sub systems are

implemented in a block diagram. These Block diagrams are at the system level and

realized in higher level of abstraction to obtain sustainable performance and will

be observed with different transaction level.

3.5.2 Behavioral or RTL Design

All the Subsystems and sub blocks of the design are coded in HDL

language either in Verilog or VHDL. These sub-systems are also obtained in the

form of Intellectual property (IP). Due to the fact that many optimized efficient

RTL codes are available from the previous design within the company or from the

IP vendors. All these put together and create a verification environment to verify

total functionality. Verification will be done with many techniques and will be

based on the design stages.

3.6 VERIFICATION OF THE DESIGN

Verification is carried out with Simulator and also with emulation as well

as FPGA prototyping. Initial verification is carried out using Industry standard

simulators predominantly event based simulators. The verification will be done in

different stages as shown in figure 3.3. Simulation is software based and is less

expensive and can be done with quick setup time. Emulation is hardware based

verification, this can be carried out when full RTL code /Design is available. This

is very expensive and also time consuming for initial setup. FPGA based approach

is efficient but again this can be carried out when full RTL code /Design is

available. But additional resources and skill set is required.

55

In general Register Transfer Level (RTL) simulation and verification is one

of the initial steps that was done. This step ensures that the design is logically

correct and without major timing errors. It is advantageous to perform this step,

especially in the early stages of the design, because long synthesis and place-and-

route times can be avoided when an error is discovered at this stage. This step also

eliminates all the syntactical errors from HDL code and. Simulation tools to

perform RTL verification. More specifically all the simulators check for

correctness of the code with compilation / HDL parser, secondly it elaborates the

design , in this stage actual implementation of the hardware structure will be done

and specific to platform ( OS).Then elaborated code will be successfully loaded to

simulator along with the test stimuli. All input and output waveforms can be

observed.

Figure 3.3 RTL Verification flow with Linting and Code coverage

As the verification continues, we need to check for the quality of tests so

adoption of newer techniques is required. This new technique that improves the

quality of test is done by code coverage tools by providing the statistics about test

56

coverage done. Code Coverage is a process of validating or finding the quality of

the test bench for RTL code for a particular design. This is a measurement which

tells how good the design has been exercised with the test bench / test cases. Code

coverage is applied to DUT (Design under Test) to check how thoroughly the HDL

is exercised by test suite. Code coverage points the portion of code which did not

exercise, also points the corner/edge cases, unused/dead code.

Before Simulation certain verification will be done to clean up the RTL. In

general Lint tools that flag suspicious and non- portable/non-synthesizable usage

of Verilog / VHDL language. It points out the code where it likely bugs. In the

Industry Lint tools are also referred as Design Rule Checker. Check the cleanness

and portability of the HDLs code for IEEE standards. Usually compiler does not

show the errors and warnings which are detected by lint tools. There are many

advantages of it.

There are many standards and lint check rules are available. Some example for

rules,

1. Coding style,

2. DFT rule Checks,

3. Design style,

4. Language constructs for synthesizability,

5. Race condition detection

6. Combinational logic loops

7. Custom rule checking.

8. Documentation to select/deselect the rules/standards during lint checking.

57

3.7 STNTHESIS OF THE DESIGN

With the availability of design at the gate (switch) level, the logical design

is complete. The corresponding circuit hardware realization is carried out by a

synthesis tool. The synthesis flow with Low power approach is shown in Figure

3.4.

Figure 3.4 Synthesis flow with Low power and UPF

58

There are two common approaches used in the synthesis of VLSI system

and they are as follows: The circuit is realized through an FPGA. The gate level

design description is the starting point for the synthesis. The FPGA vendors

provide an interface to the synthesis tool. Through the interface the gate level

design is realized as a final circuit. With many synthesis tools, one can directly use

the design description at the data flow level itself to realize the final circuit through

an FPGA. The FPGA route is attractive for limited volume production or a fast

development cycle.

The circuit is realized as an ASIC. A typical ASIC vendor will have the

standard library of a particular technology. The standard library contains library of

basic components like elementary gates and flip-flops. Eventually the circuit is to

be realized by selecting such components and interconnecting them conforming to

the required design. This constitutes the physical design. Being an elaborate and

costly process, a physical design may call for an intermediate functional

verification through the FPGA route. The circuit realized through the FPGA is

tested as a prototype. It provides another opportunity for testing the design closer

to the final circuit. The present researcher’s proposed work synthesis was done

using ASIC approach by selecting the different standard cells.

The process of automated way of converting HDL code into technology

specific structural / gate level netlist using EDA tools is called Synthesis. Synthesis

requires a RTL code, Constrained file (design constrained and timing constrained),

Delay file / parasitic file and technology specific timing library, which is industry

de facto standard liberty format commonly called .lib file. Synthesis engine has

parser to understand the HDL code, specifically 2001 and 1995 standards. This

also has inbuilt STA engine (“static timing analysis”) to calculate and synthesize

for required frequency/ speed.

59

Output of the synthesis engine is the following.

1. Netlist

2. Delay file

3. SDC file (Synopsys design constraints)

Reports:

1. Timing

2. Area

3. Gate

4. Gated clock

5. Power

6. Generated clock

Output files like technology specific netlist and sdc file will be the input for

the physical design.

3.8 VLSI PHYSICAL DESIGN

A fully tested and error-free design at the switch level can be the starting

point for a physical design. The Hierarchy of the physical design is shown in the

figure 3.5. The physical design is to be realized as the final circuit using (typically)

a million components in the foundry’s library. The step-by-step activities in the

process are described briefly as follows and it is shown in figure 3.5 and the

complete flow is shown in figure 3.6.

60

Figure 3.5 VLSI Physical Design Hierarchy

3.8.1 Floor Planning

The Floor-planning, means placing the IP's ( Standard Cells , Blocks )

based on the connectivity, placing the memories, Create the Pad-ring, placing the

Pads (Signal/power/transfer-cells) to switch voltage domains/Corner pads (proper

accessibility for routing), meeting the simultaneous switching noise requirements

that when the high-speed bus is switching that it doesn't create any noise related

activities. An optimized floor plan meets effective utilization of the target chip.

3.8.2 Placement

The selected components from the ASIC library are placed in position on

the “Silicon floor”. It is done with each of the blocks used in the design. During

the placement, rows are cut, blockages are created where the tool is prevented from

placing the cells, and then the physical placement of the cells is performed based

on the timing/area requirements. The power-grid is built to meet the power targets

of the Chip.

61

3.8.3 Routing

The components placed as described above are to be interconnected to the

rest of the block. It is done with each of the blocks by suitably routing the

interconnects. Once the routing is complete, the physical design is taken as

complete. The final mask for the design can be made at this stage and the ASIC

manufactured in the foundry. At first, the Global routing to check for the

placement legalization and route ability with the present placement without routing

congestion. Detailed routing will be performed after the Clock tree synthesis.

Detailed routing will be the actual routing between the all the cells / Modules and

blocks with delay calculation and static timing analysis. In Timing driven

methodology router will be run with the static timing analysis along with all

possible calculation of the delay. Perform STA (Static Timing Analysis) with the

SPEF file and routed netlist file, to check whether the Design is meeting the

timing-requirements.

62

Figure 3.6 VLSI Physical Design Flow

63

3.8.4 Physical Verification

This stage involves checking the design for all manufacturability and

fabrication requirements.

1. DRC

2. ERC

3. LVS

To Perform DRC (Design Rule Check), verification will be done with

standard rule file called run set. Which has all the physical rules pertaining to that

specific technology library. Provided by the foundry to confirm that the design is

meeting the Fabrication requirements.

Perform the ERC (Electrical Rule Checking) check, to know that the design is

meeting the ERC requirement, i.e. to check any open, short circuit or floating nets

in the layout.

One of the important stage in physical design is the LVS (layout Vs.

Schematic) check, this is a part of the verification which takes a routed netlist to

that of synthesized net list and compare that the two are matching.

For effective timing closure perform separate Static Timing Analysis need to

be run at every stage to verify that the Signal-integrity of the Chip. STA is

important as the signal-integrity effect can cause cross-talk delay and cross-talk

noise effects, and effects in the functionality and timing aspects of the design.

3.9 POST LAYOUT SIMULATION

Once the placement and routing are completed, the performance

specifications like silicon area, power consumed, path delays, etc., can be

computed. Equivalent circuit can be extracted at the component level and

64

performance analysis carried out. This constitutes the final stage called

“verification.” One may have to go through the placement and routing activity

once again to improve performance.

3.10 LOW POWER VLSI DESIGN

Over the years historically, VLSI designers have used to design circuit

speed Vs the "performance" metric. As the design size become large demand in

terms of performance and silicon area increased but subsequently the increase in

the power consumption. As the integration of number of devices increased demand

for power also increased exponentially. In fact, power considerations have been the

ultimate design criteria in special portable applications such as mobile phones,

Music system like MP3, MP4 player, wristwatches and pacemakers for a long

time. The objectives in these applications are minimum power for maximum

battery life time. Almost all recent applications, power dissipation is becoming an

important constraint in large integration design. These low power requirements are

continued with the other applications with battery powered systems such as

Laptops, Notebook, Digital readers, Digital Camera and electronic organizer, etc.

In general, power reduction can be implemented at different levels of

design

abstraction.

1. System,

2. Architectural,

3. Gate,

4. Circuit and

5. Technology level.

65

At the system level, inactive modules may be turned off to save power. At

the architectural level, parallel hardware may be used to reduce global interconnect

and allow a reduction in supply voltage without degrading system throughput.

Clock gating is commonly used at the gate level. A many design techniques can be

used at the circuit level to reduce both dynamic and static power. For a design

specification, designers have many choices to make at different levels of

abstraction. Based on particular design requirement and constraints (such as

power, performance, cost), the designer can select a particular algorithm,

architecture and determine various parameters such as supply voltage and clock

frequency. This multi-dimensional design space offers a wide range of possible

trade-offs. The most effective design decisions derive from choosing and

optimizing architectures and algorithms at those levels.

The CMOS power consumption is proportional to the clock frequency

dynamically turning off the clock to unused logic or peripherals is an obvious way

to reduce power consumption. Control can be done at the hardware level or it can

be managed by the operating system of the application or software. For example,

some systems and hardware devices have sleep or idle modes. Typically, in these

modes, the clocks to most of the sections are turned off to reduce power

consumption. In sleep mode, the device is not working where a wake-up event

reuses the device from sleep mode. Devices may require different amounts of time

to wake-up from different sleep modes

3.11 SOURCES OF POWER DISSIPATION

Reduction of power consumption makes a device more reliable. The need

for devices that consume a minimum amount of power was a major driving force

behind the development of CMOS technologies. As a result, CMOS devices are

best known for low power consumption.

Two components determine the power consumption in a CMOS circuit:

Static power consumption

66

Dynamic power consumption

CMOS devices have very low static power consumption, which is the result

of leakage current. This power consumption occurs when all inputs are held at

some valid logic level and the circuit is not in charging states. But, when switching

at a high frequency, dynamic power consumption can contribute significantly to

overall power consumption. Charging and discharging a capacitive output load

further increases this dynamic power consumption.

Figure 3.7 CMOS Circuit in sub threshold region

Power dissipation in CMOS devices or circuits involves both static and

dynamic power dissipations. In the deep submicron technologies as shown in

figure 3.7, the static power dissipation, caused by leakage currents and sub

threshold currents contribute a considerably small percentage to the total power

consumption, while the dynamic power dissipation, resulting from charging and

discharging of parasitic capacitive loads of interconnects and devices dominates

the overall power consumption. But as technologies scale down to submicron the

static power dissipation becomes more dominant than the dynamic power

consumption. The aggressive scaling down of device dimensions and reductions of

67

supply voltages, which reduce the power consumption of the individual transistors.

The exponential increase of operating frequencies results in a steady increase of

the total power consumption.

3.12 STATIC POWER CONSUMPTION

Typically, all low-voltage devices have a CMOS inverter in the input and

output stage. Therefore, for a clear understanding of static power consumption,

refer to the CMOS inverter modes shown in Figure 3.8.

Figure 3.8 CMOS Inverter mode for Static power consumption

As shown in Figure 3.8, if the input is at logic 0, the n-MOS device is OFF,

and the p-MOS device is ON. The output voltage is VCC, or logic 1. Similarly,

when the input is at logic 1, the associated n-MOS device is biased ON and the p-

MOS device is OFF. The output voltage is GND, or logic 0. Note that one of the

transistors is always OFF when the gate is in either of these logic states. Since no

current flows into the gate terminal, and there is no DC current path from VCC to

GND, the resultant quiescent (steady-state) current is zero; hence, static power

consumption is zero. However, there is a small amount of static power

consumption due to reverse-bias leakage between diffused regions and the

substrate. This leakage current leads to the static power dissipation.

68

3.13 DYNAMIC POWER DISSIPATION

Dynamic power consumption is basically the result of charging and

discharging capacitances. It can be broken down into three fundamental

components, which are:

Load capacitance transient dissipation

Internal capacitance transient dissipation

Current spiking during switching.

3.14 LOAD CAPACITANCE TRANSIENT DISSIPATION

The first contributor to power consumption is the charging and discharging

of external load capacitances. Figure 3.9 is a schematic diagram of a simple CMOS

inverter driving a capacitive load. A simple expression for power dissipation as a

function of load capacitance is given by the equation 3.1.

Figure 3.9 Simple CMOS Inverter Driving a Capacitive External Load

PD = CLV2cc f --------- 3.1

Dynamic switching current is used in charging and discharging circuit load

capacitance, which is composed of gate and interconnect capacitance. The greater

69

this dynamic switching current is, the faster you can charge and discharge

capacitive loads, and your circuit will perform better.

3.14.1 Internal Capacitance Transient Dissipation

Internal capacitance transient dissipation is similar to load capacitance

dissipation, except that the internal parasitic “on-chip” capacitance is being

charged and discharged. Figure 3.10 is a circuit diagram of the parasitic nodal

capacitances associated with two CMOS inverters.

Figure 3.10 Parasitic Internal Capacitors Associated with Two Inverters

C1 and C2 are capacitances associated with the overlap of the gate area and

the source and channel regions of the P and N-channel transistors, respectively. C3

is due to the overlap of the gate and source (output), and is known as the Miller

capacitance. C4 and C5 are capacitances of the parasitic diodes from the output to

VCC and ground, respectively. Thus the total internal capacitance seen by inverter

1 driving inverter 2 is given by the equation 3.2.

CL = C1+C2+C3+C4+C5 ----- 3.2

70

Since an internal capacitance may be treated identically to an external load

capacitor for power consumption calculations, the same equation 3.1 can be used.

3.14.2 Current Spiking During Switching

The final contributor to power consumption is current spiking during

switching. While the input to a gate is making a transition between logic levels,

both the P-and N-channel transistors are turned partially on. This creates a low

impedance path for supply current to flow from VCC to ground, as illustrated in

Figure 3.11.

Figure 3.11 Equivalent schematic of a CMOS inverter whose

input is between logic levels

For fast input rise and fall times (shorter than 50 ns), the resulting power

consumption is frequency dependent. This is due to the fact that the more often a

device is switched, the more often the input is situated between logic levels,

causing both transistors to be partially turned on. Since this power consumption is

proportional to input frequency and specific to a given device in any application,

as is CL, it can be combined with CL. The resulting term is called “CPD”, the no-

load power dissipation capacitance.

71

Switching power is the dominant source of power drain on a CMOS chip.

Each time a transistor switches state from 1 to 0 and 0 to 1, it either pumps charge

into a capacitor or drains the capacitor to ground. In a full cycle, charge is taken

from the rail and pumped to ground. This produces a current, which, multiplied by

the rail voltage, equals power. If the CMOS transistor does not switch as often, less

charge moves and the power is lower. If the switching is controlled, invariably

activity of transistors reduces, this is an excellent way to limit power.

From the all above discussion, it is quite clear that significant work lies

ahead for reducing and managing power dissipation in CMOS devices. With the

market desire to shrink size and put systems on chips, advances in circuit design

and materials are clearly required for managing this situation, since neither speed

nor size is helping power reduction.

3.15 POWER REDUCTION TECHNIQUES IN VLSI DESIGN

There are many methods of achieving low power consumption. Systems

designers have developed several techniques to save power at the logic,

architecture, and systems levels.

3.15.1 Clock Gating

A significant amount of power in a chip is in the distribution network of the

clock. Clock buffers consume more than 50% of the dynamic power. This is

because these buffers have the highest toggle rate in the system. Also the flops

receiving the clock dissipate dynamic power even when the input and output

remain the same.

In Clock Gating method of power reduction, clocks are turned off when

they are not required. Modern design tools support automatic clock gating. These

tools identify circuits where clock gating can be inserted without changing the

functionality. Figure 3.12 shows the different clock gating scheme to reduce the

power consumption in VLSI chip design. This clock gating method also results in

area saving, this is due

multiplexers.

MULTI

e to single clock gating cell that takes the plac

IPLEXER BASED CLOCK GATING

72

ce of multiple

73

LATCH BASED CLOCKING APPROACH

Figure 3.12 Different Clock gating schemes

74

3.15.2 Asynchronous Logic

Asynchronous logic have pointed out that because their systems do not

have a clock, they save the considerable power that a clock tree requires. However,

asynchronous logic design suffers from the drawback of generating the completion

signals. This requirement means that additional logic must be used at each register

transfer in some cases, a double-rail implementation, which can increase the

amount of logic and wiring. Other drawbacks include testing difficulty and an

absence of design tools. Further, the asynchronous designer works at a

disadvantage because today’s design tools are geared for synchronous design.

Ultimately, asynchronous design does not offer sufficient advantages to merit a

wholesale switch from synchronous designs. However, asynchronous techniques

can play an important role in globally asynchronous, locally synchronous systems.

Such systems reduce clock power and help with the growing problem of clock

skew across a large chip, while allowing the use of conventional design techniques

for most of the chip.

3.15.3 Multi vdd Techniques

The different multi Voltage strategies for low power used in the VLSI chip design are

Static Voltage Scaling (SVC): Different supply voltage for different blocks of the

system.

Multi-level Voltage Scaling (MVS): Extension of SVC, each block or subsystem

can be supplied with different voltages (discrete, fixed number, dependent on the

mode).

Dynamic Voltage and Frequency Scaling (DVFS) : Extension of MVS, large

number of voltage levels dynamically switched based on the workload.

Adaptive Voltage Scaling (AVS): Extension of DVFS, voltage is adjusted with

the help of the control loop

75

3.15.4 Architectural Level

The power reduction at Architectural level includes exploiting parallelism

and using speculation. Parallelism can reduce power, whereas speculation can

permit computations to proceed beyond dependent instructions that may not have

completed. If the speculation is wrong, executing useless instructions can waste

energy. But this is not necessarily so. Branch prediction is perhaps the best-known

example of speculation. New architectural ideas can contribute most profitably to

reducing the dynamic power consumption term, specifically the activity factor.

Activity factor accounts for the number of transition of bit per unit time.

The architect or systems designer can do little to limit leakage except shut

down the memory. This is only practical if the memory will remain unused for a

long time.

SUMMARY

In this chapter, the VLSI design flow for the chip design is discussed in

detail. This describes how the digital front end design with the RTL code is

performed. The simulation of the design with verification using linting and code

coverage is also discussed. The VLSI design flow describes how to generate the

gate level netlist from the HDL code by performing the synthesis of the design.

The physical design describes the floor planning, placement and routing of the

design on the silicon core. This chapter is also describes why the low power is

required in any chip design, what are the different sources of power consumption

and what are different methodologies to reduce the power consumption in the

VLSI chip design.

76

CHAPTER 4

LOW POWER VLSI ARCHITECTURE FOR DISCRETE COSINE TRANSFORM

4.1 INTRODUCTION

In the previous chapter the basic concepts of VLSI design flow and the

different low power techniques in VLSI design and the low power techniques

adopted in the present work was described at length.

In this chapter the present researcher describes a proposed new

architectural design of the low power two dimensional Discrete Cosine Transform

core and the implementation of the proposed architecture is presented in detail.

Compression reduces the volume of data to be transmitted via text, fax, and

images and also reduces the bandwidth required for transmission and also reduces

the storage requirements of speech, audio, video which is much-needed. In digital

image, neighboring samples on a scanning line are normally similar to the spatial

redundancy. Spatial frequency is rate of change of magnitude as one traverses the

Image matrix. Useful image contents change relatively slowly across the image

much of the information in an image is repeated, hence the spatial redundancy.

Human eye is less sensitive to the higher spatial frequency components than the

lower frequency components.

4.2 DCT MODULE

In signal processing applications, The DCT is the most widely used

transform after the discrete Fourier transform. The DCT and IDCT are important

components in many picture compression and decompression standards, including

H.263, MPEG, HDTV and JPEG. The applications for these standards range from

still pictures on the Internet to low quality videophones to high definition

77

television. The DCT and IDCT also have applications in such wide ranging areas

as filtering, speech coding and pattern recognition [56].

The Discrete Cosine Transform is the most complex operation that needs to

be performed in the baseline JPEG process. This subsection starts with an

introduction to our chosen DCT architecture, followed by a detailed mathematical

explanation of the principles involved. Our implementation of the Discrete Cosine

Transform stage is based on a vector processing architecture. Our choice of this

particular architecture was due to a multiple reasons. The design uses a concurrent

architecture that incorporates distributed arithmetic [94] and a memory oriented

structure to achieve high speed, low power, high accuracy, and efficient hardware

realization of the 2-D DCT.

4.2.1 Mathematical Description of the DCT

The Discrete Cosine Transform is an orthogonal transform consisting of a

set of basis vectors that are sampled cosine functions. The 2-D DCT of a data

matrix is defined as

Y = C.X. CT ……. 4.1)

Where X is the data matrix, C is the matrix of DCT Coefficients, and CT is

the Transpose of C. The 2-D DCT (8 x 8 DCT) is implemented by the row-column

decomposition technique.

We first compute the 1-D DCT (8 x 1 DCT) of each column of the input

data matrix X to yield CTX. After appropriate rounding or truncation, the

transpose of the resulting matrix, CTX, is stored in an intermediate memory. We

then compute another 1-D DCT (8 x 1 DCT) of each row of CTX to yield the

desired 2-D DCT as defined in equation 4.1. The block diagram of the proposed

design is shown in Figure 4.1.

78

4.3 BLOCK DIAGRAM OF DCT CORE

Block diagram is presented below describing the top level of the design.

DCT core architecture is based on two 1-D DCT units connected through transpose

matrix RAM. Transposition RAM is double buffered, that is when 2nd stage of

DCT reads out data from transposition memory one, 1st DCT stage can populate

2nd transposition memory with new data. This enables creation of dual stage global

pipeline where every stage consist of 1-D DCT and transposition memory.1-D

DCT units are not internally pipelined, they use parallel distributed arithmetic with

butterfly computation to compute DCT values. Because of parallel DA they need

considerable amount of ROM memories to compute one DCT value in single clock

cycle. Design based on distributed arithmetic does not use any multipliers for

computing MAC (multiply and accumulate), instead it stores precomputed MAC

results in ROM memory and grab them as needed.

Figure 4.1 Block diagram of 2-D DCT Architecture

DCT 1D 1st stage

DCT 1D 2nd stage Transpose

RAM

ROM memories

ROM memories

Transpose RAMs

Timing And Control

rst clk

Xin (7:0)

dct_2d rdy_out

79

4.4 TWO DIMENSIONAL DCT CORE

This section describes the proposed architecture of the two dimensional

DCT. The figure 4.2 shown below is the input output schematic of the DCT core.

This core performs forward 2-Dimension Discrete Cosine Transform.

clk

dct_2d( dct_2d (11:0)

rst

xin (7:0) rdy_out

Figure 4.2 Top level schematic for DCT core

The DCT core is two dimensional Discrete Cosine Transform

implementation designed for use in compression systems like JPEG. Architecture

is based on parallel distributed arithmetic with butterfly computation it is done as

row/column decomposition where two 1-D DCT units are connected through

transposition matrix memory. Core has simple input interface. DCT core takes 8

bit input data and produces 12 bit output using 12 bit DCT matrix coefficients. The

8 bit signed input pixels will provide a 12 bit signed output coefficient for the

DCT. The data bits are multiplied by 3 bits can result in 11 bits and the sign bit to

give a total of 12 bits.

Vector processing using parallel multipliers is a method used for

implementation of DCT. The advantages in vector processing method are regular

structure [13, 93], simple control and interconnect and good balance between

performance and complexity of implementation.

DCT CORE

80

4.4.1 Behavioral Model for Vector Processing

DCT is widely used in current international image video coding standards

such as JPEG, Motion Picture Experts Group, H.261 and H.263 video-telephony

coding schemes. As shown in Figure 4.1, 8×8 2-D DCT can be implemented with

two 1-D DCT’s by the row-column decomposition method. In 8×1 row DCT, each

column of the input data X is computed and the outputs are stored in the

transformation memory. Another 8 × 1 column DCT is performed to yield desired

64 DCT coefficients [20]. The output bit-width is one of the parameter that

determines the quality of image. The outputs of the row DCT are truncated to N

bits before they are stored in transposition memory and sent to the input of the

column DCT.

The 8 × 8, 1-D DCT transform is expressed as

= ( )2 (2 + 1)16 , = 0, 1,2 … … 7

..… (4.2)

( ) = = 0

( ) = 1 ℎ

This equation is represented as vector-matrix form as

z = T · xT ….. (4.3)

Where T is an 8 × 8 matrix whose elements are cosine functions. x and z

are the row and the column vectors, respectively. Since 8 × 8 coefficients matrix T

in Equation 4.3 is symmetrical, the 1-D DCT matrix can be rearranged and

expressed as,

81

…. (4.4)

Where ck = cos kπ/16, , a = c1, b = c2, c = c3, d = c4, e = c5, f = c6, g = c7.

….. (4.5)

82

The odd and even components of DCT can be easily changed in to the

matrix of the form in the equation 4.5. The 8 × 8 DCT matrix multiplication can

be expressed as additions of vector scaling operations, which allows us to apply

our proposed low complexity design technique for DCT implementation.

Xin [7:0]

Figure 4.3 One Dimensional DCT Architecture

Shift register

add/sub block

add/sub block

add/sub block

add/sub block

Toggle Flip flop

Xk

Xk

Xk

Xk

Xk

Xk

Xk

Xk

ROM

X

X

X

X

ADDER

Zk0-Zk7

83

One Dimensional DCT implementation for the input pixel is shown in the

figure 4.3. A 1-D DCT is implemented on the input pixels first. The output of this

is intermediate value that is stored in a RAM. The 2nd 1-D DCT operation is done

on this stored value to give the final 2-D DCT output dct_2d. The inputs are 8 bits

wide and the 2d-dct outputs are 9 bits wide. In 1- Dimensional section the input

signals are taken one pixel at a time in the order x00 to x07, x10 to x17 and so on

up to x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit

shift registers are registered by the div8clk which is the clock signal divided by 8.

This will enable us to register in 8 pixels (one row) at a time. The pixels are paired

up in an adder/ subtractor block in the order xk0, xk7:xk1, xk6:xk2, xk5:xk3, xk4.

The adder/ subtractor are tied to clock [20, 8, 49], For every clock, the

adder/subtractor module alternately chooses addition and subtraction. This

selection is done by using the toggle flip flop. The output of the add/sub is fed into

a multiplier whose other input is connected to stored values in registers which act

as memory. The outputs of the 4 multipliers are added at every clock in the final

adder. The output of the adder Zk (0-7) is the 1-D DCT values given out in the

order in which the inputs were read in.

4.4.2 Transpose Buffer

The outputs of the adder are stored in RAMs. When WR signal is high, the

corresponding RAM address takes the write operation. Otherwise, the contents of

the RAM address are read. The period of the address signals is 64 times of the

input clocks. Two RAMs are used so that data write can be continuous. The 1st

valid input for the RAM1 is available at the 15th clock. So the RAM1 enable is

active after 15 clocks. After this the write operation continues for 64 clocks. At the

65th clock, since z_out is continuous, we get the next valid z_out_00. These 2nd

sets of valid 1D-DCT coefficients are written into RAM2 which is enabled at

15+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64 clocks

and RAM2 is in write mode. After this for every 64 clocks, the read and write

switches between the 2 RAMS and 1D-DCT section.

84

After the RAM is enabled, data is written into the RAM1 for 64 clock

cycles. Data is written into each consecutive location. After 64 locations are

written into, RAM1 goes into read mode and RAM2 goes into write mode. The

cycle then repeats. For either RAM, data is written into each consecutive location.

However, data is read in a different order. If data is assumed to be written in each

row at a time, in an 8x8 matrix, data is read in each column at a time. When RAM1

is full, the 2nd 1-D calculations can start.

4.4.3 Two Dimensional DCT architecture

The implementation of 2-D DCT is shown in the figure 4.4.The second 1-

Dimensional implementation is the same as the 1st 1-D implementation with the

inputs now coming from either RAM1 or RAM2. Also, the inputs are read in one

column at a time in the order z00 to z70, z10 to z70 upto z77. These inputs are fed

into a 8 bit shift register. The outputs of the 8 bit shift registers are registered by

the div8clk which is the clock signal divided by 8. This will enable us to register in

8 pixels (one row) at a time. The pixels are paired up in an adder/ subtractor block

in the order Zk0:Zk7, Zk1:Zk6, Zk2:Zk5, Zk3:Zk4 [17]. The adder/ subtractor is

tied to clock, For every clock, the adder/subtractor module alternately chooses

addition and subtraction. This selection is done by the toggle flop. The output of

the add/sub is fed into a multiplier whose other input is connected to stored values

in registers which act as memory. The outputs of the 4 multipliers are added at

every clock in the final adder [96]. The outputs from the adder in the 2nd section

are the 2D-DCT coefficients values given out in the order in which the inputs were

read in. When 2-D DCT computation is over the rdy_out signal goes high

indicating the output.

85

Zk [7:0]

output

Figure 4.4 2-D DCT Architecture

4.5 INVERSE DISCRETE COSINE TRANSFORM

A 1-D IDCT is shown in the figure 4.5 and it is implemented on the input

DCT values. The output of this called the intermediate value is stored in a RAM.

The 2nd 1-D IDCT operation is done on this stored value to give the final 2-D

IDCT output idct_2d. The inputs are 12 bits wide and the 2d-idct outputs are 8 bits

wide. In the 1st 1-D section, the input signals are taken one pixel at a time in the

Shift register

add/sub block

add/sub block

add/sub block

add/sub block

Toggle Flip flop

Zk0

Zk7

Zk1

Zk6

Zk2

Zk5

Zk3

Zk4

ROM

X

X

X

X

ADDER

Dct_2d [11:0]

86

order of x00 to x07, x10 to x17 and so on up to x77. These inputs are fed into an 8

bit shift register. The outputs of the 8 bit shift registers are registered at every 8th

clock [14, 93].This will enable us to register in 8 pixels (one row) at a time. The

pixels are fed into a multiplier whose other input is connected to stored values in

registers which act as memory. The outputs of the 8 multipliers are added at every

clock in the final adder. The output of the adder z_out is the 1-D IDCT values

given out in the order in which the inputs were read in.

The 1-D IDCT values are first calculated and stored in a RAM. The second

1-D implementation is the same as the 1st 1-D implementation with the inputs now

coming from RAM. Also, the inputs are read in one column at a time in the order

z00 to z70, z10 to z70 up to z77. The outputs from the adder in the 2nd section are

the 2D-IDCT coefficients. The 2-Dimensional IDCT block diagram is shown in

figure 4.6.

Figure 4.5 1-D IDCT Architecture

87

Figure 4.6 2-D IDCT Block diagram

4.5.1 STORAGE / RAM SECTION

The outputs z_out of the adder are stored in RAMs. Two RAMs are used so

that data write can be continuous. The 1st valid input for the RAM1 is available at

the 12th clock..So the RAM1 enable is active after 11 clocks. After this the write

operation continues for 64 clocks. At the 65th clock, since z_out is continuous, we

get the next valid z_out_00. This 2nd set of valid 1D-DCT coefficients are written

into RAM2 which is enabled at 12+64 clocks. So at 65th clock, RAM1 goes into

read mode for the next 64 clocks and RAM2 is in write mode. After this for every

64 clocks, the read and write switches between the 2 RAMS. 2nd 1-D IDCT

section starts when RAM1 is full. The second 1D implementation is the same as

the 1st 1-D implementation with the inputs now coming from either RAM1 or

RAM2. Also, the inputs are read in one column at a time in the order z00 to z70,

z10 to z70 up to z77. The outputs from the adder in the 2nd section are the 2-D

IDCT coefficients.

4.5.2 IDCT Core Block Diagram

clk rst idct_2d(7:0) dct_2d(11:0)

rdy_in

Figure 4.7 Top level schematic of IDCT Architecture

IDCT CORE

88

The schematic for the IDCT implementation are given in Figure 4.5 and the

block diagram in figure 4.7. The 1D-IDCT values are first calculated and stored in

a RAM. The 2nd 1D-IDCT is done on the values stored in the RAM. For each 1-D

implementation, input data are loaded into the first input of a multiplier. The

constant coefficient multiplication values are stored in a ROM and fed into the 2nd

input of the multiplier. At each clock 8 input values are multiplied with 8 constant

coefficients. The output of the eight multipliers are then added together to give the

1D coefficient values which are stored in the RAM. The values stored in the

intermediate RAM are read out one column at a time (i.e., every 8th value is read

out every clock) and this is the input for the 2nd DCT. The rdy_in signal is used as

a hand shake signal between the DCT and the IDCT functions [14]. When the

signal is high, it indicates to the IDCT that there are valid input to the core.

SUMMARY

This chapter describes the concepts involved in ASIC implementation of

2-D DCT and IDCT cores for low power consumption and it is implemented. This

architecture uses row - column decomposition to find the 2D-DCT, the number of

computations for processing an 8x8 block of pixels is reduced. 1-D DCT operation

is expressed as addition of vector-scalar products and basic common computations

are identified and shared to reduce computational complexity compared to other

architecture. In this architecture low power design, employ parallel processing

units that enable power savings from a reduction in clock speed. This architecture

saves power by disabling units that are not in use. When the units are in a standby

mode, they consume a minimum amount of power. The common low power design

technique is to reorder the input data so that a minimum number of transitions

occur on the input data lines. DCT architecture based on the proposed low

complexity vector scalar design technique is effectively used to remove redundant

computations, thus reducing computational complexity in DCT operations. This

DCT architecture shows lower power consumption, which is mainly due to the

efficient computation sharing.

89

CHAPTER 5

LOW POWER ARCHITECTURE FOR VARIABLE LENGTH ENCODING AND DECODING

5.1 INTRODUCTION

The previous chapter describes the concepts involved in the designing of

low power DCT and IDCT architectural details as implemented in the present

work. In this chapter a novel VLSI algorithm is presented for processing low

power Variable Length Encoding and Decoding for image compression

applications.

There have been several designs proposed for the variable length coding

method in the past decades, and so all these designs were concentrated over high

performance variable length coding process, i.e., their main concern was to make

the variable length coding process faster. These designs were less bothered about

power consumption, with the only intention being high performance. But with

increase in demand for portable multimedia systems like mobiles, iPods etc. the

demand for low power multimedia systems is increasing, making the designers to

concentrate on low power approaches along with high performance.

In this thesis some of the low power approaches have been adopted

efficiently to reduce the power consumption in variable length coding process. The

proposed architecture consists of three major blocks, namely, zigzag scanning

block, run length coding block and Huffman coding block [42]. The low power

approaches have been employed in these blocks individually to make the whole

variable length coding process to consume less power.

The zigzag scanning requires that all the 64 DCT coefficients are available

before starting, so we have to wait for all 64 DCT coefficients to arrive and then

start scanning. To overcome this problem two separate RAMs have been used in

90

the present design to make the process faster, where one RAM will be storing the

incoming DCT coefficients and other being used for zigzag scanning of earlier 64

DCT coefficients stored. The RAMs will be doing this thing in the alternative

order. So when we use parallel approach there are two possibilities, either we can

make process faster at the same operating frequency or we can decrease the

operating frequency still maintaining the same speed as earlier. The reduction in

the operating frequency is one of several approaches to achieve low power, since

the power dissipation is directly proportional to operating frequency.

The zigzag scanning block scans the quantized DCT coefficients in such a

way that all the low frequency components are scanned first and then the high

frequency components. The high frequency components being quantized to zero

are all accumulated at the last, so that the effective compression can be done in the

run length coding block.

The DCT coefficients are quantized to approximate the high frequency

components to zero, and there are too many intermediate zeros in between non-

zero DCT coefficients. In the proposed architecture for run-length encoder we

exploit this property of the DCT coefficients. Unlike the traditional run-length

encoder which counts the repeated number of character strings in the form of {

run_value, symbol}, we count the number of zeros in between the non-zero DCT

coefficients. By this way we can achieve more compression than the traditional

method when the DCT coefficients are approximated to zero by making use of a

proper quantization matrix in the DCT process. More compression means less

number of bits to process for the following stages, and less number of bits to

process means less switching activities. By minimizing the number of switching

activities, it is one of the well known approaches to achieve low power.

The Huffman coding is done by making use of a lookup table. The lookup

table is formed by arranging the different run-length combinations in the order of

their probabilities of occurrence with the corresponding variable length Huffman

codes. This approach of designing Huffman coder not only simplifies the design

91

but also results in less power consumption. Since we are using lookup table

approach, the only part of the encoder corresponding to the current run-length

combination will be active and other inactive parts of the encoder will not be

consuming any power. So turning off the inactive components of a circuit is the

low power approach adopted while designing Huffman coder.

5.2 VARIABLE LENGTH ENCODING

Variable Length Encoding (VLE) is the final lossless stage of the image

compression unit.VLE is done to further compress the quantized image. VLE

consists of the following three stages,

1. Zigzag Scanning

2. Run- length Encoding

3. Huffman Encoding

Figure 5.1 Block diagram of Variable Length Encoder

The figure 5.1 shows the block diagram of the Variable Length Encoder

(VLE). In the following sections we will discuss the detailed working of each sub

blocks.

Zigzag Scanning

Run-length Encoding

Huffman Encoding

Quantized DCT

Coefficients

Compressed Output

92

5.2.1 Zig Zag Scanning

Zigzag_en_in zigzag_en_out clk rdy_out rst

Figure 5.2 Block diagram of Zigzag Scanner

The figure 5.2 describes the block diagram of the zigzag scanner. The

quantized DCT coefficients obtained after applying the Discrete Cosine

Transformation to 8×8 block of pixels are fed as input to the Variable Length

Encoder (VLE). These quantized DCT coefficients will have non-zero low

frequency components in the top left corner of the 8×8 block and higher frequency

components in the remaining places [71, 62]. The higher frequency components

are approximated to zero after quantization. The low frequency DCT coefficients

are more important than higher frequency DCT coefficients. Even if we ignore

some of the higher frequency coefficients, we can successfully reconstruct the

image from the low frequency coefficients [73]. The Zigzag Scanner block

exploits this property.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 2324 25 26 27 28 29 30 3132 33 34 35 36 37 38 3940 41 42 43 44 45 46 4748 49 50 51 52 53 54 5556 57 58 59 60 61 62 63

Figure 5.3 Zigzag Scanning order

Increasing Vertical

Frequency

Increasing Horizontal

ZIG ZAG SCANNING

93

In zigzag scanning, the quantized DCT coefficients are read out in a zigzag

order, as shown in the figure 5.3. By arranging the coefficients in this manner,

RLE and Huffman coding can be done to further compress the data. The scan puts

the high-frequency components together. These components are usually zeroes.

The following figure 5.4 explains the zigzag scanning process with a typical

quantization matrix.

31

0 1 0 0 0 0 0

1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

31 0 1 0 0 0 0 0 1 2 0 0 ……………………... 0 0

0 1 2 3 4 5 6 7 8 9 10 11 62 63

31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0 0 1 8 16 9 2 3 10 17 24 32 25 62 63

Figure. 5.4 Zigzag Scanning example

Since the zigzag scanning requires that all the 64 DCT coefficients are

available before scanning, we need to store the serially incoming DCT coefficients

in a temporary memory. For each of the 64 DCT coefficients obtained and for each

8×8 block of pixels we have to repeat this procedure. So at a time either scanning

is performed or storing of incoming DCT coefficients is done [71, 62]. This will

slow down the scanning process. So in order to overcome this problem and to

Increasing Vertical

Frequency

Increasing Horizontal Frequency

Input to zigzag scanner

Output from zigzag scanner

94

make scanning faster, the present researcher proposes a new architecture in this

work for zigzag scanner and it is shown in the following figure 5.5.

Figure 5.5 Internal Architecture of the Zigzag Scanner

In the proposed architecture, two RAM memories are used in the zigzag

scanner and they are zigzag register1 and zigzag register2. One of the two RAM

memories will be busy in storing the serially incoming DCT coefficients while

scanning is performed from the other RAM memory. Two 2:1 Multiplexers are

used in this architecture. One Multiplexer (left side) will be used to switch the

input alternatively between one of the two register sets and the other Multiplexer

will be used to connect the output alternatively from one of the two register sets.

Either zigzag scanning or storing of incoming DCT coefficients, is done on a

zigzag register set, and not both simultaneously. So while scanning is performed

on one register set, other register set will be used to store the incoming DCT

coefficients. So except for first 64 clock cycles i.e., until 64 DCT coefficients of

first 8×8 blocks become available [28], the zigzag scanning and storing of serially

incoming DCT coefficients is performed simultaneously. So by using two RAM

memories we will be able to scan one DCT coefficient in each clock cycle except

95

for first 64 clock cycles. A counter is used to count the number of clocks, which

counts up to 64. This counter generates switch memory signal every 64 clock

cycles to the Multiplexer. The counter also generates the ready out signal after first

64 clock cycles to indicate the next block, that zigzag block is ready to output

values. This signal acts as ready in signal to the next run-length block, which upon

receiving it will start to operate.

5.2.2 Run Length Encoder

rle_in rle_out

rdy_in

clk

rdy_out

rst

Figure 5.6 Block diagram of Run-Length Encoder

The quantized coefficients are read out in a zig-zag order from DC

component to the highest frequency component. RLE is used to code the string of

data from the zig-zag scanner. The conventional Run length encoder codes the

coefficients in the quantized block into a run length (or number of occurrences)

and a level or amplitude. For example, transmit four coefficients of value "10" as:

{10,10,10,10}. By using RLE, the level is 10 and the run of a value of 10 is four.

By using RLE, {4,10} is transmitted, reducing the amount of data transmitted.

Typically, RLE encodes a run of symbols into two bytes, a count and a symbol. By

RUN LENGTH ENCODER

96

defining an 8 x 8 block without RLE, 64 coefficients are used [69, 26]. To further

compress the data, many of the quantized coefficients in the 8 x 8 block are zero.

Coding can be terminated when there are no more non-zero coefficients in the zig-

zag sequence. Using the "end-of-block" code terminates the coding.

But normally in a typical quantized DCT matrix the number of zeroes is

more compared to non-zero coefficients being repeated. So in the proposed

architecture for run-length encoder we exploit this property of more number of

zeroes in the DCT matrix. In the proposed architecture the number of intermediate

zeros in between non-zero DCT coefficients are counted, unlike the conventional

run-length encoder where the number of occurrence of repeated symbols are

counted. The following figure 5.7 shows the internal architecture of the proposed

run-length encoder.

Figure. 5.7 The internal architecture of run-length encoder.

The following Table 5.1 illustrates the difference between the conventional

run-length encoding and the proposed method of run-length encoding.

Comparator (= 0?)

Zero Counter

Run Level

‘0’

‘!0’

Load Count

INR

CLR

CLK

RLE_IN

RLE_OUT

RDY_IN

RST RDY_OUT

97

31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0

0 1 8 16 9 2 3 10 17 24 32 25 62 63

Table 5.1: Comparison between conventional and proposed RLE

Conventional

RLE

Proposed

RLE

31 31

1,0 1,1

1,1 1,2

1,0 0,1

1,2 5,2

1,1 EOB

5,0

1,1

EOB

The above example clearly shows that the proposed RLE design in the

present work yields better compression than the conventional RLE.

Output from zigzag scanner

98

5.2.3 Huffman Encoding

Huffman_en_in

rdy_in Huffman_en_out

clk

rst rst rdy out

Figure 5.8 Block diagram of Huffman Encoding

Conventional Huffman coding is used to code values statistically according

to their probability of occurrence. Short code words are assigned to highly

probable values and long code words to less probable values.

The procedure for Huffman coding involves the pairing of run/value

combinations. The input run/value combinations are written out in the order of

decreasing probability. The run/value combination with the highest probability is

written at the top [22, 89, 63], the least probability is written down last. The least

two probabilities are then paired and added. A new probability list is then formed

with one entry as the previously added pair. The least run/value combinations in

the new list are then paired [78, 42]. This process is continued till the list consists

of only two probability values. The values "0" and "1" are arbitrarily assigned to

each element in each of the lists.

Figure 5.9 Internal architecture of Huffman encoder

Output bit stream Run/Value

VLC Table

Encode r

HUFFMAN ENCODING

99

In the proposed architecture in the present work, the Huffman encoding is

done by making use of a lookup table approach. The lookup table is formed by

arranging the different run/value combinations in the order of their probabilities of

occurrence with the corresponding variable length Huffman codes. When the

output of the run- length encoder in the form of run/value combination is fed to the

Huffman encoder [6] the run/value combination received will be searched in the

lookup table, when run/value combination is found its corresponding variable

length Huffman code will be sent to output.

This approach of designing Huffman encoder not only simplifies the design

but also results in less power consumption. Since we are using lookup table

approach, the only part of encoder corresponding to the current run-length

combination will be active and other parts of the encoder will not be using any

power. So turning off the inactive components of a circuit in the Huffman encoder,

results in less power consumption.

5.2.4 Interconnection of VLE blocks

Figure 5.10 Interconnection of zigzag scanning, run-length & Huffman Encoding blocks

The above figure 5.10 shows the interconnection of zigzag scanning, run-

length encoding and Huffman encoding blocks in the variable length encoder

[85, 9]. The clock and reset signal are shared among all three blocks in VLE. The

100

output of the zigzag scanner is connected to the run-length encoder and also a

ready out signal is connected from zigzag scanner to the run-length encoder as

ready in signal, to indicate the RLE when to start the encoding process [72]. The

output of the run-length encoder is given as input to the Huffman encoder block,

and also the ready out signal from RLE to Huffman encoder to initiate the

encoding process. The interconnection acts as variable length encoder taking

quantized DCT coefficients as input and processing it to give variable length codes

as output.

5.3 VARIABLE LENGTH DECODER

The variable length decoder is the first block on the decoder side. It

decodes the variable length encoder output to yield the quantized DCT coefficients

the basic block diagram of the Variable Length Decoder is shown in the figure

5.11. The variable length decoder consists of three major blocks, namely,

1. Huffman Decoding.

2. Run- length Decoding.

3. Zigzag Inverse Scanning.

Figure 5.11 Block diagram of variable length decoder

Huffman Decoding

Run Length Decoding

Zigzag Inverse

Scanning Compressed

Data

Quantized DCT

Coefficients

101

5.3.1 Huffman Decoder

Figure 5.12 Block diagram of Huffman decoder

The Huffman decoder forms the front end part of the variable length

decoder. The block diagram of the Huffman decoder is shown in the figure

5.12.The internal architecture of the Huffman decoder is same as the Huffman

encoder [78]. The same VLC Huffman coding table which was used in the

Huffman encoder is also used in the Huffman decoder. The input encoded data is

taken and a search is done for the corresponding run/value combination in the VLC

table. Once the corresponding run/value combination is found, it is sent as output

and Huffman starts decoding next coming input.

The VLC Huffman coding table which we are using in both the Huffman

encoder and the Huffman decoder, reduces the complexity of the Huffman decoder

[39, 14]. It not only reduces the complexity but also reduces the dynamic power in

the Huffman decoder, since only the part of the circuit is active at a time.

5.3.2 Block Diagram of FIFO

Figure 5.13 Block diagram of FIFO

HUFFMAN DECODER

Huffman_in

rdy_in

clk

rst

Huffman_out

rdy_out

FIFO

FOutN

rst

FInN

clk

Data_In F_Data

F_EmptyN

102

The First in First out (FIFO) also forms the part of the decoder part and it is

shown in the figure 5.13. The FIFO is used between the Huffman decoder and the

run-length decoder. The FIFO is used to match the operating speed between the

Huffman decoder and run-length decoder. The Huffman decoder sends a decoded

output to the run-length decoder in the form of run/value combination. The run-

length combination takes this as input and starts decoding. Since the run in the

run/value combination represents the number of zeroes in between consecutive

non-zero coefficients, the zero ‘0’ is sent as output for next ‘run’ number of clock

cycles. Until then the run-length decoder cannot accept other run/value

combination, and we know that the Huffman decoder decodes one input to one

run/value combination in every clock cycle. So Huffman decoder cannot be

connected directly to run-length decoder. Otherwise the run-length decoder cannot

decode correctly. So to match the speed between the Huffman decoder and run-

length decoder, the FIFO is used. The output of the Huffman decoder is stored

onto the FIFO, the run-length decoder takes one decoded output of Huffman

decoder from the FIFO when it is finished with decoding of the present input to it.

So after run-length decoder finishes decoding of the present input, it has to send a

signal to the FIFO to feed it a new input. This signal is sent to the FOutN pin,

which is read out pin of the FIFO. The FInN pin is used to write onto FIFO, the

Huffman decoder generates the signal for this while it has to write a new input

onto the FIFO. So the FIFO acts as a synchronizing device between the Huffman

decoder and the run-length decoder.

5.3.3 Run Length Decoder

Figure 5.14 Block diagram of Run-Length Decoder

RUN LENGTH

DECODER

rle_in rle_out

rdy_out

rdy_in

clk

rst

103

The Run-length decoder forms the middle part of the variable length

decoder. The block diagram of the Run Length Decoder is shown in figure 5.14. It

takes decoded output from Huffman decoder through FIFO. When the Huffman

decoder decodes one input and stores the decoded output onto the FIFO, then the

FIFO becomes non-empty (the condition when at least one element is stored on the

FIFO). The FIFO then generates the signal F_EmptyN. This signal is used as

rdy_in signal to the run-length decoder. So when Huffman decoder decodes one

input and stores it onto the FIFO, then a ready signal is generated to the run-length

decoder to initiate the decoding process. The run-length decoder takes the input in

the form of a run/value combination [74], then separates run and value parts. The

run here represents number of zeroes to output before sending out the non-zero

level ‘value’ in the run/value combination. So for example if {5,2} is input to the

run-length decoder then it sends 5 zeroes (i.e., 5 ‘0’) before transmitting the non-

zero level '2’ to the output. Once the run-length decoder sends out a non-zero level,

then it means that it is finished with the decoding of the present run/value

combination, and it is ready for the next run/value combination. So for this it

generates the rdy_out signal to the FIFO, to indicate that it has finished decoding

of present input and ready for decoding the next run/value combination. This

rdy_out is connected to the FOutN pin of the FIFO, which is read out pin of the

FIFO. Upon receiving this signal the FIFO sends out a new run/value combination

to the run-length decoder, to initiate run-length decoding process for the new

run/value combination. An example for run-length decoding process is shown

below,

104

31 1,1 1,2 0,1 5,2 EOB

31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0

0 1 8 16 9 2 3 10 17 24 32 25 62 63

5.3.4 ZigZag Inverse Scanner

Figure 5.15 Block diagram of Zigzag Inverse Scanner

31 0 1 0 2 1 0 0 0 0 0 2 ………………... 0 0

0 1 8 16 9 2 3 10 17 24 32 25 62 63

31 0 1 0 0 0 0 0 1 2 0 0 ………………………... 0 0

0 1 2 3 4 5 6 7 8 9 10 11 62 63

The zigzag inverse scanner shown in the figure 5.15, forms the last stage in

the variable length decoder. The working and architecture, everything is similar to

the zigzag scanner, except that the scanning order will be different. The zigzag

inverse scanner gets the input from the run-length decoder, starts storing them in

Input to RLD

Output from RLD

ZIGZAG INVERSE SCANNER

zigzag_in

clk

rst

zigzag_out

rle_out

Input to the zigzag inverse scanner

Output from the zigzag inverse scanner

105

one of the two RAMs, until it receives all 64 coefficients. Once it receives all the

64 coefficients, it starts inverse scanning to decode back the original DCT

coefficients. Meanwhile the incoming DCT coefficients are getting stored in

another RAM. Once scanning from one RAM is finished, it starts scanning from

another RAM and meanwhile the incoming DCT coefficients gets stored in first

RAM. So this process is repeated until all the DCT coefficients are scanned. There

will be delay of 64 clock cycles before the output appears. Once after that, for

every clock cycle an element will scanned continuously. The above example

illustrates the working of the zigzag inverse scanner.

SUMMARY

This chapter describes the algorithm and architecture details of the Variable

length coding that maps input source data onto code words with variable length is

an efficient method to minimize average code length. Compression is achieved by

assigning short code words to input symbols of high probability and long code

words to those of low probability. Variable length coding can be successfully used

to relax the bit-rate requirements and storage spaces for many multimedia

compression systems. For example, a variable length coder (VLC) employed in

MPEG-2 along with the discrete cosine transform (DCT), results in very good

compression efficiency. To reconstruct the image back before going to the

dequantization Variable Length Decoder is designed with low power approach.

Since early studies have been focused only on high throughput variable

length coders, low-power variable length coders have not received much attention.

This trend is rapidly changing as the target of multimedia systems is moving

towards portable applications like laptops, mobiles and iPods etc. These systems

highly demand low-power operations, and, thus require low power functional

units.

106

CHAPTER 6

SIMULATION AND SYNTHESIS RESULTS OF DCT AND IDCT MODULES

6.1 INTRODUCTION

In the previous chapters, the development of algorithms and design of

architectures for various functional modules of Low power DCT and Variable

Length Coder for image compression were presented in detail. These functional

modules were transformation, DCT, IDCT, Zigzag Scanning, Run Length

Encoding, Huffman coding and different blocks of Variable Length Decoding.

This chapter describes the Verification of DCT and IDCT modules. These modules

are implemented using Mat lab and they are simulated using Modelsim finally

synthesized using RTL compiler.

The image data is divided up into 8x8 blocks of pixels. DCT is applied to

each 8x8 block of the image, DCT converts the spatial image representation into a

frequency map, the low-order or "DC" term represents the average value in the

block, while successive higher-order ("AC") terms represent the strength of more

and more rapid changes across the width or height of the block. The highest AC

term represents the strength of a cosine wave alternating from maximum to

minimum adjacent pixels.

The DCT calculation is fairly complex in fact; this is the most costly step in

JPEG compression. The point of doing it is that we have now separated out the

high- and low-frequency information present in the image. We can discard high-

frequency data easily without losing low-frequency information. The DCT step

itself is lossless except for round-off errors. To discard an appropriate amount of

information, the compressor divides each DCT output value by a "quantization

coefficient" and rounds the result to an integer. The larger the quantization

coefficient, the more data is lost, because the actual DCT value is represented less

107

and less accurately. Each of the 64 positions of the DCT output block has its own

quantization coefficient, with the higher-order terms being quantized more heavily

than the low-order terms (that is, the higher-order terms have larger quantization

coefficients). Furthermore, separate quantization tables are employed for

luminance and chrominance data, with the chrominance data being quantized more

heavily than the luminance data. This allows JPEG to exploit further the eye's

differing sensitivity to luminance and chrominance.

The complete quantization tables actually used are recorded in the

compressed file so that the decompressor will know how to (approximately)

reconstruct the DCT coefficients.

Several previous proposals for DCT/IDCT architectures attempt to reduce

power dissipation. The architectures use some common power reduction

techniques and some power reduction techniques specific to DCT/IDCT

architectures. Because of the wide-ranging applications of the DCT and IDCT, the

input data pattern is widely varied for DCT/IDCT architectures in different

applications. So various techniques have been proposed for data-dependant power

reduction techniques. Some architecture reduces calculations for visually irrelevant

DCT coefficients. Other architectures ignore redundant sign bits to save power in

the arithmetic units. To save power, some designs disable arithmetic units for the

many zero-valued operands in the IDCT. Many other implementations focus on

power reduction in the multiplier units.

In this thesis low power design, employ parallel processing units that

enable power savings from a reduction in clock speed. This architecture saves

power by disabling units that are not in use. When the units are in a standby mode,

they consume a minimum amount of power. The common low power design

technique is to reorder the input data so that a minimum number of transitions

occur on the input data lines. DCT architecture based on the proposed low

complexity vector scalar design technique is effectively used to remove redundant

computations, thus reducing computational complexity in DCT operations. This

DCT architecture shows lower power consumption, which is mainly due to the

108

efficient computation sharing and the advantage of using carry save adders as final

adders.

6.2 MATLAB IMPLEMENTATION OF DCT AND IDCT MODULES

The representation of the colors in the image is converted from RGB to

YCbCr, consisting of one luma component (Y), representing brightness, and two

chroma components, (Cb and Cr), representing color. The resolution of the chroma

data is reduced, usually by a factor of 2. This reflects the fact that the eye is less

sensitive to fine color details than to fine brightness details. The image is split into

blocks of 8×8 pixels, and for each block, each of the Y, Cb, and Cr data undergoes

a discrete cosine transform (DCT). A DCT is similar to a Fourier transform in the

sense that it produces a kind of spatial frequency spectrum. The amplitudes of the

frequency components are quantized. Human vision is much more sensitive to

small variations in color or brightness over large areas than to the strength of high-

frequency brightness variations. Therefore, the magnitudes of the high-frequency

components are stored with a lower accuracy than the low-frequency components.

If an excessively low quality setting is used, the high-frequency components are

discarded altogether.

6.2.1 DCT Methodology

• Each pixel value in the 2-D matrix is quantized using 8 bits which produces a

value in the range of 0 to 255 for the intensity/luminance values and the range

of -128 to + 127 for the chrominance values. All values are shifted to the range

of -128 to + 127 before computing DCT.

• All 64 values in the input matrix contribute to each entry in the transformed

matrix.

• The value in the location F [0,0] of the transformed matrix is called the DC

coefficient and is the average of all 64 values in the matrix .

109

• The other 63 values are called the AC coefficients and have a frequency

coefficient associated with them.

• The Spatial frequency coefficient increases as we move from left to right

(horizontally) or from top to bottom (vertically). Low spatial frequencies are

clustered in the left top corner.

• The Discrete Cosine Transform (DCT) separates the frequencies contained in

an image.

• The original data could be reconstructed by Inverse DCT.

Before we begin, it should be noted that the pixel values of a black-and-white

image range from 0 to 255 in steps of 1, where pure black is represented by 0, and

pure white by 255. Thus it can be seen how a photo, illustration, etc. can be

accurately represented by these 256 shades of gray. Since an image comprises

hundreds or even thousands of 8x8 blocks of pixels, the following description

shows what happens to one 8x8 block of the JPEG process; what is done to one

block of image pixels is done to all of them, in the order earlier specified. Now,

let‘s start with a block of image pixel values. This particular block was chosen

from the very upper- left-hand corner of an image.

168 161 161 150 154 168 164 154

171 154 161 150 157 171 150 164

171 168 147 164 164 161 143 154

Original image input matrix = 164 171 154 161 157 157 141 132

161 161 157 154 143 161 154 132

164 161 161 154 150 157 154 140

161 174 157 154 161 140 140 132

154 161 157 150 140 132 136 128

110

Because the DCT is designed to work on pixel values ranging from -128 to

127, the original block is “leveled off” by subtracting 128 from each entry. This

results in the following matrix. It is now ready to perform the Discrete Cosine

Transform, which is accomplished by matrix multiplication.

D = T M T’ ….. (6.1)

40 33 33 22 26 40 36 26

43 26 33 22 29 43 22 36

43 40 19 36 36 33 15 26

M = 36 43 26 33 29 29 13 4

33 33 29 26 15 33 26 4

36 33 33 26 22 29 26 12

33 46 29 26 33 12 12 4

26 33 29 22 12 4 8 0

In Equation 6.1 matrix M which is the image Matrix which is first

multiplied by the DCT matrix T . Where T is an 8 × 8 matrix from equation 4.3,

whose elements are cosine functions .This transforms the rows, The columns are

then transformed by multiplying on the right by the transpose of the DCT matrix.

By using the proposed algorithm and architecture the DCT of the image Matrix

with the Mathlab yields the following matrix.

111

This block matrix now consists of 64 DCT coefficients, c (i, j), where i and

j range from 0 to 7. The top-left coefficient, c (0, 0) correlates to the low

frequencies of the original image block. As we move away from c(0,0) in all

directions, the DCT coefficients correlate to higher and higher frequencies of the

image block, where c (7, 7) corresponds to highest frequency. Higher frequencies

are mainly represented as lower number and lower frequencies as higher number

[56,58,59]. It is important to know that human eye is most sensitive to lower

frequencies.

6.2.2 Quantization

Now 8x8 block of DCT coefficients is now ready for compression by

quantization. A remarkable and highly useful feature of the JPEG process is that in

this step, varying levels of image compression and quality are obtainable through

selection of specific quantization matrices. This enables the user to decide on

quality levels ranging from 1 to 100, where 1 gives the poorest image quality and

highest compression, while 100 gives the best quality and lowest compression. As

a result, the quality/compression ratio can be tailored to suit different needs.

Subjective experiments involving the human visual system have resulted in the

112

JPEG standard quantization matrix. With a quality level of 50, this matrix renders

both high compression and excellent decompressed image quality.

16 11 10 16 24 40 51 61

12 12 14 19 26 58 60 55

14 13 16 24 40 57 69 56

Q50 = 14 17 22 29 51 87 80 62

18 22 37 56 68 109 103 77

24 35 55 64 81 104 113 92

49 64 78 87 103 121 120 101

72 92 95 98 112 100 103 99

If however, another level of quality and compression is desired, scalar

multiples of the JPEG standard quantization matrix may be used. For a quality

level greater than 50 (less compression, higher image quality), the standard

quantization matrix is multiplied by (100-quality level)/50. For a quality level less

than 50 (more compression, lower image quality), the standard quantization matrix

is multiplied by 50/quality level. The scaled quantization matrix is then rounded

and clipped to have positive integer values ranging from 1 to 255. For example, the

above quantization matrices yield quality levels of 10 and 90.

Quantization is achieved by dividing each element in the transformed image

matrix D by the corresponding element in the quantization matrix, and then

rounding to the nearest integer value as indicated by the equation 6.2. For the

quantization process the quantization matrix Q50 is used.

Ci,j = round ( Di,j/ Qi,j) .… (6.2)

113

Recall that the coefficients situated near the upper-left corner correspond to

the lower frequencies for which the human eye is most sensitive of the image

block. In addition, the zeros represent the less important, higher frequencies that

have been discarded, giving rise to the lossy part of compression. As mentioned

earlier, only the remaining nonzero coefficients will be used to reconstruct the

image.

The IDCT is next applied to matrix, which is rounded to the nearest integer.

Finally, 128 is added to each element of that result, giving us the decompressed

JPEG version N of our original 8x8 image block M. The output of IDCT and the

input image matrices, when compared there is a remarkable result, considering that

nearly 70% of the DCT coefficients were discarded prior to image block

decompression/reconstruction. Given that similar results will occur with the rest of

the blocks that constitute the entire image, there should be no surprise that the

JPEG image will be scarcely distinguishable from the original. Remember, there

are 256 possible shades of gray in a black-and-white picture, and a difference of

say 10 is barely noticeable to the human eye [15]. DCT takes advantage of

redundancies in the data by grouping pixels with similar frequencies together; and

moreover, if we observe as the resolution of the image is very high, even after

sufficient compression and decompression there is very less change in the original

and decompressed image. Thus, we can also conclude that at the same

compression ratio the difference between original and decompressed image goes

on decreasing as there is increase in image resolution.

6.3 MATLAB RESULTS

The procedure of the steps followed for image compression using Matlab is

given in the form of flow chart and it is shown in the Figure 6.1.

114

Figure 6.1 Matlab Design flow

To find the DCT for the given image read the image in the form of frames,

each frame consists of 8x8 pixels these pixels are represented in the form of 8X8

image matrix. The coefficients of the image matrix are varied from 0 to 255, shift

each of the coefficients in the range of -128 to +127 by subtracting each coefficient

by 128. Apply the proposed DCT algorithm to individual rows and columns to find

the DCT. Perform the quantization using Q50 quantization matrix to achieve the

compression then display the compressed image. Matlab simulation results are

compared with the Verilog HDL simulation results. Results obtained after

Read the image in matrix form

Shift the values to the range of -128 to + 127

Apply DCT to individual rows and columns

Apply quantization

Dequantize and apply IDCT to individual rows and columns

Display the compressed images

End

115

performing DCT on the original images are shown in Figure 6.2. The figure 6.3

shows original image and the image obtained after applying DCT and also shows

the reconstructed image by applying IDCT.

Figure 6.2 Matlab Simulation Results for 8x8 image

The above results shows that from the present work for the proposed

DCT and IDCT algorithm and architecture to achieve the low power, it is

possible to reconstruct the original image with out losing the originality of the

image quality.

116

Figure 6.3 Matlab Simulation Results for full image

The figure 6.3 shows how to reconstruct the complete image with the

proposed DCT and IDCT algorithms using Mathlab.

The figure 6.4 and 6.5 shows the one more typical image before and after

performing the proposed DCT and IDCT Algorithms.

117

Figure 6.4 Input image in color and Gray Scale

Figure 6.5 Reconstructed Image after proposed DCT and IDCT

118

6.4 VLSI DESIGN OF THE PROPOSED ARCHITECTURE

The main objective of the researcher is to achieve the low power with the

DCT and IDCT core design. The proposed DCT and IDCT architectural blocks of

the design were written in the form of RTL code using Verilog HDL. For

functional verification of the design test-benches are written and simulation is done

using Cadence IUS simulator and Modelsim simulator. Behavioral simulation was

also done using Matlab from Math Works. The Matlab was used to generate input

image matrix for the input of the DCT core. Matlab reads both input and output

calculates DCT/IDCT coefficients of the input file and compares them to the

output of the core.

After behavioral verification RTL Complier EDA tool was used to

synthesize the Verilog code. IC Compiler was used to generate layout and to

perform the place and route of the design for both DCT and IDCT chips. After

synthesis, analysis of the parameters like area and power were performed by

mapping the design on to the 65nm standard cell technology library. The power

optimization was done by applying the low power techniques by mapping on to the

65nm standard cells. The results show that the proposed design of the present

work, it is possible to achieve the low power with good performance in the design.

The timing result showed that the design could operate till a maximum speed of

100 MHz.

6.5 SIMULATION RESULTS USING VERILOG

Simulation of the DCT core was done using Cadence Incicive Unified

Simulator (IUS) and Modelsim simulator. The DCT core was simulated by giving

the 8x8 image matrices taken from the Matlab. The Verilog test bench is written to

simulate the DCT core. When the DCT computation is over rdy_out signal goes

high. After 92 clock cycles the 2-D DCT output for all the 64 values are

continuous. The Table 6.1 shows the signal description of the DCT core.

119

Table 6.1 Signal Description of DCT core

Signal Name I/O Function

clk Input clock rst Input reset

Xin[7:0] Input 8-bit input

rdy_out Output Ready signal to indicate DCT output

dct_2d[11:0] Output 12 bit 2D DCT output

6.6 COMPARISON OF MATLAB AND HDL SIMULATION RESULTS

An 8x8 image matrix is given to Matlab, The DCT block performs the 2-D

DCT. The 2-D DCT output obtained from the Matlab is given in figure 6.6. The

same image input is given to test bench of the Verilog code to obtain the 2-D DCT.

Output from HDL Verilog simulation of DCT core is shown in figure 6.7.

Resultant DCT matrix obtained from both the cases are compared. Only a few

coefficients are different which does not affect the output of the image much. The

IDCT core outputs are also compared with the Matlab results and HDL simulation

results they are comparable hence from the proposed IDCT Architecture it is

possible to reconstruct the image back with out much degradation in the image.

120

6.6.1 2-D DCT Simulation Results Using Matlab and Verilog

Figure 6.6 2-D DCT Matlab results

The Figure 6.6 shows the Matlab results for the 2-D DCT using Matlab and

using the proposed low power architecture to find the 2-D DCT for a 8x8 image

matrix.The figure 6.7 shows the Modelsim simulated result to find the 2-D DCT

with the same architecture.

Figure 6.7 DCT HDL simulation results

121

Comparison of Matlab output for 8x8 image matrix with the verilog HDL

simulation results. Image matrix (8x8) 64 values are forced as input to the DCT

core using test bench and the output as 64 DCT coefficients are observed. The

results obtained in both Matlab and HDL simulator are same. Hence the proposed

architecture for hardware implementation of DCT core with the low power is

working efficiently.

6.6.2 IDCT Simulation Results

The IDCT core was simulated, where the input is given from dct_2d signal

from DCT core. The signal description of the IDCT core is shown in the Table 6.2.

The output is obtained from IDCT is given to the Matlab to reconstruct the image.

The verilog testbench is written to simulate the IDCT core. When the IDCT

computation is starts the rdy_in signal is made high. After a initial latency of 84

clock cycles the 2-D IDCT output for all the 64 values is continuous. Simulation

steps were followed using Cadence IUS simulator and Modelsim simulator.

Table 6.2 Signal Description of IDCT core

Signal Name I/O Function clk Input Clock rst Input Reset dct_2d Input 12-bit input rdy_in Input Ready signal to indicate dct_2d as input idct_2d[7:0] Output 8bit - 2d dct output

Figure 6.8 2-D IDCT HDL simulation results

122

Figure 6.9 2-D IDCT MATLAB results

Simulation steps of IDCT were done exactly like the steps followed in the

DCT core. Matlab was used for generating the test vectors, and analysis of the

output. The core was used to obtain IDCT coefficients and compared with Matlab

results.The 8x8 blocks of IDCT was taken and processed, the whole image was

done using block processing in Matlab. The figure 6.8 and figure 6.9 shows the

HDL simulation and Matlab simulation of IDCT respectively.

6.7 SYNTHESIS RESULTS OF DCT AND IDCT

The different features and analysis of both DCT and IDCT core was done

using RTL compiler from cadence. The design was synthesized and the core was

mapped on to tcbtcbn65lphvtwcl_ccs standard cells. The features like area, power

and number of cells required for the design are tabulated in table 6.3. The layout,

place and route of each of the design were done using IC Compiler.

123

Table 6.3 Power Area Characteristics of DCT and IDCT using 65nm standard cells

Figure 6.10 Layout of 2D- DCT

Features DCT IDCT

Internal Power 0.2708 mw 0.3376mw

Switching Power 0.1641mw 0.2142mw

Leakage Power 103.5371nw 101.7760nw

Total Power 0.4350mw 0.5519mw

Area 34983.359724 µm2 34903.799800 µm2

Block size 8X8 8X8

No. of Cells 6494 7171

124

Figure 6.11 Zoomed Version of 2D- DCT Layout

Figure 6.12 Layout of 2D- IDCT

125

Figure 6.13 Zoomed Version of 2D-IDCT Layout

The layouts of both DCT and IDCT core shows that design can be

comfortably placed on the physical area using Place and route tool without any

conjetion in the design. The Zoomed version of the design shows how the design

can be placed in the standard cells approach.

6.8 COMPRESSION USING DISCRETE COSINE TRANSFORM AND

QUANTIZATION

The starting point for this proposed research was a FPGA implementation

of image compression using DCT .This unit gives the detail of that

implementation. The FPGA implementation of image compression using Discrete

Cosine Transformation and Quantization was done with the different architecture

and this design was implemented on to the FPGA device 2vp30ff896-7 and the

simulation of the design was done using Modelsim simulator. The different stages

of the process for image compression using DCTQ are shown in different figures.

The figure 6.14 shows the original image which had to be compressed using

DCTQ and reconstruct the same using IDCT.

126

Figure 6.14 Original image for compression

The image that was generated in Matlab along with the pixel values and the

image matrix is given in the figure 6.15.

Figure 6.15 Original image with the pixel values

The matrix repr

shows the pure white co

The matrix C i

Matlab. As we notice

implement the design

multiplied by 128 and l

resentation of the gray image of figure 6.15, ma

olor and minimum value shows the pure block

is the DCT matrix that is obtained from the

from the DCT matrix values lies between -1 a

on to the FPGA device each of the DCT co

level shifted by 128.

127

aximum value

color.

design using

and +1 .But to

oefficients are

Figure 6

The obtained

quantization matrix for

6.17 shows the image a

6.16 DCT coefficients before quantization

DCT coefficients are divided by the JP

r image compression and is shown as quant_m

after compression.

128

PEG standard

mat. The figure

Fi

6.9 RECONSTRUC

On the receiver

quantization matrix and

6.18.

igure 6.17 Image after compression

CTION OF IMAGE USING IDCT

r side the image is then denormalized by mu

d de- transformed image is obtained and is g

129

ultiplying with

given in figure

130

Figure 6.18 Reconstructed Image using IDCT

The following figure 6.19 shows the comparison of original and final

image using DCT and IDCT respectively.

Figure 6.19 Comparison of original and reconstructed image

131

6.10 SIMULATION RESULT OF IMAGE AFTER COMPRESSION

The Figure 6.20 illustrates the process of compression by quantization. It

also shows the general agreement between the results generated in Matlab and the

simulated result in Modelsim. The compressed image with their pixel values is also

shown.

Figure 6.20 Simulation results after compression

6.11 FPGA IMPLEMENTATION OF THE DCTQ

The various modules of DCTQ described in the previous sections were

coded in Verilog, simulated using Modelsim, synthesized using Xilinx ISE

9.2i.The target device chosen was Xilinx Vertex-II Pro XC2VP30 -7 with package

FF896 FPGA[2,3]. The Verilog code developed for this design is fully RTL

132

compliant and technology independent. As a result it can work on any FPGA or

ASIC without needing to change any code.

The synthesis results and the device utilization summary of DCTQ for the

targeted device are presented below.

6.11.1 Device Utilization Summary

Number of slices : 390 out of 13696 2%

Number of slice Flip Flops : 560 out of 27392 2%

Number of 4 input LUTS : 225 out of 27392 0%

Number of IOS : 387

Number of bonded IOBS : 387 out of 556 69%

Number of MULT 18X18S : 16 out of 136 11%

Number of MULT GCLKS : 1 out of 16 6%

6.11.2 HDL Synthesis Results

The synthesis results of main module are as follows:

=======================================

Macro Statistics

# Multipliers : 16

16x9-bit multiplier : 8

9x9-bit multiplier : 8

# Adders/Sub tractors : 30

16-bit adder : 7

133

26-bit adder : 7

9-bit subtract or : 16

# Registers : 48

16-bit register : 16

26-bit register : 8

9-bit register : 24

SUMMARY

The simulation and synthesis of the Discrete Cosine Transform and Inverse

Discrete Cosine Transforms were described in the present chapter. The proposed

architecture of the DCT and IDCT were first implemented in Matlab in order to

estimate the quality of the reconstructed image and the compression that can be

achieved. In addition Matlab output serves as a reference for verifying the Verilog

output. The core Modules of the DCT and IDCT were realized using Verilog for

ASIC implementation. The quality of the reconstructed pictures using Matlab and

Verilog were compared. The synthesis of the proposed architecture was done using

the ASIC Synthesis tool. The synthesis results shows that design was done to

achieve the low power both in DCT and IDCT modules. The design was

implemented using 65nm standard cells with low power approach. The physical

design and Place and Route results for the DCT and IDCT were also presented.

134

CHAPTER 7

SIMULATION AND SYNTHESIS RESULTS OF VARIABLE LENGTH ENCODING AND DECODING MODULE

7.1 INTRODUCTION

In the previous chapter Simulation and synthesis of DCT and IDCT

modules for image compression and decompression were presented in detail. This

chapter describes the Verification of different modules used in Low Power

Variable Length Encoding and Decoding. Each of the blocks involved in Variable

Length Encoding and Decoding are written in synthesizable Verilog code. The

simulation of the design was done using modelsim simulator tool. The synthesis of

each block of the design is done using Cadence RTL Compiler, Place and route

using Synopsis IC Compiler EDA tools. Finally the design is mapped on to 90nm

standard cells. The synthesis report shows the reduction in the power consumption

of the design. In the following sections we will discuss the simulation and

synthesis of each individual block used in the Encoder and the Decoder designs.

7.2 ZIGZAG SCANNING

The Methodology involved in working of the Zigzag scanning is explained

in the chapter 5.The simulation of zigzag scanning is done using the following test

sequence.

31 0 1 0 0 0 0 0 1 2 0 0 …………………………... 0 0

0 1 2 3 4 5 6 7 8 9 10 11 62 63

135

The following figure 7.1 shows the simulation waveform of the zigzag

scanning block for the above test sequence. The main purpose of the zigzag

scanning is to scan the low frequency components before the high frequency

components [28]. The simulation results shows that the low frequency components

are scanned before the high frequency components so that by neglecting the high

frequency components later still it is possible to reconstruct the image.

Figure 7.1 Waveform obtained after simulating the Zigzag Scanning block

7. 3 RUN LENGTH ENCODING

The simulation of the run-length encoding block is done using the output

sequence obtained in zigzag scanning process, which appears as the input of the

run-length encoder and the pattern is as shown below.

The following figure 7.2 shows the simulation waveform of the proposed

Run-Length Encoding block for the above scanned input sequence; but normally in

a typical quantized DCT matrix the number of zeros is more compared to non zero

coefficients being repeated. So in the present research for run length

encoder [68, 25], the researcher exploits this property of more number of zeros in

between non- zero DCT coefficients and is counted; Unlike in the conventional run

length encoder where the number of occurrence of repeated symbols are counted.

136

The results shows 31 is appeared first ,then one zero before getting 1 and again one

more zero before getting 2 and that process repeat. This proposed RLE architecture

requires less number of digits to represent the same data and this yields better

compression than the conventional RLE.

Figure 7.2 Waveform of Run Length Encoding block

7.4 HUFFMAN ENCODING

The output of the run-length encoder is used as the test sequence to the

Huffman encoder. The output of the run-length encoder will appear as shown

below.

The following figure 7.3 shows the simulation waveform of the Huffman

encoding block for the above test input sequence. In the proposed look up table

approach for the Huffman encoding block the run/value combination received will

be searched in the look up table, when the run/value combination is found its

corresponding Huffman code will be generated [22, 29, 63].The waveform shows

1,1 is coded as 0003 and 1,2 is coded as 0004 and so on.

31 1,1 1,2 0,1 5,2 EOB

137

Fig 7.3 Waveform of Huffman Encoder block

The simulation result of the Variable Length Encoder block shows the

further compression of the image after the lossy compression using DCT. Using

the Variable Length Encoder further compression is achieved with lossless

approach after the synthesis of the VLC design. The physical implementation of

the Variable Length Encoder is done using Synopsis IC Compiler EDA tool. The

figures 7.4 and 7.5 shows layout diagrams of VLC with good placement and

routing. This layout of the Variable Length Encoder shows that the proposed

design can be easily implementable with the available standard cells.

Figure 7.4 Layout of Variable Length Coding

138

Figure 7.5 Zoomed Version Layout of VLC

7.5 HUFFMAN DECODING

The compressed data from the variable length encoder is fed to the

Huffman decoding block as an input. The output obtained from the Huffman

encoding block is used as a test bench to the Huffman decoding block [84, 88].

The following figure 7.6 shows the simulation waveform for the Huffman

decoding block. The waveform shows 0003 decoded as 1,1 and 0004 is decoded as

1,2 and so on.

Figure 7.6 Waveform of Huffman Decoder block

139

7.6 RUN LENGTH DECODING

Compressed data sent by the Encoder has to go through the processing of decompression at the Receiver. First stage of decompression takes place at the Huffman Decoder. The Run Length Decoder does the second stage of the decompression. The Run Length Decoder is verified using the output of the Huffman decoder output. The run length decoder, decodes the output of the Huffman decoder [74]. The figure 7.7 shows the simulation waveform of run-length decoder. This decoded output shows it is same as the output of the Zigzag scanner.

Figure 7.7 Waveform of run-length decoder block

7.7 ZIGZAG INVERSE SCANNING

The output of the run-length decoder is given as input to the zigzag inverse

scanner, which will output the quantized DCT coefficients. The original Quantized

coefficients and the inverse scanner outputs are matched. Hence from this it is

possible to obtain the original pixel coefficients by performing the IDCT and then

reconstruction of the image can be achieved easily.

140

Figure 7.8 Waveform of zigzag inverse scanner block

Figure 7.9 Layout of Variable Length Decoding

With the proposed architecture it is possible to do the good physical design

with place and route without any congestion in the design. The physical design of

the Variable Length Decoder is shown in the figure 7.9. This result shows that the

proposed architecture can be easily implemented on the available 90nm standard

cell libraries.

141

7.8 POWER AND AREA REPORTS

VLC and VLD blocks have been designed using 90 nm Standard Cells with

0.7 Volts global operating voltage. The power and area requirements of all the

different blocks used in the VLC and VLD designs have been obtained by the tool.

They are tabulated in table 7.1. They are also shown in the form of bar chart in the

figure 7.10.

Table 7.1

Power and Area Parameters of VLC and VLD blocks

Features

Design Power in mW Area µm2

No of cells

Zigzag Scanner 0.7319 33496.95 3063

Run Length Encoder 0.0208 1039.35 127

Huffman Encoder 0.0451 1718.14 347

Zigzag Inverse Scanner 0.7285 33597.85 3094

FIFO 0.2744 12877.20 1184

Run Length Decoder 0.1110 715.48 106

Huffman Decoder 0.0451 1718.14 347

142

Figure 7.10 Bar Chart of Power and Area parameters

7.9 PERFORMANCE COMPARISON

In the synthesis process the RTL code of the proposed architecture is

targeted for a specific standard libraries and analysis of the parameters like area

and power consumption of the design was computed. The power comparison of the

proposed architecture with some of the existing systems has been done and the

comparison of results are tabulated in the following sections.

7.9.1 Power Comparison of Huffman Decoder

The table 7.2 shows the comparison of results [40] of Huffman decoder with the

proposed architecture and the comparison representation in the form of graph is

shown in the figure 7.11.

143

Table 7.2 Power Comparison for Huffman decoders

Table Size Huffman decoder type Power (in µW)

100

Power Analysis of the Huffman

Decoding Tree, by Jason McNeely [40] 95

Proposed 45

Figure 7.11 Bar chart showing the power comparison of Huffman decoders

7.9.2 Power Comparison of Run Length and Huffman Encoder

The power consumed by the proposed design is compared with [59]. The

compared results of Run Length Encoding and Huffman coding is shown in the

Table 7.3. The same is shown in the form of a bar chart and is shown in the figure

7.12.

144

Table 7.3 Power Comparison for Run Length and Huffman Encoders

RL-Huffman Encoding Type Power (in µW)

RL-Huffman Encoding for Test Compression and

Power Reduction in Scan Applications [59] 90

Proposed 65.9

Figure 7.12 Bar chart showing the power comparison of RL-Huffman Encoder combination

145

7.9.3 Percentage of Power Saving

The percentage of power savings from proposed design are calculated and

are tabulated as shown in the table 7.4.

Table 7.4 Percentage of Power Savings

Features Comparison Percentage of Power Saving

Huffman decoder Power Analysis of the Huffman Decoding Tree [40] 52.63%

RL-Huffman Encoder

RL-Huffman Encoding for Test Compression and Power Reduction in Scan Applications [59]

26.77%

Each of the Comparison results shows the proposed design are mapped to

standard libraries of 90nm by adopting low power techniques, and from this

research work it is possible to achieve the low power consumption without

compromising with the performance of the design.

SUMMARY

The simulation and synthesis of the Variable Length Encoder and Variable

Length Decoder were described in the present chapter. The proposed architecture

of the VLC and VLD modules were designed using Verilog code and they are

simulated using Modelsim simulator. The synthesis of the design was done using

RTL Compiler EDA tool and implemented using 90nm standard cells. The

physical design of both VLC and VLD was done using the IC Compiler back end

VLSI tool. The Power analysis of each of the design was done and the results are

tabulated. The tabulated results show the present research work is able to achieve

the low power with the proposed architecture. Finally the compression ratio was

computed this ratio shows good compression without compromising with the

performance.

146

CHAPTER 8

CONCLUSIONS AND SCOPE FOR FUTURE WORK

8.1 CONCLUSIONS

In the present work, the researcher has implemented low-power

architecture for the image compression system based on DCT and Variable Length

Coding. All of the modules in the design were modeled using the synthesizable

Verilog code.

An ASIC implementation of 2-D DCT core was designed to meet low

power constraints. DCT and IDCT algorithm was coded using Verilog HDL, then

the design was simulated and synthesized successfully. Simulation of the design

was done using Modelsim and Cadence IUS simulator. The image coefficients are

obtained by using MATLAB and are applied as inputs to proposed Low Power

DCT architecture. The compressed image is finally reconstructed using IDCT. The

compressed image coefficients thus obtained are compared with the simulation

results from MATLAB and Verilog simulator.

A Novel architecture has been developed for implementation of low power

architecture for variable length encoder and decoder for image processing

applications. The designing and modeling of all the blocks in the VLC is done by

using synthesizable Verilog HDL. The proposed architecture was synthesized

using RTL complier and it was mapped using standard cells. The Simulation of all

blocks in design is done using Modelsim. The Power analysis of the design was

done using RTL compiler and achieved low power in all the different Novel

architectural blocks used in this design.

147

Power consumptions of Variable Length Encoder and Decoder with 90nm

standard cells are limited to 0.798mW and 0.884mW with minimum area. A 53%

of power saving is achieved in the dynamic power of Huffman decoding [40] by

including the lookup table approach and also a 27% of power saving is achieved in

the RL-Huffman encoder [59].

8.2 ORIGINAL CONTRIBUTIONS

In this research work, algorithms and architecture have been developed for

Discrete Cosine Transform, Quantization, Variable Length Encoding and

Decoding for image compression with an emphasis on low power consumption.

These algorithms have been subsequently verified and the corresponding hardware

architectures are proposed so that they are suitable for ASIC implementation. Six

novel ideas implemented have been presented in this thesis in order to speed up the

Low Power Image Compression implementation on ASIC and they are as follows

• Development of algorithm and architecture for Low Power DCT and

Quantization to achieve the required compression.

• Algorithm for processing Variable Length Coding (VLC).

• Architecture of Zigzag Scanning which is used in VLC.

• Architectural design of Low Power Run Length Encoding (RLE).

• Design of Architecture for Huffman Encoding.

• Architecture of Variable Length Decoding (VLD).

8.3 SCOPE FOR FUTURE WORK

In this Thesis, novel algorithms developed along with their architectural

design and ASIC implementation of Image compression using DCT has been

148

presented. The present work throws open a number of work that may be

undertaken by researchers in the near future.

The design of the proposed work is modular and scalable and therefore, it

can be upgraded to accommodate more compatible standards, without appreciable

increase in hardware. The work may also include making the processing of images

faster by including parallel architectures. Some of the applications do require

maintaining a high quality for images like in medical image processing, where

high quality is preferred over the compression, in those cases the lossless variable

length coding can play a vital role. As a part of future enhancement, low-power

architecture for Ultra low power design is suitable for portable systems and can

also be implemented for both DCT and IDCT by redesigning the multipliers and

adder circuits which consume major power in the design presented. The power

requirement is also scaled down drastically by reducing the picture size as well as

operating frequency and voltage.

149

APPENDIX

CADENCE NCVERILOG & SIM VISION TOOLS TUTORIAL

Getting started

Incisive is suite of tools from Cadence Design Systems related to the

design and verification of ASICs, SoCs, and FPGAs. Incisive is commonly

referred by the name NCsim in reference to core simulation engine. In the late

1990s, the tool suite was known as ldv (logic design and verification)

Depending on the design requirement, Incisive has many different bundling

options of the following tools:

Tool command description

NC Verilog ncvlog Compiler for Verilog 95, Verilog 2001 and SystemVerilog

NC VHDL ncvhdl Compiler for VHDL 87, VHDL 93

NC SystemC ncsc Compiler for SystemC

NC Elaborator ncelab

Unified linker / elaborator for Verilog, VHDL, and SystemC libraries. Generates a simulation object file referred to as a snapshot image.

NC Sim

ncsim

Unified simulation engine for Verilog, VHDL, and SystemC. Loads snapshot images generated by NC Elaborator. This tool can be run in GUI mode or batch command-line mode. In GUI mode, ncsim is similar to the debug features of ModelSim'svsim.

Sim Vision simvision A standalone graphical waveform viewer and netlist tracer.

This is very similar to Novas Software's Debussy

150

The methodology of debugging project design involves following steps

• Compiling verilog source code

• Running the simulation

• Viewing the generated waveforms

Here in this thesis compiling and simulation is done by invoking NCverilog tool and waveforms can be viewed by Simvision tool.

Pre Setup

If you are using MAC OS/X or windows please refer to the links shown at end for software requirements to connect to the UNIX server.

For Windows:

• PuTTY

To connect to with PuTTY simply type in domain name, under the section session, make sure that the connection port is SSH with port 22, and click open.

151

You will then be prompt to input your user name and password that should

be given from your system admin.

• SSH Tectia Client

With SSH Tectia Client click on Quick Connect and you will be prompt to

ask for domain name and user name. Make sure the port number is 22.

Click connect and you will be prompt to ask your password.

• Cygwin

Run Cygwin and type in the following:

Ssh–l username hafez.sfsu.edu

Username referring to your username, hit enter and you will be prompt to

ask for your password:

152

Setting up the Verilog environment

Once you have connected to the server check if you have file ius55.csh. if you

have it then its all good if not report to your system admin.

Next type in the following:

Csh

Source ius55.csh

153

Now you press enter button. You will see in the terminal as and that’s it. The

verilog environment was setted up successfully.

Setting up environment for Verilog

Creating/Editing Verilog Source Code:

There are many editors that one can choose. Here I opted vieditor i.e. gvimeditor.

Creating a Verilog file with gvim:

To write a verilog source file using gvim type in the following command:

gvim test1.v

The file extension must be .v otherwise the compiler will not recognize the file.

After editing verilog code in the editor it looks like

154

Compiling and Simulating:

This is the easy part. To just compile your code do the following:

ncverilog –C test1.v

This is what you should see if you everything is done correctly:

To run both compiler and simulator type following:

ncverilog test1.v

If done correctly then you should see the following

155

After running the compiler and simulator, you should notice that in your

current directory a folder called INCA_libswii be created. Which holds snapshots

of the simulation. To invoke the snapshot simply type the following for the current

program

ncsim worklib.test1_tb:v

If you notice from the following argument, test1_tbis your test benchmark

function. If all goes well you should see the following:

156

Furthermore if you notice in your current directory ncsim and ncverilog has

written log of the past activities

Use any editor to view the files.

WAVEFORM VIEWING:

Type the command as shown below:

bsubsimvision

herebsubindicate the submission of the job to server.Now the waveform terminal

will get opened as shown below.

You can also access the other tools in the Simvision analysis environment through

menu choices and toolbar buttons, as follows

157

Next the design browser lets you move the through the design hierarchy to view

objects. You can see the design browser to select the signals that you want to

display in the waveform window.

In order to display the selected signals in the waveform window, you select

the required signals like ex. load, clk, reset etc. next click on the waveform button

to display these selected signals in the waveform window.

158

Synthesis of Variable Length Coding:

159

Power Analysis from Synthesis Tool:

Layout from IC Compiler:

REFERENCES

1. A. Mukherjee, J.W. Flieder. N. Ranganathan (1992), “A VLSI Chip for Variable Length Encoding and Decoding”, IEEE International Conference on Industrial Electronics and Applications (ICIEA) pp.1-4.

2. A.H. Taherinia, M. Jamzad (2009 ), “A Robust Image Watermarking using Two Level DCT and Wavelet Packets Denoising”, International Conference on Availability, Reliability and Security, pp.150-57.

3. A.P. Vinod, D. Rajan and A. Singla (2007), “Differential pixel-based low- power and high-speed implementation of DCT for on-board satellite image processing”, IET international Journals on Circuits, Devices and Systems, pp.444-450.

4. A.Pradini, T.M.Roffi, R.Dirza, T.Adiono (2011), “VLSI Design of a High-Throughput Discrete Cosine Transform for Image Compression System”, International Conference on Electrical Engineering and Informatics, Indonesia (ICEEI), pp.1-6.

5. Abdullah Al Muhit, Md. Shabiul Islam and Masuri Othman [2004], “VLSI Implementation of Discrete Wavelet Transform (DWT) for Image Compression”, 2nd International conference on Autonomous Robots and Agents New Zealand December,pp.391-395.

6. Ahmed Desoky and Mark Gregory (1988), “Compression of Text and Binary Files Using Adaptive Huffman Coding Technique”, Southeastcon IEEE Conference Proceedings pp.660-663.

7. Amir Z. Averbuch, F. Meyer, J.-O. Stromberg, R. Coifman and A. Vassiliou (2001), “Low Bit-Rate Efficient Compression for Seismic Data”, IEEE Transactions on Image Prcessing, Vol.10, Issue.12, pp.1801-1804.

8. An-Yeu Wu and K.J. Ray Liu (1994) “A Low and Low-Complex DCT/IDCT VLSI Architecture Based on Backward Chebyshev Recursion” IEEE International Symposium on Circuits and Systems.Vol.4 , pp.155-158.

9. Aslan Tchamkerten and I.Emre Telatar (2006), “Variable Length Coding Over an Unknown Channel” IEEE Transactions on Information Theory, Vol. 52, No. 5, pp. 2126-2145.

10. B. Heyne and J. Goetz (2007), “A low-power and high-quality implementation of the discrete cosine transformation”, Adv. Radio Sci., 5, 305–311, 2007.

11. Bao Ergude, Li Weisheng, Fan Dongrui, Ma Xiaoyu, (2008), “A Study and Implementation of the Huffman Algorithm Based on Condensed Huffman Table”, IEEE International Conference on Computer Science and Software Engineering, pp.42-45.

12. Basant K. Mohanty, Pramod K. Meher (2011), “Memory-Efficient Architecture for 3-D DWT Using Overlapped Grouping of Frames”, IEEE Transactions on Signal Processing, pp.5605-5616.

13. Byoung-2 Kim, Sotirios. G. Ziavras (2009), “Low Power Multiplier less DCT for Image/Video Coders”, IEEE 13th International Symposium on Consumer Electronics, pp.133-136.

14. Chi-Chia Sun, Benjamin Heyne, Juergen, Goetze (2006), “A Low Power and high quality Cardic based Loffler DCT”, IEEE conference.

15. Chi-chia Sun, Ce Zhang and Juergen Goetze (2010), “A Configurable IP Core for Inverse Quantized Discrete Cosine and Integer Transforms with arbitrary Accuracy”, Proceedings of IEEE Asia Pacific International Conference on Circuits and Systems (APCCAS), pp.915-918.

16. Chi-Chia Sun, Philipp Donner and Jurgen Gotze (2009), “Low-Complexity Multi-Purpose IP Core for Quantized Discrete Cosine and Integer Transform”, IEEE International Symposium on Circuits and Systems, pp.3014-3017.

17. Chin-Teng Lin, Yuan-Chu Yu, Lan-Da Van (2008), “Cost-Effective Triple-Mode Reconfigurable Pipeline FFT/IFFT/2-D DCT Processor”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No.8, pp.1058-1071.

18. Chunrong Zhang, Shibao Zheng (2005),“A Novel Low-complexity and High-performance Frame-skipping Transcoder in DCT Domain”, IEEE Transactions on Consumer Electronics, Vol.51, NO.4, pp.1306-1312.

19. D.A. Karras, S.A. Karkanisand B.G. Mertzios (1998), “Image Compression Using the Wavelet Transform on Textural Regions of Interest”, 24th IEEE International Euromicro conference, pp.633-639.

20. Da An, Xin Tong, Bingqiang Zhu and Yun He(2009), “A Novel Fast DCT Coefficient Scan Architecture”, IEEE Picture Coding Symposium I Beijing 100084, China, pp.1-4.

21. David A. Maluf, Peter B. Tran, David Tran (2008), “Effective Data Representation and Compression in Ground Data Systems”, International Conference on Aerospace, pp 1-7.

22. Ding Xing hao, Qian Kun, Xiao Quan, Liao Ying hao, Guo Dong hui, Wang Shou jue (2009),“Low Bit Rate compression of Facial Images Based on Adaptive Over-complete Sparse Representation” , IEEE 2nd International congress on Image and Signal Processing, pp.1-3.

23. Dr. Muhammad Younus Javed and Abid Nadeem (2000), “Data Compression Through Adaptive Huffman Coding Scheme”, IEEE Proceedings on TENCON ,Vol.2. pp.187-190.

24. Emy Ramola, J. Samuel Manoharan (2011), “An Area Efficient VLSI Realization of Discrete Wavelet Transform for Multi resolution Analysis”, IEEE International Conference on Electronics Computer Technology (ICECT) pp.377-381.

25. En-hui Yang, and Longji Wang (2009), “Joint Optimization of Run-Length Coding, Huffman Coding, and Quantization Table With Complete Baseline JPEG Decoder Compatibility”, IEEE Transactions on Image processing, Vol.8, Issue 1, pp.63-74.

26. En-Hui Yang, Longji Wang (2009),”Joint Optimization of Run-Length Coding. Hoffman Coding, and Quantization Table with Complete Baseline JPEG Decoder Compatibility”, IEEE transaction on image processing, pp. 63-74.

27. F.M.Bayer and R.J.Cintra (2010),“Image Compression via a Fast DCT Approximation” IEEE Latin America Transactions, VOL. 8, NO. 6, pp.708-713.

28. Giridhar Mandyam, Nasir Ahmed and Samuel D. Stearns (1995), “A Two-Stage Scheme for Lossless Compression of Images”, IEEE International Symposium on Circuits and Systems.Vol.2 , pp.1102-1105.

29. Gopal Lakhani (2004), “Optimal Huffman Coding of DCT Blocks”, IEEE transactions on circuits and systems for video technology, Vol.14,issue.4. pp 522-527.

30. Gregory K.Wallace (1991), “The JPEG Still Picture Compression Standard”, IEEE transactions on consumer electronics, Vol.38, Issue.1, pp.18-38.

31. Hai Huang, Tze-Yun Sung, Yaw-shih Shieh (2010), “A Novel VLSI Linear array for 2-D DCT/IDCT”, IEEE 3rd International Congress on Image and Signal Processing, pp.3680-3690.

32. Hassan Shojania and Subramania Sudarsanan (2005), “A High Performance CABAC Encoder”, Proceedings of the 3rd International Conference, pp.315-318.

33. Hatim Anas, Said Belkouch, M.El Aakif, Noureddine Chabini(2010), “FPGA Implementation of a Pipelined 2D DCT and Simplified Quantization for Real Time Applications”, IEEE International Conference on Multimedia Computing and Systems , pp.1-6.

34. He Cuiqun, Liu Guodong, Xie Zhihua (2010),“Infrared Face Recognition Based on Blood Perfusion and weighted block-DCT in Wavelet Domain” International Conference on Computational Intelligence and Security,pp.283-287.

35. Hitomi Murakami, Shuichi Matsumoto, Hideo Yamamoto(1984), “Algorithm for Construction of Variable Length Code with Limited Maximum Word Length”, IEEE Transactions on Communications, Vol. Com-32, No.10,pp.1157-1159.

36. Hyoung Joong Kim (2009), “A New Lossless Data Compression Method”, IEEE International Conference on Multimedia and Expo (ICME), pp.1740-1743.

37. Jack Venbrux, Pen-Shu and Muye (1992), “A VLSI Chip Set for High-Speed Lossless Data Compression”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 2, Issue.4, pp. 381-391.

38. Jaehwan Jeon, Jinhee Lee, and Joonki Paik (2011), “Robust Focus Measure for Unsupervised Auto-Focusing Based on Optimum Discrete Cosine Transform Coefficients”, IEEE Transactions on Consumer Electronics, Vol. 57, No. 1, pp.1-5.

39. Jason McNeely and Magdi Bayoumi (2007), “Low Power Look-Up Tables for Huffman Decoding”, IEEE International Conference on Image Processing , pp.465-468.

40. Jason McNeely, Yasser Ismail, Magdy A. Bayoumi and Peiyi Zaho (2008).” Power Analysis of the Huffman Decoding Tree”, 15th IEEE International Conference on Image Processing, pp.1416-1419.

41. Jer Min Jou and Pei-Yin Chen (1999), “A Fast and Efficient Lossless Data-Compression Method”, IEEE Transactions on communications, Vol.47.Issue.9, pp.1278-1283.

42. Jia-Yu Lin, Ying Liu, and Ke-Chu Yi(2004), “ Balance of 0,1 Bits for Huffman and Reversible Variable-Length Coding”, IEEE Journal on Communications, pp. 359-361.

43. Jin Li Weiwei Chen Moncef Gabbouj Jarmo Takala Hexin Chen (2011) “Prediction of Discrete Cosine Transformed Coefficients in Resized Pixel Blocks”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp.1045-1048.

44. Jin Li, Moncef Gabbouj, Jarmo Takala and Hexin Chen (2009), “Direct 3-D DCT-to-DCT Resizing Algorithm for Video Coding”, Proceedings of the 6th International Symposium on Image and Signal Processing and Analysis, pp.105-110.

45. Jing-Ming Guo and Chia-Hao Chang (2009), “Prediction-Based Watermarking Schemes for DCT-Based Image Coding”, 5th IEEE International Conference on Information Assurance and Security, pp.619-621.

46. JongsunPark, Kaushik Roy (2008), “A Low Complexity Reconfigurable DCT Architecture to Trade off Image Quality for Power Consumption”, IEEE Transactions on Speech and Image processing,Vol.5, pp.17-20.

47. Kamrul Hasan Talukder and Koichi Harada (2007), “Discrete Wavelet Transform for Image Compression and A Model of Parallel Image Compression Scheme for Formal Verification”, Proceedings of the World Congress on Engineering.

48. Koen Denecker, Jeroen Van Overloop and Ignace Lemahieu (1997), “An Experimental Comparison of Several Lossless Image Coders for Medical Images”, IEEE International Conference on Data Compression.

49. Kyeounsoo Kim and Peter A. Beerel (1999), “A High-Performance Low-Power Asynchronous Matrix-Vector Multiplier for Discrete Cosine Transform”, IEEE Asia Pacific International Conference on ASICs, pp.135-138.

50. L.Y.Liu, J.F.Wang, R.J. Wang, J.Y. Lee (1995), “Design and Hardware Architectures for Dynamic Huffman Coding”, IEEE Proceedings on Computers and Digital Techniques. Vol.142, Issue.6, pp 411-418.

51. Laurentiu Acasandrei, Marius Neag (2008), “A Fast Parallel Huffman Decoder for FPGA Implementation”, ACTA TECHNICA NAPOCENSIS Volume 49, Number 1, pp. 8-15.

52. Li Wenna, Goa Yang, Yi Yufeng, Goa Liqun (2011) “Medical image coding based on wavelet transform and distributed arithmetic coding”, IEEE International Conference on Chinese Control and Decision Conference (CCDC), pp.4159-4162.

53. Liang-Wie, Liang-Ying Liu, Jhing-Fa Wang and Jau-Yien Lee (1993) “Dynamic Mapping Technique for Adaptive Huffman Code” IEEE International Journal on Computer, Communication, Control and Power Engineering,Vol.3,pp 653-656.

54. Lili Liu, Hexin Chen, Aijun Sang, Haojing Bao (2011), “Four-dimensional Vector Matrix DCT Integer Transform codec based on multi-dimensional vector matrix theory”, IEEE fourth International Conference on Intelligent Computation Technology and Automation (ICICTA), pp.552-555.

55. Lin Ma, Songnan Li, Fan Zhang and King Ngi Ngan (2011), “Reduced Reference Image Quality Assessment using Reorganized DCT Based Image Representation”, IEEE Transactions on Multimedia, Vol. 13, NO. 4, pp. 824-829.

56. M. El Aakif, S. Belkouch, N. Chabini, M. M. Hassani (2011 ),“Low Power and Fast DCT Architecture Using Multiplier-Less Method”, IEEE International Conference on Faible Tension Faible Consommation (FTFC). pp.63-66.

57. M. El Aakif, S. Belkouch, N. Chabini, M.M. Hassani (2011 ), “Low Power and Fast DCT Architecture Using Multiplier-Less Method”, International Journal on Faible tension Faible Consomation, pp. 63-66.

58. M. Jridi and A. Alfalou (2010),“A Low-Power, High-Speed DCT architecture for image compression: principle and implementation” 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-Soc 2010) pp.304-309.

59. M.H.Tehranipour, M.Nourani, K. Arabi, A. Afzali-Kusha (2004) “Mixed RL-Huffman Encoding for Power Reduction and Data Compression in Scan Test”, Proceedings of the International Symposium on Circuits and Systems, Vol.2, pp.681-684.

60. M.R.M. Rizk (2007) “Low Power Small Area High Performance 2D-DCT Architecture”, 2nd International Design and Test Workshop (IDT), pp.120-125-777.

61. Majdi elhaji, Abdlekrim Zitouni, Samy meftali, Jean-luc Dekey ser and rached tourki (2011), “A Low power and highly parallel implementation of the H.264 8*8 transform and quantization”, Proceedings of 10th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp.528-531.

62. Marcelo J. Weinberger, Gadiel Seroussi, Guillermo Sapiro (1996), “LOCO-I A Low Complexity, Context-Based, Lossless Image Compression Algorithm”, IEEE International Conference on Data Compression, pp.140-149.

63. Md. Al Mamun, Xiuping Jia and Michael Ryan (2009), “Adaptive data compression for Efficient Sequential Transmission and Change Updating of Remote Sensing Images”, IEEE International Symposium Geoscience and Remote sensing, IGARSS, pp.498-501.

64. Med Lassaad Kaddachi Nouira, Adel Soudani, Vincent Lecuire, Kholdoun Torki (2010), “Efficient hardware solution for low power and adaptive image-compression in WSN”, Proceedings of IEEE 17TH International Conference on Electronics Circuits and Systems (ICECS), pp.583-586.

65. Muhammad Bilal Akhtar, Adil Masoud Qureshi,Qamar- Ul- Islam(2011), “Optimized Run Length Coding for JPEG Image Compression Used in Space Research Program of IST”, IEEE International Conference on Computer Networks and Information Technology (ICCNIT) , pp.81-85.

66. Muhammed Yusuf Khan, Ekram Khan and M. Salim Beg (2008), “Performance Evaluation of 4×4 DCT Algorithms for Low Power Wireless Applications”, International Conference on Emerging trends in Engineering and Technology. pp.1284-1286.

67. Muhammed Yusuf Khan, Ekram Khan, M.Salim Beg (2008), “Performance Evaluation of 4x4 DCT Algorithms For Low Power Wireless Applications”, First International Conference on Emerging Trends in Engineering and Technology, pp.1284-1286.

68. Munish Jindal, RSV Prasad and K. Ramkishor (2003), “Fast Video Coding at Low Bit-Rates Mobile Devices”, International Conference on Information, Communication and Signal Processing Vol.1, pp 483- 487.

69. Mustafa Safa Al-Wahaiba, Kosshiek Wong (2010), “A Lossless Image Compression Algorithm Using Duplication Free Run-Length Coding”, Second International Conference on Network Applications, Protocols and Services (NETAPPS) , pp.245-250.

70. N.Venugopal and Dr S.Ramachandran (2009),“Design and FPGA Implementation of Fast Variable Length Coder for a Video Encoder”, International Journal of Computer Science and Network Security (IJCSNS), VOL.9 No.7, pp.178-184..

71. Pablo Montero, Javier Taibo Videalab (2010), “Parallel Zigzag Scanning and Huffman Coding for a GPU-Based MPEG-2 Encoder”, IEEE International Symposium on Multimedia , pp.97-104.

72. Paul G. Howard and Jeffrey Scott Vitter (1991), “Analysis of Arithmetic Coding for Data Compression”, IEEE International Conference on Data Compression, pp.3-12.

73. Paulo Roberto Rosa Lopes Nunes (2006), “Segmented Optimal Linear Prediction applied to Lossless Image Coding”, IEEE International Symposium on Telecommunications, pp.524-528.

74. Pei-Yin Chen, Member, Yi-Ming Lin, and Min-Yi Cho (2008), “An Efficient Design of Variable Length Decoder for MPEG-1/2/4”, IEEE International Transactions on multimedia, Vol.16, Issue 9, pp.1307-1315.

75. Peng Wu, Chuangbai Xiao, Shoudao Wang, Mu Ling (2009), “An Efficient Method for early Detecting All-Zero Quantized DCT Coefficients for H.264/AVC”, IEEE International Conference on Systems, Man and Cybernetics San Antonio, USA, pp. 3797-3800.

76. Piyush Kumar Shukla, Pradeep Rusiya, Deepak Agrawal, Lata Chhablani, Balwant Singh (2009.), “Multiple Subgroup Data Compression Technique Based On Huffman Coding”. First International Conference on Computational Intelligence, Communication Systems and Networks (CICSYN), pp. 397-402.

77. Raymond K.W.Chan, Moon-Chuen Lee (2006), “Multiplier less approximation of fast DCT algorithms” , IEEE International conference on multimedia and Expo, pp.1925-1928.

78. Reza Hashemian (1994) “Design and Hardware Implementation of a Memory Efficient Huffman Decoding”. IEEE Transactions on Consumer Electronics, Vol.40, NO.3, pp. 345-350.

79. Reza Hashemian (2003),”Direct Huffman Coding and Decoding using the Table of Code-Lengths”, IEEE International conference on Information Technology, Coding, Computers and Communication, pp.237-241.

80. Ricardo Castellanos, Hari Kalva and Ravi Shankar (2009), “Low Power DCT using Highly Scalable Multipliers”, 16th IEEE International Conference on Image Processing, pp.1925-1928.

81. S. Ramachandran and S. Srinivasan (2002), “A Novel, Automatic Quality Control Scheme for Real Time Image Transmission”, India VLSI Design, Vol.14 (4), pp. 329–335.

82. S.V.V.Sateesh, R.Sakthivel, K.Nirosha, Harisha M.Kittur(2011), “An Optimized Architecture to Perform Image Compression and Encryption Simultaneously Using Modified DCT Algorithm”, IEEE International Conference on Signal Processing, Communication, Computing and Network Technologies, pp.442-447.

83. S.Vijay, D. Anchit (2009), “Low Power Implementation of DCT for On-Board Satellite Image Processing Systems”, 52nd IEEE International Symposium on Circuits and Systems, pp.774-777.

84. Shang Xue and Bengt Oelmann (2003),“Efficient VLSI Implementation of a VLC Decoder for Universal Variable Length Code”, Proceedings of the IEEE Computer Society Annual Symposium on VLSI.

85. Stephen Molloy and Rajeev Jain (1997), “Low Power VLSI Architectures for Variable-Length Encoding and Decoding”, Proceedings of the 40th International Midwest Symposium on Circuits and Systems, pp.997-1000.

86. Sung wook Yu and Earl E. Swartzlander Jr (2001), “DCT Implementation with Distributed Arithmetic”, IEEE Transactions on Computers, Vol.50, Issue.9. pp. 985–991.

87. Sung-Wen Wang, Shang-Chih Chuang, Chih-Chieh Hsiao, Yi-Shin Tung and Ja-ling Wu (2008), “An efficient Memory Construction Scheme for an Arbitrary Side Growing Huffman table”, IEEE International conference on multimedia and Expo, pp.141-144.

88. Sung-Won Lee (2003), “A Low-Power Variable Length Decoder for MPEG-2 Based on Successive Decoding of Short Codewords”, IEEE Transactions on Circuits and Systems – II Analog and Digital Signal Processing, VOL.50. NO.2, pp.73-82.

89. Sunil Bhoosan, Shipra Sharma(2009), “An Effective and Selective Image Compression Scheme Using Huffman and Adaptive Interpolation”, 24th IEEE International Conference on Image and Vision Computing New Zealand, pp.197-202.

90. Sunil Bhooshan, Shipra Sharma (2009), “An Efficient and Selective Image Compression Scheme using Huffman and Adaptive Interpolation”, 24th International Conference Image and Vision Computing New Zealand (IVCNZ ), pp.1-3.

91. Taizo Suzuki and Masaaki Ikehara (2011).“Integer Fast Lapped Orthogonal Transform Based on Direct-Lifting of DCTS for Lossless-To-Lossy Image Coding”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp.1525-1528.

92. Thucydides Xanthopoulos, P.Chandrashekaran (2000), “A Low-Power DCT Core Using Adaptive Bitwidth and Arithmetic Activity Exploiting Signal Correlations and Quantization” IEEE Journal of Solid-State Circuits, VOL.35, NO.5, pp.740-750.

93. Tze-Yun sung, Yaw-shih Shieh and Chun-Wang Yu Hsi-Chin Hsin (2006),” High Efficiency and Low Power Architectures for 2-D DCT and IDCT based on cordic rotation”, Seventh International conference on Parallel, Distributed Computing, Applications and Technologies(PDCAT), pp.191-196.

94. Vijay Kumar Sharma, K. K. Mahapatra and Umesh C. Pati (2011) “An Efficient Distributed Arithmetic based VLSI Architecture for DCT”, Proceedings of IEEE International Conference on Devices and Communications, pp.1-5.

95. Vijay Kumar Sharma,U.C. Pati and K.K. Mahapatra (2010), “An Study of Removal of Subjective Redundancy in JPEG for Low Cost, Low Power, Computation efficient Circuit Design and High Compression Image” Proceedings of IEEE International Conference on Power, Control and Embedded systems (ICPCES), pp.1-6.

96. Vimal P. Singn Thoudam, Prof. B. Bhaumik, Dr. S. Chatterjee (2010), “Ultra Low Power Implementation of 2-D DCT for Image/Video Compression”, International Conference on Computer Applications and Industrial Electronics (ICCAIE), pp.532-536.

97. Wei-Yeo Chiu, Yu-Ming Lee and Yinyi Lin (2010), “Advanced Zero-block Mode Decision Algorithm for H.264/AVC Video Coding”, Proceedings of IEEE International Conference (TENCON), pp.687-690.

98. Wenna Li, Zhaohua Cui (2010), “Low Bit Rate Image Coding Based on Wavelet Transform and Color Correlative Coding”, International Conference on Computer Design and Applications (ICCDA) , pp.479-482.

99. Y. M. Lin and P. Y. Chen (2006) ,“A Low-Cost VLSI Implementation for VLC” IEEE International Conference on Industrial Electronics and Applications (ICIEA) ,pp.1-4.

100. Y.P Lee, Chen (1997) “A cost effective architecture for 8x8 two-dimensional DCT/IDCT using direct method”, IEEE Transactions on circuit and system for video technology vol 7. Issue.9. pp. 459–467.

101. Y.Wongsawat, H. Ochoa, K.R.Rao (2004), “A Modified Hybrid DCT-SVD Image-Coding System”, International Symposium on Communications and Information Technologies (ISCIT), pp.766-769.

102. Yan Lu, Wen Gao and Feng Wu (2003), “Efficient Video Coding with Fractional Resolution Sprite Predicition Technique”, IEEE Electronics Letters, pp.279-280.

103. Yongli Zhu, Zhengya Xu (2006) “Adaptive Context Based Coding for Lossless Color Image Compression” IMACS Multiconference on Computational Engineering in Systems Applications (CESA), Beijing, China, pp.1310-1314.

104. Yushi Chen, Yuhhang Zhang, Ye Zhang, Zhixin Zhou (2011) “Fast Vector Quantization Algorithm for Hyperspectral Image Compression”, IEEE International Conference on Data Compression. pp.450.

LIST OF PUBLICATIONS

NATIONAL CONFERENCES

1. VijayaPrakash.A M and K.S.Gurumurthy,(2010)”Design and Implementation of AES Algorithm (AES-128)”,National Conference CMNE -2010 at SIT Tumkur, Karnataka.

2. VijayaPrakash. A M, K. S. Gurumurthy and Sindura Prakash, ( 2010)” Design and Implementation of Low Power SDRAM Controller”, National Conference 2010 at VCET Bellary, Karnataka.

INTERNATIONAL CONFERENCES

1. VijayaPrakash.A M and K.S.Gurumurthy, (2008)“Design and Implementation of a High Speed and low power Constant Multiplier”, International Conference Emerging Microelectronics and Interconnection Technologies. (EMIT-2008) at National Institute of Advanced Studies (NIAS). Bangalore, India.

2. Vijaya Prakash. A M, Anoop R.Katti and Shakeeb Ahabed Pasha. B K,( 2011) “Novel VLSI Architecture for Real Time Blind Source Separation”,IEEE International Conference on ARTCOM-2011 at Reva College of Engineering. Bangalore, India.

INTERNATIONAL JOURNALS

1. VijayaPrakash.A M and K.S.Gurumurthy,(September- 2010),“A Novel VLSI Architecture for Digital Image Compression Using Discrete Cosine Transform and Quantization”, International Journal of Computer Science and Network security (IJCNS) [ISSN: 1738-7906] VOL.10 No.9. [Impact Factor-1.047]

2. VijayaPrakash.A M and K.S.Gurumurthy (December- 2010), “A Novel VLSI Architecture for Image Compression Model Using Low power Discrete Cosine Transform”, International Journal of Word Academy of Science Engineering and Technologies (WASET) [ISSN 1307-6892] Year 6, Issue 72. [Impact Factor-1.0]

3. VijayaPrakash.A M and K.S.Gurumurthy (December- 2011),”A Novel VLSI Architecture for Low Power FIR Filter”, International Journal of Advanced Engineering and Applications (IJAEA) [ISSN 0975-7791].

4. VijayaPrakash. A M and K.S.Gurumurthy (January -2012), “VLSI Architecture for Low Power Variable Length Encoding and Decoding for Image Processing Applications” International Journal of Advances in Engineering and Technology (IJAET). [ISSN: 2231-1963], VOL.2. Issue 1. [Impact Factor-1.96]

5. VijayaPrakash.A.M and D.Preethi (June-2012), “A Low Power VLSI Architecture for Image Compression System Using DCT and IDCT”, International Journal of Engineering and Advances Technology (IJEAT). [ISSN: 2249-8958], Volume.I, Issue-5.

VITAE

VIJAYAPRAKASH A M No. 233, Third Stage, Fourth Block Basaveswaranagar Bangalore –560 079 Ph:9844658446 Email:[email protected]

Educational Qualification:

B.E in Electronics Engineering from University Visvesvaraya College of

Engineering Bangalore, Affiliated to Bangalore University, with First Class in the

year 1992.

M.E in Digital Electronics from S.D.M College of Engineering and Technology

Dharwad, Affiliated to Karnataka University, in First Class with distinction in the

year 1997.

Pursuing PhD (“Low Power VLSI Architecture for Image Compression Using

Discrete Cosine Transform”) from Dr.M.G.R University Chennai.

Software skills:

VHDL, VERILOG, SYSTEM VERILOG, C Language and MATHLAB.

EDA tools : Cadence,Synopsis, Xilinx synthesis Tool, Modelsim.

Working Experience:

Presently working as an Associate Professor and P.G Coordinator in the

Department of Electronics and Communication Engineering at Bangalore Institute

of Technology, Bangalore since December 1997.

Worked as Lecturer in the Department of Electronics and Communication

Engineering at SJC Institute of Technology, Chikkaballapur from November 1994

to August 1995 and February 1997 to December 1997.

Worked as Apprentice Trainee at Bharath Electronics Ltd, Bangalore from August

1992 to July 1993.

Worked as Lecturer in the Department of Electronics and Communication

Engineering at Vidyavikas Polytechnic, Bangalore from August 1993 to October

1994.

PERSONEL DETAILS:

NAME : VIJAYAPRAKASH A.M DATE OF BIRTH : 18-05-1967 PERMANENT ADDRESS : No. 233 ,Third Stage, Fourth Block, Basaveswaranagar, Bangalore –560 079

DECLARATION

I declare that the above particulars are true to the best of my knowledge and belief.

( VIJAYAPRAKASH. AM)

low power vlsi architecture for image compression using discrete cosine transform

Documents