low power vlsi architecture for image compression using discrete cosine transform
TRANSCRIPT
LOW POWER VLSI ARCHITECTURE FOR IMAGE COMPRESSION
USING DISCRETE COSINE TRANSFORM
A THESIS
Submitted by
VIJAYAPRAKASH A M
For the award of the degree
of
DOCTOR OF PHILOSOPHY
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
DR. M.G.R EDUCATIONAL AND RESEARCH INSTITUTE
UNIVERSITY (Declared U/S 3 of the UGC Act, 1956)
CHENNAI – 600 095
JULY 2012
ii
DECLARATION
I declare that the thesis entitled “LOW POWER VLSI ARCHITECTURE FOR
IMAGE COMPRESSION USING DISCRETE COSINE TRANSFORM”
submitted by me for the degree of Doctor of Philosophy is the record of work
carried out by me during the period from August 2004 to July 2012 under the
guidance of Dr. K. S GURUMURTHY and has not formed the basis for the
award of any degree, diploma, associate-ship, fellowship titles in this or any other
university or other similar institution of higher learning.
Signature of the candidate
iii
BONAFIDE CERTIFICATE
Certified that this thesis titled. “LOW POWER VLSI ARCHITECTURE FOR
IMAGE COMPRESSION USING DISCRETE COSINE TRANSFORM” is
the bonafide work of Mr.VIJAYAPRAKASH AM, who is carried out the
research work under my supervision. Certified further, that to the best of my
knowledge, the work reported herein does not form part of any other thesis or
dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.
Dr K.S.GURUMURTHY
Professor DOS in Electronics and Communication
Engineering University Visvesvaraya College of Engineering.
Bangalore -560001
iv
ABSTRACT
Image data compression refers to a process in which the amount of data is
used to represent image is reduced to meet a bit rate requirement (below or at most
equal to the maximum available bit rate), while the quality of the reconstructed
image satisfies a requirement for a certain application and the complexity of
computation involved is affordable for the application.
Data compression methods play an important role in data storage and
transmission. It is the process of converting an input data stream into another data
stream that has a smaller size. In image compression we reduce the irrelevance and
redundant image data in order to store or transmit data in an efficient form. Image
coding algorithms and techniques are developed that optimize the bit rate and
quality of the image. Image compression has got applications in many fields like
digital video, video conferencing and video over wireless networks and internet
etc,.In image compression redundant information is removed which is possible due
to the high correlation of the image data. For the increasing number of portable
wireless devices, the key design constraint is power dissipation.Limited battery life
constraints portable devices to low power dissipation; advances in battery life do
not grow as fast as the density and the operating frequency of ASICs. The ever
growing circuit densities and operating frequencies of ASICs only result in higher
power dissipation. Since early studies have focused only on high throughput DCT
with variable length coders, low-power DCT and low-power Variable Length
v
Coders have not received much attention.The target of the multimedia systems is
moving towards portable applications like laptops, mobiles and IPods etc. These
systems highly demand for low power operations, and thus require low power
functional units.
The proposed work is a realization of the low power two dimensional
Discrete Cosine Transform for image compression. This architecture uses row -
column decomposition, the number of calculations for processing an 8x8 block of
pixels is reduced. 1-D DCT operation is expressed as addition of vector-scalar
products and basic common computations are identified and shared to reduce
computational complexity. Compared to Distributed arithmetic based architecture,
the proposed DCT consumes less power. ASIC implementation of DCT and IDCT
cores for low power consumption is implemented.
The proposed work is a design and implementation of low power
architecture for Two Dimensional DCT and Variable Length coding for image
compression. The 2-D DCT calculation is performed by using the 2-D DCT
separability property, Such that the whole architecture is divided in to two 1-D
DCT calculations by using a transpose RAM. Vector processing using parallel
multipliers is a method used for implementation of DCT.The advantages in vector
processing method are regular structures, simple control and interconnect and good
balance between performance and complexity of implementation. DCT and IDCT
cores are implemented in ASIC which consumes less Power.
vi
Variable length coding that maps input source data onto code words with
variable length and it is an efficient method to minimize average code length.
Compression is achieved by assigning short code words to input symbols of high
probability and long code words to those of low probability. Variable length
coding can be successfully used to relax the bit-rate requirements and storage
spaces for many multimedia applications. For example, a variable length coder
(VLC) employed in MPEG-2 along with the Discrete Cosine Transform (DCT),
results in very good compression efficiency.
In this work researcher is going with the ASIC design for the image
compression system. Firstly, the system compresses the image using DCT and
Quantization. Next apply the Variable Length Coding for the compressed image
so that a further compression is achieved, finally use the IDCT and Variable
Length Decoding to retrieval of the image. Compression algorithms require
different amounts of processing power to encode and decode. Some high
compression algorithms require high processing power. So in this work, the present
researcher is concentrating on the low power VLSI design for the image
compression system and at the same time obtaining a good compression ratio. The
development of low power compression algorithms and architecture is not only
challenging but also intellectually stimulating.
In this research work, algorithms and architecture have been developed for
Discrete Cosine Transform, Quantization, Variable Length Encoding and
Decoding for image compression with an emphasis on low power consumption.
vii
These algorithms have been subsequently verified and the corresponding hardware
architectures are proposed so that they are suitable for ASIC implementation.
The DCT and IDCT architecture was first coded in Matlab in order to prove
the concepts and design methodology proposed for the work. After it was
successfully coded and tested, the VLSI design of the architecture was coded in
Verilog, a popular hardware description language used in industries, conforming to
RTL Coding Guidelines. The proposed hardware architecture for image
compression was synthesized using RTL compiler and it is mapped using 65nm
node standard cells. The Simulation was done using Modelsim simulator. Detailed
analysis for power and area was done using Design compiler (DC) from Synopsis
EDA tool. Power consumption of the DCT and IDCT are limited to 0.4350 mw
and 0.5519 mw with the cell area of 34983.35µm2 and 34903.79µm2 respectively.
The variable length encoder is mapped using 90nm node standard cells. The power
consumption is limited to1.5790µw with minimum cell area of 5409.922. The
physical design of the proposed hardware in this research was done using IC
compiler.
viii
ACKNOWLEDGEMENT
The joy and satisfaction that would accompany the successful completion of
any task would be incomplete without the mention of those who made it possible. I
am grateful that and now have the opportunity to thank all those people who have
helped me in different capacities to complete this thesis work successfully.
I would like to thank Dr.Thirunavakarasu Dean Research, Dr. M .G .R
Educational and Research Institute, University for his inspiration and support
during the period of this thesis work.
I express my deep sense of gratitude towards my guide Dr.K.S.Gurumurthy
Professor and Chairman Department of Electronics and Communication
Engineering, UVCE Bangalore University, Bangalore for giving me his invaluable
guidance, motivation, confidence and support for the speedy completion of this
thesis work.
I sincerely thank Dr.S.Ravi Professor and HOD Department of ECE,
Dr.M.G.R Educational and Research Institute, University who has given constant
support with motivation in completion of this thesis.
I thank whole heartedly to my wife Mrs.Geetha.S and my daughter
Jahnavi.V for their support and encouragement.
Also my sincere thanks to Industry professional friends and all my
colleagues for their constant encouragement and moral support. They have in
someway or the other responsible for the successful completion of this thesis.
VijayaPrakash A M
ix
TABLE OF CONTENTS
CHAPTER
NO TITLE
PAGE
NO
ABSTRACT iv
LIST OF TABLES xvi
LIST OF FIGURES xvii
LIST OF ABBREVATIONS xxi
1 INTRODUCTION
1.1 IMAGE DATA COMPRESSION 1
1.2 NEED FOR IMAGE COMPRESSION 6
1.3 PRINCIPLES BEHIND COMPRESSION 6
1.4 DIFFERENT TYPES OF REDUNDANCIES
IN IMAGE
7
1.4.1 Coding Redundancy 7
1.4.2 Interpixel Redundancy 7
1.4.3 Psychovisual Redundancy 8
1.5 TYPES OF JPEG COMPRESSION 8
1.5.1 Sequential DCT based 9
1.5.2 Progressive DCT based 9
1.5.3 Lossless Mode 10
1.5.4 Hierarchical Mode 10
1.6 LOSSLESS VERSUS LOSSY
COMPRESSION
11
1.6.1 Predictive Versus Transform Coding 12
x
1.7 DCT PROCESS 12
1.8 JPEG IMAGE COMPRESSION 15
1.8.1 Input Transformer 17
1.8.2 Quantization 18
1.8.3 Entropy Coding 20
1.8.4 Run-Length Encoding 20
1.8.5 Huffman Encoding 21
1.9 APLICATIONS OF DCT 22
1.10 DCT ALGORITHMS 22
1.10.1 One Dimensional DCT 22
1.10.2 Two Dimensional DCT 23
1.11 DCT ARCHITECTURES 24
1.11.1 Two-Dimensional Approaches 24
1.11.2 Row – Column Decomposition 25
1.11.3 Direct Method 25
1.11.4 Distributed Arithmetic Algorithms 26
1.12 PROPERTIES OF DCT 27
1.13 ORGANIZATION OF THE THESIS 28
SUMMARY 29
2 REVIEW OF LITERTURE
2.1 INTRODUCTION 30
3 VLSI DESIGN FLOW AND LOW POWER
VLSI DESIGN
3.1 INTRODUCTION 50
3.2 ASIC DESIGN FLOW 51
3.3 DESIGN DESCRIPTION 52
3.4 DESIGN OPTIMIZATION 53
xi
3.5 BEHAVIORAL SIMULATION 53
3.5.1 Specification for the Design 54
3.5.2 Behavioral or RTL Design 54
3.6 VERIFICATION OF THE DESIGN 54
3.7 STNTHESIS OF THE DESIGN 57
3.8 VLSI PHYSICAL DESIGN 59
3.8.1 Floor Planning 60
3.8.2 Placement 60
3.8.3 Routing 61
3.8.4 Physical Verification 63
3.9 POST LAYOUT SIMULATION 63
3.10 LOW POWER VLSI DESIGN 64
3.11 SOURCES OF POWER DISSIPATION 65
3.12 STATIC POWER CONSUMPTION 67
3.13 DYNAMIC POWER DISSIPATION 68
3.14 LOAD CAPACITANCE TRANSIENT
DISSIPATION
68
3.14.1 Internal Capacitance Transient
Dissipation
69
3.14.2 Current Spiking During Switching 70
3.15 POWER REDUCTION TECHNIQUES IN
VLSI DESIGN
71
3.15.1 Clock Gating 71
3.15.2 Asynchronous Logic 74
3.15.3 Multi vdd Techniques 74
3.15.4 Architectural Level 75
SUMMARY 75
xii
4 LOW POWER VLSI ARCHITECTURE
FOR DISCRETE COSINE TRANSFORM
4.1 INTRODUCTION 76
4.2 DCT MODULE 76
4.2.1 Mathematical Description of the DCT 77
4.3 BLOCK DIAGRAM OF DCT CORE 78
4.4 TWO DIMENSIONAL DCT CORE 79
4.4.1 Behavioral Model for Vector
Processing
80
4.4.2 Transpose Buffer 83
4.4.3 Two Dimensional DCT Architecture 84
4.5 INVERSE DISCRETE COSINE
TRANSFORM
85
4.5.1 Storage / RAM Section 87
4.5.2 IDCT Core Block diagram 87
SUMMARY 88
5 LOW POWER ARCHITECTUR FOR VARIABLELENGTH ENCODING AND DECODING
5.1 INTRODUCTION 89
5.2 VARIABLE LENGTH ENCODING 91
5.2.1 Zig- Zag Scanning 92
5.2.2 Run Length Encoder 95
5.2.3 Huffman Encoding 98
5.2.4 Interconnection of VLE blocks 99
5.3 VARIABLE LENGTH DECODER 100
5.3.1 Huffman Decoder 101
5.3.2 Block Diagram of FIFO 101
xiii
5.3.3 Run Length Decoder 102
5.3.4 Zig Zag Inverse Scanner 104
SUMMARY 105
6 SIMULATION AND SYNTHESIS RESULTS OF
DCT AND IDCT MODULES
6.1 INTRODUCTION 106
6.2 MATLAB IMPLEMENTATION OF DCT
AND IDCT MODULES
108
6.2.1 DCT Methodology 108
6.2.2 Quantization 111
6.3 MATLAB RESULTS 113
6.4 VLSI DESIGN OF THE PROPOSED
ARCHITECTURE
118
6.5 SIMULATION RESULTS USING
VERILOG
118
6.6 COMPARISON OF MATLAB AND HDL SIMULATION RESULTS
119
6.6.1 2-D DCT Simulation results using
Matlab and VERILOG
120
6.6.2 IDCT Simulation Results 121
6.7 SYNTHESIS RESULTS OF DCT AND
IDCT
122
6.8 IMAGE COMPRESSION USING
DISCRETE COSINE TRANSFORM AND
QUANTIZATION
125
6.9 RECONSTRUCTION OF IMAGE
USING IDCT
129
xiv
6.10 SIMULATION RESULT OF IMAGE
AFTER COMPRESSION
131
6.11 FPGA IMPLEMENTATION OF THE
DCTQ
131
6.11.1 Device Utilization Summary 132
6.11.2 HDL Synthesis Results 132
SUMMARY 133
7 SIMULATION AND SYNTHESIS
RESULTS OF VARIABLE LENGTH
ENCODING AND DECODING MODULE
7.1 INTRODUCTION 134
7.2 ZIGZAG SCANNING 134
7.3 RUN LENGTH ENCODING 135
7.4 HUFFMAN ENCODING 136
7.5 HUFFMAN DECODING 138
7.6 RUN LENGTH DECODING 139
7.7 ZIGZAG INVERSE SCANNING 139
7.8 POWER AND AREA REPORTS 141
7.9 PERFORMANCE COMPARISON 142
7.9.1 Power Comparison of Huffman
Decoder
142
7.9.2 Power Comparison of Run Length and
Huffman Encoder
143
7.9.3 Percentage of Power saving 145
SUMMARY 145
8 CONCLUSIONS AND SCOPE FOR
FUTURE WORK
xv
8.1 CONCLUSION 146
8.2 ORIGINAL CONTRIBUTIONS 147
8.3 SCOPE FOR FUTURE WORK 147
APPENDIX 149
REFERENCES
LIST OF PUBLICATIONS
VITAE
xvi
LIST OF TABLES
TABLE
NO TITLE
PAGE
NO
5.1 Comparison between conventional and proposed RLE 97
6.1 Signal Description of DCT core 119
6.2 Signal Description of IDCT core 121
6.3 Power Area Characteristics of DCT and IDCT using
65nm standard cells
123
7.1 Power and Area Parameters of VLC and VLD blocks 141
7.2 Power Comparison for Huffman decoders 143
7.3 Power Comparison for Run Length and Huffman
Encoders
144
7.4 Percentage of Power Savings 145
xvii
LIST OF FIGURES
FIGURE
NO TITLE
PAGE
NO
1.1 Sequential Coding and Progressive coding 10
1.2 Hierarchical multi resolution coding 11
1.3 DCT Process for compression 12
1.4 DCT Coefficients 14
1.5 Image compression model 17
1.6 Image Decompression model 17
1.7 Row – column Decomposition 25
1.8 2-D DCT model 28
3.1 Major activities in ASIC design 51
3.2 ASIC design and development flow 52
3.3 RTL Verification flow with Linting and Codecove
rage
55
3.4 Synthesis flow with Low power and UPF 57
3.5 VLSI Physical Design Hierarchy 60
3.6 VLSI Physical Design Flow 62
3.7 CMOS Circuit in subthreshold 66
3.8 CMOS Inverter mode for Static power consum 67
3.9 Simple CMOS Inverter Driving a Capacitive External
Load
68
3.10 Parasitic Internal Capacitors Associated with Two
Inverters
69
3.11 Equivalent schematic of a CMOS inverter whose 70
xviii
input is between logic levels
3.12 Different Clock gating schemes 73
4.1 Block diagram of 2-D DCT Architecture 78
4.2 Top level schematic for DCT core 79
4.3 One Dimensional- DCT Architecture 82
4.4 2-D DCT Architecture 85
4.5 1-D IDCT Architecture 86
4.6 2-D IDCT Block diagram 87
4.7 Top level schematic of IDCT Architecture 87
5.1 Block diagram of variable length encoder 91
5.2 Block diagram of Zigzag Scanner 92
5.3 Zigzag Scanning order 92
5.4 Zigzag Scanning example 93
5.5 Internal Architecture of the Zigzag Scanner 94
5.6 Block diagram of Run-Length Encoder 95
5.7 The internal architecture of run-length encoder. 96
5.8 Block diagram of Huffman Encoding 98
5.9 Internal architecture of Huffman encoder 98
5.10 Interconnection of zigzag scanning, run-length and
Huffman encoding blocks
99
5.11 Block diagram of variable length decoder 100
5.12 Block diagram of Huffman decoder 101
5.13 Block diagram of FIFO 101
5.14 Block diagram of Run-Length Decoder 102
5.15 Block diagram of Zigzag Inverse Scanner 104
6.1 Matlab Design flow 114
6.2 Matlab Simulation Results for 8x8 image 115
xix
6.3 Matlab Simulation Results for full image 116
6.4 Input image in color and Gray Scale 117
6.5 Reconstructed Image after proposed DCT and
IDCT
117
6.6 2-D DCT Matlab results 120
6.7 DCT HDL simulation results 120
6.8 2-D IDCT HDL simulation results 121
6.9 2-D IDCT MATLAB results 122
6.10 Layout of 2D- DCT 123
6.11 Zoomed Version of 2D- DCT Layout 124
6.12 Layout of 2D- IDCT 124
6.13 Zoomed Version of 2D-IDCT Layout 125
6.14 Original image for compression 126
6.15 Original image with the pixel values 126
6.16 DCT co efficients before quantization 128
6.17 Image after compression 129
6.18 Reconstructed Image using IDCT 130
6.19 Comparison of original and reconstructed image 130
6.20 Simulation results after compression 131
7.1 Waveform obtained after simulating the Zigzag
Scanning block
135
7.2 Waveform of Run Length Encoding block 136
7.3 Waveform of Huffman Encoder block 137
7.4 Layout of Variable Length Coding 137
7.5 Zoomed Version Layout of VLC 138
7.6 Waveform of Huffman Decoder block 138
7.7 Waveform of run-length decoder block 139
xx
7.8 Waveform of zigzag inverse scanner block 140
7.9 Layout of Variable Length Decoding 140
7.10 Bar Chart of Power and Area parameters 142
7.11 Bar chart showing the power comparison of Huffman
decoders
143
7.12 Barchart showing the power comparison of RL-
Huffman Encoder combination
144
xxi
LIST OF ABBREVIATIONS DCT Discrete Cosine Transform
VLSI Very Large Scale Integration
JPEG Joint Photographic Expert Group
IDCT Inverse Discrete Cosine Transform
ISO International Organization for Standardization
IEC International Electro technical Commission
MPEG Moving Picture Expert Group
HDTV High Density Television
HVS Human Visual System
KLH Karhunen-Loeve-Hotelling
CR Compression Ratio
DPCM Differential Pulse Code Modulation
DWT Discrete Wavelet Transform
RLE Run Length Encoding
DA Distributed Arithmetic
LUT Look-up table
HSM Highly Scalable Multiplier
PSNR Peak Signal to Noise Ratio
ASIC Application Specific Integrated Circuit
HDL Hardware Description Language
FPGA Field Programmable Gate Array
RTL Register Transfer Level
STA Static Timing Analysis
DRC Design Rule Check
ERC Electrical Rule Checking
SVC Static Voltage Scaling
MVS Multi-level Voltage Scaling
DVFS Dynamic Voltage and Frequency Scaling
xxii
AVS Adaptive Voltage Scaling
MAC Multiply and Accumulate
VLE Variable Length Encoding
FIFO First In First Out
RLD Run-Length Decoder
IUS Incicive Unified Simulator
CHT Condensed Huffman Table
SGHT Single-Side Growing Huffman Table
RVLC Reversible Variable-Length Coding
PE Processing Elements
1
CHAPTER 1
INTRODUCTION
The purpose of this chapter is to provide an overview of compression, in
particular an image compression using discrete cosine transform (DCT). The
different types of architecture to find the 2-Dimensional DCT and the methodology
followed for image compression using DCT.
1.1 IMAGE DATA COMPRESSION
Image data compression has been an active research area for image
processing over the last decade and has been used in a variety of applications.
Compression is the process of reducing the size of a data by encoding its
information more efficiently. By doing this, the result is a reduction in the number
of bits and bytes used to store the information. In effect, reduce the bandwidth
required for transmission and to reduce the storage requirements [97]. This
research investigates the implementation of an image data compression method
with Low power VLSI hardware that could be used in practical coding systems to
compress Image signals.
In practical situations, an image is originally defined over a large matrix of
picture elements (pixels), with each pixel represented by an 8- or 16-bit gray scale
value. This representation could be so large that it is difficult to store or transmit.
The purpose of image compression is to reduce the size of the representation and,
at the same time, to keep most of the information contained in the original
image.DCT based coding/decoding systems play a dominant role in real-time
applications. However, the DCT is computationally intensive. In addition, 2D-
DCT has been recommended by standard organizations the Joint Photographic
Expert Group (JPEG) [95] .The standards developed by these groups aid industry
manufacturers in developing real-time 2D-DCT chips for use in various image
transmission and storage systems [46].
2
DCT based coding and decoding systems play a dominant role in real-time
applications in science and engineering like audio and Images. VLSI DCT
processor chips have become indispensable in real time coding systems because of
their fast processing speed and high reliability. JPEG has defined an international
standard for coding and compression of continuous tone- still images. This
standard is commonly referred to as the JPEG standard [55]. The primary aim of
the JPEG standard is to propose an image compression algorithm that would be
generic, application independent and aid VLSI implementation of data
compression. As the DCT core becomes a critical part in an image compression
system, close studies on its performance and implementation are worthwhile and
important. Application specific requirements are the basic concern in its design. In
the last decade the advancement in data communication techniques was significant,
during the explosive growth of the Internet the demand for using multimedia has
increased. Video and Audio data streams require a huge bandwidth to be
transferred in an uncompressed form. Several ways of compressing multimedia
streams evolved, some of them use the Discrete Cosine Transform (DCT) for
transform coding and its inverse (IDCT) for transform decoding.
Image compression is a useful topic in the digital world. A digital image
bitmap can contain considerably large amounts of data causing exceptional
overhead in both computational complexity as well as data processing. Storage
media has exceptional capacity however access speeds are typically inversely
proportional to capacity. Compression is a must to manage large amounts of data
for network, internet, or storage media. Compression techniques have been studied
for years, and will continue to improve. Typically image and video compressors
and decompressors (CODECS) are performed mainly in software as signal
processors can manage these operations without incurring too much overhead in
computation. However, the complexity of these operations can be efficiently
implemented in hardware. Hardware specific CODECS can be integrated into
digital systems fairly easily. Improvements in speed occur primarily because the
hardware is tailored to the compression algorithm rather than to handle a broad
range of operations like a digital signal processor. Data compression itself is the
3
process of reducing the amount of information into a smaller data set that can be
used to represent, and reproduce the information. Types of image compression
include lossless compression, and lossy compression techniques that are used to
meet the needs of specific applications.
JPEG compression can be used as a lossless or a lossy process depending
on the requirements of the application. Both lossless and lossy compression
techniques employ reduction of redundant data. Work in standardization has been
controlled by the International Organization for Standardization (ISO) in
cooperation with the International Electro technical Commission (IEC). The Joint
Photographic Experts Group produced the well known image format JPEG, a
widely used image format. JPEG provides a solid baseline compression algorithm
that can be modified in numerous ways to any desired application. The JPEG
specification was released initially in 1991, although it does not specify a
particular implementation.
In signal processing applications, the DCT is the most widely used
transform after the discrete Fourier transform. The DCT and IDCT are important
components in many static picture compression and decompression standards [95,
55, 54], including MPEG, HDTV, and JPEG. The applications for these standards
range from still pictures on the Internet to low quality videophones to high
definition television. The DCT transforms data from the spatial domain to the
spatial frequency domain. The DCT attempts to de-correlate the image data, which
is typically highly correlated for small areas of an image. Heavily correlated data
samples provide much redundant information, whereas just a few pieces of
uncorrelated information can represent the same data much more efficiently.
To compress data, it is important to recognize redundancies in data, in the
form of coding redundancy, inter-pixel redundancy, and psycho-visual
redundancy. Data redundancies occur when unnecessary data is used to represent
source information. Compression is achieved when one or more of these types of
redundancies are reduced. Intuitively, removing unnecessary data will decrease the
4
size of the data, without losing any important information. However, this is not the
case for psycho-visual redundancy. The most obvious way to increase compression
is to reduce the coding redundancy. This is referring to the entropy of an image in
the sense that more data is used than necessary to convey the information. Lossless
redundancy removal compression techniques are classified as entropy coding.
Other compression can be obtained through inter-pixel redundancy removal. Each
adjacent pixel is highly related to its neighbors, thus can be differentially encoded
rather than sending the entire value of the pixel. Similarly adjacent blocks have the
same property, although not too the extent of pixels.
In order to produce loss less compression, it is recommended that only
coding redundancy is reduced or eliminated. This means that the source image will
be exactly the same as the decompressed image. However, inter-pixel
redundancies can also be removed, as the exact pixel value can be reconstructed
from differential coding or through run length coding. Psycho-visual redundancy
refers to the fact that the human visual system will interpret an image in a way
such that removal of this redundancy will create an image that is nearly
indistinguishable by human viewers. The main way to reduce this redundancy is
through quantization. Quantizing data will reduce it to levels defined by the
quantization value. Psycho-visual properties are taken advantage of from studies
performed on the human visual system. The Human Visual System (HVS)
describes the way that the human eye processes an image, and relays it to the brain.
By taking advantage of some properties of HVS, a lot of compression can be
achieved. In general, the human eye is more sensitive to low frequency
components, and the overall brightness, or luminance of the image.
Images contain both low frequency and high frequency components. Low
frequencies correspond to slowly varying color, whereas high frequencies
represent fine detail within the image. Intuitively, low frequencies are more
important to create a good representation of an image. Higher frequencies can
largely be ignored to a certain degree. The human eye is more sensitive to the
luminance (brightness), than the chrominance (color difference) of an image. Thus
5
during compression, chrominance values are less important and quantization can
be used to reduce the amount of psycho-visual redundancy [92, 101, 18].
Luminance data can be quantized, but more coarsely to ensure that important data
is not lost.
Several compression algorithms use transforms to change the image from
pixel values representing color to frequencies dealing with lightness and darkness
of an image. Many forms of the JPEG compressions algorithm make use of the
discrete cosine transform. Other transforms such as wavelets are employed by
other compression algorithms. These models take advantage of subjective
redundancy by exploiting the human visual system sensitivity to image
characteristics. Another form of compression technique aside from exploiting
redundancies in data is known as transform coding or block quantization.
Transform coding employs techniques such as differential pulse code modulation
as well as other predictive compression measures. Transform-coding works by
moving data from spatial components to transform space such that data is reduced,
into a fewer number of samples. Transform coders effectively create an output that
has a majority of the energy compacted into a smaller number of transform
coefficients.
The JPEG image compression algorithm makes use of a discrete cosine
transform to move pixel data representing color intensities to a frequency domain.
Most multimedia systems combine both transform coding and entropy coding into
a hybrid coding technique. The most efficient transform coding technique employs
the Karhunen-Loeve-Hotelling (KLH) transform. The KLH transform has the best
results of any studied transform regarding the best energy compaction. However,
the KLH transform has no fast algorithm or effective hardware implementation.
Thus, JPEG compression replaces the KLH transform with the discrete cosine
transform, which is closely related to the discrete Fourier transform. Transform
coders typically make use of quantizers to scale the transform coefficients to
achieve greater compression.
6
As a majority of energy from the source image is compacted into few
coefficients, vector quantizers can be used to coarsely quantize the components,
and finely quantize the more important coefficients [20]. A major benefit to
transform coding is that distortion or noise produced by quantization and rounding
gets evenly distributed over the resulting image through the inverse transform [89].
1.2 NEED FOR IMAGE COMPRESSION
The need for image compression becomes apparent when number of bits
per image is computed resulting from typical sampling rates and quantization
methods. For example, the amount of storage required for given images is (i) a
low resolution, TV quality, color video image which has 512×512 pixels/color,
8bits/pixel, and 3 colors approximately consists of 6×10⁶ bits; (ii) a 24×36 mm
negative photograph scanned at 12×10⁻⁶mm: 3000×2000 pixels/color, 8
bits/pixel, and 3 colors nearly contains 144×10⁶ bits; (iii) a 14×17 inch radiograph
scanned at 70×10⁻⁶mm: 5000×6000 pixels, 12 bits/pixel nearly contains 360×10⁶
bits. Thus storage of even a few images could cause a problem. As another
example of the need for image compression, consider the transmission of low
resolution 512×512×8 bits/pixel×3- color video image over telephone lines. Using
a 96000 bauds (bits/sec) modem, the transmission would take approximately 11
minutes for just a single image, which is unacceptable for most applications.
1.3 PRINCIPLES BEHIND COMPRESSION
Number of bits required to represent the information in an image can be
minimized by removing the redundancy present in it. There are three types of
redundancies: (i) spatial redundancy, which is due to the correlation or
dependence between neighboring pixel values; (ii) spectral redundancy, which
is due to the correlation between different color planes or spectral bands; (iii)
temporal redundancy, which is present because of correlation between
different frames in images. Image compression research aims to reduce the
number of bits required to represent an image by removing the spatial and spectral
redundancies as much as possible.
7
Data redundancy is of central issue in digital image compression. If n1 and
n2 denote the number of information carrying units in original and compressed
image respectively, then the compression ratio CR can be defined as
CR= n1/n2
And relative data redundancy RD of the original image can be defined as
RD=1-(1/CR);
Three possibilities arise here:
(1) If n1=n2, then CR=1 and hence RD=0 which implies that original image
does not contain any redundancy between the pixels.
(2) If n1>>n2, then CR→∞ and hence RD→1 which implies considerable
amount of redundancy in the original image.
(3) If n1<<n2, then CR→0 and hence RD→-∞ which indicates that the
compressed image contains more data than the original image.
1.4 DIFFERENT TYPES OF REDUNDANCIES IN IMAGE
Image compression and coding techniques explore three types of
redundancies: coding redundancy, interpixel (spatial) redundancy and
pyschovisual redundancy. Compression is achieved when one or more of these
types of redundancies are reduced.
1.4.1 Coding Redundancy
This refers to the entropy of the image in the sense more data is used than
necessary to convey the information. This can be overcome by variable length
coding. Examples of image coding schemes that explore coding redundancy are
Huffman codes and Arithmetic coding technique.
1.4.2 Interpixel Redundancy
This is also known as spatial redundancy, interframe redundancy or
geometric redundancy. It exploits the fact that an image often contains strongly
8
correlated pixels, in other words, large regions where the pixel values are the same
or almost the same. Examples of compression techniques that explore interpixel
redundancy include Constant Area Coding, Run Length Encoding and many
predictive coding algorithms like Differential Pulse Code Modulation.
1.4.3 Psychovisual Redundancy
This refers to the fact that the human visual system will interpret an image
in a way such that removal of redundancy will create an image that is nearly
indistinguishable by human viewers. The main way to reduce this type of
redundancy is through quantization. Psychovisual properties are taken advantage
from studies performed on human visual system. Most of the image coding
algorithms in use today exploit this type of redundancy such as Discrete Cosine
Transform (DCT) based algorithm at the heart of the JPEG encoding standard.
1.5 TYPES OF JPEG COMPRESSION
Still image coding is an important application of data compression. When
an analog image or picture is digitized, each pixel is represented by a fixed number
of bits, which correspond to a certain number of gray levels. In this uncompressed
format, the digitized image requires a large number of bits to be stored or
transmitted. As a result, compression becomes necessary due to the limited
communication bandwidth or storage size.
The JPEG standard allows for both lossy and lossless encoding of still
images. The algorithm for lossy coding is a discrete cosine transforms (DCT)
based coding scheme. This is the baseline of JPEG and is sufficient for many
applications. However, to meet the needs of applications that cannot tolerate loss,
e.g., compression of medical images, a lossless coding scheme is also provided and
is based on a predictive coding scheme. From the algorithmic point of view, JPEG
includes four distinct modes of operation: sequential DCT-based mode,
progressive DCT-based mode, lossless mode, and hierarchical mode.
9
1.5.1 Sequential DCT based
The sequential DCT based mode of operation comprises the baseline JPEG
algorithm. This technique can produce very good compression ratios, while
sacrificing image quality. The sequential DCT based mode achieves much of its
compression through quantization, which removes entropy from the data set.
Although this baseline algorithm is transform based, it does use some measure of
predictive coding called the differential pulse code modulation (DPCM) [8]. After
each input 8x8 block of pixels is transformed to frequency space using the DCT,
the resulting block contains a single DC component, and 63 AC components. The
DC component is predictively encoded through a difference between the current
DC value and the previous. This mode only uses Huffman coding models, not
arithmetic coding models which are used in JPEG extensions. This mode is the
most basic, but still has a wide acceptance for its high compression ratios, which
can fit many general applications very well.
1.5.2 Progressive DCT based
However, in the progressive mode, the quantized DCT coefficients are first
stored in a buffer before the encoding is performed. The DCT coefficients in the
buffer are then encoded by a multiple scanning process. In each scan, the quantized
DCT coefficients are partially encoded by either spectral selection or successive
approximation. In the method of spectral selection, the quantized DCT coefficients
are divided into multiple spectral bands according to a zigzag order. In each scan, a
specified band is encoded. In the method of successive approximation, a specified
number of most significant bits of the quantized coefficients are first encoded,
followed by the least significant bits in later scans. The difference between
sequential coding and progressive coding is shown in Figure 1.1. In the sequential
coding an image is encoded part-by-part according to the scanning order while in
the progressive coding, the image is encoded by multi scanning process and in
each scan the full image is encoded to a certain quality level.
10
(a) Sequential Coding
(b) Progressive Coding
Figure 1.1 (a) Sequential coding (b) Progressive coding
1.5.3 Lossless Mode
Lossless coding is achieved by a predictive coding scheme. In this scheme,
three neighboring pixels are used to predict the current pixel to be coded. The
prediction difference is entropy coded using either Huffman or arithmetic coding.
Because the prediction is not quantized, the coding is lossless. As it is lossless, it
also rules out the use of quantization. This method does not achieve high
compression ratios, but some applications do require extremely precise image
reproduction like in medical scanning images.
1.5.4 Hierarchical Mode
In the hierarchical mode, an image is first spatially down-sampled to a
multilayered pyramid, resulting in a sequence of frames as shown in Figure 1.2.
11
Figure 1.2 Hierarchical multi resolution coding
This sequence of frames is encoded by a predictive coding scheme. Except
for the first frame, the predictive coding process is applied to the differential
frames, i.e., the differences between the frame to be coded and the predictive
reference frame. It is important to note that the reference frame is equivalent to the
earlier frame that would be reconstructed in the decoder. The coding method for
the difference frame may either use the DCT-based coding method, the lossless
coding method, or the DCT-based processes with a final lossless process. Down -
sampling and up-sampling filters are used in the hierarchical mode. The
hierarchical coding mode provides a progressive presentation similar to
progressive DCT-based mode, but is also useful in the applications that have multi
resolution requirements. The hierarchical coding mode also provides the capability
of progressive coding to a final lossless stage.
1.6 LOSSLESS VERSUS LOSSY COMPRESSION
In lossless compression schemes, the reconstructed image, after
compression, is numerically identical to the original image. However lossless
compression can only achieve a modest amount of compression. Lossless
compression is preferred for archival purposes and often medical imaging,
technical drawings, clip art or comics. This is because lossy compression methods,
especially when used at low bit rates, introduce compression artifacts. An image
reconstructed following lossy compression contains degradation relative to the
original. Often this is because the compression scheme completely discards
12
redundant information. However, lossy schemes are capable of achieving much
higher compression. Lossy methods are especially suitable for natural images
such as photos in applications where minor (sometimes imperceptible) loss of
fidelity is acceptable to achieve a substantial reduction in bit rate. The lossy
compression that produces imperceptible differences can be called visually
lossless.
1.6.1 Predictive Versus Transform Coding
In predictive coding, information already sent or available is used to predict
future values, and the difference is coded. Since this is done in the image or spatial
domain, it is relatively simple to implement and is readily adapted to local
image characteristics. Differential Pulse Code Modulation (DPCM) is one
particular example of predictive coding. Transform coding, on the other hand, first
transforms the image from its spatial domain representation to a different type
of representation using some well-known transform and then codes the
transformed values (coefficients). This method provides greater data compression
compared to predictive methods, although at the expense of greater
computational requirements.
1.7 DCT PROCESS
Figure 1.3 DCT Process for compression
8X8 Input Image
DCT
Quantizer
Quantized DCT
coefficients
Quantization Tables
13
The DCT process is shown in Figure 1.3 The input image is divided into
non-overlapping blocks of 8 x 8 pixels, and input to the baseline encoder. The
pixel values are converted from unsigned integer format to signed integer format,
and DCT computation is performed on each block. DCT transforms the pixel data
into a block of spatial frequencies that are called the DCT coefficients. Since the
pixels in the 8 x 8 neighborhood typically have small variations in gray levels, the
output of the DCT will result in most of the block energy being stored in the lower
spatial frequencies [80, 83, 60]. On the other hand, the higher frequencies will
have values equal to or close to zero and hence, can be ignored during encoding
without significantly affecting the image quality.
The selection of frequencies based on which frequencies are most
important and which ones are less important can affect the quality of the final
image. JPEG allows for this by letting the user predefine the quantization tables
used in the quantization step that follows the DCT computation. The selection of
quantization values is critical since it affects both the compression efficiency [70],
and the reconstructed image quality.
The 2-D DCT transforms an 8x8 block of spatial data samples into an 8x8
block of spatial frequency components. The IDCT performs the inverse of DCT,
transforming spatial frequency components back into the spatial domain. Figure
1.4 shows the frequency components represented by each coefficient in the output
matrix. The low frequency coefficients occur in the top left side of the output
matrix, while the remaining
14
Figure 1.4 DCT Coefficients
higher frequency coefficients occur in the bottom right side. The DC coefficient at
position (0,0) gives an idea of the average intensity (for luminance blocks) or hue
(for chrominance blocks) of an entire block. Moving horizontally from position
(0,0) to position (0,7), the coefficients give the contributions of increasing vertical
frequency components to the overall 8x8 block. The coefficients from position
(1,0) to position (7,0) have similar meaning for horizontal frequency components.
Moving diagonally through the matrix gives the combined contribution of
horizontal and vertical frequency components. The original block is rebuilt by the
IDCT with these discrete frequency components. High frequency coefficients have
small magnitude for typical image data, which usually does not change
dramatically between neighboring pixels. Additionally, the human eye is not as
sensitive to high frequencies as to low frequencies. It is difficult for the human eye
to discern changes in intensity or colour that occurs between successive pixels. The
human eye tends to blur these rapid changes into an average hue and intensity.
However, gradual changes over the 8 pixels in a block are much more discernible
than rapid changes. When the DCT is used for compression purposes, the quantizer
15
unit attempts to force the insignificant high frequency coefficients to zero while
retaining the important low frequency coefficients.
DCT transforms the information from the time or space domains to the
frequency domain, such that other tools and transmission media can be run or used
more efficiently to reach application goals: compact representation [13], fast
transmission, memory savings, and so on. The JPEG image compression standard
was developed by Joint Photographic Expert Group. The JPEG compression
principle is the use of controllable losses to reach high compression rates. In this
context, the information is transformed to the frequency domain through DCT.
Since neighbor pixels in an image have high likelihood of showing small
variations in color, the DCT output will group the higher amplitudes in the lower
spatial frequencies. Then, the higher spatial frequencies can be discarded,
generating a high compression rate and a small perceptible loss in the image
quality. The JPEG compression is recommended for photographic images, since
drawing images are richer in high frequency areas that are distorted with the
application of the JPEG compression.
1.8 JPEG IMAGE COMPRESSION
JPEG image compression standard uses DCT (Discrete Cosine Transform).
The discrete cosine transform is a fast transform. It is a widely used and robust
method for image compression. It has excellent compaction for highly correlated
data, DCT has fixed basis images. DCT gives good compromise between
information packing ability and computational complexity.
JPEG 2000 image compression standard makes use of DWT (Discrete
Wavelet Transform). DWT can be used to reduce the image size without losing
much of the resolutions computed and values less than a pre-specified threshold
are discarded. Thus it reduces the amount of memory required to represent given
image. DWT provides lower quality than JPEG at low compression rates. DWT
requires longer compression time [12, 52].
16
The name "JPEG" stands for Joint Photographic Experts Group. JPEG is a
method of lossy compression for digitized photographic images. JPEG can achieve
a good compression with little perceptible loss in image quality. It works with
color and grayscale images and finds applications in satellite, medical etc.
JPEG Encoding consists of following stages. The Major Steps in JPEG
Coding involve:
• Image/Block Preparation
• DCT (Discrete Cosine Transformation)
• Quantization
• Entropy Coding
• Zigzag Scanning (Vectoring)
• Run Length Encoding (RLE)
The baseline JPEG compression algorithm is the most basic form of
sequential DCT based compression. By using transform coding, quantization, and
entropy coding, at an 8-bit pixel resolution, a high-level of compression can be
achieved. However, the compression ratio achieved is due to the sacrifices made in
quality. The baseline specification assumes that 8-bit pixels are the source image,
but extensions can use higher pixel resolutions. JPEG assumes that each block of
data input is 8x8 pixels, which are serially input in raster order. Similarly, each
block is sequentially input in raster order.
Baseline JPEG compression has some configurable portions, such as
quantization tables, and Huffman tables, which can individually be specified in the
JPEG file header. By studying the source images to be compressed, Huffman
codes and quantization codes can be optimized to reach a higher level of
compression without losing more quality than is acceptable. Although this mode of
JPEG is not highly configurable, it still allows a considerable amount of
compression. Furthermore compression can be achieved by sub sampling
17
chrominance portions of the input image, which is a useful technique playing on
the human visual system.
Figure 1.5 Image compression model
Figure 1.6 Image Decompression model
Image compression model shown in figure 1.5 consists of a Transformer, quantizer
and encoder.
1.8.1 Input Transformer
It transforms the input data into a format to reduce inter pixel redundancies
in the input image. Transform coding techniques use a reversible, linear
mathematical transform to map the pixel values onto a set of coefficients, which
are then quantized and encoded. The key factor behind the success of transform-
based coding schemes is that many of the resulting coefficients for most natural
images have small magnitudes and can be quantized without causing significant
distortion in the decoded image. For compression purpose, the higher the
capability of compressing information in fewer coefficients, the better the
transform; for that reason, the Discrete Cosine Transform (DCT) and Discrete
DCT
Quantization
Encoder
Compressed image
Original Image
IDCT
Dequantization
Decoder
Reconstructed image
Compressed Image
18
Wavelet Transform (DWT) have become the most widely used transform coding
techniques.
Transform coding algorithms usually start by partitioning the original
image into sub images (blocks) of small size (usually 8 × 8). For each block the
transform coefficients are calculated, effectively converting the original 8 × 8 array
of pixel values into an array of coefficients within which the coefficients closer to
the top-left corner usually contain most of the information needed to quantize and
encode (and eventually perform the reverse process at the decoder’s side) the
image with little perceptual distortion. The resulting coefficients are then quantized
and the output of the quantizer is used by symbol encoding techniques to produce
the output bit stream representing the encoded image. In image decompression
model at the decoder’s side, the reverse process takes place shown in figure 1.6
[65, 75, 16], with the obvious difference that the dequantization stage will only
generate an approximated version of the original coefficient values e.g., whatever
loss was introduced by the quantizer in the encoder stage is not reversible.
In order to make the data fit the discrete cosine transform, each pixel value
is level shifted by subtracting 128 from its value. The result of this is 8-bit pixels
that have the range of -127 to 128, making the data symmetric across 0. This is
good for DCT as any symmetry that is exposed will lead towards better entropy
compression [22, 90]. Effectively this shifts the DC coefficient to fall more in line
with the value of the AC coefficients. The AC coefficients produced by the DCT
are not affected in any way by this level shifting.
1.8.2 Quantization
The human eye responds to the DC coefficient and the lower spatial
frequency coefficients. If the magnitude of a higher frequency coefficient is below
a certain threshold, the eye will not detect it. Set the frequency coefficients in the
transformed matrix whose amplitudes are less than a defined threshold to zero
(these coefficients cannot be recovered during decoding) during quantization, the
19
sizes of the DC and AC coefficients are reduced. A division operation is performed
using the predefined threshold value as the divisor.
DCT-based image compression relies on two techniques to reduce the data
required to represent the image. Quantization is the process of reducing the number
of possible values of a quantity, thereby reducing the number of bits needed to
represent it. Entropy coding is a technique for representing the quantized data as
compactly as possible. We will develop functions to quantize images and to
calculate the level of compression provided by different degrees of quantization.
We will not implement the entropy coding required to create a compressed image
file. A simple example of quantization is the rounding of real into integers. To
represent a real number between 0 and 7 to some specified precision takes many
bits. Rounding the number to the nearest integer gives a quantity that can be
represented by just three bits. In this process, we reduce the number of possible
values of the quantity (and thus the number of bits needed to represent it) at the
cost of losing information. A “finer” quantization, that allows more values and
loses less information, can be obtained by dividing the number by a weight factor
before rounding In the JPEG image compression standard, each DCT coefficient is
quantized using a weight that depends on the frequencies for that coefficient.
The coefficients in each 8 x 8 block are divided by a corresponding entry of
an 8 x 8 quantization matrix, and the result is rounded to the nearest integer. In
general, higher spatial frequencies are less visible to the human eye than low
frequencies. Therefore, the quantization factors are usually chosen to be larger for
the higher frequencies. The quantization matrix is widely used for monochrome
images and for the luminance component of a color image. Our quantization
function blocks the image, divides each block (element-by-element) by the
quantization matrix, reassembles the blocks, and then rounds the entries to the
nearest integer. The dequantization function blocks the matrix, multiplies each
block by the quantization factors, and reassembles the each coefficient of the
matrix.
20
1.8.3 Entropy Coding
It creates a fixed or variable-length code to represent the quantizers output
and maps the output in accordance with the code. In most cases, a variable-length
code is used. An entropy encoder compresses the compressed values obtained by
the quantizer to provide more efficient compression. Most important types of
entropy encoders used in lossy image compression techniques are arithmetic
encoder, Huffman encoder and run-length encoder. Vectoring 2-D matrix of
quantized DCT coefficients are represented in the form of a single-dimensional
vector [76, 61, 50]. After quantization, most of the high frequency coefficients
(lower right corner) are zero. To exploit the number of zeros, a zigzag scan of the
matrix is used. Zigzag scan allows [73] all the DC coefficients and lower
frequency AC coefficients to be scanned first. DC coefficients are encoded using
differential encoding and AC coefficients are encoded using run-length encoding.
Huffman coding is used to encode both after that. Differential Encoding, DC
coefficient is the largest in the transformed matrix. DC coefficient varies slowly
from one block to the next. Only the difference in value of the DC coefficients is
encoded. Number of bits required to encode is reduced. The difference values are
encoded in the form (SSS, value) where SSS field indicates the number of bits
needed to encode the value and the value field indicates the binary form [41].
1.8.4 Run-Length Encoding
Run Length is the first step in entropy coding. This is a simple thought that
is accomplished by assigning a code, run length and size, to every non-zero value
in the quantized data stream. The run length is a count of zero values before the
non-zero value occurred. The size is a category given to the non-zero value which
is used to recover the value later. The DC value of the block is omitted in this
process [69, 25]. Additionally, with every non-zero value a magnitude is generated
which determines the number of bits that are necessary to reconstruct the value. It
will indicate possible values in the size category that can be correct. Run Length
coding is a basic form of lossless compression. Essentially, this process is a
21
generalization of zero suppression techniques [99, 1]. Zero suppression assumes
that one symbol or value appears often in a data stream. After quantization the goal
is that most high frequency components, which are less important to the human
visual system, are set to zero. The zigzag process organized the sequence to have
the lower frequency components which are less likely to be zero in the first part of
the data stream. This effectively has organized the data to have larger runs of
zeros, especially at the end, making the run length coding very efficient. The 63
values of the AC coefficients with long strings of zeros because of the zig-zag scan
[28]. Each AC coefficient encoded as a pair of values (skip, value), skip indicates
the number of zeros in the run and value is the next non-zero coefficient.
1.8.5 Huffman Encoding
It is a technique which will assign a variable length codeword to an input
data item. Huffman coding assigns a smaller codeword to an input that occurs
more frequently. It is very similar to Morse code, which assigned smaller pulse
combinations to letters that occurred more frequently [59]. Huffman coding is
variable length coding, where characters are not coded to a fixed number of bits.
This is the last step in the encoding process. It organizes the data stream into a
smaller number of output data packets by assigning unique code words that later
during decompression can be reconstructed without loss [78, 42]. For the JPEG
process, each combination of run length and size category, from the run length
coder is assigned a Huffman codeword. Long strings of binary digits are replaced
by shorter codeword. Prefix property of the Huffman code words enable decoding
the encoded bit stream unambiguously.
The figure 1.6 shows the Decompression model of the image in which the
reverse operation of the compression model is performed so that the original image
can be reconstructed back.
22
1.9 APPLICATIONS OF DCT
The DCT core can be utilized for a variety of multimedia applications
including
• Office automation Equipment (Multifunction printers, digital copié etc.)
• Digital cameras & camcorders
• Video production, video conference
• Surveillance systems
Like other transforms, the Discrete Cosine Transform (DCT) attempts to
decorrelate the image data. After decorrelation each transform coefficient can be
encoded independently without losing compression efficiency. The next section
describes the DCT and some of its important properties.
1.10 DCT ALGORITHMS
DCT algorithm is very effective due to its symmetry and simplicity. It is
good replacement of FFT due to consideration of real component of the image data
[75, 60]. In DCT we leave the unwanted frequency components while opting only
required frequency components of the image. Image is divided into blocks and
each block is compressed using quantization. Moreover, many simulation tools
like MATLAB ect.. are available to estimate the results prior to realization of
design in real time. Equations 1.1 and 1.2 are the one dimension DCT standard
equations for the data out.
1.10.1 One-Dimensional DCT
The most common DCT definition of a 1-D sequence of length N is
( ) = ∝ ( ) ∑ ( )cos ( ) …… (1.1)
23
for u = 0,1,2,…,N −1. Similarly, the inverse transformation is defined as
( ) = ∑ ∝ ( ) ( )cos ( )
…… (1.2)
for x = 0,1,2,…,N −1. In both equations (1.1) and (1.2) α (u) is defined as
∝ ( ) = = 0 ≠ 0 …… (1.3)
It is clear from Equation (1.1) that for
= 0, ( = 0) = 1 ( )
….. (1.4)
Thus, the first transform coefficient is the average value of the sample
sequence. In literature, this value is referred to as the DC Coefficient. All other
transform coefficients are called the AC Coefficients.
1.10.2 Two-Dimensional DCT
The 2-D DCT is a direct extension of the 1-Dimensional DCT and is given
by
( , ) = ( ) ( ) ∑ ∑ ( , )cos ( ) cos ( )
…. (1.5)
24
For u, v = 0,1,2,…,N −1 and α(u) and α(v) are defined in (1.3).
( , ) = ( ) ( ) ( , )cos (2 + 1)2 cos (2 + 1)2
…. (1.6)
The inverse transform is defined as for x, y = 0, 1, 2…, N −1. The 2-D basis
functions can be generated by multiplying the horizontally oriented 1-D basis
functions with vertically oriented set of the same functions.
1.11 DCT ARCHITECTURES
The different architectures are available to find the 2-D DCT for a image
matrix some of the architectures are discussing in the next section.
1.11.1 Two-Dimensional Approaches
The implementation of the 2-D DCT directly from the theoretical equation
results in 1024 multiplications and 896 additions. Fast algorithms exploit the
symmetry within the DCT to achieve dramatic computational savings.
25
1.11.2 Row – Column Decomposition
This algorithm computes the 2-D DCT by row-column decomposition. In
this approach, the separability property of the DCT is exploited. An 8-point, 1-D
DCT is applied to each of the 8 rows, and then again to each of the 8 columns. The
1-D algorithm that is applied to both the rows and columns is the same. Therefore,
it could be possible to use identical pieces of hardware to do the row computation
as well as the column computation. A transposition matrix would separate the row
and column is as shown in figure 1.7. The bulk of the design and computation is
in the 8 point 1-D DCT block, which can potentially be reused 16 times, 8 times
for each row, and 8 times for each column. Therefore, an algorithm for computing
the 1-D DCT is usually selected. The high regularity of this approach is very
attractive for reduced cell count and low power consumption with ASIC
implementation.
Figure 1.7 Row – Column Decomposition
1.11.3 Direct Method
This approach to computation of the 2-D DCT is by a direct method using
the results of a polynomial transform. Computational complexity is greatly
reduced, but regularity is sacrificed. Instead of the 16 1-D DCTs used in the
conventional row-column decomposition, uses all real arithmetic including eight
1-D DCTs, and stages of pre-adds and post-adds (a total of 234 additions) to
compute the 2-D DCT. Thus, the number of multiplications for most
implementations should be halved as multiplication only appears within the 1-D
DCT. Although this direct method of extension into two dimensions creates an
26
irregular relationship between inputs and outputs of the system, the savings in
computational power may be significant with the use of certain 1-D DCT
algorithms. With this direct approach, large chunks of the design cannot be reused
to the same extent as in the conventional row-column decomposition approach.
Thus, the direct approach will lead to more hardware, more complex control, and
much more intensive debugging. Although the direct approach used 278 less
additions than the row-column approach, it had much greater complexity.
Therefore, the number of computations alone could not determine which
implementation would result in the lowest power design.
1.11.4 Distributed Arithmetic Algorithms
Distributed Arithmetic (DA), is a bit level rearrangement of a multiply
accumulate to recast the multiplications as additions. The DA method is designed
for inner (dot) products of a constant vector with a variable, or input, vector. It is
the order of operations that distinguishes distributed arithmetic from conventional
arithmetic [94]. The DA technique forms partial products with one bit of data from
the input vector at a time. The partial products are shifted according to weight and
summed together. Look-up tables (LUTs) are essential to the DA method. LUTs
store all the possible sums of the elements in the constant vector. The LUT grows
exponentially in size with the dimension of the input, but are optimal on four
dimensions. DA is implemented with the least possible resources by computing it
in a fully bit-serial manner. The key elements required to implement the DA are a
16-element LUT and decoder, an adder/ subtractor, and a shifter. These elements
are grouped together in a ROM Accumulate (RAC) structure. A shift register
inputs one column of input bits per clock cycle to the LUT. It begins by inputting
the most significant column of bits and rotates to finally input the least significant
column of variable bits. The contents from the LUT get summed with the shifted
contents of the previous look up. In this fully bit-serial approach, the answer
converges in as many clock cycles as the bit length of the input elements. While
the serial inputs limit the performance of the RAC, it requires the least possible
resources. Greater performance can be achieved with an increase in hardware.
27
With an increase in resources, the result of the RAC can converge quicker. The
speed of the calculation is increased by replicating the LUT. In a fully parallel
approach, the result of the DA converges at maximum speed the clock rate. In this
case, the LUT must be replicated as many times as there are input bits.
1.12 PROPERTIES OF DCT
Some properties of the DCT which are of particular value to image
processing applications:
Decorrelation: The principle advantage of image transformation is the
removal of redundancy between neighboring pixels. This leads to uncorrelated
transform coefficients which can be encoded independently. It can be inferred that
DCT exhibits excellent decorrelation properties.
Energy Compaction: Efficiency of a transformation scheme can be
directly gauged by its ability to pack input data into as few coefficients as possible.
This allows the quantizer to discard coefficients with relatively small amplitudes
without introducing visual distortion in the reconstructed image. DCT exhibits
excellent energy compaction for highly correlated images.
Separability: The DCT transform equation can be expressed as
( , ) = 1√2 ( ) ( ) (2 + 1)2 ( , ) (2 + 1)2
… (1.7)
This property, known as separability, has the principle advantage that D (i,
j) can be computed in two steps by successive 1-D operations on rows and
columns of an image. The arguments presented can be identically applied for the
inverse DCT computation.
28
Figure 1.8 2-D DCT model
Symmetry: Row and column operations in the DCT Equation reveals that these
operations are functionally identical. Such a transformation is called a symmetric
transformation. A separable and symmetric transform can be expressed in the form
D = TMT’ … (1.8)
Where M is an N ×N symmetric transformation matrix and T is the DCT matrix.
This is an extremely useful property since it implies that the transformation
matrix can be precomputed offline and then applied to the image thereby providing
orders of magnitude improvement in computation efficiency.
1.13 ORGANIZATION OF THE THESIS
This Thesis consists of eight chapters, appendix and references in total. The
framework of the thesis is as follows. Chapter 1 describes the Introduction about
the DCT and VLC and also discusses the different DCT Architecture. In Chapter 2,
the researcher discusses about the literature review related to the present research.
Chapter 3 describes the VLSI design flow and Importance of Low Power and
different Techniques for low power VLSI design. Chapter 4 describes the proposed
Low power VLSI Architecture for DCT and IDCT. In chapter 5, the present
researcher discusses about the Low power Architecture for VLC and VLD.
Chapter 6 and Chapter 7 describe the results and discussion of the present research
to achieve the good compression with low power approach. Chapter 8 deals with
the conclusion of the present thesis and also possibilities of future works.
29
SUMMARY
This chapter describes the concepts involved in the Image compression.
The methodology adopted in transforming the image pixels from spatial domain to
the frequency domain by performing the DCT. In DCT process what are the
different architecture to find the DCT for an 8x8 pixels represented in the form of
image matrix. The output of the DCT is quantized to perform the compression
using the Quantization process. The output of the quantizer prepares the ground for
lossy compression. After the Quantization then perform the Variable Length
coding to achieve the lossless compression and finally how to reconstruct the
image using decompression process. The next chapter describes the proposed
architecture to find the Discrete Cosine Transform with the low power VLSI
approach.
30
CHAPTER 2
REVIEW OF LITERTURE
2.1 INTRODUCTION
The previous chapter describes the concepts involved in the Image
compression. The methodology adopted is transforming the image pixels from
spatial domain to the frequency domain by performing the DCT. That also
describes the basic concepts required to reconstruct the image by performing the
IDCT. This chapter describes the literature review of the previous works related to
the present researcher.
Image and Video Compression has been a very active field of research and
development for over 20 years and many different systems and algorithms for
compression and decompression have been proposed and developed. In order to
encourage interworking, competition and increased choice, it has been necessary to
define standard methods of compression, encoding and decoding to allow products
from different manufacturers to communicate effectively. This has led to
development of a number of key international standards for image and video
compression, including the JPEG, MPEG and H.26X series of standards.
The Discrete Cosine Transform (DCT) was first proposed by Ahmed et al.
(1974), and it has been more and more important in recent years. DCT has been
widely used in signal processing of image data, especially in coding for
compression, for its near-optimal performance. Because of the wide-spread use of
DCT's, research into fast algorithms for their implementation has been rather
active.
1. Ricardo Castellanos, Hari Kalva and Ravi Shankar (2009), “Low Power
DCT using Highly Scalable Multipliers”, 16th IEEE International
Conference on Image Processing ,pp.1925-1928.
31
In this paper the authors have implemented a low power DCT using highly
scalable multiplier. Scalable multipliers are used since the width of the operand bit
varies in each stage of DCT implementation. Due to the use of the variable sized
multipliers power saving is obtained. A highly scalable multiplier (HSM) allows
dynamic configuration of multiplier for each stage. The authors have calculated
PSNR and SSIM on a set of images which are JPEG encoded based on the use of
scalable multiplier. The authors conclude that the use of scalable multiplier with
variable size HSM reduces the power consumption more than compared to any
other algorithms.
2. Vimal P. Singh Thoudam, Prof. B. Bhaumik, Dr. S. Chatterjee (2010),
“Ultra Low Power Implementation of 2-D DCT for Image/Video Compression”,
International Conference on Computer Applications and Industrial Electronics
(ICCAIE), pp.532-536.
Proposed DCT architecture is based on loeffler algorithm and uses CSD
(Canonical Signal Digit). In CSD number system each bit is allowed to have {-1,
0, +1}. No two consecutive bits are nonzero in CSD numbers. So, multiplications
are replaced by shift and add operation. The proposed scheme targets reduced
power by minimizing arithmetic operations and using the concept of clock gating
technique. Considerable amount of dynamic power is consumed in the clock
distribution network. Moreover flip-flops dissipate some dynamic power even if its
state does not change. So an EN input is used for effective power reduction.
3. M. Jridi and A. Alfalou (2010), “A Low-Power, High-Speed DCT
architecture for image compression: principle and implementation” 18th IEEE/IFIP
International Conference on VLSI and System-on-Chip (VLSI-Soc 2010) pp.304-
309.
A low power and high speed DCT for image compression is implemented
on FPGA. The DCT optimization is achieved through the hardware simplification
of the multipliers used to compute the DCT coefficients. In this work the authors
have implemented the DCT with constant multiplier by making use of Canonic
32
signed digit encoding to perform constant multiplication. The canonic signed digit
representation is the signed data representation containing the fewest number of
nonzero bits. Thus for the constant multipliers, the number of addition and
subtraction will be minimum. A common sub-expression elimination technique has
also been used for further DCT optimization, thu
4. M. El Aakif, S. Belkouch, N. Chabini, M. M. Hassani (2011), “Low Power
and Fast DCT Architecture Using Multiplier-Less Method”, IEEE International
Conference on Faible Tension Faible Consommation (FTFC). pp.63-66.
In this paper a new modified flow graph algorithm (FGA) of DCT based on
Lo-effler with hardware implementation of multiplier-less operation has been
proposed. The proposed FGA uses unsigned constant coefficient multiplication.
The multiplier-less method is widely used for VLSI realization because of
improvements in speed, area overhead and power consumption. The Lo-effler fast
algorithm is used to implement 1-D DCT. Two 1D DCT steps are used to generate
2-D DCT coefficient.
5. Yongli Zhu, Zhengya Xu (2006) “Adaptive Context Based Coding for
Lossless Color Image Compresssion” IMACS Multiconference on Computational
Engineering in Systems Applications (CESA), Beijing, China, pp.1310-1314.
An adaptive low complexity lossless segment (context) based color image
compression method is proposed by the authors. Adaptive context based dividing
for an image is conducted to find optimal fixed length codes that are used for
storing relative values of every pixel in each segment of the image and the length
of each segment by modified greedy algorithm. Thus a modified Huffman coding
is applied for the result of the first compression. The compression ratios obtained
were high.
6. Sunil Bhooshan, Shipra Sharma (2009), “An Efficient and Selective Image
Compression Scheme using Huffman and Adaptive Interpolation”, 24th
33
International Conference Image and Vision Computing New Zealand (IVCNZ ),
pp.1-3.
In this paper authors have made use of lossy and lossless compression
techniques. Different blocks are compressed in one of the ways depending on the
information content in that block. The process of compression begins with passing
the image through an high pass filter and the image matrix is divided into a number
of non-overlapping sub-blocks. Each of the sub-blocks is checked for the number
of zeros by setting a threshold. If the number of zeros in a particular block is more
than the threshold, it implies that the block contains less information and that
particular block from the original image matrix is taken for lossy compression
using Adaptive Interpolation. On the other hand if the number of zeros is less than
the threshold, then it implies that the information content is more and thus the
corresponding block of the original image matrix is subjected to lossless
compression using Huffman coding. The authors conclude that the computational
complexity of this approach is less and the compression ratios obtained by this
method is also high.
7. Piyush Kumar Shukla, Pradeep Rusiya, Deepak Agrawal, Lata Chhablani,
Balwant Singh (2009.)”,Multiple Subgroup Data Compression Technique Based
On Huffman Coding”.First International Conference on Computational
Intelligence, Communication Systems and Networks (CICSYN), pp.397-402.
In this paper the authors have proposed data compression for math’s text
files based on Adaptive Huffman coding. The compression ratio obtained is more
than Adaptive Huffman coding. In this the method used by the authors the
encoding process of the system encodes the frequently occurring characters with
shorter bit codes and the infrequently occurring characters with longer bit codes.
The algorithm proceeds as follows: subgroups of 256 characters are made. These
subgroups are again grouped into three groups of alphabets, numbers and some
operators and remaining symbols. The symbols of each group are arranged in the
34
decreasing order of probability of occurrence. Codeword for each character is
provided and the data file is thus encoded.
8. Dr. Muhammad Younus Javed and Abid Nadeem (2000), “Data
Compression Through Adaptive Huffman Coding Scheme” , IEEE Proceedings on
TENCON ,Vol.2. pp. 187-190.
In this paper the authors describes the development of data compression
system that employs the Adaptive Huffman method for generating variable length
codes. Adaptive Huffman coding has been used for dynamic data compression in
order to decrease the average code length used to represent the symbols of
character sets. This encoding system is used on text files where the characters in
the text are encoded with variable length codes. Shorter codes are used to encode
the character that appears frequently and longer length codes are used to encode
characters that appear infrequently. The author concludes that this scheme is well
suited for online encoding/decoding in data networks and compression for larger
files in order to reduce storage and transmission.
9. A.P. Vinod, D. Rajan and A. Singla (2007), “ Differential pixel-based low-
power and high-speed implementation of DCT for on-board satellite image
processing”, IET international Journals on Circuits, Devices and Systems, pp. 444-
450.
In this paper the authors have presented the techniques for minimizing the
complexity of multiplication by employing Differential Pixel Image (DPI). DPI is
the matrix obtained taking the difference of intensities of the adjacent pixels in the
input image matrix. The use of DPI instead of the original image matrix results in
significant reduction in the number of operations and hence the power consumed.
The intensity of the pixel in the DPI is obtained as fd (x,y)= [f(x,y) – f(x,y-1)],
where f(x,y) is the intensity at (x,y) in the original image. Also the intensity of the
first pixel of every sub-block of the DPI will be the same as the original matrix. In
this work the DCT coefficient matrix is represented using canonic signed digits.
The authors have also used common sub-expression elimination method where
35
multiple occurrences of identical bit patterns are identified in the DCT matrix and
thereby reducing the resources necessary that which can be shared.
10. Muhammed Yusuf Khan, Ekram Khan, M.Salim Beg (2008),”
Performance Evaluation of 4x4 DCT Algorithms For Low Power Wireless
Applications”. First International Conference on Emerging Trends in Engineering
and Technology, pp.1284-1286.
In this paper the authors have compared the performance of 4x4 DCT with
8x8 DCT, since small size DCT is suitable for mobile applications using low
power devices as fast computation speed is required for real time applications. The
authors have compared the performance of 4x4 transforms with the conventional
8x8 DCT in floating point. Firstly, the authors have compared the conventional
4x4 DCT in floating point with conventional 8x8 DCT in floating point. Next, the
4x4 integer transform is compared with the conventional 8x8 DCT in floating
point. The comparison was done on computation time of the transform and inverse
transform and objective quality, based on the calculation of PSNR between input
and reconstructed image. The authors have concluded that the integer transform
approximation of the DCT will reduce the computational time considerably.
11. A.Pradini, T.M.Roffi, R.Dirza, T.Adiono (2011), “VLSI Design of a High-
Throughput Discrete Cosine Transform for Image Compression System”,
International Conference on Electrical Engineering and Informatics, Indonesia
(ICEEI), pp.1-6.
In this paper the authors have proposed a unique 2D DCT architecture
based on regular butterfly structure. The architecture employs eight 1D DCT
processors and four post-addition stages to obtain the 2D DCT coefficients. Each
1D DCT processor is designed using Algebraic Integer Encoding architecture
which requires no multipliers, therefore the entire 2D DCT architecture is
multiplier less. The multiplier less design approach has given the high throughput
to the architecture.
36
12. S.V.V.Sateesh, R.Sakthivel, K.Nirosha, Harisha M.Kittur (2011), “An
Optimized Architecture to Perform Image Compression and Encryption
Simultaneously Using Modified DCT Algorithm”, IEEE International Conference
on Signal Processing, Communication, Computing and Network Technologies,
pp.442-447.
In this paper authors have designed the architecture for DCT based on Lo-
effler scheme. The 1D DCT is calculated for only the first two terms and the rest
six terms are taken as zero. The architecture takes in 8 pixels as input for every
clock cycle and generates only 2 outputs against the 8 outputs in the traditional Lo-
effler DCT. Thus the architecture needs only 4 multipliers and 14 adders. The
adders used are carry select adders and the multipliers used are high performance
multipliers.
13. HatimAnas, SaidBelkouch, M.ElAakif, NoureddineChabini (2011), FPGA
Implementation of a Pipelined 2D DCT and Simplified Quantization for Real Time
Applications”, IEEE International Conference on Multimedia Computing and
Systems pp.1-6
This paper explores a new approach to obtain the DCT using flow graph
algorithm. All of the multiplications are merged in the quantization block. To
avoid the reduction in the operating frequency during the division at the
quantization process, all of the elements in the quantization matrix are represented
in the nearest powers of 2. Authors have shown that this method outperforms any
of the approaches in terms of operating frequency and resource requirements.
14. Chi-Chia Sun, Benjamin Heyne, Juergen, Goetze (2006), “A Low Power
and high quality Cardic based Loffler DCT”, IEEE conference.
In this paper a low power, high quality preserving DCT architecture is
presented. It is obtained by optimizing the loeffler DCT based on CORDIC
algorithm. This architecture design starts with the basic loeffler DCT which
requires 11 multiplications. This paper mainly concentrates on area and power
37
reduction. At the same time it maintains the same transformation quality as
original loeffler DCT. It is very suitable for low-power and high quality CODECS,
especially for battery based systems.
15. Hai Huang, Tze-Yun Sung, Yaw-shih Shieh (2010), “A Novel VLSI
Linear array for 2-D DCT/IDCT”, IEEE 3rd International Congress on Image AND
Signal Processing, pp.3680-3690.
This paper proposes an efficient 1-D DCT and IDCT architectures using
sub-band decomposition algorithm. The orthonormal property of DCT/IDCT
transformation matrices is fully used to simplify the hardware complexities. The
proposed architecture with computation complexity o and 0 for DCT and
IDCT, respectively and low hardware complexity 0 for both DCT and IDCT
are fully pipelined and scalable for variable length 2-D DCT/IDCT computation.
The proposed architecture requires 3 multipliers and 21 adders. In addition the
proposed architecture is highly regular, scalable, and flexible.
16. Byoung-2 Kim, Sotirios. G. Ziavras (2009 ), “Low Power Multiplierless
DCT for Image/Video Coders”, IEEE 13th International Symposium on Consumer
Electronics, pp.133-136.
This paper mainly focuses on power efficiency of coders. Power reduction
is achieved by minimising the number of arithmetic operations and their bit-width.
To minimise arithmetic operation redundancy, our DCT design focuses on chen’s
factorization approach and the constant matrix multiplication (CMM) problem.
The 8*1 DCT is decomposed using six two input butterfly networks. Each
butterfly is for 2*2 multiplications (matrix) and requires a maximum of eight
adders/ subtractors with 13-bit cosine coefficients. An extended canonic signed
digit (CSD) format is used for expressing the binary constant coefficients.
Adaptive companding scheme is used for bit-width reduction. It consists of a bit-
width compressor and bit-width expander.It provides the butterflies with reduced
bit-width while minimising image/ video quality degradation.
38
17. Tze-Yun sung, Yaw-shih Shieh and Chun-Wang Yu Hsi-Chin Hsin (2006),
“High Efficiency and Low Power Architectures for 2-D DCT and IDCT
based on cordic rotation”, Seventh International conference on Parallel,
Distributed Computing, Applications and Technologies(PDCAT), pp.191-196.
Multiplication is the key operation for both DCT and IDCT. In the
CORDIC based processor, multipliers can be replaced by simple shifters and
adders. Double rotation CORDIC algorithm has even better latency composed to
conventional CORDIC based algorithm. Hardware implementation of 8-point 2-D
DCT requires two SRAM banks (128 words), two 8 point DCT\IDCT processors,
two multiplexers and a control unit. By taking into account the symmetry
properties of the fast DCT/IDCT algorithm, high efficiency architecture with a
parallel – pipelined structure have been proposed to implement DCT and IDCT
processors. In the constituent 1-D DCT/IDCT processors, the double rotation
CORDIC algorithm with rotation mode in the circular co-ordinate system has been
utilised for the arithmetic unit for both DCT/IDCT i.e. multiplication computation.
Thus they are very much suited to VLSI implementation with design tradeoffs.
18. Gopal Lakhani (2004), “Optimal Huffman Coding of DCT Blocks”, IEEE
transactions on circuits and systems for video technology, Vol.14, issue.4. pp 522-
527.
A minor modification to the Huffman coding for the JPEG image
compression is made. During the run length coding, instead of pairing the non zero
coefficient with the preceding number of zeros, the authors have designed an
encoder that which pairs it to the subsequent number of zeros. This change in the
run length encoder has made utilized in forming the Huffman table which is
optimized for the position of the non zero coefficient denoted by the pair. The
advantage of this coding method is that no eob marker is needed to represent the
end of block.
39
19. Raymond K.W.Chan, Moon-Chuen Lee (2006), “Multiplierless
approximation of fast DCT algorithms”, IEEE International conference on
multimedia and Expo, pp.1925-1928.
This paper presents an effective method to convert any float 1-D DCT into
an approximate multiplier-less version with shift and add operations. It converts
AAN’s fast DCT algorithm to their multiplier-less versions. Experiment results
shows that ANN’s fast DCT algorithm approximated by the proposed method and
using an optimized configurations can be used to reconstruct images with high
visual quality in terms of peak signal to noise ratio (PSNR). The constant
coefficients are represented in MSD form. All the butterflies present in ANN’s
algorithm are converted to lifting structures before using the proposed method.
20. Kamrul Hasan Talukder and Koichi Harada (2007), “Discrete Wavelet
Transform for Image Compression and A Model of Parallel Image Compression
Scheme for Formal Verification”, Proceedings of the World Congress on
Engineering.
The use of discrete wavelet for image compression and a model of the
scheme of verification of parallelizing the compression have been presented in this
paper. We know that wavelet transform exploits both the spatial and frequency
correlation of data by dilations (or contractions) and translations of mother wavelet
on the input data. It supports the multi resolution analysis of data i.e. it can be
applied to different scales according to the details required, which allows
progressive transmission and zooming of the image without the need of extra
storage. Therefore the DWT characteristics is well suited for image compression
and includes the ability to take into account of Human Visual System’s (HVS)
characteristics, very good energy compaction capabilities, robustness under
transmission, high compression ratio etc. The implementation of wavelet
compression scheme is very similar to that of subband coding scheme: the signal is
decomposed using filter banks. The output of the filter banks is down-sampled,
quantized, and encoded. The decoder decodes the coded representation, up-
40
samples and recomposes the signal. A model for parallelizing the compression
technique has also been proposed here.
21. Abdullah Al Muhit, Md. Shabiul Islam and Masuri Othman (2004), “VLSI
Implementation of Discrete Wavelet Transform (DWT) for Image Compression”,
2nd International conference on Autonomous Robots and Agents, New Zealand.
December pp.391-395.
This paper presents an approach towards VLSI implementation of the
Discrete Wavelet Transform (DWT) for image compression. The design follows
the JPEG2000 standard and can be used for both lossy and lossless compression.
In order to reduce complexities of the design, linear algebra view of DWT and
IDWT has been used in this paper. The DWT algorithm consists of Forward DWT
(FDWT) and Inverse DWT (IDWT). The FDWT can be performed on a signal
using different types of filters such as db7, db4 or Haar. The Forward transform
can be done in two ways, such as matrix multiply method and linear equations.
After the FDWT stage, the resulting average and detail wavelet values can be
compressed using thresholding method. In the IDWT process, to get the
reconstructed image, the wavelet details and averages can be used in the matrix
multiply method and linear equations. This design can be used for image
compression in a robotic system.
22. En-Hui Yang, Longji Wang (2009),”Joint Optimization of Run-Length
Coding. Hoffman Coding, and Quantization Table with Complete Baseline JPEG
Decoder Compatibility”, IEEE transaction on image processing, pp. 63-74.
The authors have presented a graph-based R-D optimal algorithm for JPEG
run-length coding. It finds the optimal run size pairs in the R-D sense among all
possible candidates. Based on this algorithm, they have proposed an iterative
algorithm to optimize run-length coding, Huffman coding and quantization table
jointly. The proposed iterative joint optimization algorithm results in up to 30% bit
rate compression improvement for the test images, compared to baseline JPEG.
The algorithms are not only computationally efficient but completely compatible
41
with existing JPEG and MPEG decoders. They can be applied to the application
areas such as web image acceleration, digital camera image compression, MPEG
frame optimization and transcoding, etc.
23. D.A. Karras, S.A. Karkanisand B.G. Mertzios (1998),”Image Compression
Using the Wavelet Transform on Textural Regions of Interest”, 24th IEEE
International Euromicro conference, pp.633-639.
This paper suggests a new image compression scheme, using the discrete
wavelet transformation (DWT), which is based on attempting to preserve the
texturally important image characteristics. The main point of the proposed
methodology lies on that, the image is divided into regions of textural significance
employing textural descriptors as criteria and fuzzy clustering methodologies.
These textural descriptors include co-occurrence matrices based measures and
coherence analysis derived features. More specifically, the DWT is applied
separately to each region in which the original image is partitioned and, depending
on how it has been texturally clustered, its relative number of the wavelet
coefficients to keep is then, determined. Therefore, different compression ratios are
applied to the image regions. The reconstruction process of the original image
involves the linear combination of its corresponding reconstructed regions.
24. Muhammad Bilal Akhtar, Adil Masoud Qureshi,Qamar- Ul- Islam (2011),
“ Optimized Run Length Coding for JPEG Image Compression Used in Space
Research Program of IST,” IEEE International Conference on Computer
Networks and Information Technology (ICCNIT) , pp.81-85.
In this paper the authors have proposed a new scheme for run length coding
to minimize the error during the transmission. The optimized run length coding
uses a pair of (RUN, LEVELS) only when a pattern of consecutive zeros occur at
the input of the encoder. The non zero digits are encoded as their respective values
in LEVELS parameters. The RUN parameter is eliminated from the final encoded
message for non zero digits.
42
25. Gregory K.Wallace (1991), “ The JPEG Still Picture Compression
Standard”,IEEE transactions on consumer electronics, Vol.38,Issue.1, pp.18-38.
In this paper the author has given the brief description for JPEG image
compression standard, which uses both lossy and lossless compression methods.
For JPEG images the lossy compression is based on DCT followed by
quantization. The lossless method is based on entropy coding which is a
completely reversible process. The procedures of run length coding are also given
by the author.
26. Mustafa Safa Al-Wahaiba, Kosshiek Wong, “A Lossless Image
Compression Algorithm Using Duplication Free Run-Length Coding”, Second
International Conference on Network Applications, Protocols and Services
(NETAPPS) , pp.245-250.
In this paper authors propose a novel lossless image compression algorithm
using duplication free run-length coding. An entropy rule-based generative coding
method was proposed to produce code words that are capable of encoding both
intensity level and different flag values into a single codeword. The proposed
method gains compression by reducing a run of two pixels to only one codeword.
The algorithm has no duplication problem, and the number of pixels that can be
encoded by a single run is infinite.
27. Jason McNeely and Magdi Bayoumi (2007), “Low Power Look-Up Tables
for Huffman Decoding”, IEEE International Conference on Image Processing,
pp.465-468.
This works includes studying of different lookup tables for Huffman
decoding. It was found that PLA type architecture is common among those
Huffman lookup tables, because of its speed and simplicity advantage. In certain
situations, it was determined that the tree structure for lookup tables can be a low
power alternative to the PLA structure. These situations accounted for 56% of the
total simulation runs, and of these runs, the average power savings of the tree in
43
those situations was found to be 78%. The work also shows the effect of varying
the table size and varying the probability distributions of a table on power, area
and delay.
28. Bao Ergude,Li Weisheng, Fan Dongrui, Ma Xiaoyu, (2008), “A Study and
Implementation of the Huffman Algorithm Based on Condensed Huffman Table”,
IEEE International Conference on Computer Science and Software Engineering,
pp.42-45.
They have used the property of canonical Huffman tree to study and
implement a new Huffman algorithm based on condensed Huffman table, which
greatly reduces the expense of Huffman table and increases the compression ratio.
The binary sequence in the paper requires only a small space and under some
special circumstances, the level without leaf is marked to be 1, which can further
reduce the required size.
29. Sung-Wen Wang, Shang-Chih Chuang, Chih-Chieh Hsiao, Yi-Shin Tung
and Ja-ling Wu (2008), “An efficient Memory Construction Scheme for an
Arbitrary Side Growing Huffman table”, IEEE International conference on
multimedia and Expo.
To speed up the Huffman decoding, a memory efficient Huffman table is
constructed on the basis of arbitrary-side growing Huffman tree (AGH-tree), by
grouping the common prefix of a Huffman tree, instead of the commonly used
single-side rowing Huffman tree (SGHtree). Simulation results show that, in
Huffman decoding, an AGH-tree based Huffman table is 2.35 times faster that of
the Hashemian’s method (an SGHtree based one) and needs only one-fifth the
corresponding memory size.
30. Reza Hashemian (2003), “Direct Huffman Coding and Decoding using the
Table of Code-Lengths”IEEE International conference on information,
Technology, Coding, Computers and Communication, pp.237-241.
44
This work includes developing of a memory efficient and high-speed
search technique for encoding and decoding symbols using Huffman coding. This
technique is based on a Condensed Huffman Table (CHT) for decoding purposes,
and it is shown that a CHT is significantly smaller than the ordinary Huffman
Table. In addition, the procedure is shown to be faster in searching for a code-word
and its corresponding symbol in the symbol-list. An efficient technique is also
proposed for encoding symbols by using the code-word properties in a Single-Side
Growing Huffman Table (SGHT), where code-word values are ordered in the
ascending order exactly in contrast with the probabilities that are in descending
order.
31. Jia-Yu Lin, Ying Liu, and Ke-Chu Yi(2004), “Balance of 0,1 Bits for
Huffman and Reversible Variable-Length Coding”, IEEE Journal on
Communications, pp. 359-361.
The authors have proposed an effective algorithm to make the bit
probabilities balanced. This algorithm can be embedded in the construction of the
Huffman codebook, with little complexity. In the analysis of RVLCs based on the
Huffman codes, it was shown that the bidirectionally decodable stream had good
performance under the bit-balance criterion, and it could be combined with the
proposed algorithm to further decrease the bit-probability difference. For
symmetrical and asymmetrical RVLCs, probability differences could be decreased
by reassigning code words to source symbols after the creation of codebooks.
32. Da An, Xin Tong, Bingqiang Zhu and Yun He (2009), “A Novel Fast DCT
Coefficient Scan Architecture”, IEEE Picture Coding Symposium I, Beijing
100084, China, pp.1-4.
A novel, fast and configurable architecture for zigzag scan and optional
scans in multiple video coding standards, including H.261, MPEG-1,2,4,
H.264/AVC, and AVS is proposed. Arbitrary scan patterns could be supported by
configuring the ROM data, and the architecture can largely reduce the processing
45
cycles. The experimental results show the proposed architecture is able to reduce
up to 80% of total scanning cycles on average.
33. Pablo Montero, Javier Taibo Gulias, Samuel Rivas (2010), “Parallel
Zigzag Scanning and Huffman Coding for a GPU-Based MPEG-2 Encoder”, IEEE
International Symposium on multimedia pp.97-104.
This work describes three approaches to compute the zigzag scan, run-
level, and Huffman codes in a GPU based MPEG-2 encoder. The most efficient
method exploits the parallel configuration used for DCT computation and
quantization in the GPU using the same threads to perform the last encoding steps:
zigzag scan and Huffman coding. In the experimental results, the optimized
version averaged a 10% reduction of the compression time, including the
transference to the CPU.
34. Pei-Yin Chen, Member, Yi-Ming Lin, and Min-Yi Cho (2008),” An
Efficient Design of Variable Length Decoder for MPEG-1/2/4”, IEEE
International Transactions on multimedia, Vol.16, Issue.9, pp.1307-1315.
The authors propose an area-efficient variable length decoder (VLD) for
MPEG-1/2/4. They employed an efficient clustering-merging technique to reduce
both the size of a single LUT and the total number of LUTs required for MPEG-
1/2/4, rather than using one dedicated lookup table for carrying out variable length
coding. Synthesis results show that our VLD occupies 10666 gate counts and
operates at 125 MHz by using the standard cell from Artisan TSMC’s 0.18µm
process. The proposed design outperforms other VLDs with less hardware cost.
35. Basant K. Mohanty and Pramod K. Meher (2010), “Parallel and Pipelined
Architectures for High Throughput Computation of Multilevel 3-D DWT”.
In this paper¸ we present a throughput-scalable parallel and pipeline
architecture for high-throughput computation of multilevel 3-D DWT. The
computation of 3-D DWT for each level of decomposition is split into three
46
distinct stages¸ and all the three stages are implemented in parallel by a processing
unit consisting of an array of processing modules. The throughput rate of the
proposed structure can easily be scaled without increasing the on-chip storage and
frame-memory by using more number of processing modules¸ and it provides
greater advantage over the existing designs for higher frame-rates and higher input
bock-size. The full-parallel implementation of proposed scalable structure provides
the best of its performance.
36. Anirban Das Anindya Hazra¸and Swapna Banerjee (2010), “An Efficient
Architecture for 3-D Discrete Wavelet Transform (DWT)”.
Architecture of the 3-D DWT which is a powerful image compression
algorithm is implemented using lifting based approach. This architecture enjoys
reduced memory referencing¸ related low power consumption¸ low latency and
high throughput. A lifting based technique is carried out in 2 stages namely Spatial
Transform method and Temporal Transform method.
37. Michael Weeks and Magdy A Bayoumi (2002), “Three-Dimensional
Discrete Wavelet Transform Architectures”.
Here the two different architectures of Three-Dimensional Discrete
Wavelet Transform is implemented. They are 3DW-I and the 3DW-II. The first
architecture (3DW-I) is based on folding, whereas the 3DW-II architecture is
block-based. The 3DW-I architecture is an implementation of the 3-D DWT
similar to folded 1-D and 2-D designs. It allows even distribution of the processing
load onto 3 sets of filters, with each set performing the calculations for one
dimension. The control for this design is very simple, since the data are operated
on in a row-column-slice fashion. The 3DW-II architecture uses block inputs to
reduce the requirement of on-chip memory. It has a central control unit to select
which coefficients to pass on to the lowpass and highpass filters. Finally¸ the
3DW-I and 3DW-II architectures are compared according to memory
requirements, number of clock cycles, and processing of frames per second.
47
38. B. Das and Swapna Banerjee (2002), “Low power architecture of running
3-D wavelet transform for medical imaging application”.
In this paper a real-time 3-D DWT algorithm and its architecture realization
is proposed. Reduced buffer and low wait-time are the salient features which
makes it fit for bidirectional videoconferencing application mostly in real-time
biomedical application. The reduced hardware complexity and 100% hardware
utilization is ensured in this design. This architecture implemented on 0.25u
BiCMOS technology.
39. B.Das and Swapna Banerjee (2003), “A Memory Efficient 3-D DWT
Architecture”.
This paper proposes a memory efficient real-time 3-D DWT algorithm and
its architectural implementation. Parallelism being an added advantage for fast
processing has been used with three pipelined stages in this architecture. The
architecture proposed here is memory efficient and has a high throughput rate of 1
clock¸ with low latency period. Here we make use of Daubechies wavelet filters
for co-efficient mapping¸ correlation between low pass filter and high pass filter.
The 3-D DWT has been implemented for 8-tap Daubechies filter. However¸ this
algorithm can be extended for any number of frames at the cost of wait time. This
architecture requires a simple regular data-flow pattern. Thus¸ the control circuitry
overhead reduces making the circuit efficient for high speed¸ low power
applications. An optimization between parallelism and pipelined structure has been
used for making the circuit applicable for low power domains.
40. Basant K. Mohanty and Pramod K. Meher (2008), “Concurrent Systolic
Architecture for High-Throughput Implementation of 3- Dimensional Discrete
Wavelet Transform”.
In this paper, we present a novel systolic architecture for high-throughput
computation of 3- dimensional (3-D) discrete wavelet transform (DWT). The
entire 3-D DWT computation is decomposed into three distinct stages and
48
implemented concurrently in a linear array of fully pipelined processing elements
(PE). The proposed structure for 3-D DWT provides higher throughput than the
existing architecture; and involves nearly half or less the number of multipliers and
adders; and less on-chip memory (when normalized for unit throughput rate) than
the other. The systolic design of involves on-chip and off-chip storage of size
O(MKN) and O(N2), respectively; and computes the 3-D DWT of an input data of
size (M × N × N) in approximately (MN2)/7 cycles, where K is the row and
column size of the subband filter coefficient matrix. (N×N) is the size of each
frame and M is the frame rate of the video input.
Higher computational throughput rate is achieved in this architecture which
does not involve any off-chip memory and involves either the same or less on-chip
memory than the existing design.
41. Basant K. Mohanty and Pramod K. Meher (2011), “Memory-Efficient
Architecture for 3-D DWT Using Overlapped Grouping of Frames”.
In this paper we presented a memory efficient architecture for 3-D DWT
using overlapped grouping of frames (GOFs). It involves only a frame-buffer of
size O(MN) to compute multilevel 3-D DWT, unlike the existing folded structures
which involve frame-buffer of size O(MNR). The 3-D DWT structure consists of
two types of hardware components: (i) combinational component and (ii)
memory/storage component. The combinational component consists mainly of
arithmetic circuits and multiplexors; and the memory component consists of a
frame-memory, temporal-memory, registers and transposition-memory. Frame-
memory is usually external to the chip, while temporal-memory may either be on-
chip or external.
Due to less memory complexity, this architecture dissipates significantly
less dynamic power than the existing structures. It can compute multilevel running
3-D DWT on an infinite GOFs frames and involves much less memory and
resource than the existing designs. It could, therefore, be used for high-
performance video processing applications.
49
42. Erdal Oruklu, Sonali Maharishi and Jafar Saniie (2007), “Analysis of
Ultrasonic 3-D Image Compression Using Non-Uniform, Separable Wavelet
Transforms”.
Here Discrete Wavelet Transform (DWT) is used for compression of 3-D
ultrasound data. Different wavelet kernels are analyzed and benchmarked for
compression of experimental signals. In order to reduce computational complexity,
non-uniform DWT method is utilized where different wavelet filters are applied to
ultrasonic axial resolution and spatial resolutions. Axial resolution contains more
information than spatial resolutions; therefore simple wavelet filters such as Haar
can be used for the spatial resolutions reducing the computations significantly.
Here make use of thresholding concept in which it can be applied to transform
coefficients of the original ultrasonic signal for data compression.
50
CHAPTER 3
VLSI DESIGN FLOW AND LOW POWER VLSI DESIGN
3.1 INTRODUCTION
The previous chapter describes the literature review in view of the present
research. The present researcher went through the different approaches presented
by the different authors to implement image compression with the low power
VLSI design approach. In this chapter, the author is discussing the VLSI Design
flow with the concepts of Verification i.e. how to adapt the concept of Linting and
code coverage, Synthesis of the design with low power concepts and finally the
physical design. The author also discuss the need for Low power in VLSI design
and the different Low power techniques applied in VLSI chip design.
The complexity of Very Large Scale Integrated circuits (VLSI) being
designed and used today makes the manual approach to design impractical. Design
automation is the order of the day. With the rapid technological developments in
the last two decades, the status of VLSI technology is characterized by the
following:
A steady increase in the size and hence the functionality of the ICs.
A steady reduction in feature size and hence increase in the speed of
operation as gate or transistor density.
A steady improvement in the predictability of circuit behavior.
A steady increase in the variety and size of software tools for VLSI design.
The above developments have resulted in a proliferation of approaches to
VLSI design. An abstraction based model is the basis of the automated design.
51
3.2 ASIC DESIGN FLOW
As with any other technical activity, development of an Application
Specific Integrated Circuit (ASIC) starts with an idea and takes tangible shape
through the stages of development as shown in Figure 3.1 and Figure 3.2 shows
the details of the design flow. The first step in the process is to expand the idea in
terms of behavior of the target circuit. Through stages of programming, the same is
fully developed into a design description in terms of well defined standard
constructs and conventions.
Figure 3.1 Major activities in ASIC design
The design is tested through a simulation process; it is to check, verify, and
ensure that what is wanted is what is described. Simulation is carried out through
dedicated simulation tools. With every simulation run, the simulation results are
studied to identify errors in the design description. The errors are corrected and
another simulation run carried out. Simulation and changes to design description
together form a cyclic iterative process, repeated until an error-free design is
evolved.
Design Description
Idea
Simulation
Physical Design
Synthesis
52
Design description is an activity independent of the target technology or
manufacturer. It results in a description of the digital circuit. To translate it into a
tangible circuit, one goes through the physical design process. The same
constitutes a set of activities closely linked to the manufacturer and the target
technology
3.3 DESIGN DESCRIPTION
Figure 3.2 ASIC design and development flow
The Figure 3.2 describes details about the design that is carried out in different
stages. The process of transforming the idea into a detailed circuit description in
terms of the elementary circuit components constitutes design description. The
final circuit of such an IC can have up to a billion such components; it is arrived at
in a step-by-step manner.
The first step in evolving the design description is to describe the circuit in
terms of its behavior. The description looks like a program in a high level language
53
like C. Once the behavioral level design description is ready, it is tested
extensively with the help of a simulation tool, it checks and confirms that all the
expected functions are carried out satisfactorily. If necessary, this behavioral level
routine is edited, and modified. Finally, one has a design for the expected system
described at the behavioral level. The behavioral design forms the input to the
synthesis tools, for circuit synthesis. The behavioral constructs not supported by
the synthesis tools are replaced by data flow and gate level constructs. To surmise,
the designer has to develop synthesizable codes for his own design.
The design at the behavioral level is to be elaborated in terms of known and
acknowledged functional blocks. It forms the next detailed level of design
description. Once again the design is to be tested through simulation and iteratively
corrected for errors. The elaboration can be continued one or two steps further. It
leads to a detailed design description in terms of logic gates and transistor
switches.
3.4 DESIGN OPTIMIZATION
The circuit at the gate level in terms of the gates and flip flops can be
redundant in nature. The same can be minimized with the help of minimization
tools. The minimized logical design is converted to a circuit in terms of the switch
level cells from standard libraries provided by the foundries, In this proposed work
designer has used both 90nm and 65nm standard cells. The cell based design
generated by the tool is the last step in the logical design process, it forms the input
to the first level of physical design.
3.5 BEHAVIORAL SIMULATION
The design descriptions are tested for their functionality at every level like
behavioral, data flow, and gate. One has to check here whether all the functions are
carried out as expected and rectify them. All such activities are carried out by the
Verification tool. The tool also has an editor to carry out any corrections to the
source code. Simulation involves testing the design for all its functions, functional
54
sequences, timing constraints, and specifications. Normally testing and simulation
at all the levels; behavioral to switch level is also carried out by a single tool.
Figure 3.3 shows the RTL verification with linting.
3.5.1 Specification for the Design
During this stage for the required specifications all the sub systems are
implemented in a block diagram. These Block diagrams are at the system level and
realized in higher level of abstraction to obtain sustainable performance and will
be observed with different transaction level.
3.5.2 Behavioral or RTL Design
All the Subsystems and sub blocks of the design are coded in HDL
language either in Verilog or VHDL. These sub-systems are also obtained in the
form of Intellectual property (IP). Due to the fact that many optimized efficient
RTL codes are available from the previous design within the company or from the
IP vendors. All these put together and create a verification environment to verify
total functionality. Verification will be done with many techniques and will be
based on the design stages.
3.6 VERIFICATION OF THE DESIGN
Verification is carried out with Simulator and also with emulation as well
as FPGA prototyping. Initial verification is carried out using Industry standard
simulators predominantly event based simulators. The verification will be done in
different stages as shown in figure 3.3. Simulation is software based and is less
expensive and can be done with quick setup time. Emulation is hardware based
verification, this can be carried out when full RTL code /Design is available. This
is very expensive and also time consuming for initial setup. FPGA based approach
is efficient but again this can be carried out when full RTL code /Design is
available. But additional resources and skill set is required.
55
In general Register Transfer Level (RTL) simulation and verification is one
of the initial steps that was done. This step ensures that the design is logically
correct and without major timing errors. It is advantageous to perform this step,
especially in the early stages of the design, because long synthesis and place-and-
route times can be avoided when an error is discovered at this stage. This step also
eliminates all the syntactical errors from HDL code and. Simulation tools to
perform RTL verification. More specifically all the simulators check for
correctness of the code with compilation / HDL parser, secondly it elaborates the
design , in this stage actual implementation of the hardware structure will be done
and specific to platform ( OS).Then elaborated code will be successfully loaded to
simulator along with the test stimuli. All input and output waveforms can be
observed.
Figure 3.3 RTL Verification flow with Linting and Code coverage
As the verification continues, we need to check for the quality of tests so
adoption of newer techniques is required. This new technique that improves the
quality of test is done by code coverage tools by providing the statistics about test
56
coverage done. Code Coverage is a process of validating or finding the quality of
the test bench for RTL code for a particular design. This is a measurement which
tells how good the design has been exercised with the test bench / test cases. Code
coverage is applied to DUT (Design under Test) to check how thoroughly the HDL
is exercised by test suite. Code coverage points the portion of code which did not
exercise, also points the corner/edge cases, unused/dead code.
Before Simulation certain verification will be done to clean up the RTL. In
general Lint tools that flag suspicious and non- portable/non-synthesizable usage
of Verilog / VHDL language. It points out the code where it likely bugs. In the
Industry Lint tools are also referred as Design Rule Checker. Check the cleanness
and portability of the HDLs code for IEEE standards. Usually compiler does not
show the errors and warnings which are detected by lint tools. There are many
advantages of it.
There are many standards and lint check rules are available. Some example for
rules,
1. Coding style,
2. DFT rule Checks,
3. Design style,
4. Language constructs for synthesizability,
5. Race condition detection
6. Combinational logic loops
7. Custom rule checking.
8. Documentation to select/deselect the rules/standards during lint checking.
57
3.7 STNTHESIS OF THE DESIGN
With the availability of design at the gate (switch) level, the logical design
is complete. The corresponding circuit hardware realization is carried out by a
synthesis tool. The synthesis flow with Low power approach is shown in Figure
3.4.
Figure 3.4 Synthesis flow with Low power and UPF
58
There are two common approaches used in the synthesis of VLSI system
and they are as follows: The circuit is realized through an FPGA. The gate level
design description is the starting point for the synthesis. The FPGA vendors
provide an interface to the synthesis tool. Through the interface the gate level
design is realized as a final circuit. With many synthesis tools, one can directly use
the design description at the data flow level itself to realize the final circuit through
an FPGA. The FPGA route is attractive for limited volume production or a fast
development cycle.
The circuit is realized as an ASIC. A typical ASIC vendor will have the
standard library of a particular technology. The standard library contains library of
basic components like elementary gates and flip-flops. Eventually the circuit is to
be realized by selecting such components and interconnecting them conforming to
the required design. This constitutes the physical design. Being an elaborate and
costly process, a physical design may call for an intermediate functional
verification through the FPGA route. The circuit realized through the FPGA is
tested as a prototype. It provides another opportunity for testing the design closer
to the final circuit. The present researcher’s proposed work synthesis was done
using ASIC approach by selecting the different standard cells.
The process of automated way of converting HDL code into technology
specific structural / gate level netlist using EDA tools is called Synthesis. Synthesis
requires a RTL code, Constrained file (design constrained and timing constrained),
Delay file / parasitic file and technology specific timing library, which is industry
de facto standard liberty format commonly called .lib file. Synthesis engine has
parser to understand the HDL code, specifically 2001 and 1995 standards. This
also has inbuilt STA engine (“static timing analysis”) to calculate and synthesize
for required frequency/ speed.
59
Output of the synthesis engine is the following.
1. Netlist
2. Delay file
3. SDC file (Synopsys design constraints)
Reports:
1. Timing
2. Area
3. Gate
4. Gated clock
5. Power
6. Generated clock
Output files like technology specific netlist and sdc file will be the input for
the physical design.
3.8 VLSI PHYSICAL DESIGN
A fully tested and error-free design at the switch level can be the starting
point for a physical design. The Hierarchy of the physical design is shown in the
figure 3.5. The physical design is to be realized as the final circuit using (typically)
a million components in the foundry’s library. The step-by-step activities in the
process are described briefly as follows and it is shown in figure 3.5 and the
complete flow is shown in figure 3.6.
60
Figure 3.5 VLSI Physical Design Hierarchy
3.8.1 Floor Planning
The Floor-planning, means placing the IP's ( Standard Cells , Blocks )
based on the connectivity, placing the memories, Create the Pad-ring, placing the
Pads (Signal/power/transfer-cells) to switch voltage domains/Corner pads (proper
accessibility for routing), meeting the simultaneous switching noise requirements
that when the high-speed bus is switching that it doesn't create any noise related
activities. An optimized floor plan meets effective utilization of the target chip.
3.8.2 Placement
The selected components from the ASIC library are placed in position on
the “Silicon floor”. It is done with each of the blocks used in the design. During
the placement, rows are cut, blockages are created where the tool is prevented from
placing the cells, and then the physical placement of the cells is performed based
on the timing/area requirements. The power-grid is built to meet the power targets
of the Chip.
61
3.8.3 Routing
The components placed as described above are to be interconnected to the
rest of the block. It is done with each of the blocks by suitably routing the
interconnects. Once the routing is complete, the physical design is taken as
complete. The final mask for the design can be made at this stage and the ASIC
manufactured in the foundry. At first, the Global routing to check for the
placement legalization and route ability with the present placement without routing
congestion. Detailed routing will be performed after the Clock tree synthesis.
Detailed routing will be the actual routing between the all the cells / Modules and
blocks with delay calculation and static timing analysis. In Timing driven
methodology router will be run with the static timing analysis along with all
possible calculation of the delay. Perform STA (Static Timing Analysis) with the
SPEF file and routed netlist file, to check whether the Design is meeting the
timing-requirements.
62
Figure 3.6 VLSI Physical Design Flow
63
3.8.4 Physical Verification
This stage involves checking the design for all manufacturability and
fabrication requirements.
1. DRC
2. ERC
3. LVS
To Perform DRC (Design Rule Check), verification will be done with
standard rule file called run set. Which has all the physical rules pertaining to that
specific technology library. Provided by the foundry to confirm that the design is
meeting the Fabrication requirements.
Perform the ERC (Electrical Rule Checking) check, to know that the design is
meeting the ERC requirement, i.e. to check any open, short circuit or floating nets
in the layout.
One of the important stage in physical design is the LVS (layout Vs.
Schematic) check, this is a part of the verification which takes a routed netlist to
that of synthesized net list and compare that the two are matching.
For effective timing closure perform separate Static Timing Analysis need to
be run at every stage to verify that the Signal-integrity of the Chip. STA is
important as the signal-integrity effect can cause cross-talk delay and cross-talk
noise effects, and effects in the functionality and timing aspects of the design.
3.9 POST LAYOUT SIMULATION
Once the placement and routing are completed, the performance
specifications like silicon area, power consumed, path delays, etc., can be
computed. Equivalent circuit can be extracted at the component level and
64
performance analysis carried out. This constitutes the final stage called
“verification.” One may have to go through the placement and routing activity
once again to improve performance.
3.10 LOW POWER VLSI DESIGN
Over the years historically, VLSI designers have used to design circuit
speed Vs the "performance" metric. As the design size become large demand in
terms of performance and silicon area increased but subsequently the increase in
the power consumption. As the integration of number of devices increased demand
for power also increased exponentially. In fact, power considerations have been the
ultimate design criteria in special portable applications such as mobile phones,
Music system like MP3, MP4 player, wristwatches and pacemakers for a long
time. The objectives in these applications are minimum power for maximum
battery life time. Almost all recent applications, power dissipation is becoming an
important constraint in large integration design. These low power requirements are
continued with the other applications with battery powered systems such as
Laptops, Notebook, Digital readers, Digital Camera and electronic organizer, etc.
In general, power reduction can be implemented at different levels of
design
abstraction.
1. System,
2. Architectural,
3. Gate,
4. Circuit and
5. Technology level.
65
At the system level, inactive modules may be turned off to save power. At
the architectural level, parallel hardware may be used to reduce global interconnect
and allow a reduction in supply voltage without degrading system throughput.
Clock gating is commonly used at the gate level. A many design techniques can be
used at the circuit level to reduce both dynamic and static power. For a design
specification, designers have many choices to make at different levels of
abstraction. Based on particular design requirement and constraints (such as
power, performance, cost), the designer can select a particular algorithm,
architecture and determine various parameters such as supply voltage and clock
frequency. This multi-dimensional design space offers a wide range of possible
trade-offs. The most effective design decisions derive from choosing and
optimizing architectures and algorithms at those levels.
The CMOS power consumption is proportional to the clock frequency
dynamically turning off the clock to unused logic or peripherals is an obvious way
to reduce power consumption. Control can be done at the hardware level or it can
be managed by the operating system of the application or software. For example,
some systems and hardware devices have sleep or idle modes. Typically, in these
modes, the clocks to most of the sections are turned off to reduce power
consumption. In sleep mode, the device is not working where a wake-up event
reuses the device from sleep mode. Devices may require different amounts of time
to wake-up from different sleep modes
3.11 SOURCES OF POWER DISSIPATION
Reduction of power consumption makes a device more reliable. The need
for devices that consume a minimum amount of power was a major driving force
behind the development of CMOS technologies. As a result, CMOS devices are
best known for low power consumption.
Two components determine the power consumption in a CMOS circuit:
Static power consumption
66
Dynamic power consumption
CMOS devices have very low static power consumption, which is the result
of leakage current. This power consumption occurs when all inputs are held at
some valid logic level and the circuit is not in charging states. But, when switching
at a high frequency, dynamic power consumption can contribute significantly to
overall power consumption. Charging and discharging a capacitive output load
further increases this dynamic power consumption.
Figure 3.7 CMOS Circuit in sub threshold region
Power dissipation in CMOS devices or circuits involves both static and
dynamic power dissipations. In the deep submicron technologies as shown in
figure 3.7, the static power dissipation, caused by leakage currents and sub
threshold currents contribute a considerably small percentage to the total power
consumption, while the dynamic power dissipation, resulting from charging and
discharging of parasitic capacitive loads of interconnects and devices dominates
the overall power consumption. But as technologies scale down to submicron the
static power dissipation becomes more dominant than the dynamic power
consumption. The aggressive scaling down of device dimensions and reductions of
67
supply voltages, which reduce the power consumption of the individual transistors.
The exponential increase of operating frequencies results in a steady increase of
the total power consumption.
3.12 STATIC POWER CONSUMPTION
Typically, all low-voltage devices have a CMOS inverter in the input and
output stage. Therefore, for a clear understanding of static power consumption,
refer to the CMOS inverter modes shown in Figure 3.8.
Figure 3.8 CMOS Inverter mode for Static power consumption
As shown in Figure 3.8, if the input is at logic 0, the n-MOS device is OFF,
and the p-MOS device is ON. The output voltage is VCC, or logic 1. Similarly,
when the input is at logic 1, the associated n-MOS device is biased ON and the p-
MOS device is OFF. The output voltage is GND, or logic 0. Note that one of the
transistors is always OFF when the gate is in either of these logic states. Since no
current flows into the gate terminal, and there is no DC current path from VCC to
GND, the resultant quiescent (steady-state) current is zero; hence, static power
consumption is zero. However, there is a small amount of static power
consumption due to reverse-bias leakage between diffused regions and the
substrate. This leakage current leads to the static power dissipation.
68
3.13 DYNAMIC POWER DISSIPATION
Dynamic power consumption is basically the result of charging and
discharging capacitances. It can be broken down into three fundamental
components, which are:
Load capacitance transient dissipation
Internal capacitance transient dissipation
Current spiking during switching.
3.14 LOAD CAPACITANCE TRANSIENT DISSIPATION
The first contributor to power consumption is the charging and discharging
of external load capacitances. Figure 3.9 is a schematic diagram of a simple CMOS
inverter driving a capacitive load. A simple expression for power dissipation as a
function of load capacitance is given by the equation 3.1.
Figure 3.9 Simple CMOS Inverter Driving a Capacitive External Load
PD = CLV2cc f --------- 3.1
Dynamic switching current is used in charging and discharging circuit load
capacitance, which is composed of gate and interconnect capacitance. The greater
69
this dynamic switching current is, the faster you can charge and discharge
capacitive loads, and your circuit will perform better.
3.14.1 Internal Capacitance Transient Dissipation
Internal capacitance transient dissipation is similar to load capacitance
dissipation, except that the internal parasitic “on-chip” capacitance is being
charged and discharged. Figure 3.10 is a circuit diagram of the parasitic nodal
capacitances associated with two CMOS inverters.
Figure 3.10 Parasitic Internal Capacitors Associated with Two Inverters
C1 and C2 are capacitances associated with the overlap of the gate area and
the source and channel regions of the P and N-channel transistors, respectively. C3
is due to the overlap of the gate and source (output), and is known as the Miller
capacitance. C4 and C5 are capacitances of the parasitic diodes from the output to
VCC and ground, respectively. Thus the total internal capacitance seen by inverter
1 driving inverter 2 is given by the equation 3.2.
CL = C1+C2+C3+C4+C5 ----- 3.2
70
Since an internal capacitance may be treated identically to an external load
capacitor for power consumption calculations, the same equation 3.1 can be used.
3.14.2 Current Spiking During Switching
The final contributor to power consumption is current spiking during
switching. While the input to a gate is making a transition between logic levels,
both the P-and N-channel transistors are turned partially on. This creates a low
impedance path for supply current to flow from VCC to ground, as illustrated in
Figure 3.11.
Figure 3.11 Equivalent schematic of a CMOS inverter whose
input is between logic levels
For fast input rise and fall times (shorter than 50 ns), the resulting power
consumption is frequency dependent. This is due to the fact that the more often a
device is switched, the more often the input is situated between logic levels,
causing both transistors to be partially turned on. Since this power consumption is
proportional to input frequency and specific to a given device in any application,
as is CL, it can be combined with CL. The resulting term is called “CPD”, the no-
load power dissipation capacitance.
71
Switching power is the dominant source of power drain on a CMOS chip.
Each time a transistor switches state from 1 to 0 and 0 to 1, it either pumps charge
into a capacitor or drains the capacitor to ground. In a full cycle, charge is taken
from the rail and pumped to ground. This produces a current, which, multiplied by
the rail voltage, equals power. If the CMOS transistor does not switch as often, less
charge moves and the power is lower. If the switching is controlled, invariably
activity of transistors reduces, this is an excellent way to limit power.
From the all above discussion, it is quite clear that significant work lies
ahead for reducing and managing power dissipation in CMOS devices. With the
market desire to shrink size and put systems on chips, advances in circuit design
and materials are clearly required for managing this situation, since neither speed
nor size is helping power reduction.
3.15 POWER REDUCTION TECHNIQUES IN VLSI DESIGN
There are many methods of achieving low power consumption. Systems
designers have developed several techniques to save power at the logic,
architecture, and systems levels.
3.15.1 Clock Gating
A significant amount of power in a chip is in the distribution network of the
clock. Clock buffers consume more than 50% of the dynamic power. This is
because these buffers have the highest toggle rate in the system. Also the flops
receiving the clock dissipate dynamic power even when the input and output
remain the same.
In Clock Gating method of power reduction, clocks are turned off when
they are not required. Modern design tools support automatic clock gating. These
tools identify circuits where clock gating can be inserted without changing the
functionality. Figure 3.12 shows the different clock gating scheme to reduce the
power consumption in VLSI chip design. This clock gating method also results in
area saving, this is due
multiplexers.
MULTI
e to single clock gating cell that takes the plac
IPLEXER BASED CLOCK GATING
72
ce of multiple
73
LATCH BASED CLOCKING APPROACH
Figure 3.12 Different Clock gating schemes
74
3.15.2 Asynchronous Logic
Asynchronous logic have pointed out that because their systems do not
have a clock, they save the considerable power that a clock tree requires. However,
asynchronous logic design suffers from the drawback of generating the completion
signals. This requirement means that additional logic must be used at each register
transfer in some cases, a double-rail implementation, which can increase the
amount of logic and wiring. Other drawbacks include testing difficulty and an
absence of design tools. Further, the asynchronous designer works at a
disadvantage because today’s design tools are geared for synchronous design.
Ultimately, asynchronous design does not offer sufficient advantages to merit a
wholesale switch from synchronous designs. However, asynchronous techniques
can play an important role in globally asynchronous, locally synchronous systems.
Such systems reduce clock power and help with the growing problem of clock
skew across a large chip, while allowing the use of conventional design techniques
for most of the chip.
3.15.3 Multi vdd Techniques
The different multi Voltage strategies for low power used in the VLSI chip design are
Static Voltage Scaling (SVC): Different supply voltage for different blocks of the
system.
Multi-level Voltage Scaling (MVS): Extension of SVC, each block or subsystem
can be supplied with different voltages (discrete, fixed number, dependent on the
mode).
Dynamic Voltage and Frequency Scaling (DVFS) : Extension of MVS, large
number of voltage levels dynamically switched based on the workload.
Adaptive Voltage Scaling (AVS): Extension of DVFS, voltage is adjusted with
the help of the control loop
75
3.15.4 Architectural Level
The power reduction at Architectural level includes exploiting parallelism
and using speculation. Parallelism can reduce power, whereas speculation can
permit computations to proceed beyond dependent instructions that may not have
completed. If the speculation is wrong, executing useless instructions can waste
energy. But this is not necessarily so. Branch prediction is perhaps the best-known
example of speculation. New architectural ideas can contribute most profitably to
reducing the dynamic power consumption term, specifically the activity factor.
Activity factor accounts for the number of transition of bit per unit time.
The architect or systems designer can do little to limit leakage except shut
down the memory. This is only practical if the memory will remain unused for a
long time.
SUMMARY
In this chapter, the VLSI design flow for the chip design is discussed in
detail. This describes how the digital front end design with the RTL code is
performed. The simulation of the design with verification using linting and code
coverage is also discussed. The VLSI design flow describes how to generate the
gate level netlist from the HDL code by performing the synthesis of the design.
The physical design describes the floor planning, placement and routing of the
design on the silicon core. This chapter is also describes why the low power is
required in any chip design, what are the different sources of power consumption
and what are different methodologies to reduce the power consumption in the
VLSI chip design.
76
CHAPTER 4
LOW POWER VLSI ARCHITECTURE FOR DISCRETE COSINE TRANSFORM
4.1 INTRODUCTION
In the previous chapter the basic concepts of VLSI design flow and the
different low power techniques in VLSI design and the low power techniques
adopted in the present work was described at length.
In this chapter the present researcher describes a proposed new
architectural design of the low power two dimensional Discrete Cosine Transform
core and the implementation of the proposed architecture is presented in detail.
Compression reduces the volume of data to be transmitted via text, fax, and
images and also reduces the bandwidth required for transmission and also reduces
the storage requirements of speech, audio, video which is much-needed. In digital
image, neighboring samples on a scanning line are normally similar to the spatial
redundancy. Spatial frequency is rate of change of magnitude as one traverses the
Image matrix. Useful image contents change relatively slowly across the image
much of the information in an image is repeated, hence the spatial redundancy.
Human eye is less sensitive to the higher spatial frequency components than the
lower frequency components.
4.2 DCT MODULE
In signal processing applications, The DCT is the most widely used
transform after the discrete Fourier transform. The DCT and IDCT are important
components in many picture compression and decompression standards, including
H.263, MPEG, HDTV and JPEG. The applications for these standards range from
still pictures on the Internet to low quality videophones to high definition
77
television. The DCT and IDCT also have applications in such wide ranging areas
as filtering, speech coding and pattern recognition [56].
The Discrete Cosine Transform is the most complex operation that needs to
be performed in the baseline JPEG process. This subsection starts with an
introduction to our chosen DCT architecture, followed by a detailed mathematical
explanation of the principles involved. Our implementation of the Discrete Cosine
Transform stage is based on a vector processing architecture. Our choice of this
particular architecture was due to a multiple reasons. The design uses a concurrent
architecture that incorporates distributed arithmetic [94] and a memory oriented
structure to achieve high speed, low power, high accuracy, and efficient hardware
realization of the 2-D DCT.
4.2.1 Mathematical Description of the DCT
The Discrete Cosine Transform is an orthogonal transform consisting of a
set of basis vectors that are sampled cosine functions. The 2-D DCT of a data
matrix is defined as
Y = C.X. CT ……. 4.1)
Where X is the data matrix, C is the matrix of DCT Coefficients, and CT is
the Transpose of C. The 2-D DCT (8 x 8 DCT) is implemented by the row-column
decomposition technique.
We first compute the 1-D DCT (8 x 1 DCT) of each column of the input
data matrix X to yield CTX. After appropriate rounding or truncation, the
transpose of the resulting matrix, CTX, is stored in an intermediate memory. We
then compute another 1-D DCT (8 x 1 DCT) of each row of CTX to yield the
desired 2-D DCT as defined in equation 4.1. The block diagram of the proposed
design is shown in Figure 4.1.
78
4.3 BLOCK DIAGRAM OF DCT CORE
Block diagram is presented below describing the top level of the design.
DCT core architecture is based on two 1-D DCT units connected through transpose
matrix RAM. Transposition RAM is double buffered, that is when 2nd stage of
DCT reads out data from transposition memory one, 1st DCT stage can populate
2nd transposition memory with new data. This enables creation of dual stage global
pipeline where every stage consist of 1-D DCT and transposition memory.1-D
DCT units are not internally pipelined, they use parallel distributed arithmetic with
butterfly computation to compute DCT values. Because of parallel DA they need
considerable amount of ROM memories to compute one DCT value in single clock
cycle. Design based on distributed arithmetic does not use any multipliers for
computing MAC (multiply and accumulate), instead it stores precomputed MAC
results in ROM memory and grab them as needed.
Figure 4.1 Block diagram of 2-D DCT Architecture
DCT 1D 1st stage
DCT 1D 2nd stage Transpose
RAM
ROM memories
ROM memories
Transpose RAMs
Timing And Control
rst clk
Xin (7:0)
dct_2d rdy_out
79
4.4 TWO DIMENSIONAL DCT CORE
This section describes the proposed architecture of the two dimensional
DCT. The figure 4.2 shown below is the input output schematic of the DCT core.
This core performs forward 2-Dimension Discrete Cosine Transform.
clk
dct_2d( dct_2d (11:0)
rst
xin (7:0) rdy_out
Figure 4.2 Top level schematic for DCT core
The DCT core is two dimensional Discrete Cosine Transform
implementation designed for use in compression systems like JPEG. Architecture
is based on parallel distributed arithmetic with butterfly computation it is done as
row/column decomposition where two 1-D DCT units are connected through
transposition matrix memory. Core has simple input interface. DCT core takes 8
bit input data and produces 12 bit output using 12 bit DCT matrix coefficients. The
8 bit signed input pixels will provide a 12 bit signed output coefficient for the
DCT. The data bits are multiplied by 3 bits can result in 11 bits and the sign bit to
give a total of 12 bits.
Vector processing using parallel multipliers is a method used for
implementation of DCT. The advantages in vector processing method are regular
structure [13, 93], simple control and interconnect and good balance between
performance and complexity of implementation.
DCT CORE
80
4.4.1 Behavioral Model for Vector Processing
DCT is widely used in current international image video coding standards
such as JPEG, Motion Picture Experts Group, H.261 and H.263 video-telephony
coding schemes. As shown in Figure 4.1, 8×8 2-D DCT can be implemented with
two 1-D DCT’s by the row-column decomposition method. In 8×1 row DCT, each
column of the input data X is computed and the outputs are stored in the
transformation memory. Another 8 × 1 column DCT is performed to yield desired
64 DCT coefficients [20]. The output bit-width is one of the parameter that
determines the quality of image. The outputs of the row DCT are truncated to N
bits before they are stored in transposition memory and sent to the input of the
column DCT.
The 8 × 8, 1-D DCT transform is expressed as
= ( )2 (2 + 1)16 , = 0, 1,2 … … 7
..… (4.2)
( ) = = 0
( ) = 1 ℎ
This equation is represented as vector-matrix form as
z = T · xT ….. (4.3)
Where T is an 8 × 8 matrix whose elements are cosine functions. x and z
are the row and the column vectors, respectively. Since 8 × 8 coefficients matrix T
in Equation 4.3 is symmetrical, the 1-D DCT matrix can be rearranged and
expressed as,
81
…. (4.4)
Where ck = cos kπ/16, , a = c1, b = c2, c = c3, d = c4, e = c5, f = c6, g = c7.
….. (4.5)
82
The odd and even components of DCT can be easily changed in to the
matrix of the form in the equation 4.5. The 8 × 8 DCT matrix multiplication can
be expressed as additions of vector scaling operations, which allows us to apply
our proposed low complexity design technique for DCT implementation.
Xin [7:0]
Figure 4.3 One Dimensional DCT Architecture
Shift register
add/sub block
add/sub block
add/sub block
add/sub block
Toggle Flip flop
Xk
Xk
Xk
Xk
Xk
Xk
Xk
Xk
ROM
X
X
X
X
ADDER
Zk0-Zk7
83
One Dimensional DCT implementation for the input pixel is shown in the
figure 4.3. A 1-D DCT is implemented on the input pixels first. The output of this
is intermediate value that is stored in a RAM. The 2nd 1-D DCT operation is done
on this stored value to give the final 2-D DCT output dct_2d. The inputs are 8 bits
wide and the 2d-dct outputs are 9 bits wide. In 1- Dimensional section the input
signals are taken one pixel at a time in the order x00 to x07, x10 to x17 and so on
up to x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit
shift registers are registered by the div8clk which is the clock signal divided by 8.
This will enable us to register in 8 pixels (one row) at a time. The pixels are paired
up in an adder/ subtractor block in the order xk0, xk7:xk1, xk6:xk2, xk5:xk3, xk4.
The adder/ subtractor are tied to clock [20, 8, 49], For every clock, the
adder/subtractor module alternately chooses addition and subtraction. This
selection is done by using the toggle flip flop. The output of the add/sub is fed into
a multiplier whose other input is connected to stored values in registers which act
as memory. The outputs of the 4 multipliers are added at every clock in the final
adder. The output of the adder Zk (0-7) is the 1-D DCT values given out in the
order in which the inputs were read in.
4.4.2 Transpose Buffer
The outputs of the adder are stored in RAMs. When WR signal is high, the
corresponding RAM address takes the write operation. Otherwise, the contents of
the RAM address are read. The period of the address signals is 64 times of the
input clocks. Two RAMs are used so that data write can be continuous. The 1st
valid input for the RAM1 is available at the 15th clock. So the RAM1 enable is
active after 15 clocks. After this the write operation continues for 64 clocks. At the
65th clock, since z_out is continuous, we get the next valid z_out_00. These 2nd
sets of valid 1D-DCT coefficients are written into RAM2 which is enabled at
15+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64 clocks
and RAM2 is in write mode. After this for every 64 clocks, the read and write
switches between the 2 RAMS and 1D-DCT section.
84
After the RAM is enabled, data is written into the RAM1 for 64 clock
cycles. Data is written into each consecutive location. After 64 locations are
written into, RAM1 goes into read mode and RAM2 goes into write mode. The
cycle then repeats. For either RAM, data is written into each consecutive location.
However, data is read in a different order. If data is assumed to be written in each
row at a time, in an 8x8 matrix, data is read in each column at a time. When RAM1
is full, the 2nd 1-D calculations can start.
4.4.3 Two Dimensional DCT architecture
The implementation of 2-D DCT is shown in the figure 4.4.The second 1-
Dimensional implementation is the same as the 1st 1-D implementation with the
inputs now coming from either RAM1 or RAM2. Also, the inputs are read in one
column at a time in the order z00 to z70, z10 to z70 upto z77. These inputs are fed
into a 8 bit shift register. The outputs of the 8 bit shift registers are registered by
the div8clk which is the clock signal divided by 8. This will enable us to register in
8 pixels (one row) at a time. The pixels are paired up in an adder/ subtractor block
in the order Zk0:Zk7, Zk1:Zk6, Zk2:Zk5, Zk3:Zk4 [17]. The adder/ subtractor is
tied to clock, For every clock, the adder/subtractor module alternately chooses
addition and subtraction. This selection is done by the toggle flop. The output of
the add/sub is fed into a multiplier whose other input is connected to stored values
in registers which act as memory. The outputs of the 4 multipliers are added at
every clock in the final adder [96]. The outputs from the adder in the 2nd section
are the 2D-DCT coefficients values given out in the order in which the inputs were
read in. When 2-D DCT computation is over the rdy_out signal goes high
indicating the output.
85
Zk [7:0]
output
Figure 4.4 2-D DCT Architecture
4.5 INVERSE DISCRETE COSINE TRANSFORM
A 1-D IDCT is shown in the figure 4.5 and it is implemented on the input
DCT values. The output of this called the intermediate value is stored in a RAM.
The 2nd 1-D IDCT operation is done on this stored value to give the final 2-D
IDCT output idct_2d. The inputs are 12 bits wide and the 2d-idct outputs are 8 bits
wide. In the 1st 1-D section, the input signals are taken one pixel at a time in the
Shift register
add/sub block
add/sub block
add/sub block
add/sub block
Toggle Flip flop
Zk0
Zk7
Zk1
Zk6
Zk2
Zk5
Zk3
Zk4
ROM
X
X
X
X
ADDER
Dct_2d [11:0]
86
order of x00 to x07, x10 to x17 and so on up to x77. These inputs are fed into an 8
bit shift register. The outputs of the 8 bit shift registers are registered at every 8th
clock [14, 93].This will enable us to register in 8 pixels (one row) at a time. The
pixels are fed into a multiplier whose other input is connected to stored values in
registers which act as memory. The outputs of the 8 multipliers are added at every
clock in the final adder. The output of the adder z_out is the 1-D IDCT values
given out in the order in which the inputs were read in.
The 1-D IDCT values are first calculated and stored in a RAM. The second
1-D implementation is the same as the 1st 1-D implementation with the inputs now
coming from RAM. Also, the inputs are read in one column at a time in the order
z00 to z70, z10 to z70 up to z77. The outputs from the adder in the 2nd section are
the 2D-IDCT coefficients. The 2-Dimensional IDCT block diagram is shown in
figure 4.6.
Figure 4.5 1-D IDCT Architecture
87
Figure 4.6 2-D IDCT Block diagram
4.5.1 STORAGE / RAM SECTION
The outputs z_out of the adder are stored in RAMs. Two RAMs are used so
that data write can be continuous. The 1st valid input for the RAM1 is available at
the 12th clock..So the RAM1 enable is active after 11 clocks. After this the write
operation continues for 64 clocks. At the 65th clock, since z_out is continuous, we
get the next valid z_out_00. This 2nd set of valid 1D-DCT coefficients are written
into RAM2 which is enabled at 12+64 clocks. So at 65th clock, RAM1 goes into
read mode for the next 64 clocks and RAM2 is in write mode. After this for every
64 clocks, the read and write switches between the 2 RAMS. 2nd 1-D IDCT
section starts when RAM1 is full. The second 1D implementation is the same as
the 1st 1-D implementation with the inputs now coming from either RAM1 or
RAM2. Also, the inputs are read in one column at a time in the order z00 to z70,
z10 to z70 up to z77. The outputs from the adder in the 2nd section are the 2-D
IDCT coefficients.
4.5.2 IDCT Core Block Diagram
clk rst idct_2d(7:0) dct_2d(11:0)
rdy_in
Figure 4.7 Top level schematic of IDCT Architecture
IDCT CORE
88
The schematic for the IDCT implementation are given in Figure 4.5 and the
block diagram in figure 4.7. The 1D-IDCT values are first calculated and stored in
a RAM. The 2nd 1D-IDCT is done on the values stored in the RAM. For each 1-D
implementation, input data are loaded into the first input of a multiplier. The
constant coefficient multiplication values are stored in a ROM and fed into the 2nd
input of the multiplier. At each clock 8 input values are multiplied with 8 constant
coefficients. The output of the eight multipliers are then added together to give the
1D coefficient values which are stored in the RAM. The values stored in the
intermediate RAM are read out one column at a time (i.e., every 8th value is read
out every clock) and this is the input for the 2nd DCT. The rdy_in signal is used as
a hand shake signal between the DCT and the IDCT functions [14]. When the
signal is high, it indicates to the IDCT that there are valid input to the core.
SUMMARY
This chapter describes the concepts involved in ASIC implementation of
2-D DCT and IDCT cores for low power consumption and it is implemented. This
architecture uses row - column decomposition to find the 2D-DCT, the number of
computations for processing an 8x8 block of pixels is reduced. 1-D DCT operation
is expressed as addition of vector-scalar products and basic common computations
are identified and shared to reduce computational complexity compared to other
architecture. In this architecture low power design, employ parallel processing
units that enable power savings from a reduction in clock speed. This architecture
saves power by disabling units that are not in use. When the units are in a standby
mode, they consume a minimum amount of power. The common low power design
technique is to reorder the input data so that a minimum number of transitions
occur on the input data lines. DCT architecture based on the proposed low
complexity vector scalar design technique is effectively used to remove redundant
computations, thus reducing computational complexity in DCT operations. This
DCT architecture shows lower power consumption, which is mainly due to the
efficient computation sharing.
89
CHAPTER 5
LOW POWER ARCHITECTURE FOR VARIABLE LENGTH ENCODING AND DECODING
5.1 INTRODUCTION
The previous chapter describes the concepts involved in the designing of
low power DCT and IDCT architectural details as implemented in the present
work. In this chapter a novel VLSI algorithm is presented for processing low
power Variable Length Encoding and Decoding for image compression
applications.
There have been several designs proposed for the variable length coding
method in the past decades, and so all these designs were concentrated over high
performance variable length coding process, i.e., their main concern was to make
the variable length coding process faster. These designs were less bothered about
power consumption, with the only intention being high performance. But with
increase in demand for portable multimedia systems like mobiles, iPods etc. the
demand for low power multimedia systems is increasing, making the designers to
concentrate on low power approaches along with high performance.
In this thesis some of the low power approaches have been adopted
efficiently to reduce the power consumption in variable length coding process. The
proposed architecture consists of three major blocks, namely, zigzag scanning
block, run length coding block and Huffman coding block [42]. The low power
approaches have been employed in these blocks individually to make the whole
variable length coding process to consume less power.
The zigzag scanning requires that all the 64 DCT coefficients are available
before starting, so we have to wait for all 64 DCT coefficients to arrive and then
start scanning. To overcome this problem two separate RAMs have been used in
90
the present design to make the process faster, where one RAM will be storing the
incoming DCT coefficients and other being used for zigzag scanning of earlier 64
DCT coefficients stored. The RAMs will be doing this thing in the alternative
order. So when we use parallel approach there are two possibilities, either we can
make process faster at the same operating frequency or we can decrease the
operating frequency still maintaining the same speed as earlier. The reduction in
the operating frequency is one of several approaches to achieve low power, since
the power dissipation is directly proportional to operating frequency.
The zigzag scanning block scans the quantized DCT coefficients in such a
way that all the low frequency components are scanned first and then the high
frequency components. The high frequency components being quantized to zero
are all accumulated at the last, so that the effective compression can be done in the
run length coding block.
The DCT coefficients are quantized to approximate the high frequency
components to zero, and there are too many intermediate zeros in between non-
zero DCT coefficients. In the proposed architecture for run-length encoder we
exploit this property of the DCT coefficients. Unlike the traditional run-length
encoder which counts the repeated number of character strings in the form of {
run_value, symbol}, we count the number of zeros in between the non-zero DCT
coefficients. By this way we can achieve more compression than the traditional
method when the DCT coefficients are approximated to zero by making use of a
proper quantization matrix in the DCT process. More compression means less
number of bits to process for the following stages, and less number of bits to
process means less switching activities. By minimizing the number of switching
activities, it is one of the well known approaches to achieve low power.
The Huffman coding is done by making use of a lookup table. The lookup
table is formed by arranging the different run-length combinations in the order of
their probabilities of occurrence with the corresponding variable length Huffman
codes. This approach of designing Huffman coder not only simplifies the design
91
but also results in less power consumption. Since we are using lookup table
approach, the only part of the encoder corresponding to the current run-length
combination will be active and other inactive parts of the encoder will not be
consuming any power. So turning off the inactive components of a circuit is the
low power approach adopted while designing Huffman coder.
5.2 VARIABLE LENGTH ENCODING
Variable Length Encoding (VLE) is the final lossless stage of the image
compression unit.VLE is done to further compress the quantized image. VLE
consists of the following three stages,
1. Zigzag Scanning
2. Run- length Encoding
3. Huffman Encoding
Figure 5.1 Block diagram of Variable Length Encoder
The figure 5.1 shows the block diagram of the Variable Length Encoder
(VLE). In the following sections we will discuss the detailed working of each sub
blocks.
Zigzag Scanning
Run-length Encoding
Huffman Encoding
Quantized DCT
Coefficients
Compressed Output
92
5.2.1 Zig Zag Scanning
Zigzag_en_in zigzag_en_out clk rdy_out rst
Figure 5.2 Block diagram of Zigzag Scanner
The figure 5.2 describes the block diagram of the zigzag scanner. The
quantized DCT coefficients obtained after applying the Discrete Cosine
Transformation to 8×8 block of pixels are fed as input to the Variable Length
Encoder (VLE). These quantized DCT coefficients will have non-zero low
frequency components in the top left corner of the 8×8 block and higher frequency
components in the remaining places [71, 62]. The higher frequency components
are approximated to zero after quantization. The low frequency DCT coefficients
are more important than higher frequency DCT coefficients. Even if we ignore
some of the higher frequency coefficients, we can successfully reconstruct the
image from the low frequency coefficients [73]. The Zigzag Scanner block
exploits this property.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 2324 25 26 27 28 29 30 3132 33 34 35 36 37 38 3940 41 42 43 44 45 46 4748 49 50 51 52 53 54 5556 57 58 59 60 61 62 63
Figure 5.3 Zigzag Scanning order
Increasing Vertical
Frequency
Increasing Horizontal
ZIG ZAG SCANNING
93
In zigzag scanning, the quantized DCT coefficients are read out in a zigzag
order, as shown in the figure 5.3. By arranging the coefficients in this manner,
RLE and Huffman coding can be done to further compress the data. The scan puts
the high-frequency components together. These components are usually zeroes.
The following figure 5.4 explains the zigzag scanning process with a typical
quantization matrix.
31
0 1 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 0 1 0 0 0 0 0 1 2 0 0 ……………………... 0 0
0 1 2 3 4 5 6 7 8 9 10 11 62 63
31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0 0 1 8 16 9 2 3 10 17 24 32 25 62 63
Figure. 5.4 Zigzag Scanning example
Since the zigzag scanning requires that all the 64 DCT coefficients are
available before scanning, we need to store the serially incoming DCT coefficients
in a temporary memory. For each of the 64 DCT coefficients obtained and for each
8×8 block of pixels we have to repeat this procedure. So at a time either scanning
is performed or storing of incoming DCT coefficients is done [71, 62]. This will
slow down the scanning process. So in order to overcome this problem and to
Increasing Vertical
Frequency
Increasing Horizontal Frequency
Input to zigzag scanner
Output from zigzag scanner
94
make scanning faster, the present researcher proposes a new architecture in this
work for zigzag scanner and it is shown in the following figure 5.5.
Figure 5.5 Internal Architecture of the Zigzag Scanner
In the proposed architecture, two RAM memories are used in the zigzag
scanner and they are zigzag register1 and zigzag register2. One of the two RAM
memories will be busy in storing the serially incoming DCT coefficients while
scanning is performed from the other RAM memory. Two 2:1 Multiplexers are
used in this architecture. One Multiplexer (left side) will be used to switch the
input alternatively between one of the two register sets and the other Multiplexer
will be used to connect the output alternatively from one of the two register sets.
Either zigzag scanning or storing of incoming DCT coefficients, is done on a
zigzag register set, and not both simultaneously. So while scanning is performed
on one register set, other register set will be used to store the incoming DCT
coefficients. So except for first 64 clock cycles i.e., until 64 DCT coefficients of
first 8×8 blocks become available [28], the zigzag scanning and storing of serially
incoming DCT coefficients is performed simultaneously. So by using two RAM
memories we will be able to scan one DCT coefficient in each clock cycle except
95
for first 64 clock cycles. A counter is used to count the number of clocks, which
counts up to 64. This counter generates switch memory signal every 64 clock
cycles to the Multiplexer. The counter also generates the ready out signal after first
64 clock cycles to indicate the next block, that zigzag block is ready to output
values. This signal acts as ready in signal to the next run-length block, which upon
receiving it will start to operate.
5.2.2 Run Length Encoder
rle_in rle_out
rdy_in
clk
rdy_out
rst
Figure 5.6 Block diagram of Run-Length Encoder
The quantized coefficients are read out in a zig-zag order from DC
component to the highest frequency component. RLE is used to code the string of
data from the zig-zag scanner. The conventional Run length encoder codes the
coefficients in the quantized block into a run length (or number of occurrences)
and a level or amplitude. For example, transmit four coefficients of value "10" as:
{10,10,10,10}. By using RLE, the level is 10 and the run of a value of 10 is four.
By using RLE, {4,10} is transmitted, reducing the amount of data transmitted.
Typically, RLE encodes a run of symbols into two bytes, a count and a symbol. By
RUN LENGTH ENCODER
96
defining an 8 x 8 block without RLE, 64 coefficients are used [69, 26]. To further
compress the data, many of the quantized coefficients in the 8 x 8 block are zero.
Coding can be terminated when there are no more non-zero coefficients in the zig-
zag sequence. Using the "end-of-block" code terminates the coding.
But normally in a typical quantized DCT matrix the number of zeroes is
more compared to non-zero coefficients being repeated. So in the proposed
architecture for run-length encoder we exploit this property of more number of
zeroes in the DCT matrix. In the proposed architecture the number of intermediate
zeros in between non-zero DCT coefficients are counted, unlike the conventional
run-length encoder where the number of occurrence of repeated symbols are
counted. The following figure 5.7 shows the internal architecture of the proposed
run-length encoder.
Figure. 5.7 The internal architecture of run-length encoder.
The following Table 5.1 illustrates the difference between the conventional
run-length encoding and the proposed method of run-length encoding.
Comparator (= 0?)
Zero Counter
Run Level
‘0’
‘!0’
Load Count
INR
CLR
CLK
RLE_IN
RLE_OUT
RDY_IN
RST RDY_OUT
97
31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0
0 1 8 16 9 2 3 10 17 24 32 25 62 63
Table 5.1: Comparison between conventional and proposed RLE
Conventional
RLE
Proposed
RLE
31 31
1,0 1,1
1,1 1,2
1,0 0,1
1,2 5,2
1,1 EOB
5,0
1,1
EOB
The above example clearly shows that the proposed RLE design in the
present work yields better compression than the conventional RLE.
Output from zigzag scanner
98
5.2.3 Huffman Encoding
Huffman_en_in
rdy_in Huffman_en_out
clk
rst rst rdy out
Figure 5.8 Block diagram of Huffman Encoding
Conventional Huffman coding is used to code values statistically according
to their probability of occurrence. Short code words are assigned to highly
probable values and long code words to less probable values.
The procedure for Huffman coding involves the pairing of run/value
combinations. The input run/value combinations are written out in the order of
decreasing probability. The run/value combination with the highest probability is
written at the top [22, 89, 63], the least probability is written down last. The least
two probabilities are then paired and added. A new probability list is then formed
with one entry as the previously added pair. The least run/value combinations in
the new list are then paired [78, 42]. This process is continued till the list consists
of only two probability values. The values "0" and "1" are arbitrarily assigned to
each element in each of the lists.
Figure 5.9 Internal architecture of Huffman encoder
Output bit stream Run/Value
VLC Table
Encode r
HUFFMAN ENCODING
99
In the proposed architecture in the present work, the Huffman encoding is
done by making use of a lookup table approach. The lookup table is formed by
arranging the different run/value combinations in the order of their probabilities of
occurrence with the corresponding variable length Huffman codes. When the
output of the run- length encoder in the form of run/value combination is fed to the
Huffman encoder [6] the run/value combination received will be searched in the
lookup table, when run/value combination is found its corresponding variable
length Huffman code will be sent to output.
This approach of designing Huffman encoder not only simplifies the design
but also results in less power consumption. Since we are using lookup table
approach, the only part of encoder corresponding to the current run-length
combination will be active and other parts of the encoder will not be using any
power. So turning off the inactive components of a circuit in the Huffman encoder,
results in less power consumption.
5.2.4 Interconnection of VLE blocks
Figure 5.10 Interconnection of zigzag scanning, run-length & Huffman Encoding blocks
The above figure 5.10 shows the interconnection of zigzag scanning, run-
length encoding and Huffman encoding blocks in the variable length encoder
[85, 9]. The clock and reset signal are shared among all three blocks in VLE. The
100
output of the zigzag scanner is connected to the run-length encoder and also a
ready out signal is connected from zigzag scanner to the run-length encoder as
ready in signal, to indicate the RLE when to start the encoding process [72]. The
output of the run-length encoder is given as input to the Huffman encoder block,
and also the ready out signal from RLE to Huffman encoder to initiate the
encoding process. The interconnection acts as variable length encoder taking
quantized DCT coefficients as input and processing it to give variable length codes
as output.
5.3 VARIABLE LENGTH DECODER
The variable length decoder is the first block on the decoder side. It
decodes the variable length encoder output to yield the quantized DCT coefficients
the basic block diagram of the Variable Length Decoder is shown in the figure
5.11. The variable length decoder consists of three major blocks, namely,
1. Huffman Decoding.
2. Run- length Decoding.
3. Zigzag Inverse Scanning.
Figure 5.11 Block diagram of variable length decoder
Huffman Decoding
Run Length Decoding
Zigzag Inverse
Scanning Compressed
Data
Quantized DCT
Coefficients
101
5.3.1 Huffman Decoder
Figure 5.12 Block diagram of Huffman decoder
The Huffman decoder forms the front end part of the variable length
decoder. The block diagram of the Huffman decoder is shown in the figure
5.12.The internal architecture of the Huffman decoder is same as the Huffman
encoder [78]. The same VLC Huffman coding table which was used in the
Huffman encoder is also used in the Huffman decoder. The input encoded data is
taken and a search is done for the corresponding run/value combination in the VLC
table. Once the corresponding run/value combination is found, it is sent as output
and Huffman starts decoding next coming input.
The VLC Huffman coding table which we are using in both the Huffman
encoder and the Huffman decoder, reduces the complexity of the Huffman decoder
[39, 14]. It not only reduces the complexity but also reduces the dynamic power in
the Huffman decoder, since only the part of the circuit is active at a time.
5.3.2 Block Diagram of FIFO
Figure 5.13 Block diagram of FIFO
HUFFMAN DECODER
Huffman_in
rdy_in
clk
rst
Huffman_out
rdy_out
FIFO
FOutN
rst
FInN
clk
Data_In F_Data
F_EmptyN
102
The First in First out (FIFO) also forms the part of the decoder part and it is
shown in the figure 5.13. The FIFO is used between the Huffman decoder and the
run-length decoder. The FIFO is used to match the operating speed between the
Huffman decoder and run-length decoder. The Huffman decoder sends a decoded
output to the run-length decoder in the form of run/value combination. The run-
length combination takes this as input and starts decoding. Since the run in the
run/value combination represents the number of zeroes in between consecutive
non-zero coefficients, the zero ‘0’ is sent as output for next ‘run’ number of clock
cycles. Until then the run-length decoder cannot accept other run/value
combination, and we know that the Huffman decoder decodes one input to one
run/value combination in every clock cycle. So Huffman decoder cannot be
connected directly to run-length decoder. Otherwise the run-length decoder cannot
decode correctly. So to match the speed between the Huffman decoder and run-
length decoder, the FIFO is used. The output of the Huffman decoder is stored
onto the FIFO, the run-length decoder takes one decoded output of Huffman
decoder from the FIFO when it is finished with decoding of the present input to it.
So after run-length decoder finishes decoding of the present input, it has to send a
signal to the FIFO to feed it a new input. This signal is sent to the FOutN pin,
which is read out pin of the FIFO. The FInN pin is used to write onto FIFO, the
Huffman decoder generates the signal for this while it has to write a new input
onto the FIFO. So the FIFO acts as a synchronizing device between the Huffman
decoder and the run-length decoder.
5.3.3 Run Length Decoder
Figure 5.14 Block diagram of Run-Length Decoder
RUN LENGTH
DECODER
rle_in rle_out
rdy_out
rdy_in
clk
rst
103
The Run-length decoder forms the middle part of the variable length
decoder. The block diagram of the Run Length Decoder is shown in figure 5.14. It
takes decoded output from Huffman decoder through FIFO. When the Huffman
decoder decodes one input and stores the decoded output onto the FIFO, then the
FIFO becomes non-empty (the condition when at least one element is stored on the
FIFO). The FIFO then generates the signal F_EmptyN. This signal is used as
rdy_in signal to the run-length decoder. So when Huffman decoder decodes one
input and stores it onto the FIFO, then a ready signal is generated to the run-length
decoder to initiate the decoding process. The run-length decoder takes the input in
the form of a run/value combination [74], then separates run and value parts. The
run here represents number of zeroes to output before sending out the non-zero
level ‘value’ in the run/value combination. So for example if {5,2} is input to the
run-length decoder then it sends 5 zeroes (i.e., 5 ‘0’) before transmitting the non-
zero level '2’ to the output. Once the run-length decoder sends out a non-zero level,
then it means that it is finished with the decoding of the present run/value
combination, and it is ready for the next run/value combination. So for this it
generates the rdy_out signal to the FIFO, to indicate that it has finished decoding
of present input and ready for decoding the next run/value combination. This
rdy_out is connected to the FOutN pin of the FIFO, which is read out pin of the
FIFO. Upon receiving this signal the FIFO sends out a new run/value combination
to the run-length decoder, to initiate run-length decoding process for the new
run/value combination. An example for run-length decoding process is shown
below,
104
31 1,1 1,2 0,1 5,2 EOB
31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0
0 1 8 16 9 2 3 10 17 24 32 25 62 63
5.3.4 ZigZag Inverse Scanner
Figure 5.15 Block diagram of Zigzag Inverse Scanner
31 0 1 0 2 1 0 0 0 0 0 2 ………………... 0 0
0 1 8 16 9 2 3 10 17 24 32 25 62 63
31 0 1 0 0 0 0 0 1 2 0 0 ………………………... 0 0
0 1 2 3 4 5 6 7 8 9 10 11 62 63
The zigzag inverse scanner shown in the figure 5.15, forms the last stage in
the variable length decoder. The working and architecture, everything is similar to
the zigzag scanner, except that the scanning order will be different. The zigzag
inverse scanner gets the input from the run-length decoder, starts storing them in
Input to RLD
Output from RLD
ZIGZAG INVERSE SCANNER
zigzag_in
clk
rst
zigzag_out
rle_out
Input to the zigzag inverse scanner
Output from the zigzag inverse scanner
105
one of the two RAMs, until it receives all 64 coefficients. Once it receives all the
64 coefficients, it starts inverse scanning to decode back the original DCT
coefficients. Meanwhile the incoming DCT coefficients are getting stored in
another RAM. Once scanning from one RAM is finished, it starts scanning from
another RAM and meanwhile the incoming DCT coefficients gets stored in first
RAM. So this process is repeated until all the DCT coefficients are scanned. There
will be delay of 64 clock cycles before the output appears. Once after that, for
every clock cycle an element will scanned continuously. The above example
illustrates the working of the zigzag inverse scanner.
SUMMARY
This chapter describes the algorithm and architecture details of the Variable
length coding that maps input source data onto code words with variable length is
an efficient method to minimize average code length. Compression is achieved by
assigning short code words to input symbols of high probability and long code
words to those of low probability. Variable length coding can be successfully used
to relax the bit-rate requirements and storage spaces for many multimedia
compression systems. For example, a variable length coder (VLC) employed in
MPEG-2 along with the discrete cosine transform (DCT), results in very good
compression efficiency. To reconstruct the image back before going to the
dequantization Variable Length Decoder is designed with low power approach.
Since early studies have been focused only on high throughput variable
length coders, low-power variable length coders have not received much attention.
This trend is rapidly changing as the target of multimedia systems is moving
towards portable applications like laptops, mobiles and iPods etc. These systems
highly demand low-power operations, and, thus require low power functional
units.
106
CHAPTER 6
SIMULATION AND SYNTHESIS RESULTS OF DCT AND IDCT MODULES
6.1 INTRODUCTION
In the previous chapters, the development of algorithms and design of
architectures for various functional modules of Low power DCT and Variable
Length Coder for image compression were presented in detail. These functional
modules were transformation, DCT, IDCT, Zigzag Scanning, Run Length
Encoding, Huffman coding and different blocks of Variable Length Decoding.
This chapter describes the Verification of DCT and IDCT modules. These modules
are implemented using Mat lab and they are simulated using Modelsim finally
synthesized using RTL compiler.
The image data is divided up into 8x8 blocks of pixels. DCT is applied to
each 8x8 block of the image, DCT converts the spatial image representation into a
frequency map, the low-order or "DC" term represents the average value in the
block, while successive higher-order ("AC") terms represent the strength of more
and more rapid changes across the width or height of the block. The highest AC
term represents the strength of a cosine wave alternating from maximum to
minimum adjacent pixels.
The DCT calculation is fairly complex in fact; this is the most costly step in
JPEG compression. The point of doing it is that we have now separated out the
high- and low-frequency information present in the image. We can discard high-
frequency data easily without losing low-frequency information. The DCT step
itself is lossless except for round-off errors. To discard an appropriate amount of
information, the compressor divides each DCT output value by a "quantization
coefficient" and rounds the result to an integer. The larger the quantization
coefficient, the more data is lost, because the actual DCT value is represented less
107
and less accurately. Each of the 64 positions of the DCT output block has its own
quantization coefficient, with the higher-order terms being quantized more heavily
than the low-order terms (that is, the higher-order terms have larger quantization
coefficients). Furthermore, separate quantization tables are employed for
luminance and chrominance data, with the chrominance data being quantized more
heavily than the luminance data. This allows JPEG to exploit further the eye's
differing sensitivity to luminance and chrominance.
The complete quantization tables actually used are recorded in the
compressed file so that the decompressor will know how to (approximately)
reconstruct the DCT coefficients.
Several previous proposals for DCT/IDCT architectures attempt to reduce
power dissipation. The architectures use some common power reduction
techniques and some power reduction techniques specific to DCT/IDCT
architectures. Because of the wide-ranging applications of the DCT and IDCT, the
input data pattern is widely varied for DCT/IDCT architectures in different
applications. So various techniques have been proposed for data-dependant power
reduction techniques. Some architecture reduces calculations for visually irrelevant
DCT coefficients. Other architectures ignore redundant sign bits to save power in
the arithmetic units. To save power, some designs disable arithmetic units for the
many zero-valued operands in the IDCT. Many other implementations focus on
power reduction in the multiplier units.
In this thesis low power design, employ parallel processing units that
enable power savings from a reduction in clock speed. This architecture saves
power by disabling units that are not in use. When the units are in a standby mode,
they consume a minimum amount of power. The common low power design
technique is to reorder the input data so that a minimum number of transitions
occur on the input data lines. DCT architecture based on the proposed low
complexity vector scalar design technique is effectively used to remove redundant
computations, thus reducing computational complexity in DCT operations. This
DCT architecture shows lower power consumption, which is mainly due to the
108
efficient computation sharing and the advantage of using carry save adders as final
adders.
6.2 MATLAB IMPLEMENTATION OF DCT AND IDCT MODULES
The representation of the colors in the image is converted from RGB to
YCbCr, consisting of one luma component (Y), representing brightness, and two
chroma components, (Cb and Cr), representing color. The resolution of the chroma
data is reduced, usually by a factor of 2. This reflects the fact that the eye is less
sensitive to fine color details than to fine brightness details. The image is split into
blocks of 8×8 pixels, and for each block, each of the Y, Cb, and Cr data undergoes
a discrete cosine transform (DCT). A DCT is similar to a Fourier transform in the
sense that it produces a kind of spatial frequency spectrum. The amplitudes of the
frequency components are quantized. Human vision is much more sensitive to
small variations in color or brightness over large areas than to the strength of high-
frequency brightness variations. Therefore, the magnitudes of the high-frequency
components are stored with a lower accuracy than the low-frequency components.
If an excessively low quality setting is used, the high-frequency components are
discarded altogether.
6.2.1 DCT Methodology
• Each pixel value in the 2-D matrix is quantized using 8 bits which produces a
value in the range of 0 to 255 for the intensity/luminance values and the range
of -128 to + 127 for the chrominance values. All values are shifted to the range
of -128 to + 127 before computing DCT.
• All 64 values in the input matrix contribute to each entry in the transformed
matrix.
• The value in the location F [0,0] of the transformed matrix is called the DC
coefficient and is the average of all 64 values in the matrix .
109
• The other 63 values are called the AC coefficients and have a frequency
coefficient associated with them.
• The Spatial frequency coefficient increases as we move from left to right
(horizontally) or from top to bottom (vertically). Low spatial frequencies are
clustered in the left top corner.
• The Discrete Cosine Transform (DCT) separates the frequencies contained in
an image.
• The original data could be reconstructed by Inverse DCT.
Before we begin, it should be noted that the pixel values of a black-and-white
image range from 0 to 255 in steps of 1, where pure black is represented by 0, and
pure white by 255. Thus it can be seen how a photo, illustration, etc. can be
accurately represented by these 256 shades of gray. Since an image comprises
hundreds or even thousands of 8x8 blocks of pixels, the following description
shows what happens to one 8x8 block of the JPEG process; what is done to one
block of image pixels is done to all of them, in the order earlier specified. Now,
let‘s start with a block of image pixel values. This particular block was chosen
from the very upper- left-hand corner of an image.
168 161 161 150 154 168 164 154
171 154 161 150 157 171 150 164
171 168 147 164 164 161 143 154
Original image input matrix = 164 171 154 161 157 157 141 132
161 161 157 154 143 161 154 132
164 161 161 154 150 157 154 140
161 174 157 154 161 140 140 132
154 161 157 150 140 132 136 128
110
Because the DCT is designed to work on pixel values ranging from -128 to
127, the original block is “leveled off” by subtracting 128 from each entry. This
results in the following matrix. It is now ready to perform the Discrete Cosine
Transform, which is accomplished by matrix multiplication.
D = T M T’ ….. (6.1)
40 33 33 22 26 40 36 26
43 26 33 22 29 43 22 36
43 40 19 36 36 33 15 26
M = 36 43 26 33 29 29 13 4
33 33 29 26 15 33 26 4
36 33 33 26 22 29 26 12
33 46 29 26 33 12 12 4
26 33 29 22 12 4 8 0
In Equation 6.1 matrix M which is the image Matrix which is first
multiplied by the DCT matrix T . Where T is an 8 × 8 matrix from equation 4.3,
whose elements are cosine functions .This transforms the rows, The columns are
then transformed by multiplying on the right by the transpose of the DCT matrix.
By using the proposed algorithm and architecture the DCT of the image Matrix
with the Mathlab yields the following matrix.
111
This block matrix now consists of 64 DCT coefficients, c (i, j), where i and
j range from 0 to 7. The top-left coefficient, c (0, 0) correlates to the low
frequencies of the original image block. As we move away from c(0,0) in all
directions, the DCT coefficients correlate to higher and higher frequencies of the
image block, where c (7, 7) corresponds to highest frequency. Higher frequencies
are mainly represented as lower number and lower frequencies as higher number
[56,58,59]. It is important to know that human eye is most sensitive to lower
frequencies.
6.2.2 Quantization
Now 8x8 block of DCT coefficients is now ready for compression by
quantization. A remarkable and highly useful feature of the JPEG process is that in
this step, varying levels of image compression and quality are obtainable through
selection of specific quantization matrices. This enables the user to decide on
quality levels ranging from 1 to 100, where 1 gives the poorest image quality and
highest compression, while 100 gives the best quality and lowest compression. As
a result, the quality/compression ratio can be tailored to suit different needs.
Subjective experiments involving the human visual system have resulted in the
112
JPEG standard quantization matrix. With a quality level of 50, this matrix renders
both high compression and excellent decompressed image quality.
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
Q50 = 14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
If however, another level of quality and compression is desired, scalar
multiples of the JPEG standard quantization matrix may be used. For a quality
level greater than 50 (less compression, higher image quality), the standard
quantization matrix is multiplied by (100-quality level)/50. For a quality level less
than 50 (more compression, lower image quality), the standard quantization matrix
is multiplied by 50/quality level. The scaled quantization matrix is then rounded
and clipped to have positive integer values ranging from 1 to 255. For example, the
above quantization matrices yield quality levels of 10 and 90.
Quantization is achieved by dividing each element in the transformed image
matrix D by the corresponding element in the quantization matrix, and then
rounding to the nearest integer value as indicated by the equation 6.2. For the
quantization process the quantization matrix Q50 is used.
Ci,j = round ( Di,j/ Qi,j) .… (6.2)
113
Recall that the coefficients situated near the upper-left corner correspond to
the lower frequencies for which the human eye is most sensitive of the image
block. In addition, the zeros represent the less important, higher frequencies that
have been discarded, giving rise to the lossy part of compression. As mentioned
earlier, only the remaining nonzero coefficients will be used to reconstruct the
image.
The IDCT is next applied to matrix, which is rounded to the nearest integer.
Finally, 128 is added to each element of that result, giving us the decompressed
JPEG version N of our original 8x8 image block M. The output of IDCT and the
input image matrices, when compared there is a remarkable result, considering that
nearly 70% of the DCT coefficients were discarded prior to image block
decompression/reconstruction. Given that similar results will occur with the rest of
the blocks that constitute the entire image, there should be no surprise that the
JPEG image will be scarcely distinguishable from the original. Remember, there
are 256 possible shades of gray in a black-and-white picture, and a difference of
say 10 is barely noticeable to the human eye [15]. DCT takes advantage of
redundancies in the data by grouping pixels with similar frequencies together; and
moreover, if we observe as the resolution of the image is very high, even after
sufficient compression and decompression there is very less change in the original
and decompressed image. Thus, we can also conclude that at the same
compression ratio the difference between original and decompressed image goes
on decreasing as there is increase in image resolution.
6.3 MATLAB RESULTS
The procedure of the steps followed for image compression using Matlab is
given in the form of flow chart and it is shown in the Figure 6.1.
114
Figure 6.1 Matlab Design flow
To find the DCT for the given image read the image in the form of frames,
each frame consists of 8x8 pixels these pixels are represented in the form of 8X8
image matrix. The coefficients of the image matrix are varied from 0 to 255, shift
each of the coefficients in the range of -128 to +127 by subtracting each coefficient
by 128. Apply the proposed DCT algorithm to individual rows and columns to find
the DCT. Perform the quantization using Q50 quantization matrix to achieve the
compression then display the compressed image. Matlab simulation results are
compared with the Verilog HDL simulation results. Results obtained after
Read the image in matrix form
Shift the values to the range of -128 to + 127
Apply DCT to individual rows and columns
Apply quantization
Dequantize and apply IDCT to individual rows and columns
Display the compressed images
End
115
performing DCT on the original images are shown in Figure 6.2. The figure 6.3
shows original image and the image obtained after applying DCT and also shows
the reconstructed image by applying IDCT.
Figure 6.2 Matlab Simulation Results for 8x8 image
The above results shows that from the present work for the proposed
DCT and IDCT algorithm and architecture to achieve the low power, it is
possible to reconstruct the original image with out losing the originality of the
image quality.
116
Figure 6.3 Matlab Simulation Results for full image
The figure 6.3 shows how to reconstruct the complete image with the
proposed DCT and IDCT algorithms using Mathlab.
The figure 6.4 and 6.5 shows the one more typical image before and after
performing the proposed DCT and IDCT Algorithms.
117
Figure 6.4 Input image in color and Gray Scale
Figure 6.5 Reconstructed Image after proposed DCT and IDCT
118
6.4 VLSI DESIGN OF THE PROPOSED ARCHITECTURE
The main objective of the researcher is to achieve the low power with the
DCT and IDCT core design. The proposed DCT and IDCT architectural blocks of
the design were written in the form of RTL code using Verilog HDL. For
functional verification of the design test-benches are written and simulation is done
using Cadence IUS simulator and Modelsim simulator. Behavioral simulation was
also done using Matlab from Math Works. The Matlab was used to generate input
image matrix for the input of the DCT core. Matlab reads both input and output
calculates DCT/IDCT coefficients of the input file and compares them to the
output of the core.
After behavioral verification RTL Complier EDA tool was used to
synthesize the Verilog code. IC Compiler was used to generate layout and to
perform the place and route of the design for both DCT and IDCT chips. After
synthesis, analysis of the parameters like area and power were performed by
mapping the design on to the 65nm standard cell technology library. The power
optimization was done by applying the low power techniques by mapping on to the
65nm standard cells. The results show that the proposed design of the present
work, it is possible to achieve the low power with good performance in the design.
The timing result showed that the design could operate till a maximum speed of
100 MHz.
6.5 SIMULATION RESULTS USING VERILOG
Simulation of the DCT core was done using Cadence Incicive Unified
Simulator (IUS) and Modelsim simulator. The DCT core was simulated by giving
the 8x8 image matrices taken from the Matlab. The Verilog test bench is written to
simulate the DCT core. When the DCT computation is over rdy_out signal goes
high. After 92 clock cycles the 2-D DCT output for all the 64 values are
continuous. The Table 6.1 shows the signal description of the DCT core.
119
Table 6.1 Signal Description of DCT core
Signal Name I/O Function
clk Input clock rst Input reset
Xin[7:0] Input 8-bit input
rdy_out Output Ready signal to indicate DCT output
dct_2d[11:0] Output 12 bit 2D DCT output
6.6 COMPARISON OF MATLAB AND HDL SIMULATION RESULTS
An 8x8 image matrix is given to Matlab, The DCT block performs the 2-D
DCT. The 2-D DCT output obtained from the Matlab is given in figure 6.6. The
same image input is given to test bench of the Verilog code to obtain the 2-D DCT.
Output from HDL Verilog simulation of DCT core is shown in figure 6.7.
Resultant DCT matrix obtained from both the cases are compared. Only a few
coefficients are different which does not affect the output of the image much. The
IDCT core outputs are also compared with the Matlab results and HDL simulation
results they are comparable hence from the proposed IDCT Architecture it is
possible to reconstruct the image back with out much degradation in the image.
120
6.6.1 2-D DCT Simulation Results Using Matlab and Verilog
Figure 6.6 2-D DCT Matlab results
The Figure 6.6 shows the Matlab results for the 2-D DCT using Matlab and
using the proposed low power architecture to find the 2-D DCT for a 8x8 image
matrix.The figure 6.7 shows the Modelsim simulated result to find the 2-D DCT
with the same architecture.
Figure 6.7 DCT HDL simulation results
121
Comparison of Matlab output for 8x8 image matrix with the verilog HDL
simulation results. Image matrix (8x8) 64 values are forced as input to the DCT
core using test bench and the output as 64 DCT coefficients are observed. The
results obtained in both Matlab and HDL simulator are same. Hence the proposed
architecture for hardware implementation of DCT core with the low power is
working efficiently.
6.6.2 IDCT Simulation Results
The IDCT core was simulated, where the input is given from dct_2d signal
from DCT core. The signal description of the IDCT core is shown in the Table 6.2.
The output is obtained from IDCT is given to the Matlab to reconstruct the image.
The verilog testbench is written to simulate the IDCT core. When the IDCT
computation is starts the rdy_in signal is made high. After a initial latency of 84
clock cycles the 2-D IDCT output for all the 64 values is continuous. Simulation
steps were followed using Cadence IUS simulator and Modelsim simulator.
Table 6.2 Signal Description of IDCT core
Signal Name I/O Function clk Input Clock rst Input Reset dct_2d Input 12-bit input rdy_in Input Ready signal to indicate dct_2d as input idct_2d[7:0] Output 8bit - 2d dct output
Figure 6.8 2-D IDCT HDL simulation results
122
Figure 6.9 2-D IDCT MATLAB results
Simulation steps of IDCT were done exactly like the steps followed in the
DCT core. Matlab was used for generating the test vectors, and analysis of the
output. The core was used to obtain IDCT coefficients and compared with Matlab
results.The 8x8 blocks of IDCT was taken and processed, the whole image was
done using block processing in Matlab. The figure 6.8 and figure 6.9 shows the
HDL simulation and Matlab simulation of IDCT respectively.
6.7 SYNTHESIS RESULTS OF DCT AND IDCT
The different features and analysis of both DCT and IDCT core was done
using RTL compiler from cadence. The design was synthesized and the core was
mapped on to tcbtcbn65lphvtwcl_ccs standard cells. The features like area, power
and number of cells required for the design are tabulated in table 6.3. The layout,
place and route of each of the design were done using IC Compiler.
123
Table 6.3 Power Area Characteristics of DCT and IDCT using 65nm standard cells
Figure 6.10 Layout of 2D- DCT
Features DCT IDCT
Internal Power 0.2708 mw 0.3376mw
Switching Power 0.1641mw 0.2142mw
Leakage Power 103.5371nw 101.7760nw
Total Power 0.4350mw 0.5519mw
Area 34983.359724 µm2 34903.799800 µm2
Block size 8X8 8X8
No. of Cells 6494 7171
124
Figure 6.11 Zoomed Version of 2D- DCT Layout
Figure 6.12 Layout of 2D- IDCT
125
Figure 6.13 Zoomed Version of 2D-IDCT Layout
The layouts of both DCT and IDCT core shows that design can be
comfortably placed on the physical area using Place and route tool without any
conjetion in the design. The Zoomed version of the design shows how the design
can be placed in the standard cells approach.
6.8 COMPRESSION USING DISCRETE COSINE TRANSFORM AND
QUANTIZATION
The starting point for this proposed research was a FPGA implementation
of image compression using DCT .This unit gives the detail of that
implementation. The FPGA implementation of image compression using Discrete
Cosine Transformation and Quantization was done with the different architecture
and this design was implemented on to the FPGA device 2vp30ff896-7 and the
simulation of the design was done using Modelsim simulator. The different stages
of the process for image compression using DCTQ are shown in different figures.
The figure 6.14 shows the original image which had to be compressed using
DCTQ and reconstruct the same using IDCT.
126
Figure 6.14 Original image for compression
The image that was generated in Matlab along with the pixel values and the
image matrix is given in the figure 6.15.
Figure 6.15 Original image with the pixel values
The matrix repr
shows the pure white co
The matrix C i
Matlab. As we notice
implement the design
multiplied by 128 and l
resentation of the gray image of figure 6.15, ma
olor and minimum value shows the pure block
is the DCT matrix that is obtained from the
from the DCT matrix values lies between -1 a
on to the FPGA device each of the DCT co
level shifted by 128.
127
aximum value
color.
design using
and +1 .But to
oefficients are
Figure 6
The obtained
quantization matrix for
6.17 shows the image a
6.16 DCT coefficients before quantization
DCT coefficients are divided by the JP
r image compression and is shown as quant_m
after compression.
128
PEG standard
mat. The figure
Fi
6.9 RECONSTRUC
On the receiver
quantization matrix and
6.18.
igure 6.17 Image after compression
CTION OF IMAGE USING IDCT
r side the image is then denormalized by mu
d de- transformed image is obtained and is g
129
ultiplying with
given in figure
130
Figure 6.18 Reconstructed Image using IDCT
The following figure 6.19 shows the comparison of original and final
image using DCT and IDCT respectively.
Figure 6.19 Comparison of original and reconstructed image
131
6.10 SIMULATION RESULT OF IMAGE AFTER COMPRESSION
The Figure 6.20 illustrates the process of compression by quantization. It
also shows the general agreement between the results generated in Matlab and the
simulated result in Modelsim. The compressed image with their pixel values is also
shown.
Figure 6.20 Simulation results after compression
6.11 FPGA IMPLEMENTATION OF THE DCTQ
The various modules of DCTQ described in the previous sections were
coded in Verilog, simulated using Modelsim, synthesized using Xilinx ISE
9.2i.The target device chosen was Xilinx Vertex-II Pro XC2VP30 -7 with package
FF896 FPGA[2,3]. The Verilog code developed for this design is fully RTL
132
compliant and technology independent. As a result it can work on any FPGA or
ASIC without needing to change any code.
The synthesis results and the device utilization summary of DCTQ for the
targeted device are presented below.
6.11.1 Device Utilization Summary
Number of slices : 390 out of 13696 2%
Number of slice Flip Flops : 560 out of 27392 2%
Number of 4 input LUTS : 225 out of 27392 0%
Number of IOS : 387
Number of bonded IOBS : 387 out of 556 69%
Number of MULT 18X18S : 16 out of 136 11%
Number of MULT GCLKS : 1 out of 16 6%
6.11.2 HDL Synthesis Results
The synthesis results of main module are as follows:
=======================================
Macro Statistics
# Multipliers : 16
16x9-bit multiplier : 8
9x9-bit multiplier : 8
# Adders/Sub tractors : 30
16-bit adder : 7
133
26-bit adder : 7
9-bit subtract or : 16
# Registers : 48
16-bit register : 16
26-bit register : 8
9-bit register : 24
SUMMARY
The simulation and synthesis of the Discrete Cosine Transform and Inverse
Discrete Cosine Transforms were described in the present chapter. The proposed
architecture of the DCT and IDCT were first implemented in Matlab in order to
estimate the quality of the reconstructed image and the compression that can be
achieved. In addition Matlab output serves as a reference for verifying the Verilog
output. The core Modules of the DCT and IDCT were realized using Verilog for
ASIC implementation. The quality of the reconstructed pictures using Matlab and
Verilog were compared. The synthesis of the proposed architecture was done using
the ASIC Synthesis tool. The synthesis results shows that design was done to
achieve the low power both in DCT and IDCT modules. The design was
implemented using 65nm standard cells with low power approach. The physical
design and Place and Route results for the DCT and IDCT were also presented.
134
CHAPTER 7
SIMULATION AND SYNTHESIS RESULTS OF VARIABLE LENGTH ENCODING AND DECODING MODULE
7.1 INTRODUCTION
In the previous chapter Simulation and synthesis of DCT and IDCT
modules for image compression and decompression were presented in detail. This
chapter describes the Verification of different modules used in Low Power
Variable Length Encoding and Decoding. Each of the blocks involved in Variable
Length Encoding and Decoding are written in synthesizable Verilog code. The
simulation of the design was done using modelsim simulator tool. The synthesis of
each block of the design is done using Cadence RTL Compiler, Place and route
using Synopsis IC Compiler EDA tools. Finally the design is mapped on to 90nm
standard cells. The synthesis report shows the reduction in the power consumption
of the design. In the following sections we will discuss the simulation and
synthesis of each individual block used in the Encoder and the Decoder designs.
7.2 ZIGZAG SCANNING
The Methodology involved in working of the Zigzag scanning is explained
in the chapter 5.The simulation of zigzag scanning is done using the following test
sequence.
31 0 1 0 0 0 0 0 1 2 0 0 …………………………... 0 0
0 1 2 3 4 5 6 7 8 9 10 11 62 63
135
The following figure 7.1 shows the simulation waveform of the zigzag
scanning block for the above test sequence. The main purpose of the zigzag
scanning is to scan the low frequency components before the high frequency
components [28]. The simulation results shows that the low frequency components
are scanned before the high frequency components so that by neglecting the high
frequency components later still it is possible to reconstruct the image.
Figure 7.1 Waveform obtained after simulating the Zigzag Scanning block
7. 3 RUN LENGTH ENCODING
The simulation of the run-length encoding block is done using the output
sequence obtained in zigzag scanning process, which appears as the input of the
run-length encoder and the pattern is as shown below.
The following figure 7.2 shows the simulation waveform of the proposed
Run-Length Encoding block for the above scanned input sequence; but normally in
a typical quantized DCT matrix the number of zeros is more compared to non zero
coefficients being repeated. So in the present research for run length
encoder [68, 25], the researcher exploits this property of more number of zeros in
between non- zero DCT coefficients and is counted; Unlike in the conventional run
length encoder where the number of occurrence of repeated symbols are counted.
136
The results shows 31 is appeared first ,then one zero before getting 1 and again one
more zero before getting 2 and that process repeat. This proposed RLE architecture
requires less number of digits to represent the same data and this yields better
compression than the conventional RLE.
Figure 7.2 Waveform of Run Length Encoding block
7.4 HUFFMAN ENCODING
The output of the run-length encoder is used as the test sequence to the
Huffman encoder. The output of the run-length encoder will appear as shown
below.
The following figure 7.3 shows the simulation waveform of the Huffman
encoding block for the above test input sequence. In the proposed look up table
approach for the Huffman encoding block the run/value combination received will
be searched in the look up table, when the run/value combination is found its
corresponding Huffman code will be generated [22, 29, 63].The waveform shows
1,1 is coded as 0003 and 1,2 is coded as 0004 and so on.
31 1,1 1,2 0,1 5,2 EOB
137
Fig 7.3 Waveform of Huffman Encoder block
The simulation result of the Variable Length Encoder block shows the
further compression of the image after the lossy compression using DCT. Using
the Variable Length Encoder further compression is achieved with lossless
approach after the synthesis of the VLC design. The physical implementation of
the Variable Length Encoder is done using Synopsis IC Compiler EDA tool. The
figures 7.4 and 7.5 shows layout diagrams of VLC with good placement and
routing. This layout of the Variable Length Encoder shows that the proposed
design can be easily implementable with the available standard cells.
Figure 7.4 Layout of Variable Length Coding
138
Figure 7.5 Zoomed Version Layout of VLC
7.5 HUFFMAN DECODING
The compressed data from the variable length encoder is fed to the
Huffman decoding block as an input. The output obtained from the Huffman
encoding block is used as a test bench to the Huffman decoding block [84, 88].
The following figure 7.6 shows the simulation waveform for the Huffman
decoding block. The waveform shows 0003 decoded as 1,1 and 0004 is decoded as
1,2 and so on.
Figure 7.6 Waveform of Huffman Decoder block
139
7.6 RUN LENGTH DECODING
Compressed data sent by the Encoder has to go through the processing of decompression at the Receiver. First stage of decompression takes place at the Huffman Decoder. The Run Length Decoder does the second stage of the decompression. The Run Length Decoder is verified using the output of the Huffman decoder output. The run length decoder, decodes the output of the Huffman decoder [74]. The figure 7.7 shows the simulation waveform of run-length decoder. This decoded output shows it is same as the output of the Zigzag scanner.
Figure 7.7 Waveform of run-length decoder block
7.7 ZIGZAG INVERSE SCANNING
The output of the run-length decoder is given as input to the zigzag inverse
scanner, which will output the quantized DCT coefficients. The original Quantized
coefficients and the inverse scanner outputs are matched. Hence from this it is
possible to obtain the original pixel coefficients by performing the IDCT and then
reconstruction of the image can be achieved easily.
140
Figure 7.8 Waveform of zigzag inverse scanner block
Figure 7.9 Layout of Variable Length Decoding
With the proposed architecture it is possible to do the good physical design
with place and route without any congestion in the design. The physical design of
the Variable Length Decoder is shown in the figure 7.9. This result shows that the
proposed architecture can be easily implemented on the available 90nm standard
cell libraries.
141
7.8 POWER AND AREA REPORTS
VLC and VLD blocks have been designed using 90 nm Standard Cells with
0.7 Volts global operating voltage. The power and area requirements of all the
different blocks used in the VLC and VLD designs have been obtained by the tool.
They are tabulated in table 7.1. They are also shown in the form of bar chart in the
figure 7.10.
Table 7.1
Power and Area Parameters of VLC and VLD blocks
Features
Design Power in mW Area µm2
No of cells
Zigzag Scanner 0.7319 33496.95 3063
Run Length Encoder 0.0208 1039.35 127
Huffman Encoder 0.0451 1718.14 347
Zigzag Inverse Scanner 0.7285 33597.85 3094
FIFO 0.2744 12877.20 1184
Run Length Decoder 0.1110 715.48 106
Huffman Decoder 0.0451 1718.14 347
142
Figure 7.10 Bar Chart of Power and Area parameters
7.9 PERFORMANCE COMPARISON
In the synthesis process the RTL code of the proposed architecture is
targeted for a specific standard libraries and analysis of the parameters like area
and power consumption of the design was computed. The power comparison of the
proposed architecture with some of the existing systems has been done and the
comparison of results are tabulated in the following sections.
7.9.1 Power Comparison of Huffman Decoder
The table 7.2 shows the comparison of results [40] of Huffman decoder with the
proposed architecture and the comparison representation in the form of graph is
shown in the figure 7.11.
143
Table 7.2 Power Comparison for Huffman decoders
Table Size Huffman decoder type Power (in µW)
100
Power Analysis of the Huffman
Decoding Tree, by Jason McNeely [40] 95
Proposed 45
Figure 7.11 Bar chart showing the power comparison of Huffman decoders
7.9.2 Power Comparison of Run Length and Huffman Encoder
The power consumed by the proposed design is compared with [59]. The
compared results of Run Length Encoding and Huffman coding is shown in the
Table 7.3. The same is shown in the form of a bar chart and is shown in the figure
7.12.
144
Table 7.3 Power Comparison for Run Length and Huffman Encoders
RL-Huffman Encoding Type Power (in µW)
RL-Huffman Encoding for Test Compression and
Power Reduction in Scan Applications [59] 90
Proposed 65.9
Figure 7.12 Bar chart showing the power comparison of RL-Huffman Encoder combination
145
7.9.3 Percentage of Power Saving
The percentage of power savings from proposed design are calculated and
are tabulated as shown in the table 7.4.
Table 7.4 Percentage of Power Savings
Features Comparison Percentage of Power Saving
Huffman decoder Power Analysis of the Huffman Decoding Tree [40] 52.63%
RL-Huffman Encoder
RL-Huffman Encoding for Test Compression and Power Reduction in Scan Applications [59]
26.77%
Each of the Comparison results shows the proposed design are mapped to
standard libraries of 90nm by adopting low power techniques, and from this
research work it is possible to achieve the low power consumption without
compromising with the performance of the design.
SUMMARY
The simulation and synthesis of the Variable Length Encoder and Variable
Length Decoder were described in the present chapter. The proposed architecture
of the VLC and VLD modules were designed using Verilog code and they are
simulated using Modelsim simulator. The synthesis of the design was done using
RTL Compiler EDA tool and implemented using 90nm standard cells. The
physical design of both VLC and VLD was done using the IC Compiler back end
VLSI tool. The Power analysis of each of the design was done and the results are
tabulated. The tabulated results show the present research work is able to achieve
the low power with the proposed architecture. Finally the compression ratio was
computed this ratio shows good compression without compromising with the
performance.
146
CHAPTER 8
CONCLUSIONS AND SCOPE FOR FUTURE WORK
8.1 CONCLUSIONS
In the present work, the researcher has implemented low-power
architecture for the image compression system based on DCT and Variable Length
Coding. All of the modules in the design were modeled using the synthesizable
Verilog code.
An ASIC implementation of 2-D DCT core was designed to meet low
power constraints. DCT and IDCT algorithm was coded using Verilog HDL, then
the design was simulated and synthesized successfully. Simulation of the design
was done using Modelsim and Cadence IUS simulator. The image coefficients are
obtained by using MATLAB and are applied as inputs to proposed Low Power
DCT architecture. The compressed image is finally reconstructed using IDCT. The
compressed image coefficients thus obtained are compared with the simulation
results from MATLAB and Verilog simulator.
A Novel architecture has been developed for implementation of low power
architecture for variable length encoder and decoder for image processing
applications. The designing and modeling of all the blocks in the VLC is done by
using synthesizable Verilog HDL. The proposed architecture was synthesized
using RTL complier and it was mapped using standard cells. The Simulation of all
blocks in design is done using Modelsim. The Power analysis of the design was
done using RTL compiler and achieved low power in all the different Novel
architectural blocks used in this design.
147
Power consumptions of Variable Length Encoder and Decoder with 90nm
standard cells are limited to 0.798mW and 0.884mW with minimum area. A 53%
of power saving is achieved in the dynamic power of Huffman decoding [40] by
including the lookup table approach and also a 27% of power saving is achieved in
the RL-Huffman encoder [59].
8.2 ORIGINAL CONTRIBUTIONS
In this research work, algorithms and architecture have been developed for
Discrete Cosine Transform, Quantization, Variable Length Encoding and
Decoding for image compression with an emphasis on low power consumption.
These algorithms have been subsequently verified and the corresponding hardware
architectures are proposed so that they are suitable for ASIC implementation. Six
novel ideas implemented have been presented in this thesis in order to speed up the
Low Power Image Compression implementation on ASIC and they are as follows
• Development of algorithm and architecture for Low Power DCT and
Quantization to achieve the required compression.
• Algorithm for processing Variable Length Coding (VLC).
• Architecture of Zigzag Scanning which is used in VLC.
• Architectural design of Low Power Run Length Encoding (RLE).
• Design of Architecture for Huffman Encoding.
• Architecture of Variable Length Decoding (VLD).
8.3 SCOPE FOR FUTURE WORK
In this Thesis, novel algorithms developed along with their architectural
design and ASIC implementation of Image compression using DCT has been
148
presented. The present work throws open a number of work that may be
undertaken by researchers in the near future.
The design of the proposed work is modular and scalable and therefore, it
can be upgraded to accommodate more compatible standards, without appreciable
increase in hardware. The work may also include making the processing of images
faster by including parallel architectures. Some of the applications do require
maintaining a high quality for images like in medical image processing, where
high quality is preferred over the compression, in those cases the lossless variable
length coding can play a vital role. As a part of future enhancement, low-power
architecture for Ultra low power design is suitable for portable systems and can
also be implemented for both DCT and IDCT by redesigning the multipliers and
adder circuits which consume major power in the design presented. The power
requirement is also scaled down drastically by reducing the picture size as well as
operating frequency and voltage.
149
APPENDIX
CADENCE NCVERILOG & SIM VISION TOOLS TUTORIAL
Getting started
Incisive is suite of tools from Cadence Design Systems related to the
design and verification of ASICs, SoCs, and FPGAs. Incisive is commonly
referred by the name NCsim in reference to core simulation engine. In the late
1990s, the tool suite was known as ldv (logic design and verification)
Depending on the design requirement, Incisive has many different bundling
options of the following tools:
Tool command description
NC Verilog ncvlog Compiler for Verilog 95, Verilog 2001 and SystemVerilog
NC VHDL ncvhdl Compiler for VHDL 87, VHDL 93
NC SystemC ncsc Compiler for SystemC
NC Elaborator ncelab
Unified linker / elaborator for Verilog, VHDL, and SystemC libraries. Generates a simulation object file referred to as a snapshot image.
NC Sim
ncsim
Unified simulation engine for Verilog, VHDL, and SystemC. Loads snapshot images generated by NC Elaborator. This tool can be run in GUI mode or batch command-line mode. In GUI mode, ncsim is similar to the debug features of ModelSim'svsim.
Sim Vision simvision A standalone graphical waveform viewer and netlist tracer.
This is very similar to Novas Software's Debussy
150
The methodology of debugging project design involves following steps
• Compiling verilog source code
• Running the simulation
• Viewing the generated waveforms
Here in this thesis compiling and simulation is done by invoking NCverilog tool and waveforms can be viewed by Simvision tool.
Pre Setup
If you are using MAC OS/X or windows please refer to the links shown at end for software requirements to connect to the UNIX server.
For Windows:
• PuTTY
To connect to with PuTTY simply type in domain name, under the section session, make sure that the connection port is SSH with port 22, and click open.
151
You will then be prompt to input your user name and password that should
be given from your system admin.
• SSH Tectia Client
With SSH Tectia Client click on Quick Connect and you will be prompt to
ask for domain name and user name. Make sure the port number is 22.
Click connect and you will be prompt to ask your password.
• Cygwin
Run Cygwin and type in the following:
Ssh–l username hafez.sfsu.edu
Username referring to your username, hit enter and you will be prompt to
ask for your password:
152
Setting up the Verilog environment
Once you have connected to the server check if you have file ius55.csh. if you
have it then its all good if not report to your system admin.
Next type in the following:
Csh
Source ius55.csh
153
Now you press enter button. You will see in the terminal as and that’s it. The
verilog environment was setted up successfully.
Setting up environment for Verilog
Creating/Editing Verilog Source Code:
There are many editors that one can choose. Here I opted vieditor i.e. gvimeditor.
Creating a Verilog file with gvim:
To write a verilog source file using gvim type in the following command:
gvim test1.v
The file extension must be .v otherwise the compiler will not recognize the file.
After editing verilog code in the editor it looks like
154
Compiling and Simulating:
This is the easy part. To just compile your code do the following:
ncverilog –C test1.v
This is what you should see if you everything is done correctly:
To run both compiler and simulator type following:
ncverilog test1.v
If done correctly then you should see the following
155
After running the compiler and simulator, you should notice that in your
current directory a folder called INCA_libswii be created. Which holds snapshots
of the simulation. To invoke the snapshot simply type the following for the current
program
ncsim worklib.test1_tb:v
If you notice from the following argument, test1_tbis your test benchmark
function. If all goes well you should see the following:
156
Furthermore if you notice in your current directory ncsim and ncverilog has
written log of the past activities
Use any editor to view the files.
WAVEFORM VIEWING:
Type the command as shown below:
bsubsimvision
herebsubindicate the submission of the job to server.Now the waveform terminal
will get opened as shown below.
You can also access the other tools in the Simvision analysis environment through
menu choices and toolbar buttons, as follows
157
Next the design browser lets you move the through the design hierarchy to view
objects. You can see the design browser to select the signals that you want to
display in the waveform window.
In order to display the selected signals in the waveform window, you select
the required signals like ex. load, clk, reset etc. next click on the waveform button
to display these selected signals in the waveform window.
158
Synthesis of Variable Length Coding:
159
Power Analysis from Synthesis Tool:
Layout from IC Compiler:
REFERENCES
1. A. Mukherjee, J.W. Flieder. N. Ranganathan (1992), “A VLSI Chip for Variable Length Encoding and Decoding”, IEEE International Conference on Industrial Electronics and Applications (ICIEA) pp.1-4.
2. A.H. Taherinia, M. Jamzad (2009 ), “A Robust Image Watermarking using Two Level DCT and Wavelet Packets Denoising”, International Conference on Availability, Reliability and Security, pp.150-57.
3. A.P. Vinod, D. Rajan and A. Singla (2007), “Differential pixel-based low- power and high-speed implementation of DCT for on-board satellite image processing”, IET international Journals on Circuits, Devices and Systems, pp.444-450.
4. A.Pradini, T.M.Roffi, R.Dirza, T.Adiono (2011), “VLSI Design of a High-Throughput Discrete Cosine Transform for Image Compression System”, International Conference on Electrical Engineering and Informatics, Indonesia (ICEEI), pp.1-6.
5. Abdullah Al Muhit, Md. Shabiul Islam and Masuri Othman [2004], “VLSI Implementation of Discrete Wavelet Transform (DWT) for Image Compression”, 2nd International conference on Autonomous Robots and Agents New Zealand December,pp.391-395.
6. Ahmed Desoky and Mark Gregory (1988), “Compression of Text and Binary Files Using Adaptive Huffman Coding Technique”, Southeastcon IEEE Conference Proceedings pp.660-663.
7. Amir Z. Averbuch, F. Meyer, J.-O. Stromberg, R. Coifman and A. Vassiliou (2001), “Low Bit-Rate Efficient Compression for Seismic Data”, IEEE Transactions on Image Prcessing, Vol.10, Issue.12, pp.1801-1804.
8. An-Yeu Wu and K.J. Ray Liu (1994) “A Low and Low-Complex DCT/IDCT VLSI Architecture Based on Backward Chebyshev Recursion” IEEE International Symposium on Circuits and Systems.Vol.4 , pp.155-158.
9. Aslan Tchamkerten and I.Emre Telatar (2006), “Variable Length Coding Over an Unknown Channel” IEEE Transactions on Information Theory, Vol. 52, No. 5, pp. 2126-2145.
10. B. Heyne and J. Goetz (2007), “A low-power and high-quality implementation of the discrete cosine transformation”, Adv. Radio Sci., 5, 305–311, 2007.
11. Bao Ergude, Li Weisheng, Fan Dongrui, Ma Xiaoyu, (2008), “A Study and Implementation of the Huffman Algorithm Based on Condensed Huffman Table”, IEEE International Conference on Computer Science and Software Engineering, pp.42-45.
12. Basant K. Mohanty, Pramod K. Meher (2011), “Memory-Efficient Architecture for 3-D DWT Using Overlapped Grouping of Frames”, IEEE Transactions on Signal Processing, pp.5605-5616.
13. Byoung-2 Kim, Sotirios. G. Ziavras (2009), “Low Power Multiplier less DCT for Image/Video Coders”, IEEE 13th International Symposium on Consumer Electronics, pp.133-136.
14. Chi-Chia Sun, Benjamin Heyne, Juergen, Goetze (2006), “A Low Power and high quality Cardic based Loffler DCT”, IEEE conference.
15. Chi-chia Sun, Ce Zhang and Juergen Goetze (2010), “A Configurable IP Core for Inverse Quantized Discrete Cosine and Integer Transforms with arbitrary Accuracy”, Proceedings of IEEE Asia Pacific International Conference on Circuits and Systems (APCCAS), pp.915-918.
16. Chi-Chia Sun, Philipp Donner and Jurgen Gotze (2009), “Low-Complexity Multi-Purpose IP Core for Quantized Discrete Cosine and Integer Transform”, IEEE International Symposium on Circuits and Systems, pp.3014-3017.
17. Chin-Teng Lin, Yuan-Chu Yu, Lan-Da Van (2008), “Cost-Effective Triple-Mode Reconfigurable Pipeline FFT/IFFT/2-D DCT Processor”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No.8, pp.1058-1071.
18. Chunrong Zhang, Shibao Zheng (2005),“A Novel Low-complexity and High-performance Frame-skipping Transcoder in DCT Domain”, IEEE Transactions on Consumer Electronics, Vol.51, NO.4, pp.1306-1312.
19. D.A. Karras, S.A. Karkanisand B.G. Mertzios (1998), “Image Compression Using the Wavelet Transform on Textural Regions of Interest”, 24th IEEE International Euromicro conference, pp.633-639.
20. Da An, Xin Tong, Bingqiang Zhu and Yun He(2009), “A Novel Fast DCT Coefficient Scan Architecture”, IEEE Picture Coding Symposium I Beijing 100084, China, pp.1-4.
21. David A. Maluf, Peter B. Tran, David Tran (2008), “Effective Data Representation and Compression in Ground Data Systems”, International Conference on Aerospace, pp 1-7.
22. Ding Xing hao, Qian Kun, Xiao Quan, Liao Ying hao, Guo Dong hui, Wang Shou jue (2009),“Low Bit Rate compression of Facial Images Based on Adaptive Over-complete Sparse Representation” , IEEE 2nd International congress on Image and Signal Processing, pp.1-3.
23. Dr. Muhammad Younus Javed and Abid Nadeem (2000), “Data Compression Through Adaptive Huffman Coding Scheme”, IEEE Proceedings on TENCON ,Vol.2. pp.187-190.
24. Emy Ramola, J. Samuel Manoharan (2011), “An Area Efficient VLSI Realization of Discrete Wavelet Transform for Multi resolution Analysis”, IEEE International Conference on Electronics Computer Technology (ICECT) pp.377-381.
25. En-hui Yang, and Longji Wang (2009), “Joint Optimization of Run-Length Coding, Huffman Coding, and Quantization Table With Complete Baseline JPEG Decoder Compatibility”, IEEE Transactions on Image processing, Vol.8, Issue 1, pp.63-74.
26. En-Hui Yang, Longji Wang (2009),”Joint Optimization of Run-Length Coding. Hoffman Coding, and Quantization Table with Complete Baseline JPEG Decoder Compatibility”, IEEE transaction on image processing, pp. 63-74.
27. F.M.Bayer and R.J.Cintra (2010),“Image Compression via a Fast DCT Approximation” IEEE Latin America Transactions, VOL. 8, NO. 6, pp.708-713.
28. Giridhar Mandyam, Nasir Ahmed and Samuel D. Stearns (1995), “A Two-Stage Scheme for Lossless Compression of Images”, IEEE International Symposium on Circuits and Systems.Vol.2 , pp.1102-1105.
29. Gopal Lakhani (2004), “Optimal Huffman Coding of DCT Blocks”, IEEE transactions on circuits and systems for video technology, Vol.14,issue.4. pp 522-527.
30. Gregory K.Wallace (1991), “The JPEG Still Picture Compression Standard”, IEEE transactions on consumer electronics, Vol.38, Issue.1, pp.18-38.
31. Hai Huang, Tze-Yun Sung, Yaw-shih Shieh (2010), “A Novel VLSI Linear array for 2-D DCT/IDCT”, IEEE 3rd International Congress on Image and Signal Processing, pp.3680-3690.
32. Hassan Shojania and Subramania Sudarsanan (2005), “A High Performance CABAC Encoder”, Proceedings of the 3rd International Conference, pp.315-318.
33. Hatim Anas, Said Belkouch, M.El Aakif, Noureddine Chabini(2010), “FPGA Implementation of a Pipelined 2D DCT and Simplified Quantization for Real Time Applications”, IEEE International Conference on Multimedia Computing and Systems , pp.1-6.
34. He Cuiqun, Liu Guodong, Xie Zhihua (2010),“Infrared Face Recognition Based on Blood Perfusion and weighted block-DCT in Wavelet Domain” International Conference on Computational Intelligence and Security,pp.283-287.
35. Hitomi Murakami, Shuichi Matsumoto, Hideo Yamamoto(1984), “Algorithm for Construction of Variable Length Code with Limited Maximum Word Length”, IEEE Transactions on Communications, Vol. Com-32, No.10,pp.1157-1159.
36. Hyoung Joong Kim (2009), “A New Lossless Data Compression Method”, IEEE International Conference on Multimedia and Expo (ICME), pp.1740-1743.
37. Jack Venbrux, Pen-Shu and Muye (1992), “A VLSI Chip Set for High-Speed Lossless Data Compression”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 2, Issue.4, pp. 381-391.
38. Jaehwan Jeon, Jinhee Lee, and Joonki Paik (2011), “Robust Focus Measure for Unsupervised Auto-Focusing Based on Optimum Discrete Cosine Transform Coefficients”, IEEE Transactions on Consumer Electronics, Vol. 57, No. 1, pp.1-5.
39. Jason McNeely and Magdi Bayoumi (2007), “Low Power Look-Up Tables for Huffman Decoding”, IEEE International Conference on Image Processing , pp.465-468.
40. Jason McNeely, Yasser Ismail, Magdy A. Bayoumi and Peiyi Zaho (2008).” Power Analysis of the Huffman Decoding Tree”, 15th IEEE International Conference on Image Processing, pp.1416-1419.
41. Jer Min Jou and Pei-Yin Chen (1999), “A Fast and Efficient Lossless Data-Compression Method”, IEEE Transactions on communications, Vol.47.Issue.9, pp.1278-1283.
42. Jia-Yu Lin, Ying Liu, and Ke-Chu Yi(2004), “ Balance of 0,1 Bits for Huffman and Reversible Variable-Length Coding”, IEEE Journal on Communications, pp. 359-361.
43. Jin Li Weiwei Chen Moncef Gabbouj Jarmo Takala Hexin Chen (2011) “Prediction of Discrete Cosine Transformed Coefficients in Resized Pixel Blocks”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp.1045-1048.
44. Jin Li, Moncef Gabbouj, Jarmo Takala and Hexin Chen (2009), “Direct 3-D DCT-to-DCT Resizing Algorithm for Video Coding”, Proceedings of the 6th International Symposium on Image and Signal Processing and Analysis, pp.105-110.
45. Jing-Ming Guo and Chia-Hao Chang (2009), “Prediction-Based Watermarking Schemes for DCT-Based Image Coding”, 5th IEEE International Conference on Information Assurance and Security, pp.619-621.
46. JongsunPark, Kaushik Roy (2008), “A Low Complexity Reconfigurable DCT Architecture to Trade off Image Quality for Power Consumption”, IEEE Transactions on Speech and Image processing,Vol.5, pp.17-20.
47. Kamrul Hasan Talukder and Koichi Harada (2007), “Discrete Wavelet Transform for Image Compression and A Model of Parallel Image Compression Scheme for Formal Verification”, Proceedings of the World Congress on Engineering.
48. Koen Denecker, Jeroen Van Overloop and Ignace Lemahieu (1997), “An Experimental Comparison of Several Lossless Image Coders for Medical Images”, IEEE International Conference on Data Compression.
49. Kyeounsoo Kim and Peter A. Beerel (1999), “A High-Performance Low-Power Asynchronous Matrix-Vector Multiplier for Discrete Cosine Transform”, IEEE Asia Pacific International Conference on ASICs, pp.135-138.
50. L.Y.Liu, J.F.Wang, R.J. Wang, J.Y. Lee (1995), “Design and Hardware Architectures for Dynamic Huffman Coding”, IEEE Proceedings on Computers and Digital Techniques. Vol.142, Issue.6, pp 411-418.
51. Laurentiu Acasandrei, Marius Neag (2008), “A Fast Parallel Huffman Decoder for FPGA Implementation”, ACTA TECHNICA NAPOCENSIS Volume 49, Number 1, pp. 8-15.
52. Li Wenna, Goa Yang, Yi Yufeng, Goa Liqun (2011) “Medical image coding based on wavelet transform and distributed arithmetic coding”, IEEE International Conference on Chinese Control and Decision Conference (CCDC), pp.4159-4162.
53. Liang-Wie, Liang-Ying Liu, Jhing-Fa Wang and Jau-Yien Lee (1993) “Dynamic Mapping Technique for Adaptive Huffman Code” IEEE International Journal on Computer, Communication, Control and Power Engineering,Vol.3,pp 653-656.
54. Lili Liu, Hexin Chen, Aijun Sang, Haojing Bao (2011), “Four-dimensional Vector Matrix DCT Integer Transform codec based on multi-dimensional vector matrix theory”, IEEE fourth International Conference on Intelligent Computation Technology and Automation (ICICTA), pp.552-555.
55. Lin Ma, Songnan Li, Fan Zhang and King Ngi Ngan (2011), “Reduced Reference Image Quality Assessment using Reorganized DCT Based Image Representation”, IEEE Transactions on Multimedia, Vol. 13, NO. 4, pp. 824-829.
56. M. El Aakif, S. Belkouch, N. Chabini, M. M. Hassani (2011 ),“Low Power and Fast DCT Architecture Using Multiplier-Less Method”, IEEE International Conference on Faible Tension Faible Consommation (FTFC). pp.63-66.
57. M. El Aakif, S. Belkouch, N. Chabini, M.M. Hassani (2011 ), “Low Power and Fast DCT Architecture Using Multiplier-Less Method”, International Journal on Faible tension Faible Consomation, pp. 63-66.
58. M. Jridi and A. Alfalou (2010),“A Low-Power, High-Speed DCT architecture for image compression: principle and implementation” 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-Soc 2010) pp.304-309.
59. M.H.Tehranipour, M.Nourani, K. Arabi, A. Afzali-Kusha (2004) “Mixed RL-Huffman Encoding for Power Reduction and Data Compression in Scan Test”, Proceedings of the International Symposium on Circuits and Systems, Vol.2, pp.681-684.
60. M.R.M. Rizk (2007) “Low Power Small Area High Performance 2D-DCT Architecture”, 2nd International Design and Test Workshop (IDT), pp.120-125-777.
61. Majdi elhaji, Abdlekrim Zitouni, Samy meftali, Jean-luc Dekey ser and rached tourki (2011), “A Low power and highly parallel implementation of the H.264 8*8 transform and quantization”, Proceedings of 10th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp.528-531.
62. Marcelo J. Weinberger, Gadiel Seroussi, Guillermo Sapiro (1996), “LOCO-I A Low Complexity, Context-Based, Lossless Image Compression Algorithm”, IEEE International Conference on Data Compression, pp.140-149.
63. Md. Al Mamun, Xiuping Jia and Michael Ryan (2009), “Adaptive data compression for Efficient Sequential Transmission and Change Updating of Remote Sensing Images”, IEEE International Symposium Geoscience and Remote sensing, IGARSS, pp.498-501.
64. Med Lassaad Kaddachi Nouira, Adel Soudani, Vincent Lecuire, Kholdoun Torki (2010), “Efficient hardware solution for low power and adaptive image-compression in WSN”, Proceedings of IEEE 17TH International Conference on Electronics Circuits and Systems (ICECS), pp.583-586.
65. Muhammad Bilal Akhtar, Adil Masoud Qureshi,Qamar- Ul- Islam(2011), “Optimized Run Length Coding for JPEG Image Compression Used in Space Research Program of IST”, IEEE International Conference on Computer Networks and Information Technology (ICCNIT) , pp.81-85.
66. Muhammed Yusuf Khan, Ekram Khan and M. Salim Beg (2008), “Performance Evaluation of 4×4 DCT Algorithms for Low Power Wireless Applications”, International Conference on Emerging trends in Engineering and Technology. pp.1284-1286.
67. Muhammed Yusuf Khan, Ekram Khan, M.Salim Beg (2008), “Performance Evaluation of 4x4 DCT Algorithms For Low Power Wireless Applications”, First International Conference on Emerging Trends in Engineering and Technology, pp.1284-1286.
68. Munish Jindal, RSV Prasad and K. Ramkishor (2003), “Fast Video Coding at Low Bit-Rates Mobile Devices”, International Conference on Information, Communication and Signal Processing Vol.1, pp 483- 487.
69. Mustafa Safa Al-Wahaiba, Kosshiek Wong (2010), “A Lossless Image Compression Algorithm Using Duplication Free Run-Length Coding”, Second International Conference on Network Applications, Protocols and Services (NETAPPS) , pp.245-250.
70. N.Venugopal and Dr S.Ramachandran (2009),“Design and FPGA Implementation of Fast Variable Length Coder for a Video Encoder”, International Journal of Computer Science and Network Security (IJCSNS), VOL.9 No.7, pp.178-184..
71. Pablo Montero, Javier Taibo Videalab (2010), “Parallel Zigzag Scanning and Huffman Coding for a GPU-Based MPEG-2 Encoder”, IEEE International Symposium on Multimedia , pp.97-104.
72. Paul G. Howard and Jeffrey Scott Vitter (1991), “Analysis of Arithmetic Coding for Data Compression”, IEEE International Conference on Data Compression, pp.3-12.
73. Paulo Roberto Rosa Lopes Nunes (2006), “Segmented Optimal Linear Prediction applied to Lossless Image Coding”, IEEE International Symposium on Telecommunications, pp.524-528.
74. Pei-Yin Chen, Member, Yi-Ming Lin, and Min-Yi Cho (2008), “An Efficient Design of Variable Length Decoder for MPEG-1/2/4”, IEEE International Transactions on multimedia, Vol.16, Issue 9, pp.1307-1315.
75. Peng Wu, Chuangbai Xiao, Shoudao Wang, Mu Ling (2009), “An Efficient Method for early Detecting All-Zero Quantized DCT Coefficients for H.264/AVC”, IEEE International Conference on Systems, Man and Cybernetics San Antonio, USA, pp. 3797-3800.
76. Piyush Kumar Shukla, Pradeep Rusiya, Deepak Agrawal, Lata Chhablani, Balwant Singh (2009.), “Multiple Subgroup Data Compression Technique Based On Huffman Coding”. First International Conference on Computational Intelligence, Communication Systems and Networks (CICSYN), pp. 397-402.
77. Raymond K.W.Chan, Moon-Chuen Lee (2006), “Multiplier less approximation of fast DCT algorithms” , IEEE International conference on multimedia and Expo, pp.1925-1928.
78. Reza Hashemian (1994) “Design and Hardware Implementation of a Memory Efficient Huffman Decoding”. IEEE Transactions on Consumer Electronics, Vol.40, NO.3, pp. 345-350.
79. Reza Hashemian (2003),”Direct Huffman Coding and Decoding using the Table of Code-Lengths”, IEEE International conference on Information Technology, Coding, Computers and Communication, pp.237-241.
80. Ricardo Castellanos, Hari Kalva and Ravi Shankar (2009), “Low Power DCT using Highly Scalable Multipliers”, 16th IEEE International Conference on Image Processing, pp.1925-1928.
81. S. Ramachandran and S. Srinivasan (2002), “A Novel, Automatic Quality Control Scheme for Real Time Image Transmission”, India VLSI Design, Vol.14 (4), pp. 329–335.
82. S.V.V.Sateesh, R.Sakthivel, K.Nirosha, Harisha M.Kittur(2011), “An Optimized Architecture to Perform Image Compression and Encryption Simultaneously Using Modified DCT Algorithm”, IEEE International Conference on Signal Processing, Communication, Computing and Network Technologies, pp.442-447.
83. S.Vijay, D. Anchit (2009), “Low Power Implementation of DCT for On-Board Satellite Image Processing Systems”, 52nd IEEE International Symposium on Circuits and Systems, pp.774-777.
84. Shang Xue and Bengt Oelmann (2003),“Efficient VLSI Implementation of a VLC Decoder for Universal Variable Length Code”, Proceedings of the IEEE Computer Society Annual Symposium on VLSI.
85. Stephen Molloy and Rajeev Jain (1997), “Low Power VLSI Architectures for Variable-Length Encoding and Decoding”, Proceedings of the 40th International Midwest Symposium on Circuits and Systems, pp.997-1000.
86. Sung wook Yu and Earl E. Swartzlander Jr (2001), “DCT Implementation with Distributed Arithmetic”, IEEE Transactions on Computers, Vol.50, Issue.9. pp. 985–991.
87. Sung-Wen Wang, Shang-Chih Chuang, Chih-Chieh Hsiao, Yi-Shin Tung and Ja-ling Wu (2008), “An efficient Memory Construction Scheme for an Arbitrary Side Growing Huffman table”, IEEE International conference on multimedia and Expo, pp.141-144.
88. Sung-Won Lee (2003), “A Low-Power Variable Length Decoder for MPEG-2 Based on Successive Decoding of Short Codewords”, IEEE Transactions on Circuits and Systems – II Analog and Digital Signal Processing, VOL.50. NO.2, pp.73-82.
89. Sunil Bhoosan, Shipra Sharma(2009), “An Effective and Selective Image Compression Scheme Using Huffman and Adaptive Interpolation”, 24th IEEE International Conference on Image and Vision Computing New Zealand, pp.197-202.
90. Sunil Bhooshan, Shipra Sharma (2009), “An Efficient and Selective Image Compression Scheme using Huffman and Adaptive Interpolation”, 24th International Conference Image and Vision Computing New Zealand (IVCNZ ), pp.1-3.
91. Taizo Suzuki and Masaaki Ikehara (2011).“Integer Fast Lapped Orthogonal Transform Based on Direct-Lifting of DCTS for Lossless-To-Lossy Image Coding”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp.1525-1528.
92. Thucydides Xanthopoulos, P.Chandrashekaran (2000), “A Low-Power DCT Core Using Adaptive Bitwidth and Arithmetic Activity Exploiting Signal Correlations and Quantization” IEEE Journal of Solid-State Circuits, VOL.35, NO.5, pp.740-750.
93. Tze-Yun sung, Yaw-shih Shieh and Chun-Wang Yu Hsi-Chin Hsin (2006),” High Efficiency and Low Power Architectures for 2-D DCT and IDCT based on cordic rotation”, Seventh International conference on Parallel, Distributed Computing, Applications and Technologies(PDCAT), pp.191-196.
94. Vijay Kumar Sharma, K. K. Mahapatra and Umesh C. Pati (2011) “An Efficient Distributed Arithmetic based VLSI Architecture for DCT”, Proceedings of IEEE International Conference on Devices and Communications, pp.1-5.
95. Vijay Kumar Sharma,U.C. Pati and K.K. Mahapatra (2010), “An Study of Removal of Subjective Redundancy in JPEG for Low Cost, Low Power, Computation efficient Circuit Design and High Compression Image” Proceedings of IEEE International Conference on Power, Control and Embedded systems (ICPCES), pp.1-6.
96. Vimal P. Singn Thoudam, Prof. B. Bhaumik, Dr. S. Chatterjee (2010), “Ultra Low Power Implementation of 2-D DCT for Image/Video Compression”, International Conference on Computer Applications and Industrial Electronics (ICCAIE), pp.532-536.
97. Wei-Yeo Chiu, Yu-Ming Lee and Yinyi Lin (2010), “Advanced Zero-block Mode Decision Algorithm for H.264/AVC Video Coding”, Proceedings of IEEE International Conference (TENCON), pp.687-690.
98. Wenna Li, Zhaohua Cui (2010), “Low Bit Rate Image Coding Based on Wavelet Transform and Color Correlative Coding”, International Conference on Computer Design and Applications (ICCDA) , pp.479-482.
99. Y. M. Lin and P. Y. Chen (2006) ,“A Low-Cost VLSI Implementation for VLC” IEEE International Conference on Industrial Electronics and Applications (ICIEA) ,pp.1-4.
100. Y.P Lee, Chen (1997) “A cost effective architecture for 8x8 two-dimensional DCT/IDCT using direct method”, IEEE Transactions on circuit and system for video technology vol 7. Issue.9. pp. 459–467.
101. Y.Wongsawat, H. Ochoa, K.R.Rao (2004), “A Modified Hybrid DCT-SVD Image-Coding System”, International Symposium on Communications and Information Technologies (ISCIT), pp.766-769.
102. Yan Lu, Wen Gao and Feng Wu (2003), “Efficient Video Coding with Fractional Resolution Sprite Predicition Technique”, IEEE Electronics Letters, pp.279-280.
103. Yongli Zhu, Zhengya Xu (2006) “Adaptive Context Based Coding for Lossless Color Image Compression” IMACS Multiconference on Computational Engineering in Systems Applications (CESA), Beijing, China, pp.1310-1314.
104. Yushi Chen, Yuhhang Zhang, Ye Zhang, Zhixin Zhou (2011) “Fast Vector Quantization Algorithm for Hyperspectral Image Compression”, IEEE International Conference on Data Compression. pp.450.
LIST OF PUBLICATIONS
NATIONAL CONFERENCES
1. VijayaPrakash.A M and K.S.Gurumurthy,(2010)”Design and Implementation of AES Algorithm (AES-128)”,National Conference CMNE -2010 at SIT Tumkur, Karnataka.
2. VijayaPrakash. A M, K. S. Gurumurthy and Sindura Prakash, ( 2010)” Design and Implementation of Low Power SDRAM Controller”, National Conference 2010 at VCET Bellary, Karnataka.
INTERNATIONAL CONFERENCES
1. VijayaPrakash.A M and K.S.Gurumurthy, (2008)“Design and Implementation of a High Speed and low power Constant Multiplier”, International Conference Emerging Microelectronics and Interconnection Technologies. (EMIT-2008) at National Institute of Advanced Studies (NIAS). Bangalore, India.
2. Vijaya Prakash. A M, Anoop R.Katti and Shakeeb Ahabed Pasha. B K,( 2011) “Novel VLSI Architecture for Real Time Blind Source Separation”,IEEE International Conference on ARTCOM-2011 at Reva College of Engineering. Bangalore, India.
INTERNATIONAL JOURNALS
1. VijayaPrakash.A M and K.S.Gurumurthy,(September- 2010),“A Novel VLSI Architecture for Digital Image Compression Using Discrete Cosine Transform and Quantization”, International Journal of Computer Science and Network security (IJCNS) [ISSN: 1738-7906] VOL.10 No.9. [Impact Factor-1.047]
2. VijayaPrakash.A M and K.S.Gurumurthy (December- 2010), “A Novel VLSI Architecture for Image Compression Model Using Low power Discrete Cosine Transform”, International Journal of Word Academy of Science Engineering and Technologies (WASET) [ISSN 1307-6892] Year 6, Issue 72. [Impact Factor-1.0]
3. VijayaPrakash.A M and K.S.Gurumurthy (December- 2011),”A Novel VLSI Architecture for Low Power FIR Filter”, International Journal of Advanced Engineering and Applications (IJAEA) [ISSN 0975-7791].
4. VijayaPrakash. A M and K.S.Gurumurthy (January -2012), “VLSI Architecture for Low Power Variable Length Encoding and Decoding for Image Processing Applications” International Journal of Advances in Engineering and Technology (IJAET). [ISSN: 2231-1963], VOL.2. Issue 1. [Impact Factor-1.96]
5. VijayaPrakash.A.M and D.Preethi (June-2012), “A Low Power VLSI Architecture for Image Compression System Using DCT and IDCT”, International Journal of Engineering and Advances Technology (IJEAT). [ISSN: 2249-8958], Volume.I, Issue-5.
VITAE
VIJAYAPRAKASH A M No. 233, Third Stage, Fourth Block Basaveswaranagar Bangalore –560 079 Ph:9844658446 Email:[email protected]
Educational Qualification:
B.E in Electronics Engineering from University Visvesvaraya College of
Engineering Bangalore, Affiliated to Bangalore University, with First Class in the
year 1992.
M.E in Digital Electronics from S.D.M College of Engineering and Technology
Dharwad, Affiliated to Karnataka University, in First Class with distinction in the
year 1997.
Pursuing PhD (“Low Power VLSI Architecture for Image Compression Using
Discrete Cosine Transform”) from Dr.M.G.R University Chennai.
Software skills:
VHDL, VERILOG, SYSTEM VERILOG, C Language and MATHLAB.
EDA tools : Cadence,Synopsis, Xilinx synthesis Tool, Modelsim.
Working Experience:
Presently working as an Associate Professor and P.G Coordinator in the
Department of Electronics and Communication Engineering at Bangalore Institute
of Technology, Bangalore since December 1997.
Worked as Lecturer in the Department of Electronics and Communication
Engineering at SJC Institute of Technology, Chikkaballapur from November 1994
to August 1995 and February 1997 to December 1997.
Worked as Apprentice Trainee at Bharath Electronics Ltd, Bangalore from August
1992 to July 1993.
Worked as Lecturer in the Department of Electronics and Communication
Engineering at Vidyavikas Polytechnic, Bangalore from August 1993 to October
1994.
PERSONEL DETAILS:
NAME : VIJAYAPRAKASH A.M DATE OF BIRTH : 18-05-1967 PERMANENT ADDRESS : No. 233 ,Third Stage, Fourth Block, Basaveswaranagar, Bangalore –560 079
DECLARATION
I declare that the above particulars are true to the best of my knowledge and belief.
( VIJAYAPRAKASH. AM)