memory-efficient concurrent vlsi...
TRANSCRIPT
MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR TWO-DIMENSIONAL DISCRETE
WAVELET TRANSFORM
Synopsis
of
Ph.D. thesis
By
Anurag Mahajan (Enrollment Number: 07P01005G)
Under the Guidance of
Prof. B.K.Mohanty
Department of Electronics and Communication Engineering JAYPEE UNIVERSITY OF ENGINEERING AND TECHNOLOGY,
GUNA (M.P.) - INDIA August - 2013
Synopsis - 1
Preface
Discrete wavelet transform (DWT) is a mathematical technique that provides a new method
for signal processing. It decomposes a signal in the time domain by using dilated / contracted
and translated versions of a single basis function, named as prototype wavelet Mallat (1989);
Daubachies (1992); Meyer (1993); Vetterli and Kovacevic (1995). DWT offers wide variety
of useful features over other unitary transforms like discrete Fourier transforms (DFT),
discrete cosine transform (DCT) and discrete sine transform (DST). Two-dimensional (2-D)
DWT has been applied in image compression, image analysis and image watermarking etc.
Lewis and Knowles (1992). Currently, 2-D DWT is used in JPEG 2000 image compression
standard Skodars et al. (2001). The 2-D DWT is highly computation intensive and many of
its application need real-time processing to deliver better performance. The 2-D DWT is
currently implemented in very large scale integration (VLSI) system to meet the space-time
requirement of various real-time applications. Several design schemes have been suggested
for efficient implementation of 2-D DWT in a VLSI system.
The hardware complexity of multilevel 2-D DWT structure is broadly divided into
two parts (i) arithmetic and (ii) memory. The arithmetic component is comprised of
multipliers and adders, and its complexity depends on wavelet filter size (k). The memory
component is comprised of line buffer and frame buffer. The memory complexity depends on
image size (MN), where M and N represent the height and width of input image. Small size
filters (k < 10) are used in DWT where the standard image size is (512 × 512). Therefore, the
complexity of multilevel 2-D DWT structure is dominated by the complexity of memory
component. Most of the existing design strategies are focused on arithmetic complexity, cycle
period and throughput rate. There is no specific memory-centric design method is proposed
for multilevel 2-D DWT. The objective of the proposed thesis work is to explore memory-
centric design approaches and proposes area-delay-power efficient hardware designs for
implementation of multilevel 2-D DWT.
Objective
The thesis entitled “Memory-Efficient Concurrent VLSI Architectures for Two-Dimensional
Discrete Wavelet Transform” has the following aims and objectives:
To improve memory utilization efficiency of 2-D DWT structure.
To reduce transposition memory size.
Synopsis - 2
To eliminate frame buffer.
To reduce arithmetic complexity using low complexity design scheme.
The summary of the thesis is given below:
Chapter 1: Introduction
In this Chapter, computation scheme of one dimensional (1-D) and 2-D DWT are
discussed. 1-D DWT can be performed using convolution scheme or lifting scheme proposed
by Sweldens (1996). Convolution scheme involves more arithmetic resources and memory
space than the lifting scheme. However, the lifting scheme is suitable for bi-orthogonal
wavelet filters. The 2-D DWT computation is performed by two approaches: (i) separable and
(ii) non-separable. In non-separable approach, row and column transforms of 2-D DWT are
performed simultaneously using 2-D wavelet filters. In separable approach, row and column
transforms of 2-D DWT are performed separately using 1-D DWT. Separable approach is
more popular than non-separable approach as it demands less computation than non-
separable approach. However, separable approach requires transposition memory between
row and column transform. Multilevel 2-D DWT computation can be performed using
pyramid algorithm (PA), recursive pyramid algorithm (RPA) of Vishwanath (1994) and
folded scheme of Wu and Chen (2001). Due to design simplicity, 100% hardware utilization
efficiency (HUE) and lower arithmetic resource requirement, folded scheme is more popular
than the PA and RPA for hardware realization. Keeping this in view, several architectures
based on folded scheme have been proposed for efficient implementation of 2-D DWT.
Chapter 2: Hardware Complexity Analysis
Folded 2-D DWT computation is performed level by level using one separable 2-
DWT refer to as processing unit (PU) and one frame buffer. The low-low subband of the
current DWT level is stored in the frame buffer to compute higher DWT levels. The PU
comprised of one row-processor (to perform 1-D DWT computation row-wise), one column-
processor (to perform 1-D DWT computation column-wise), one transposition memory and
one temporal memory. Transposition memory stores the intermediate matrices low-pass (Ul)
and high-pass (Uh) while temporal memory is used by the column-processor to store the
partial results of column DWT. Frame memory may either be on-chip or off-chip, while the
other two are usually on-chip memories.
Synopsis - 3
The arithmetic complexity of folded structure depends on the DWT computation
scheme and filter length. The size of frame buffer size is MN/4, words which is independent
of data access scheme, type of DWT computation scheme (convolution or lifting) and length
of wavelet filter. Temporal memory size depends on DWT computation scheme and wavelet
filter length. For convolution-based 2-D DWT temporal memory size is zero when a direct-
form FIR structure is used for computation of 1-D DWT. In case of lifting-based 2-D DWT,
the size of temporal memory depends on number of lifting steps of bi-orthogonal wavelet
filter. Transposition memory size mainly depends on the data access scheme adopted to feed
2-D input samples and DWT computation scheme (convolution or lifting). In general, the
sizes of the transposition memory and temporal memory are some multiple of image width,
while the size of frame memory is some multiple of image size. On other hand, the
complexity of arithmetic component depends on the size of the wavelet filter. The standard
image size is (512 × 512) where the size of most commonly used wavelet filter is less than
10. The hardware complexity of folded 2-D DWT structure is dominated by complexity of
memory component.
Several VLSI architectures have been suggested for the folded 2-D DWT in last
decade to meet space and time requirement of real-time application. All these designs differ
by arithmetic complexity, cycle period, throughput rate and they use almost same amount of
on-chip memory words and equal amount of frame buffer words. The Arithmetic complexity
(in terms of multiplier and adder) and memory complexity (in terms of memory words) of
best available designs are estimated for 9/7 wavelet filter and image size (512 × 512) Wu et
al. (2005); Xiong et al. (2006); Xiong et al. (2007); Cheng et al. (2007). It is found that
memory complexity is almost 103 times higher than arithmetic complexity. Consequently, the
memory words per output (MPO) of the existing designs are significantly higher than the
arithmetic complexity per output. Since, the logic complexity of arithmetic components and
memory components are widely different. Transistor count is considered to estimate
arithmetic and memory complexity of the existing structures. We find that, the transistor
count of memory component is almost 97% on average of total transistor count of the folded
designs. Therefore, memory component of folded design consumes most of the chip area and
power. However, the existing design approaches are focused on optimizing the arithmetic
complexity and cycle period. There is no specific design is suggested to address memory
complexity which is a major component of folded 2-D DWT structure.
Synopsis - 4
Chapter 3: Block-Based Architecture for Folded 2-D DWT Using Line Scanning
Folded 2-D DWT structure is memory intensive. Few two-input and two-output and
four-input and four-output designs have been suggested in Xiong et al. (2006), Xiong et al.
(2007), Li et al. (2009) and Lai et al. (2009) for high throughput implementation of folded 2-
D DWT. The arithmetic complexity of these structures is varying proportionality with
throughput rate, but the memory complexity is almost independent of throughput rate. For
example the structure of Xiong et al. (2007) processes four samples per cycle and involves
on-chip memory of size nearly 5.5N words, where the structure of Xiong et al. (2006)
processes two samples per cycle and involves on-chip memory of size 5.5N words, but both
the designs involve frame buffer of size MN/4 words. In general, on-chip and off-chip
memory of folded design is almost independent of input block size. Therefore, block
processing scheme has the potential to improve memory utilization efficiency of a 2-D DWT
structure. Keeping this in view, in this Chapter, we present a block processing scheme to
improve the on-chip and off-chip memory utilization efficiency of line-based folded 2-D
DWT structure. In the proposed block processing scheme a block of P samples is collected
from each row and the input blocks are processed in row-by-row order, where 2 ≤ P ≤ N and
P is power of 2. The lifting 2-D DWT computation is partitioned and mapped to a folded
structure to process a block of P samples in each cycle. Line-based parallel and pipeline
structure is derived using the proposed block processing scheme. The proposed structure has
an interesting feature. Its on-chip and frame buffer size is independent of input block size.
The structure is easily configured for any block size P, provided P is power of 2 and 2 ≤ P ≤
N. Since, on-chip memory and frame buffer size does not increases with the block size (P),
the proposed structure offers better memory utilization for higher block sizes and it has less
area delay product (ADP) and less energy per image (EPI).
Compared with block-based structure of Xiong et al. (2007) which is best amongst the
existing line-based folded structures, the proposed structure involves P/4 times more
multipliers and adders, nearly same on-chip and frame buffer words, and offers P/4 times less
computation time. Note that the structure of Xiong et al. (2007) only processes 4 samples per
cycle, where the proposed structure can be configured easily to process P samples in one
cycle, where 2 ≤ P ≤ N. Application specific integrated circuit (ASIC) synthesis result shows
that the core of the proposed structure using 9/7 wavelet filter and image size (512512)
involves 49% less ADP and 47% less EPI than those of Xiong et al. (2007) for block size 8.
Synopsis - 5
Due to throughput scalability, the propose structure has the flexibility to meet the area-delay-
power constraint of the target application.
Chapter 4: Area-Delay Efficient Structure for Folded 2-D DWT
The line-based lifting 2-D DWT structure involves nearly 5.5N memory words
irrespective of the input block size. Out of 5.5N on-chip memory words, 2.5N memory words
are used by transposition memory and 3N memory words used by the temporal memory. The
transposition memory size could be reduced by using parallel data access scheme of Cheng et
al. (2007), but the size of the temporal memory is almost independent of the data access
scheme and input block size. Therefore, block processing method has the potential to utilize
the temporal memory more efficiently. Keeping this in view, block-based 2-D DWT structure
using parallel data access scheme has been suggested by Tian et al. (2011). The structure of
Tian et al. (2011) involves transposition memory of size N(P + 2)/2 and temporal memory of
size 3N. Transposition memory size in this case depends on the block size as well as on the
image size. Besides, the structure of Tian et al. (2011) is not efficient for (P ≥ 4). The on-
chip memory in Xiong et al. (2007) is almost independent of the block size but this structure
is not scalable for block size P > 4. With the aforementioned facts in view, in this chapter, we
propose a new data access scheme for the computation of lifting 2-D DWT. Based on the
proposed data access scheme the input data blocks of size P are prepared from (P/2) rows and
two adjacent columns, where 4 ≤ P ≤ N and P is power of 2. A 2-D DWT structure based on
proposed data access scheme has following advantages:
The column-processor processes data blocks of high-pass and low-pass intermediate
matrices (Ul) and (Uh) separately.
The row-processor generates the data blocks of (Ul) and (Uh) in a transpose form. The
column-processor, therefore, consumes the data blocks immediately in the following
cycle.
From the dependency graph (DG) of N-point 1-D DWT computation, 1-D and 2-D
systolic arrays are derived to calculate DWT coefficients of 2-input samples and block of
input samples in one cycle. We propose a block processing structure for 2-D DWT with
regular data flow and reduced on-chip memory. The proposed structure involves transposition
memory of size N and temporal memory of size 3N. The on-chip memory of the proposed
structure is independent of input block size. It does not require control signals unlike the
Synopsis - 6
existing DWT structures. Compared with block-based structure of Tian et al. (2011), the
proposed structure Mohanty et al. (2012) involves (3P/4) times less multipliers, same number
of adders, (NP/2) less on-chip memory words and involves same computation time (CT).
ASIC synthesis result shows that the core of the proposed structure using 9/7 wavelet filter
and image size (512512) for block sizes 4 and 8 respectively involves 50% and 63% less
ADP, and 35% and 51% less EPI than the block-based structure of Tian et al. (2011). The
proposed structure is fully scalable for higher size input block. It offers better ADP and EPI
for higher block sizes which is very significant for high throughput implementation.
Chapter 5: Parallel Architecture for Multilevel 2-D DWT
Folded structure for multilevel 2-D DWT computation requires transposition memory,
temporal memory and frame memory. Block-based folded structures utilize the on-chip
memory and frame buffer more efficiently then the two-input two-output folded structures.
Consequently area-time complexity of the block-based folded structure reduces significantly
for higher block sizes. However, the size of the frame buffer of the block-based folded
structure still remains a design problem for single chip implementation of folded 2-D DWT
structures. To overcome this problem, recently parallel structures have been suggested by
Mohanty and Meher (2011) and Mohanty and Meher (2013) for multilevel 2-D DWT. A
parallel structure computes DWT levels concurrently and it does not require frame buffer
unlike folded structure. The existing parallel structures have some merits and demerits. Input
buffer design of the line-based parallel structure of Mohanty and Meher (2011) is simple, but
it requires relatively more on-chip memory words. The convolution-based structure of
Mohanty and Meher (2013) requires less on-chip memory words than the lifting-based
structure of Mohanty and Meher (2011) but it involves some overhead cost (in terms of
arithmetic resource and input buffer words) due to overlapped input blocks. Both the existing
design schemes do not offer an efficient parallel structure. To overcome problems of the
existing parallel designs, we propose a novel block processing scheme for generating
continues input blocks for the succeeding processing units of the parallel structure to achieve
100% HUE without block folding.
In the proposed block processing scheme, a block of P samples is collected by two
methods. In method-1 a block of P samples is taken from consecutive S rows and Q columns
of input matrix whereas in method-2 a block of P samples is taken from consecutive Q rows
Synopsis - 7
and S columns of input matrix. The input block size (P) is calculated using the formula:
P=QS, where Q = 2J+r, S = 2J-1, r is a positive integer. The minimum input block size of the
processing unit for a given DWT level is obtained by substituting r = 0 in the given formula,
whereas for other values of r > 0 higher input block sizes are obtained for given processing
unit. A parallel structure based on proposed block processing scheme has the following
advantages:
Input blocks are processed by row-processor in a specified order such that low-pass
and high-pass intermediate matrices are processed by the column-processor without
data transposition.
Low and high-pass intermediate matrices are processed by the column-processor
using separate computing blocks such that data blocks of four subbands are obtained
in parallel.
Based on the proposed block processing scheme, we propose two parallel structures
PAR-1 (using method-1) and PAR-2 (using method-2) for computation of multilevel lifting
2-D DWT without using frame buffer. Both the proposed structures do not require any
interfacing unit between DWT levels or overlapped memory as required by the existing
parallel structures of Mohanty and Meher (2011) and Mohanty and Meher (2013). However,
the proposed parallel structures require marginally higher on-chip memory words than the
parallel structure of Mohanty and Meher (2013) for same DWT levels.
Compared with the parallel structure of Mohanty and Meher (2011), the proposed
parallel structures involve the same number of multipliers and adders for the same input
block size and involve same CT. But, the proposed PAR-1 and PAR-2, respectively involve
nearly (3N + 3N/2J) and 3N less on-chip memory words than the parallel structure of
Mohanty and Meher (2011). ASIC synthesis result shows that the core of the proposed PAR-
1 of Mohanty and Mahajan (2013a) for 2-level DWT and image size (512512) involves 35%
less ADP and 29% less EPI than those of similar existing parallel structure of Mohanty and
Meher (2011). Compared with parallel structure of Mohanty and Meher (2011) the proposed
PAR-2 of Mohanty and Mahajan (2013b) for 2-level DWT and image size (512512)
involves 41% less ADP and 36% less EPI. The proposed structures are fully scalable for
higher input block sizes as well as higher DWT levels. Moreover, proposed structures are
highly regular, modular and systolic.
Synopsis - 8
Chapter 6: Low-Complexity Design for Folded 2-D DWT
The portable and hand-held devices are enabled with multimedia applications. They
use data compression algorithm for storage and transmit multidimensional signals. These
devices are resource constrained and most often need real-time processing to deliver better
performance. Therefore, it is essential to implement DWT in low-complexity hardware to
meet space-time requirement of portable devices. Multiplier is the most complex hardware
component in an arithmetic unit and it consumes major part of chip area and power.
Therefore, various low-complexity design schemes have been proposed for multiplier-less
implementation of convolution-based or lifting-based DWT. Distributed arithmetic (DA)
based techniques are popular for their high throughput processing capability and regular
structure, White (1989). We propose a DA-based 1-D DWT structure using carry-save full-
adder (CSFA) and carry-save-accumulator (CSAC). The proposed structure involves
significantly less logic resources and it has less bit cycle period than the similar existing
multiplier-less structure. ASIC simulation shows that, the proposed DA-based 1-D DWT
structure involves 11% less area and 35% less CP compared with the DA-based structure of
Mohanty and Meher (2009).
We propose DA-based folded 2-D DWT structures using line-based data access
(architecture -1) and parallel data access (architecture -2). Proposed DA based 1-D DWT
structure is used as row and column processor of 2-D DWT folded structures. Both the
proposed DA-based structures for 2-D DWT (architecture-1 and architecture -2) involve
same logic components but they differ with on-chip memory size and frame buffer size.
ASIC simulation shows that the proposed architecture -2 offers 43% area saving than the
similar multiplier based structure of Meher et al. (2008). The proposed DA-based structures
have significantly less area complexity than the existing designs. The proposed design
therefore, very much useful for low-complexity realization of 1-D and 2-D DWT for resource
constrained application.
Chapter 7: Conclusion and Future Scope
Area-delay efficient hardware realization of the multilevel 2-D DWT has great
practical interest for low power resource constrained mobile and portable devices. Keeping
this in view, several designs have been suggested by the various researchers during last
decade. In Chapter 2, we made a complexity analysis of existing folded structures and found
that memory complexity is 97% of the overall hardware complexity of the folded structure.
Synopsis - 9
However, the existing design strategies are focused on optimization of arithmetic complexity
and reduction of cycle period. There is no specific memory-centric design is proposed to
address the memory complexity of multilevel 2-D DWT structure. Therefore, in this thesis
work, we have proposed memory-centric design approaches and efficient hardware designs
for multilevel 2-D DWT.
In Chapter 3, a block processing scheme and a parallel-pipeline architecture based on
line-based data access is proposed with improved memory utilization efficiency. It has better
memory utilization efficiency than the existing structures and offers less ADP and EPI for
higher block sizes. In Chapter 4, a novel data accessing scheme and a systolic architecture is
proposed for area-delay efficient implementation of folded 2-D DWT. In Chapter-5, block
processing scheme for generating continuous input-blocks for succeeding PUs of a parallel
multilevel 2-D DWT structure is proposed. Using the block processing scheme, two parallel
architectures are proposed for area-delay efficient realization of multilevel 2-D DWT. Both
these proposed parallel structures do not require any interfacing unit between DWT levels or
overlapping memory unlike the existing parallel structures. In Chapter 6, we proposed DA-
based structure for 1-D DWT using carry-save full-adder and carry-save-accumulator. The
proposed DA-based 1-D DWT structure involves significantly less logic resources and it has
less bit cycle period than the existing similar DA-based structure. The proposed DA-based 1-
D DWT structure is used as a building block to derive a 2-D DWT structures based on the
line-based (architecture -1) and parallel data access scheme (architecture -2). Both the
proposed DA-based structures for 2-D DWT (architecture-1 and architecture -2) involve
same logic components but architecture-2 involves significantly less on-chip memory and,
therefore, it is suitable for low-complexity realization of 2-D DWT.
The proposed folded and parallel structures are regular, modular and easily scalable
for higher throughput rate. Interestingly, ADP and EPI of the proposed structures is less for
higher block sizes. This is significant for higher throughput implementation. Besides, the
proposed structures are individually best amongst the existing structures and they have
specific features which could be very useful for hardware implementation of 2-D DWT for
different types of image processing applications. However, there are few shortcomings in the
proposed design like critical path delay, fixed point error and interconnect delay. These issues
will be addressed in our future work.
Synopsis - 10
List of Publications
1. Mahajan, A. and Mohanty, B.K. (2010a), “Bit serial design for VLSI implementation of
1-D discrete wavelet transform using 9/7 filters based on distributed arithmetic,” in Proc.
CSI National Conference on Education and Research (ConfER 2010), JUET, Guna, pp.
439-449.
2. Mahajan, A. and Mohanty, B.K. (2010b), “Efficient VLSI architecture for
implementation of 1-D discrete wavelet transform based on distributed arithmetic,” in
Proc. 10th
IEEE Asia Pacific Conference on Circuits and Systems (APCCAS 2010), Kuala
Lumpur, Malaysia, pp. 1195-1198.
3. Mohanty, B.K., Mahajan, A. and Meher, P.K. (2012), “Area and power-efficient
architecture for high-throughput implementation of lifting 2-D DWT,” IEEE Transactions
on Circuits and Systems–II, Express Briefs, vol. 59, no. 7, pp. 434-438.
4. Mohanty, B.K. and Mahajan, A. (2013a), “Efficient-block-processing parallel architecture
for multilevel lifting 2-D DWT,” ASP Journal of Low Power Electronics, vol. 9, no. 1,
pp. 37-44.
5. Mohanty, B.K., and Mahajan, A. (2013b), “Scheduling-scheme and parallel structure for
multilevel lifting 2-D DWT without using frame-buffer,” IET Circuits, Devices and
Systems, doi: 10.1049/iet-cds.2012.0398.
References 1. Chrysafis, C. and Ortega, A. (2000), “Line-based, reduced memory, wavelet image compression,”
IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 378–389.
2. Cheng, C.-C., Huang, C.-T., Cheng, C.-Y., Lian, C.-J. and Chen, L.-G. (2007), “On-chip
memory optimization scheme for VLSI implementation of line-based two dimensional
discrete wavelet transform,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 17, no. 7, pp. 814–822.
3. Daubachies, I. (1992), Ten Lectures on Wavelets, Philadelphia: Society for Industrial and Applied
Mathematics (SIAM).
4. Huang, C.-T., Tseng, P.-C. and Chen, L.-G. (2005), “Generic RAM-based architectures for two-
dimensional discrete wavelet transform with line-based method,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 15, no. 7, pp. 910-919.
Synopsis - 11
5. Lewis, A.S. and Knowles, G. (1992), “Image compression using the 2-D wavelet
transform,” IEEE Transactions on Image Processing, vol. 1, no. 2, pp. 244–250.
6. Lai, Y.-K., Chen, L.-F. and Shih, Y.-C. (2009), “A high-performance and memory efficient VLSI
architecture with parallel scanning method for 2-D lifting-based discrete wavelet transform,”
IEEE Transactions on Consumer Electronics, vol. 55, no. 2, pp. 400-407.
7. Li, W.-M., Hsia, C.-H. and Chiang, J.-S. (2009), “Memory-efficient architecture of 2-D dual-
mode discrete wavelet transform using lifting scheme for motion-JPEG 2000,” in Proc. IEEE
International Symposium on Circuit and Systems (ISCAS), pp. 750–753.
8. Mallat, S. G. (1989), “A theory for multiresolution signal decomposition: The wavelet
representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7,
pp. 674–693.
9. Meyer, Y. (1993), Wavelets: Algorithms and Applications, Philadelphia: Society for Industrial
and Applied Mathematics (SIAM).
10. Meher, P.K., Mohanty, B.K. and Patra, J.C. (2008), “Hardware-efficient systolic like
modular design for two-dimensional discrete wavelet transform,” IEEE Transactions on
Circuits and Systems–II, Express Briefs, vol. 55, no. 2, pp. 151-154.
11. Mohanty, B. K., and Meher, P. K. (2009), “Efficient multiplierless designs for 1- D DWT using 9
/7 filters based on distributed arithmetic,” in proc. IEEE international symposium on integrated
circuits (ISIC-2009), pp. 364 – 367.
12. Mohanty, B.K. and Meher, P. K. (2011), “Memory-efficient modular VLSI architecture for high -
throughput and low-latency implementation of multilevel lifting 2-D DWT,” IEEE Transactions
on Signal Processing, vol. 59, no. 5, pp. 2072-2084.
13. Mohanty, B.K. and Meher, P. K. (2013), “Memory-efficient high-speed convolution-based generic
structure for multilevel 2-D DWT,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 23, no. 2, pp. 353-363.
14. Sweldens, W. (1996), “The lifting scheme: A costom-designe construction of biorthogogal
wavelets,” Applied and Computational Harmonic Analysis, vol. 3, no. 2, pp. 186–200.
15. Skodars, A. Christopoulos, C. and Ebrahimi, T., (2001) “The JPEG 2000 still image
compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36-58.
16. Tian, X., Wu, L., Tan, Y.H., and Tian, J.W. (2011), “Efficient multi-input/multi-output VLSI
architecture for two-dimensional lifting-based discrete wavelet transform,” IEEE Transactions on
Computers, vol. 60, no. 8, pp. 1207-1211.
Synopsis - 12
17. Vetterli, M. and Kovacevic, J. (1995), Wavelets and Subband Coding, Prentice Hall PTR,
Englewood Cliffs, New Jersey.
18. Vishwanath, M. (1994), “The recursive pyramid algorithm for the discrete wavelet transform,”
IEEE Transactions on Signal Processing, vol. 42, no. 3, pp. 673-676.
19. White, S. A. (1989), “Applications of the distributed arithmetic to digital signal processing: A
tutorial review,” IEEE ASSP Magazine, vol. 6, no. 3, pp. 5–11.
20. Wu, P.-C. and Chen, L.-G. (2001), “An efficient architecture for two-dimensional discrete
wavelet transform,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11,
no. 4, pp. 536-545.
21. Wu, B.-F. and Lin, C.-F. (2005), “A high-performance and memory-efficient pipeline architecture
for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 15, no. 12, pp. 1615-1628.
22. Xiong, C.-Y., Tian, J.-W. and Liu, J. (2006), “A note on ‘flipping structure: An efficient VLSI
architecture for lifting-based discrete wavelet transform,” IEEE Transactions on Signal
Processing, vol. 54, no. 5, pp. 1910–1916.
23. Xiong, C.-Y., Tian, J.-W. and Liu, J. (2007), “Efficient architecture for two dimensional discrete
wavelet transform using lifting scheme,” IEEE Transactions on Image Processing, vol. 16, no.
3, pp. 607–614.