memory-efficient concurrent vlsi...

MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR TWO-DIMENSIONAL DISCRETE

WAVELET TRANSFORM

Synopsis

of

Ph.D. thesis

By

Anurag Mahajan (Enrollment Number: 07P01005G)

Under the Guidance of

Prof. B.K.Mohanty

Department of Electronics and Communication Engineering JAYPEE UNIVERSITY OF ENGINEERING AND TECHNOLOGY,

GUNA (M.P.) - INDIA August - 2013

Synopsis - 1

Preface

Discrete wavelet transform (DWT) is a mathematical technique that provides a new method

for signal processing. It decomposes a signal in the time domain by using dilated / contracted

and translated versions of a single basis function, named as prototype wavelet Mallat (1989);

Daubachies (1992); Meyer (1993); Vetterli and Kovacevic (1995). DWT offers wide variety

of useful features over other unitary transforms like discrete Fourier transforms (DFT),

discrete cosine transform (DCT) and discrete sine transform (DST). Two-dimensional (2-D)

DWT has been applied in image compression, image analysis and image watermarking etc.

Lewis and Knowles (1992). Currently, 2-D DWT is used in JPEG 2000 image compression

standard Skodars et al. (2001). The 2-D DWT is highly computation intensive and many of

its application need real-time processing to deliver better performance. The 2-D DWT is

currently implemented in very large scale integration (VLSI) system to meet the space-time

requirement of various real-time applications. Several design schemes have been suggested

for efficient implementation of 2-D DWT in a VLSI system.

The hardware complexity of multilevel 2-D DWT structure is broadly divided into

two parts (i) arithmetic and (ii) memory. The arithmetic component is comprised of

multipliers and adders, and its complexity depends on wavelet filter size (k). The memory

component is comprised of line buffer and frame buffer. The memory complexity depends on

image size (MN), where M and N represent the height and width of input image. Small size

filters (k < 10) are used in DWT where the standard image size is (512 × 512). Therefore, the

complexity of multilevel 2-D DWT structure is dominated by the complexity of memory

component. Most of the existing design strategies are focused on arithmetic complexity, cycle

period and throughput rate. There is no specific memory-centric design method is proposed

for multilevel 2-D DWT. The objective of the proposed thesis work is to explore memory-

centric design approaches and proposes area-delay-power efficient hardware designs for

implementation of multilevel 2-D DWT.

Objective

The thesis entitled “Memory-Efficient Concurrent VLSI Architectures for Two-Dimensional

Discrete Wavelet Transform” has the following aims and objectives:

To improve memory utilization efficiency of 2-D DWT structure.

To reduce transposition memory size.

Synopsis - 2

To eliminate frame buffer.

To reduce arithmetic complexity using low complexity design scheme.

The summary of the thesis is given below:

Chapter 1: Introduction

In this Chapter, computation scheme of one dimensional (1-D) and 2-D DWT are

discussed. 1-D DWT can be performed using convolution scheme or lifting scheme proposed

by Sweldens (1996). Convolution scheme involves more arithmetic resources and memory

space than the lifting scheme. However, the lifting scheme is suitable for bi-orthogonal

wavelet filters. The 2-D DWT computation is performed by two approaches: (i) separable and

(ii) non-separable. In non-separable approach, row and column transforms of 2-D DWT are

performed simultaneously using 2-D wavelet filters. In separable approach, row and column

transforms of 2-D DWT are performed separately using 1-D DWT. Separable approach is

more popular than non-separable approach as it demands less computation than non-

separable approach. However, separable approach requires transposition memory between

row and column transform. Multilevel 2-D DWT computation can be performed using

pyramid algorithm (PA), recursive pyramid algorithm (RPA) of Vishwanath (1994) and

folded scheme of Wu and Chen (2001). Due to design simplicity, 100% hardware utilization

efficiency (HUE) and lower arithmetic resource requirement, folded scheme is more popular

than the PA and RPA for hardware realization. Keeping this in view, several architectures

based on folded scheme have been proposed for efficient implementation of 2-D DWT.

Chapter 2: Hardware Complexity Analysis

Folded 2-D DWT computation is performed level by level using one separable 2-

DWT refer to as processing unit (PU) and one frame buffer. The low-low subband of the

current DWT level is stored in the frame buffer to compute higher DWT levels. The PU

comprised of one row-processor (to perform 1-D DWT computation row-wise), one column-

processor (to perform 1-D DWT computation column-wise), one transposition memory and

one temporal memory. Transposition memory stores the intermediate matrices low-pass (Ul)

and high-pass (Uh) while temporal memory is used by the column-processor to store the

partial results of column DWT. Frame memory may either be on-chip or off-chip, while the

other two are usually on-chip memories.

Synopsis - 3

The arithmetic complexity of folded structure depends on the DWT computation

scheme and filter length. The size of frame buffer size is MN/4, words which is independent

of data access scheme, type of DWT computation scheme (convolution or lifting) and length

of wavelet filter. Temporal memory size depends on DWT computation scheme and wavelet

filter length. For convolution-based 2-D DWT temporal memory size is zero when a direct-

form FIR structure is used for computation of 1-D DWT. In case of lifting-based 2-D DWT,

the size of temporal memory depends on number of lifting steps of bi-orthogonal wavelet

filter. Transposition memory size mainly depends on the data access scheme adopted to feed

2-D input samples and DWT computation scheme (convolution or lifting). In general, the

sizes of the transposition memory and temporal memory are some multiple of image width,

while the size of frame memory is some multiple of image size. On other hand, the

complexity of arithmetic component depends on the size of the wavelet filter. The standard

image size is (512 × 512) where the size of most commonly used wavelet filter is less than

10. The hardware complexity of folded 2-D DWT structure is dominated by complexity of

memory component.

Several VLSI architectures have been suggested for the folded 2-D DWT in last

decade to meet space and time requirement of real-time application. All these designs differ

by arithmetic complexity, cycle period, throughput rate and they use almost same amount of

on-chip memory words and equal amount of frame buffer words. The Arithmetic complexity

(in terms of multiplier and adder) and memory complexity (in terms of memory words) of

best available designs are estimated for 9/7 wavelet filter and image size (512 × 512) Wu et

al. (2005); Xiong et al. (2006); Xiong et al. (2007); Cheng et al. (2007). It is found that

memory complexity is almost 103 times higher than arithmetic complexity. Consequently, the

memory words per output (MPO) of the existing designs are significantly higher than the

arithmetic complexity per output. Since, the logic complexity of arithmetic components and

memory components are widely different. Transistor count is considered to estimate

arithmetic and memory complexity of the existing structures. We find that, the transistor

count of memory component is almost 97% on average of total transistor count of the folded

designs. Therefore, memory component of folded design consumes most of the chip area and

power. However, the existing design approaches are focused on optimizing the arithmetic

complexity and cycle period. There is no specific design is suggested to address memory

complexity which is a major component of folded 2-D DWT structure.

Synopsis - 4

Chapter 3: Block-Based Architecture for Folded 2-D DWT Using Line Scanning

Folded 2-D DWT structure is memory intensive. Few two-input and two-output and

four-input and four-output designs have been suggested in Xiong et al. (2006), Xiong et al.

(2007), Li et al. (2009) and Lai et al. (2009) for high throughput implementation of folded 2-

D DWT. The arithmetic complexity of these structures is varying proportionality with

throughput rate, but the memory complexity is almost independent of throughput rate. For

example the structure of Xiong et al. (2007) processes four samples per cycle and involves

on-chip memory of size nearly 5.5N words, where the structure of Xiong et al. (2006)

processes two samples per cycle and involves on-chip memory of size 5.5N words, but both

the designs involve frame buffer of size MN/4 words. In general, on-chip and off-chip

memory of folded design is almost independent of input block size. Therefore, block

processing scheme has the potential to improve memory utilization efficiency of a 2-D DWT

structure. Keeping this in view, in this Chapter, we present a block processing scheme to

improve the on-chip and off-chip memory utilization efficiency of line-based folded 2-D

DWT structure. In the proposed block processing scheme a block of P samples is collected

from each row and the input blocks are processed in row-by-row order, where 2 ≤ P ≤ N and

P is power of 2. The lifting 2-D DWT computation is partitioned and mapped to a folded

structure to process a block of P samples in each cycle. Line-based parallel and pipeline

structure is derived using the proposed block processing scheme. The proposed structure has

an interesting feature. Its on-chip and frame buffer size is independent of input block size.

The structure is easily configured for any block size P, provided P is power of 2 and 2 ≤ P ≤

N. Since, on-chip memory and frame buffer size does not increases with the block size (P),

the proposed structure offers better memory utilization for higher block sizes and it has less

area delay product (ADP) and less energy per image (EPI).

Compared with block-based structure of Xiong et al. (2007) which is best amongst the

existing line-based folded structures, the proposed structure involves P/4 times more

multipliers and adders, nearly same on-chip and frame buffer words, and offers P/4 times less

computation time. Note that the structure of Xiong et al. (2007) only processes 4 samples per

cycle, where the proposed structure can be configured easily to process P samples in one

cycle, where 2 ≤ P ≤ N. Application specific integrated circuit (ASIC) synthesis result shows

that the core of the proposed structure using 9/7 wavelet filter and image size (512512)

involves 49% less ADP and 47% less EPI than those of Xiong et al. (2007) for block size 8.

Synopsis - 5

Due to throughput scalability, the propose structure has the flexibility to meet the area-delay-

power constraint of the target application.

Chapter 4: Area-Delay Efficient Structure for Folded 2-D DWT

The line-based lifting 2-D DWT structure involves nearly 5.5N memory words

irrespective of the input block size. Out of 5.5N on-chip memory words, 2.5N memory words

are used by transposition memory and 3N memory words used by the temporal memory. The

transposition memory size could be reduced by using parallel data access scheme of Cheng et

al. (2007), but the size of the temporal memory is almost independent of the data access

scheme and input block size. Therefore, block processing method has the potential to utilize

the temporal memory more efficiently. Keeping this in view, block-based 2-D DWT structure

using parallel data access scheme has been suggested by Tian et al. (2011). The structure of

Tian et al. (2011) involves transposition memory of size N(P + 2)/2 and temporal memory of

size 3N. Transposition memory size in this case depends on the block size as well as on the

image size. Besides, the structure of Tian et al. (2011) is not efficient for (P ≥ 4). The on-

chip memory in Xiong et al. (2007) is almost independent of the block size but this structure

is not scalable for block size P > 4. With the aforementioned facts in view, in this chapter, we

propose a new data access scheme for the computation of lifting 2-D DWT. Based on the

proposed data access scheme the input data blocks of size P are prepared from (P/2) rows and

two adjacent columns, where 4 ≤ P ≤ N and P is power of 2. A 2-D DWT structure based on

proposed data access scheme has following advantages:

The column-processor processes data blocks of high-pass and low-pass intermediate

matrices (Ul) and (Uh) separately.

The row-processor generates the data blocks of (Ul) and (Uh) in a transpose form. The

column-processor, therefore, consumes the data blocks immediately in the following

cycle.

From the dependency graph (DG) of N-point 1-D DWT computation, 1-D and 2-D

systolic arrays are derived to calculate DWT coefficients of 2-input samples and block of

input samples in one cycle. We propose a block processing structure for 2-D DWT with

regular data flow and reduced on-chip memory. The proposed structure involves transposition

memory of size N and temporal memory of size 3N. The on-chip memory of the proposed

structure is independent of input block size. It does not require control signals unlike the

Synopsis - 6

existing DWT structures. Compared with block-based structure of Tian et al. (2011), the

proposed structure Mohanty et al. (2012) involves (3P/4) times less multipliers, same number

of adders, (NP/2) less on-chip memory words and involves same computation time (CT).

ASIC synthesis result shows that the core of the proposed structure using 9/7 wavelet filter

and image size (512512) for block sizes 4 and 8 respectively involves 50% and 63% less

ADP, and 35% and 51% less EPI than the block-based structure of Tian et al. (2011). The

proposed structure is fully scalable for higher size input block. It offers better ADP and EPI

for higher block sizes which is very significant for high throughput implementation.

Chapter 5: Parallel Architecture for Multilevel 2-D DWT

Folded structure for multilevel 2-D DWT computation requires transposition memory,

temporal memory and frame memory. Block-based folded structures utilize the on-chip

memory and frame buffer more efficiently then the two-input two-output folded structures.

Consequently area-time complexity of the block-based folded structure reduces significantly

for higher block sizes. However, the size of the frame buffer of the block-based folded

structure still remains a design problem for single chip implementation of folded 2-D DWT

structures. To overcome this problem, recently parallel structures have been suggested by

Mohanty and Meher (2011) and Mohanty and Meher (2013) for multilevel 2-D DWT. A

parallel structure computes DWT levels concurrently and it does not require frame buffer

unlike folded structure. The existing parallel structures have some merits and demerits. Input

buffer design of the line-based parallel structure of Mohanty and Meher (2011) is simple, but

it requires relatively more on-chip memory words. The convolution-based structure of

Mohanty and Meher (2013) requires less on-chip memory words than the lifting-based

structure of Mohanty and Meher (2011) but it involves some overhead cost (in terms of

arithmetic resource and input buffer words) due to overlapped input blocks. Both the existing

design schemes do not offer an efficient parallel structure. To overcome problems of the

existing parallel designs, we propose a novel block processing scheme for generating

continues input blocks for the succeeding processing units of the parallel structure to achieve

100% HUE without block folding.

In the proposed block processing scheme, a block of P samples is collected by two

methods. In method-1 a block of P samples is taken from consecutive S rows and Q columns

of input matrix whereas in method-2 a block of P samples is taken from consecutive Q rows

Synopsis - 7

and S columns of input matrix. The input block size (P) is calculated using the formula:

P=QS, where Q = 2J+r, S = 2J-1, r is a positive integer. The minimum input block size of the

processing unit for a given DWT level is obtained by substituting r = 0 in the given formula,

whereas for other values of r > 0 higher input block sizes are obtained for given processing

unit. A parallel structure based on proposed block processing scheme has the following

advantages:

Input blocks are processed by row-processor in a specified order such that low-pass

and high-pass intermediate matrices are processed by the column-processor without

data transposition.

Low and high-pass intermediate matrices are processed by the column-processor

using separate computing blocks such that data blocks of four subbands are obtained

in parallel.

Based on the proposed block processing scheme, we propose two parallel structures

PAR-1 (using method-1) and PAR-2 (using method-2) for computation of multilevel lifting

2-D DWT without using frame buffer. Both the proposed structures do not require any

interfacing unit between DWT levels or overlapped memory as required by the existing

parallel structures of Mohanty and Meher (2011) and Mohanty and Meher (2013). However,

the proposed parallel structures require marginally higher on-chip memory words than the

parallel structure of Mohanty and Meher (2013) for same DWT levels.

Compared with the parallel structure of Mohanty and Meher (2011), the proposed

parallel structures involve the same number of multipliers and adders for the same input

block size and involve same CT. But, the proposed PAR-1 and PAR-2, respectively involve

nearly (3N + 3N/2J) and 3N less on-chip memory words than the parallel structure of

Mohanty and Meher (2011). ASIC synthesis result shows that the core of the proposed PAR-

1 of Mohanty and Mahajan (2013a) for 2-level DWT and image size (512512) involves 35%

less ADP and 29% less EPI than those of similar existing parallel structure of Mohanty and

Meher (2011). Compared with parallel structure of Mohanty and Meher (2011) the proposed

PAR-2 of Mohanty and Mahajan (2013b) for 2-level DWT and image size (512512)

involves 41% less ADP and 36% less EPI. The proposed structures are fully scalable for

higher input block sizes as well as higher DWT levels. Moreover, proposed structures are

highly regular, modular and systolic.

Synopsis - 8

Chapter 6: Low-Complexity Design for Folded 2-D DWT

The portable and hand-held devices are enabled with multimedia applications. They

use data compression algorithm for storage and transmit multidimensional signals. These

devices are resource constrained and most often need real-time processing to deliver better

performance. Therefore, it is essential to implement DWT in low-complexity hardware to

meet space-time requirement of portable devices. Multiplier is the most complex hardware

component in an arithmetic unit and it consumes major part of chip area and power.

Therefore, various low-complexity design schemes have been proposed for multiplier-less

implementation of convolution-based or lifting-based DWT. Distributed arithmetic (DA)

based techniques are popular for their high throughput processing capability and regular

structure, White (1989). We propose a DA-based 1-D DWT structure using carry-save full-

adder (CSFA) and carry-save-accumulator (CSAC). The proposed structure involves

significantly less logic resources and it has less bit cycle period than the similar existing

multiplier-less structure. ASIC simulation shows that, the proposed DA-based 1-D DWT

structure involves 11% less area and 35% less CP compared with the DA-based structure of

Mohanty and Meher (2009).

We propose DA-based folded 2-D DWT structures using line-based data access

(architecture -1) and parallel data access (architecture -2). Proposed DA based 1-D DWT

structure is used as row and column processor of 2-D DWT folded structures. Both the

proposed DA-based structures for 2-D DWT (architecture-1 and architecture -2) involve

same logic components but they differ with on-chip memory size and frame buffer size.

ASIC simulation shows that the proposed architecture -2 offers 43% area saving than the

similar multiplier based structure of Meher et al. (2008). The proposed DA-based structures

have significantly less area complexity than the existing designs. The proposed design

therefore, very much useful for low-complexity realization of 1-D and 2-D DWT for resource

constrained application.

Chapter 7: Conclusion and Future Scope

Area-delay efficient hardware realization of the multilevel 2-D DWT has great

practical interest for low power resource constrained mobile and portable devices. Keeping

this in view, several designs have been suggested by the various researchers during last

decade. In Chapter 2, we made a complexity analysis of existing folded structures and found

that memory complexity is 97% of the overall hardware complexity of the folded structure.

Synopsis - 9

However, the existing design strategies are focused on optimization of arithmetic complexity

and reduction of cycle period. There is no specific memory-centric design is proposed to

address the memory complexity of multilevel 2-D DWT structure. Therefore, in this thesis

work, we have proposed memory-centric design approaches and efficient hardware designs

for multilevel 2-D DWT.

In Chapter 3, a block processing scheme and a parallel-pipeline architecture based on

line-based data access is proposed with improved memory utilization efficiency. It has better

memory utilization efficiency than the existing structures and offers less ADP and EPI for

higher block sizes. In Chapter 4, a novel data accessing scheme and a systolic architecture is

proposed for area-delay efficient implementation of folded 2-D DWT. In Chapter-5, block

processing scheme for generating continuous input-blocks for succeeding PUs of a parallel

multilevel 2-D DWT structure is proposed. Using the block processing scheme, two parallel

architectures are proposed for area-delay efficient realization of multilevel 2-D DWT. Both

these proposed parallel structures do not require any interfacing unit between DWT levels or

overlapping memory unlike the existing parallel structures. In Chapter 6, we proposed DA-

based structure for 1-D DWT using carry-save full-adder and carry-save-accumulator. The

proposed DA-based 1-D DWT structure involves significantly less logic resources and it has

less bit cycle period than the existing similar DA-based structure. The proposed DA-based 1-

D DWT structure is used as a building block to derive a 2-D DWT structures based on the

line-based (architecture -1) and parallel data access scheme (architecture -2). Both the

proposed DA-based structures for 2-D DWT (architecture-1 and architecture -2) involve

same logic components but architecture-2 involves significantly less on-chip memory and,

therefore, it is suitable for low-complexity realization of 2-D DWT.

The proposed folded and parallel structures are regular, modular and easily scalable

for higher throughput rate. Interestingly, ADP and EPI of the proposed structures is less for

higher block sizes. This is significant for higher throughput implementation. Besides, the

proposed structures are individually best amongst the existing structures and they have

specific features which could be very useful for hardware implementation of 2-D DWT for

different types of image processing applications. However, there are few shortcomings in the

proposed design like critical path delay, fixed point error and interconnect delay. These issues

will be addressed in our future work.

Synopsis - 10

List of Publications

1. Mahajan, A. and Mohanty, B.K. (2010a), “Bit serial design for VLSI implementation of

1-D discrete wavelet transform using 9/7 filters based on distributed arithmetic,” in Proc.

CSI National Conference on Education and Research (ConfER 2010), JUET, Guna, pp.

439-449.

2. Mahajan, A. and Mohanty, B.K. (2010b), “Efficient VLSI architecture for

implementation of 1-D discrete wavelet transform based on distributed arithmetic,” in

Proc. 10th

IEEE Asia Pacific Conference on Circuits and Systems (APCCAS 2010), Kuala

Lumpur, Malaysia, pp. 1195-1198.

3. Mohanty, B.K., Mahajan, A. and Meher, P.K. (2012), “Area and power-efficient

architecture for high-throughput implementation of lifting 2-D DWT,” IEEE Transactions

on Circuits and Systems–II, Express Briefs, vol. 59, no. 7, pp. 434-438.

4. Mohanty, B.K. and Mahajan, A. (2013a), “Efficient-block-processing parallel architecture

for multilevel lifting 2-D DWT,” ASP Journal of Low Power Electronics, vol. 9, no. 1,

pp. 37-44.

5. Mohanty, B.K., and Mahajan, A. (2013b), “Scheduling-scheme and parallel structure for

multilevel lifting 2-D DWT without using frame-buffer,” IET Circuits, Devices and

Systems, doi: 10.1049/iet-cds.2012.0398.

References 1. Chrysafis, C. and Ortega, A. (2000), “Line-based, reduced memory, wavelet image compression,”

IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 378–389.

2. Cheng, C.-C., Huang, C.-T., Cheng, C.-Y., Lian, C.-J. and Chen, L.-G. (2007), “On-chip

memory optimization scheme for VLSI implementation of line-based two dimensional

discrete wavelet transform,” IEEE Transactions on Circuits and Systems for Video Technology,

vol. 17, no. 7, pp. 814–822.

3. Daubachies, I. (1992), Ten Lectures on Wavelets, Philadelphia: Society for Industrial and Applied

Mathematics (SIAM).

4. Huang, C.-T., Tseng, P.-C. and Chen, L.-G. (2005), “Generic RAM-based architectures for two-

dimensional discrete wavelet transform with line-based method,” IEEE Transactions on Circuits

and Systems for Video Technology, vol. 15, no. 7, pp. 910-919.

Synopsis - 11

5. Lewis, A.S. and Knowles, G. (1992), “Image compression using the 2-D wavelet

transform,” IEEE Transactions on Image Processing, vol. 1, no. 2, pp. 244–250.

6. Lai, Y.-K., Chen, L.-F. and Shih, Y.-C. (2009), “A high-performance and memory efficient VLSI

architecture with parallel scanning method for 2-D lifting-based discrete wavelet transform,”

IEEE Transactions on Consumer Electronics, vol. 55, no. 2, pp. 400-407.

7. Li, W.-M., Hsia, C.-H. and Chiang, J.-S. (2009), “Memory-efficient architecture of 2-D dual-

mode discrete wavelet transform using lifting scheme for motion-JPEG 2000,” in Proc. IEEE

International Symposium on Circuit and Systems (ISCAS), pp. 750–753.

8. Mallat, S. G. (1989), “A theory for multiresolution signal decomposition: The wavelet

representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7,

pp. 674–693.

9. Meyer, Y. (1993), Wavelets: Algorithms and Applications, Philadelphia: Society for Industrial

and Applied Mathematics (SIAM).

10. Meher, P.K., Mohanty, B.K. and Patra, J.C. (2008), “Hardware-efficient systolic like

modular design for two-dimensional discrete wavelet transform,” IEEE Transactions on

Circuits and Systems–II, Express Briefs, vol. 55, no. 2, pp. 151-154.

11. Mohanty, B. K., and Meher, P. K. (2009), “Efficient multiplierless designs for 1- D DWT using 9

/7 filters based on distributed arithmetic,” in proc. IEEE international symposium on integrated

circuits (ISIC-2009), pp. 364 – 367.

12. Mohanty, B.K. and Meher, P. K. (2011), “Memory-efficient modular VLSI architecture for high -

throughput and low-latency implementation of multilevel lifting 2-D DWT,” IEEE Transactions

on Signal Processing, vol. 59, no. 5, pp. 2072-2084.

13. Mohanty, B.K. and Meher, P. K. (2013), “Memory-efficient high-speed convolution-based generic

structure for multilevel 2-D DWT,” IEEE Transactions on Circuits and Systems for Video

Technology, vol. 23, no. 2, pp. 353-363.

14. Sweldens, W. (1996), “The lifting scheme: A costom-designe construction of biorthogogal

wavelets,” Applied and Computational Harmonic Analysis, vol. 3, no. 2, pp. 186–200.

15. Skodars, A. Christopoulos, C. and Ebrahimi, T., (2001) “The JPEG 2000 still image

compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36-58.

16. Tian, X., Wu, L., Tan, Y.H., and Tian, J.W. (2011), “Efficient multi-input/multi-output VLSI

architecture for two-dimensional lifting-based discrete wavelet transform,” IEEE Transactions on

Computers, vol. 60, no. 8, pp. 1207-1211.

Synopsis - 12

17. Vetterli, M. and Kovacevic, J. (1995), Wavelets and Subband Coding, Prentice Hall PTR,

Englewood Cliffs, New Jersey.

18. Vishwanath, M. (1994), “The recursive pyramid algorithm for the discrete wavelet transform,”

IEEE Transactions on Signal Processing, vol. 42, no. 3, pp. 673-676.

19. White, S. A. (1989), “Applications of the distributed arithmetic to digital signal processing: A

tutorial review,” IEEE ASSP Magazine, vol. 6, no. 3, pp. 5–11.

20. Wu, P.-C. and Chen, L.-G. (2001), “An efficient architecture for two-dimensional discrete

wavelet transform,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11,

no. 4, pp. 536-545.

21. Wu, B.-F. and Lin, C.-F. (2005), “A high-performance and memory-efficient pipeline architecture

for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec,” IEEE Transactions on

Circuits and Systems for Video Technology, vol. 15, no. 12, pp. 1615-1628.

22. Xiong, C.-Y., Tian, J.-W. and Liu, J. (2006), “A note on ‘flipping structure: An efficient VLSI

architecture for lifting-based discrete wavelet transform,” IEEE Transactions on Signal

Processing, vol. 54, no. 5, pp. 1910–1916.

23. Xiong, C.-Y., Tian, J.-W. and Liu, J. (2007), “Efficient architecture for two dimensional discrete

wavelet transform using lifting scheme,” IEEE Transactions on Image Processing, vol. 16, no.

3, pp. 607–614.

memory-efficient concurrent vlsi...

Documents