3f7bf661d01

5
290 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009 An Efcient Folded Architecture for Lifting-Based Discrete Wavelet Transform Guangming Shi, Member, IEEE , Weifeng Liu, Li Zhang, and Fu Li  Abstract—In this brief an efcient folded architecture (EF A) for lifting-based discrete wavelet transform (DWT) is presented. The proposed EFA is based on a novel form of the lifting scheme that is given in this brief. Due to this form, the conventional serial op- erations of the lifting data ow can be optimized into parallel ones by employing parallel and pipeline techniques. The corresponding optimized architecture (OA) has short critical path latency and is repeatable. Further , utilizing this repeatability, the EFA is derived from the OA by employing the fold technique. For the proposed EFA, hardware utilization achieves 100%, and the number of re- quir ed reg iste rs is red uced . Addit ionall y, the shift –add oper ation is adopt ed to optimize the multi plica tion; thus, the propose d ar - chitecture is more suitable for hardware implementation. Perfor- mance comparisons and eld-programmable gate array (FPGA) implementation results indicate that the proposed EFA possesses better performances in critical path latency, hardware cost, and control complexity .  Index Terms—Discrete wavelet transform (DWT), folded archi- tecture, lifting scheme, parallel, pipeline. I. I NTRODUCTION T HE discr ete wavelet transform (DWT) has exte nsi vely bee n used in man y app lications [1] –[3 ]. The exist ing archi tectu res for imple mentin g the DWT are mainl y classi- ed int o two cat egori es: 1) con vo lution bas ed [4]–[6] and 2) lifting based [7]–[13]. Since the lifting-based architectures have advantages over the convolution-based ones in compu- tation complexity and memory requirement, more attention is paid on the lifting-based ones. In [10], Jou et al. proposed an architecture for directly implementing the lifting scheme. Based on this direct architecture, Lian et al. [11] proposed a folded archit ectur e to increase the hardw are utilization. Un- fortunately, these architectures have limitations on the critical path latency and memory requirement. The ipping structure [12] can reduce the critical path latency by eliminating the multipliers on the path from the input node to the computa- Man usc rip t rec ei ve d Jul y 13, 2008; rev ise d Oct obe r 18, 2008 and December 17, 2008. First published March 16, 2009; current version pub- lis hed April 17, 2009. This work was support ed in par t by the Natio nal High Technology Research and Development Program of China under Grant 2007AA01Z307, by the National Natural Science Foundation of China under Grant 60736043, Grant 60776795, and Grant 60672125, and by the Program for Changjiang Scholars and Innovative Research Teams in University under Grant IRT0645. This paper was recommended by Associate Editor M. Anis. The authors are with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China (e-mail: gmshi@xid ian.edu.cn; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexp lore.ieee.org. Digital Object Identier 10.1109/TCSII.2009.201 5393 tion node without hardware overhead. Reference [13] modi- ed the conventional lifting scheme by merging the predictor and updater stages into a single lifting step; thus, the critical pat h lat enc y is sho rte ned , and the memory req uir eme nt is reduced. However, these architectures [12] and [13] involve a complex control procedure, and the roundoff noise had to be considered. In order to solve those problems, we propose an efcient folded architecture (EFA) for the lifting-based DWT in this brief. The EFA can be obtained according to the following procedures: First, we give a new formula for the lifting al- gorithm, leading to a novel form of the lifting scheme. Due to this form, the intermediate data that were used to compute the output data are distributed on different paths. Thus, we can process these intermediate data in parallel by employing the parallel and pipeline techn iques. Wit h the aforementioned operations, the conventional serial data ow of the lifting-based DWT is optimized into a parallel one. Thus, the corresponding optimized architectu re (OA) has short critical path laten cy . More importantly, the resulted OA is of repeatability. Based on this property, the EFA is derived from the OA by employing the fold technique. With the proposed EFA, the required hardware res our ce is red uce d, and the har dwa re uti liz ati on is gre atl y increased. Furthermore, the critical path latency and the number of registers are reduced. In addition, the shift–add operation is adopted to optimize the multiplication; thus, the hardware resource is further reduced, and the implementation complexity is cut down. In our work, we take the 9/7 wavelet lters as an example to explain the proposed EFA. The performance comparisons and FPGA implementation results indicate the efciency of the proposed architecture. The outline of this brief is organized as follows. Section II briey revie ws the liftin g scheme and the lifting algorithm for the 9/7 wavelet lters. Section III describes the proposed folded architecture. Performance comparisons and FPGA implementa- tion are shown in Section IV. Finally, a conclusion is given in Section V. II. LIFTING SCHEME The lifting scheme is an efcient way to construct the DWT [14] and [15]. Generally, the lifting scheme consists of three steps: 1) split; 2) predict; and 3) update. Fig. 1 shows the block diagram of the lifting-based structure. The basic principle is to break up the polyphase matrix of the wavelet lters into a sequence of alternating upper and lower triangular matrices and a diagonal normalization matrix [15]. According to the 1549-7747 /$25.00 © 2009 IEEE

Upload: cheezesha4533

Post on 10-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

8/8/2019 3F7BF661d01

http://slidepdf.com/reader/full/3f7bf661d01 1/5

290 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009

An Efficient Folded Architecture for Lifting-BasedDiscrete Wavelet Transform

Guangming Shi, Member, IEEE , Weifeng Liu, Li Zhang, and Fu Li

 Abstract—In this brief an efficient folded architecture (EFA) forlifting-based discrete wavelet transform (DWT) is presented. Theproposed EFA is based on a novel form of the lifting scheme thatis given in this brief. Due to this form, the conventional serial op-erations of the lifting data flow can be optimized into parallel onesby employing parallel and pipeline techniques. The correspondingoptimized architecture (OA) has short critical path latency and isrepeatable. Further, utilizing this repeatability, the EFA is derivedfrom the OA by employing the fold technique. For the proposedEFA, hardware utilization achieves 100%, and the number of re-quired registers is reduced. Additionally, the shift–add operation isadopted to optimize the multiplication; thus, the proposed ar-

chitecture is more suitable for hardware implementation. Perfor-mance comparisons and field-programmable gate array (FPGA)implementation results indicate that the proposed EFA possessesbetter performances in critical path latency, hardware cost, andcontrol complexity.

 Index Terms—Discrete wavelet transform (DWT), folded archi-tecture, lifting scheme, parallel, pipeline.

I. INTRODUCTION

THE discrete wavelet transform (DWT) has extensively

been used in many applications [1]–[3]. The existing

architectures for implementing the DWT are mainly classi-

fied into two categories: 1) convolution based [4]–[6] and2) lifting based [7]–[13]. Since the lifting-based architectures

have advantages over the convolution-based ones in compu-

tation complexity and memory requirement, more attention is

paid on the lifting-based ones. In [10], Jou et al. proposed

an architecture for directly implementing the lifting scheme.

Based on this direct architecture, Lian et al. [11] proposed a

folded architecture to increase the hardware utilization. Un-

fortunately, these architectures have limitations on the critical

path latency and memory requirement. The flipping structure

[12] can reduce the critical path latency by eliminating the

multipliers on the path from the input node to the computa-

Manuscript received July 13, 2008; revised October 18, 2008 andDecember 17, 2008. First published March 16, 2009; current version pub-lished April 17, 2009. This work was supported in part by the NationalHigh Technology Research and Development Program of China under Grant2007AA01Z307, by the National Natural Science Foundation of China underGrant 60736043, Grant 60776795, and Grant 60672125, and by the Programfor Changjiang Scholars and Innovative Research Teams in University underGrant IRT0645. This paper was recommended by Associate Editor M. Anis.

The authors are with the Key Laboratory of Intelligent Perception and ImageUnderstanding of Ministry of Education, Xidian University, Xi’an 710071,China (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSII.2009.2015393

tion node without hardware overhead. Reference [13] modi-

fied the conventional lifting scheme by merging the predictor

and updater stages into a single lifting step; thus, the critical

path latency is shortened, and the memory requirement is

reduced. However, these architectures [12] and [13] involve a

complex control procedure, and the roundoff noise had to be

considered.

In order to solve those problems, we propose an efficient

folded architecture (EFA) for the lifting-based DWT in this

brief. The EFA can be obtained according to the following

procedures: First, we give a new formula for the lifting al-gorithm, leading to a novel form of the lifting scheme. Due

to this form, the intermediate data that were used to compute

the output data are distributed on different paths. Thus, we

can process these intermediate data in parallel by employing

the parallel and pipeline techniques. With the aforementioned

operations, the conventional serial data flow of the lifting-based

DWT is optimized into a parallel one. Thus, the corresponding

optimized architecture (OA) has short critical path latency.

More importantly, the resulted OA is of repeatability. Based on

this property, the EFA is derived from the OA by employing the

fold technique. With the proposed EFA, the required hardware

resource is reduced, and the hardware utilization is greatly

increased. Furthermore, the critical path latency and the numberof registers are reduced. In addition, the shift–add operation

is adopted to optimize the multiplication; thus, the hardware

resource is further reduced, and the implementation complexity

is cut down.

In our work, we take the 9/7 wavelet filters as an example

to explain the proposed EFA. The performance comparisons

and FPGA implementation results indicate the efficiency of the

proposed architecture.

The outline of this brief is organized as follows. Section II

briefly reviews the lifting scheme and the lifting algorithm for

the 9/7 wavelet filters. Section III describes the proposed folded

architecture. Performance comparisons and FPGA implementa-tion are shown in Section IV. Finally, a conclusion is given in

Section V.

II. LIFTING SCHEME

The lifting scheme is an efficient way to construct the DWT

[14] and [15]. Generally, the lifting scheme consists of three

steps: 1) split; 2) predict; and 3) update. Fig. 1 shows the block 

diagram of the lifting-based structure. The basic principle is

to break up the polyphase matrix of the wavelet filters into

a sequence of alternating upper and lower triangular matrices

and a diagonal normalization matrix [15]. According to the

1549-7747/$25.00 © 2009 IEEE

8/8/2019 3F7BF661d01

http://slidepdf.com/reader/full/3f7bf661d01 2/5

SHI et al.: EFFICIENT FOLDED ARCHITECTURE FOR LIFTING-BASED DISCRETE WAVELET TRANSFORM 291

Fig. 1. Block diagram of the lifting scheme.

basic principle, the polyphase matrix of the 9/7 wavelet can be

expressed as [15]

P̃ (z) =

1 α(1 + z−1)0 1

1 0

β (1 + z) 1

1 γ (1 + z−1)0 1

1 0

δ(1 + z) 1

K  00 1/K 

(1)

where α(1 + z−1) and γ (1 + z−1) are the predict polynomials,

β (1 + z) and δ(1 + z) are the update polynomials, and the

K  is the scale normalization. Here, the lifting coefficients α,β , γ , and δ, and constant K  are α ≈ −1.586134342, β ≈−0.052980118, γ ≈ 0.8829110762, δ ≈ 0.4435068522, and

K ≈ 1.149604398, respectively.

Given the input sequence xn, n = 0, 1, . . . , N  − 1, where N is the length of the input sequence, the detailed lifting procedure

is given in four steps.

1) Splitting step:

Odd part d(0)i

= x2n+1 (2)

Even part s(0)i

= x2n. (3)

2) First lifting step:

Predictor d(1)i

= d(0)i

+ α×s(0)i

+ s(0)i+1

(4)

Updater s(1)i

= s(0)i

+ β ×d(1)i−1 + d

(1)i

. (5)

3) Second lifting step:

Predictor d(2)i

= d(1)i

+ γ ×s(1)i

+ s(1)i+1

(6)

Updater s(2)i

= s(1)i

+ δ ×d(2)i−1 + d

(2)i

. (7)

4) Scaling step:

di = d(2)i

/K  (8)

si = K × s(2)i

. (9)

d(l)i

and s(l)i

are intermediate data, where l presents the stage of 

the lifting step. Output di and si, i = 0, . . . , (N − 1)/2, are the

high-pass and low-pass wavelet coefficients.

From (4)–(7), it is obvious that the first and second lifting

steps can be implemented using the same architecture, with

alternating the lifting coefficients. Thus, the architecture for the

first lifting step can be multiplexed using the folded method

to reduce the hardware resource and areas. Based on this idea,

we will propose a novel folded architecture for the lifting-based DWT.

Fig. 2. Data flow of 9/7 lifting-based DWT.

Fig. 3. (a) Processing of the intermediate data. (b) Data flow optimizationwith a four-stage pipeline.

III. PROPOSED ARCHITECTURE

It is well known that, in the lifting scheme, the way of 

processing the intermediate data determines the hardware scale

and critical path latency of the implementing architecture. In

the following, we use the parallel and pipeline techniques to

process the intermediate data. The corresponding architecture

possesses repeatable property. Thus, it can further be improved,

leading to the EFA.

From Section II, it can be found that the conventional 9/7

lifting involves two lifting steps and one scaling step. The

corresponding data flow can be shown as Fig. 2. In this figure,

d(0)i+2 and s(0)

i+2 are the input data of the current cycle. d(0)i+1, s(0)

i+1,

d(1)i , s(1)i , and d(2)i−1 are the intermediate data obtained from theinternal memory. d(2)

iand s(2)

iare the output of the current

8/8/2019 3F7BF661d01

http://slidepdf.com/reader/full/3f7bf661d01 3/5

292 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009

Fig. 4. Corresponding OA.1

Fig. 5. Proposed folded architecture.

cycle. As we know, these intermediate data will be used for

computing d(2)i

and s(2)i

. However, the conventional lifting

scheme adopts the serial operation to process these interme-

diate data; thus, the critical path latency is very long. In fact,from (4)–(7), we can choose d(0)

i+1 + α× s(0)i+1, s(0)

i+1 + β ×

d(1)i

, d(1)i

+ γ × s(1)i

, and s(1)i

+ δ × d(2)i−1 as the intermediate

data for computing d(2)i

and s(2)i

. Since these intermediate data

are on different paths, we can calculate them in parallel. That

is, we use four delay registers D1, D2, D3, and D4 to restore

these intermediate data in the same cycle. The delay registers

are expressed as

D1 = d(0)i+1 + α× s(0)

i+1 D2 = s(0)i+1 + β × d(1)

i

D3 = d(1)i

+ γ × s(1)i

D4 = s(1)i

+ δ × d(2)i−1. (10)

Fig. 3(a) shows this operation. With this parallel operation, the

critical path latency is reduced, and the number of registers is

decreased.

According to the aforementioned processing, the data flow

of 9/7 lifting can be optimized into the four-stage pipeline flow.

This is shown in Fig. 3(b). In this figure, the data read from the

four delay registers shown in gray circles are used for current

computation. Data D1, D2, D3, and D4 along the arrows are

the candidates of the delay registers. They are computed in the

current cycle and will be used in the next cycle.

Based on the optimized data flow shown in Fig. 3(b), the

corresponding OA can be obtained, as shown in Fig. 4. In

1Here, we do not consider the splitting and scaling step.

this figure, the dashed line divides the architecture into two

similar parts. Therefore, we can multiplex the left-side ar-

chitecture, replacing the right-side one. In this way, we can

obtain our proposed EFA. It is shown in the dashed areaof Fig. 5.

In the following, we will show the EFA for processing the

two lifting steps of the 9/7 filter. Intermediate data d(1)i

and s(1)i

,

which were obtained from the first lifting step, are fed back 

to pipeline registers P 1 and P 2. They are used for the second

lifting step. As a result, the first and second lifting steps are in-

terleaved by selecting their own coefficients. In this procedure,

two delay registersD3 andD4 are needed in each lifting step for

the proper schedule. Table I shows a detailed processing of the

9/7 wavelet, where, for a given input sequence xn, we can get

output d(2)i

and s(2)i

, respectively. In the proposed architecture,

the speed of the internal processing unit is two times that of the

even (odd) input/output data. This means that the input/output

data rate to the DWT processor is one sample per clock cycle.

The proposed architecture needs only four adders and two

multipliers, which are half those of the architecture shown

in Fig. 4.

For the splitting, we use one delay register and two switches

to split the input into odd/even sequences. With regard to

scaling, it consists of one multiplier and one multiplexer. By

properly selecting coefficients 1/K  and K , the high-pass and

low-pass coefficients are normalized. Fig. 5 shows the complete

structure of the proposed EFA, including the splitting, lifting,

and scaling steps.

The proposed EFA can also be used for 5/3 lifting-basedDWT by bypassing delay registers D3 and D4 and selecting

8/8/2019 3F7BF661d01

http://slidepdf.com/reader/full/3f7bf661d01 4/5

8/8/2019 3F7BF661d01

http://slidepdf.com/reader/full/3f7bf661d01 5/5

294 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009

TABLE IIIEXPERIMENT RESULTS FOR THE FOLDED ARCHITECTURE [11] AN D

THE EFA (FO R 9/7 LIFTING-BASED DWT)

  A. Performance Comparison

The performance comparison between the proposed EFA and

other existing architectures is evaluated in Table II, in terms of 

critical path latency, control complexity, throughput rate, and

hardware complexity (measured by the number of multipliers,

adders, and registers).

From Table II, it can be found that, compared with other

architectures, the proposed EFA requires the least arithmetic re-

source (two multipliers and four adders), although the through-

put rate is one input/output data per cycle. The critical path

latency is T m + T a, which is slightly longer than that of the Direct+full pipeline, Flipping+five-stage pipeline, and the

pipeline proposed by Wu and Lin [13]. However, the number

of registers is ten, which is less than that in the three afore-

mentioned architectures. Additionally, the control complexity is

medium, whereas the Flipping+five-stage pipeline and Wu [13]

are complex. The results indicate that the proposed EFA pro-

vides a proper choice in terms of critical path latency, hardware

cost, and control complexity.

  B. FPGA Implementation

We have developed the synthesizable Verilog HDL model toimplement the proposed EFA on the Altera Stratix II FPGA

EP2S15F484C5 using the Quartus II 7.2 platform. In the im-

plementation, a 12-bit data bus width is chosen, and the coeffi-

cients are quantized to a fixed point, where 12-bit precision for

the fractional part is selected. In order to verify the efficiency

of the proposed architecture, we redesigned the architecture

presented in [11, Fig. 5] under the same circumstance. The

implementation results in terms of dedicated logic register,

combinational adaptive look-up tables (ALUT), and critical

path latency are shown in Table III. It is obvious that our archi-

tecture has shorter critical path latency and smaller hardware

resource than the folded architecture in [11]. The reason is

that the proposed EFA is based on the optimized data flow

of a conventional lifting-based DWT, whereas the architecture

in [11] is based on conventional lifting. Additionally, we also

multiplex the multiplier for scaling in order to reduce the

hardware.

V. CONCLUSION

In this brief, we have proposed a novel EFA for the lifting-

based DWT. We have given a new formula for the conventional

lifting algorithm. Then, by employing the parallel and pipeline

techniques, the conventional data flow of the lifting-based

DWT is converted to a parallel one, resulting in the OA with

repeatable property. Based on this property, the proposed EFA

is derived from the OA by further employing the fold technique.

The FPGA implementation results show that the proposed EFA

possesses short critical path latency and achieves high hardware

utilization. Performance comparisons indicate that our EFAprovides an efficient alternative in tradeoff among the critical

path latency, hardware cost, and control complexity.

ACKNOWLEDGMENT

The authors would like to thank the Associate Editor and

anonymous reviewers for their valuable comments.

REFERENCES

[1] A. N. Akansu and R. A. Haddad, Multiresolution Signal Decomposition:

Transforms, Subbands and Wavelets. New York: Academic, 1992.[2] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet co-

efficients,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3445–3462,Dec. 1993.

[3] S. C. B. Lo, H. Li, and M. T. Freedman, “Optimization of wavelet decom-position for image compression and feature preservation,” IEEE Trans.

 Med. Imag., vol. 22, no. 9, pp. 1141–1151, Sep. 2003.[4] K. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelet

transforms,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 1,no. 3, pp. 191–202, Jun. 1993.

[5] C. Chakrabarti and M. Vishwanath, “Efficient realizations of discreteand continuous wavelet transforms: From single chip implementationsto mapping on SIMD array computers,” IEEE Trans. Signal Process.,vol. 43, no. 3, pp. 759–771, Mar. 1995.

[6] M. Vishwanath, R. M. Owens,and M. J. Irwin, “VLSI architectures forthediscrete wavelet transform,” IEEE Trans. Circuits Syst. II, Analog Digit.

Signal Process., vol. 42, no. 5, pp. 305–316, May 1995.[7] W. Jiang and A. Ortega, “Lifting factorization-based discrete wavelet

transform architecture design,” IEEE Trans. Circuits Syst. Video Technol.,vol. 11, no. 5, pp. 651–657, May 2001.

[8] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecture forlifting-based forward and inverse wavelet transform,” IEEE Trans. Signal

Process., vol. 50, no. 4, pp. 966–977, Apr. 2002.[9] H. Liao, M. K. Mandal, and B. F. Cockburn, “Efficient architectures

for 1-D and 2-D lifting-based wavelet transforms,” IEEE Trans. Signal

Process., vol. 52, no. 5, pp. 1315–1326, May 2004.[10] J. M. Jou, Y. H. Shiau, and C. C. Liu, “Efficient VLSI architectures for

the biorthogonal wavelet transform by filter bank and lifting scheme,” inProc. IEEE Int. Symp. Circuits Syst., 2001, pp. 529–532.

[11] C. J. Lian, K. F. Chen, H. H. Chen, and L. G. Chen, “Lifting based discretewavelet transform architecture for JPEG2000,” in Proc. IEEE Int. Symp.

Circuits Syst., Sydney, Australia, 2001, pp. 445–448.[12] C. T. Huang, P. C. Tseng, and L. G. Chen, “Flipping structure: An efficient

VLSI architecture for lifting based discrete wavelet transform,” IEEE 

Trans. Signal Process., vol. 52, no. 4, pp. 1080–1089, Apr. 2004.

[13] B. F. Wu and C. F. Lin, “A high-performance and memory-efficientpipeline architectures for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec,” IEEE Trans. Circuits Syst. Video Technol., vol. 15,no. 12, pp. 1615–1628, Dec. 2005.

[14] W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,” Appl. Comput. Harmon. Anal., vol. 3, no. 15,pp. 186–200, Apr. 1996.

[15] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into liftingschemes,” J. Fourier Anal. Appl., vol. 4, no. 3, pp. 247–269, 1998.

[16] X. Lan, N. Zheng, and Y. Liu, “Low-power and high-speed VLSI architec-ture for lifting-based forward and inverse wavelet transform,” IEEE Trans.

Consum. Electron., vol. 51, no. 2, pp. 379–385, May 2005.[17] T. Acharya and C. Chakrabarti, “A survey on lifting-based discrete

wavelet transform architectures,” J. VLSI Signal Process., vol. 42, no. 3,pp. 321–339, Mar. 2006.