3f7bf661d01
TRANSCRIPT
8/8/2019 3F7BF661d01
http://slidepdf.com/reader/full/3f7bf661d01 1/5
290 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009
An Efficient Folded Architecture for Lifting-BasedDiscrete Wavelet Transform
Guangming Shi, Member, IEEE , Weifeng Liu, Li Zhang, and Fu Li
Abstract—In this brief an efficient folded architecture (EFA) forlifting-based discrete wavelet transform (DWT) is presented. Theproposed EFA is based on a novel form of the lifting scheme thatis given in this brief. Due to this form, the conventional serial op-erations of the lifting data flow can be optimized into parallel onesby employing parallel and pipeline techniques. The correspondingoptimized architecture (OA) has short critical path latency and isrepeatable. Further, utilizing this repeatability, the EFA is derivedfrom the OA by employing the fold technique. For the proposedEFA, hardware utilization achieves 100%, and the number of re-quired registers is reduced. Additionally, the shift–add operation isadopted to optimize the multiplication; thus, the proposed ar-
chitecture is more suitable for hardware implementation. Perfor-mance comparisons and field-programmable gate array (FPGA)implementation results indicate that the proposed EFA possessesbetter performances in critical path latency, hardware cost, andcontrol complexity.
Index Terms—Discrete wavelet transform (DWT), folded archi-tecture, lifting scheme, parallel, pipeline.
I. INTRODUCTION
THE discrete wavelet transform (DWT) has extensively
been used in many applications [1]–[3]. The existing
architectures for implementing the DWT are mainly classi-
fied into two categories: 1) convolution based [4]–[6] and2) lifting based [7]–[13]. Since the lifting-based architectures
have advantages over the convolution-based ones in compu-
tation complexity and memory requirement, more attention is
paid on the lifting-based ones. In [10], Jou et al. proposed
an architecture for directly implementing the lifting scheme.
Based on this direct architecture, Lian et al. [11] proposed a
folded architecture to increase the hardware utilization. Un-
fortunately, these architectures have limitations on the critical
path latency and memory requirement. The flipping structure
[12] can reduce the critical path latency by eliminating the
multipliers on the path from the input node to the computa-
Manuscript received July 13, 2008; revised October 18, 2008 andDecember 17, 2008. First published March 16, 2009; current version pub-lished April 17, 2009. This work was supported in part by the NationalHigh Technology Research and Development Program of China under Grant2007AA01Z307, by the National Natural Science Foundation of China underGrant 60736043, Grant 60776795, and Grant 60672125, and by the Programfor Changjiang Scholars and Innovative Research Teams in University underGrant IRT0645. This paper was recommended by Associate Editor M. Anis.
The authors are with the Key Laboratory of Intelligent Perception and ImageUnderstanding of Ministry of Education, Xidian University, Xi’an 710071,China (e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSII.2009.2015393
tion node without hardware overhead. Reference [13] modi-
fied the conventional lifting scheme by merging the predictor
and updater stages into a single lifting step; thus, the critical
path latency is shortened, and the memory requirement is
reduced. However, these architectures [12] and [13] involve a
complex control procedure, and the roundoff noise had to be
considered.
In order to solve those problems, we propose an efficient
folded architecture (EFA) for the lifting-based DWT in this
brief. The EFA can be obtained according to the following
procedures: First, we give a new formula for the lifting al-gorithm, leading to a novel form of the lifting scheme. Due
to this form, the intermediate data that were used to compute
the output data are distributed on different paths. Thus, we
can process these intermediate data in parallel by employing
the parallel and pipeline techniques. With the aforementioned
operations, the conventional serial data flow of the lifting-based
DWT is optimized into a parallel one. Thus, the corresponding
optimized architecture (OA) has short critical path latency.
More importantly, the resulted OA is of repeatability. Based on
this property, the EFA is derived from the OA by employing the
fold technique. With the proposed EFA, the required hardware
resource is reduced, and the hardware utilization is greatly
increased. Furthermore, the critical path latency and the numberof registers are reduced. In addition, the shift–add operation
is adopted to optimize the multiplication; thus, the hardware
resource is further reduced, and the implementation complexity
is cut down.
In our work, we take the 9/7 wavelet filters as an example
to explain the proposed EFA. The performance comparisons
and FPGA implementation results indicate the efficiency of the
proposed architecture.
The outline of this brief is organized as follows. Section II
briefly reviews the lifting scheme and the lifting algorithm for
the 9/7 wavelet filters. Section III describes the proposed folded
architecture. Performance comparisons and FPGA implementa-tion are shown in Section IV. Finally, a conclusion is given in
Section V.
II. LIFTING SCHEME
The lifting scheme is an efficient way to construct the DWT
[14] and [15]. Generally, the lifting scheme consists of three
steps: 1) split; 2) predict; and 3) update. Fig. 1 shows the block
diagram of the lifting-based structure. The basic principle is
to break up the polyphase matrix of the wavelet filters into
a sequence of alternating upper and lower triangular matrices
and a diagonal normalization matrix [15]. According to the
1549-7747/$25.00 © 2009 IEEE
8/8/2019 3F7BF661d01
http://slidepdf.com/reader/full/3f7bf661d01 2/5
SHI et al.: EFFICIENT FOLDED ARCHITECTURE FOR LIFTING-BASED DISCRETE WAVELET TRANSFORM 291
Fig. 1. Block diagram of the lifting scheme.
basic principle, the polyphase matrix of the 9/7 wavelet can be
expressed as [15]
P̃ (z) =
1 α(1 + z−1)0 1
1 0
β (1 + z) 1
1 γ (1 + z−1)0 1
1 0
δ(1 + z) 1
K 00 1/K
(1)
where α(1 + z−1) and γ (1 + z−1) are the predict polynomials,
β (1 + z) and δ(1 + z) are the update polynomials, and the
K is the scale normalization. Here, the lifting coefficients α,β , γ , and δ, and constant K are α ≈ −1.586134342, β ≈−0.052980118, γ ≈ 0.8829110762, δ ≈ 0.4435068522, and
K ≈ 1.149604398, respectively.
Given the input sequence xn, n = 0, 1, . . . , N − 1, where N is the length of the input sequence, the detailed lifting procedure
is given in four steps.
1) Splitting step:
Odd part d(0)i
= x2n+1 (2)
Even part s(0)i
= x2n. (3)
2) First lifting step:
Predictor d(1)i
= d(0)i
+ α×s(0)i
+ s(0)i+1
(4)
Updater s(1)i
= s(0)i
+ β ×d(1)i−1 + d
(1)i
. (5)
3) Second lifting step:
Predictor d(2)i
= d(1)i
+ γ ×s(1)i
+ s(1)i+1
(6)
Updater s(2)i
= s(1)i
+ δ ×d(2)i−1 + d
(2)i
. (7)
4) Scaling step:
di = d(2)i
/K (8)
si = K × s(2)i
. (9)
d(l)i
and s(l)i
are intermediate data, where l presents the stage of
the lifting step. Output di and si, i = 0, . . . , (N − 1)/2, are the
high-pass and low-pass wavelet coefficients.
From (4)–(7), it is obvious that the first and second lifting
steps can be implemented using the same architecture, with
alternating the lifting coefficients. Thus, the architecture for the
first lifting step can be multiplexed using the folded method
to reduce the hardware resource and areas. Based on this idea,
we will propose a novel folded architecture for the lifting-based DWT.
Fig. 2. Data flow of 9/7 lifting-based DWT.
Fig. 3. (a) Processing of the intermediate data. (b) Data flow optimizationwith a four-stage pipeline.
III. PROPOSED ARCHITECTURE
It is well known that, in the lifting scheme, the way of
processing the intermediate data determines the hardware scale
and critical path latency of the implementing architecture. In
the following, we use the parallel and pipeline techniques to
process the intermediate data. The corresponding architecture
possesses repeatable property. Thus, it can further be improved,
leading to the EFA.
From Section II, it can be found that the conventional 9/7
lifting involves two lifting steps and one scaling step. The
corresponding data flow can be shown as Fig. 2. In this figure,
d(0)i+2 and s(0)
i+2 are the input data of the current cycle. d(0)i+1, s(0)
i+1,
d(1)i , s(1)i , and d(2)i−1 are the intermediate data obtained from theinternal memory. d(2)
iand s(2)
iare the output of the current
8/8/2019 3F7BF661d01
http://slidepdf.com/reader/full/3f7bf661d01 3/5
292 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009
Fig. 4. Corresponding OA.1
Fig. 5. Proposed folded architecture.
cycle. As we know, these intermediate data will be used for
computing d(2)i
and s(2)i
. However, the conventional lifting
scheme adopts the serial operation to process these interme-
diate data; thus, the critical path latency is very long. In fact,from (4)–(7), we can choose d(0)
i+1 + α× s(0)i+1, s(0)
i+1 + β ×
d(1)i
, d(1)i
+ γ × s(1)i
, and s(1)i
+ δ × d(2)i−1 as the intermediate
data for computing d(2)i
and s(2)i
. Since these intermediate data
are on different paths, we can calculate them in parallel. That
is, we use four delay registers D1, D2, D3, and D4 to restore
these intermediate data in the same cycle. The delay registers
are expressed as
D1 = d(0)i+1 + α× s(0)
i+1 D2 = s(0)i+1 + β × d(1)
i
D3 = d(1)i
+ γ × s(1)i
D4 = s(1)i
+ δ × d(2)i−1. (10)
Fig. 3(a) shows this operation. With this parallel operation, the
critical path latency is reduced, and the number of registers is
decreased.
According to the aforementioned processing, the data flow
of 9/7 lifting can be optimized into the four-stage pipeline flow.
This is shown in Fig. 3(b). In this figure, the data read from the
four delay registers shown in gray circles are used for current
computation. Data D1, D2, D3, and D4 along the arrows are
the candidates of the delay registers. They are computed in the
current cycle and will be used in the next cycle.
Based on the optimized data flow shown in Fig. 3(b), the
corresponding OA can be obtained, as shown in Fig. 4. In
1Here, we do not consider the splitting and scaling step.
this figure, the dashed line divides the architecture into two
similar parts. Therefore, we can multiplex the left-side ar-
chitecture, replacing the right-side one. In this way, we can
obtain our proposed EFA. It is shown in the dashed areaof Fig. 5.
In the following, we will show the EFA for processing the
two lifting steps of the 9/7 filter. Intermediate data d(1)i
and s(1)i
,
which were obtained from the first lifting step, are fed back
to pipeline registers P 1 and P 2. They are used for the second
lifting step. As a result, the first and second lifting steps are in-
terleaved by selecting their own coefficients. In this procedure,
two delay registersD3 andD4 are needed in each lifting step for
the proper schedule. Table I shows a detailed processing of the
9/7 wavelet, where, for a given input sequence xn, we can get
output d(2)i
and s(2)i
, respectively. In the proposed architecture,
the speed of the internal processing unit is two times that of the
even (odd) input/output data. This means that the input/output
data rate to the DWT processor is one sample per clock cycle.
The proposed architecture needs only four adders and two
multipliers, which are half those of the architecture shown
in Fig. 4.
For the splitting, we use one delay register and two switches
to split the input into odd/even sequences. With regard to
scaling, it consists of one multiplier and one multiplexer. By
properly selecting coefficients 1/K and K , the high-pass and
low-pass coefficients are normalized. Fig. 5 shows the complete
structure of the proposed EFA, including the splitting, lifting,
and scaling steps.
The proposed EFA can also be used for 5/3 lifting-basedDWT by bypassing delay registers D3 and D4 and selecting
8/8/2019 3F7BF661d01
http://slidepdf.com/reader/full/3f7bf661d01 5/5
294 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 56, NO. 4, APRIL 2009
TABLE IIIEXPERIMENT RESULTS FOR THE FOLDED ARCHITECTURE [11] AN D
THE EFA (FO R 9/7 LIFTING-BASED DWT)
A. Performance Comparison
The performance comparison between the proposed EFA and
other existing architectures is evaluated in Table II, in terms of
critical path latency, control complexity, throughput rate, and
hardware complexity (measured by the number of multipliers,
adders, and registers).
From Table II, it can be found that, compared with other
architectures, the proposed EFA requires the least arithmetic re-
source (two multipliers and four adders), although the through-
put rate is one input/output data per cycle. The critical path
latency is T m + T a, which is slightly longer than that of the Direct+full pipeline, Flipping+five-stage pipeline, and the
pipeline proposed by Wu and Lin [13]. However, the number
of registers is ten, which is less than that in the three afore-
mentioned architectures. Additionally, the control complexity is
medium, whereas the Flipping+five-stage pipeline and Wu [13]
are complex. The results indicate that the proposed EFA pro-
vides a proper choice in terms of critical path latency, hardware
cost, and control complexity.
B. FPGA Implementation
We have developed the synthesizable Verilog HDL model toimplement the proposed EFA on the Altera Stratix II FPGA
EP2S15F484C5 using the Quartus II 7.2 platform. In the im-
plementation, a 12-bit data bus width is chosen, and the coeffi-
cients are quantized to a fixed point, where 12-bit precision for
the fractional part is selected. In order to verify the efficiency
of the proposed architecture, we redesigned the architecture
presented in [11, Fig. 5] under the same circumstance. The
implementation results in terms of dedicated logic register,
combinational adaptive look-up tables (ALUT), and critical
path latency are shown in Table III. It is obvious that our archi-
tecture has shorter critical path latency and smaller hardware
resource than the folded architecture in [11]. The reason is
that the proposed EFA is based on the optimized data flow
of a conventional lifting-based DWT, whereas the architecture
in [11] is based on conventional lifting. Additionally, we also
multiplex the multiplier for scaling in order to reduce the
hardware.
V. CONCLUSION
In this brief, we have proposed a novel EFA for the lifting-
based DWT. We have given a new formula for the conventional
lifting algorithm. Then, by employing the parallel and pipeline
techniques, the conventional data flow of the lifting-based
DWT is converted to a parallel one, resulting in the OA with
repeatable property. Based on this property, the proposed EFA
is derived from the OA by further employing the fold technique.
The FPGA implementation results show that the proposed EFA
possesses short critical path latency and achieves high hardware
utilization. Performance comparisons indicate that our EFAprovides an efficient alternative in tradeoff among the critical
path latency, hardware cost, and control complexity.
ACKNOWLEDGMENT
The authors would like to thank the Associate Editor and
anonymous reviewers for their valuable comments.
REFERENCES
[1] A. N. Akansu and R. A. Haddad, Multiresolution Signal Decomposition:
Transforms, Subbands and Wavelets. New York: Academic, 1992.[2] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet co-
efficients,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3445–3462,Dec. 1993.
[3] S. C. B. Lo, H. Li, and M. T. Freedman, “Optimization of wavelet decom-position for image compression and feature preservation,” IEEE Trans.
Med. Imag., vol. 22, no. 9, pp. 1141–1151, Sep. 2003.[4] K. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelet
transforms,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 1,no. 3, pp. 191–202, Jun. 1993.
[5] C. Chakrabarti and M. Vishwanath, “Efficient realizations of discreteand continuous wavelet transforms: From single chip implementationsto mapping on SIMD array computers,” IEEE Trans. Signal Process.,vol. 43, no. 3, pp. 759–771, Mar. 1995.
[6] M. Vishwanath, R. M. Owens,and M. J. Irwin, “VLSI architectures forthediscrete wavelet transform,” IEEE Trans. Circuits Syst. II, Analog Digit.
Signal Process., vol. 42, no. 5, pp. 305–316, May 1995.[7] W. Jiang and A. Ortega, “Lifting factorization-based discrete wavelet
transform architecture design,” IEEE Trans. Circuits Syst. Video Technol.,vol. 11, no. 5, pp. 651–657, May 2001.
[8] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecture forlifting-based forward and inverse wavelet transform,” IEEE Trans. Signal
Process., vol. 50, no. 4, pp. 966–977, Apr. 2002.[9] H. Liao, M. K. Mandal, and B. F. Cockburn, “Efficient architectures
for 1-D and 2-D lifting-based wavelet transforms,” IEEE Trans. Signal
Process., vol. 52, no. 5, pp. 1315–1326, May 2004.[10] J. M. Jou, Y. H. Shiau, and C. C. Liu, “Efficient VLSI architectures for
the biorthogonal wavelet transform by filter bank and lifting scheme,” inProc. IEEE Int. Symp. Circuits Syst., 2001, pp. 529–532.
[11] C. J. Lian, K. F. Chen, H. H. Chen, and L. G. Chen, “Lifting based discretewavelet transform architecture for JPEG2000,” in Proc. IEEE Int. Symp.
Circuits Syst., Sydney, Australia, 2001, pp. 445–448.[12] C. T. Huang, P. C. Tseng, and L. G. Chen, “Flipping structure: An efficient
VLSI architecture for lifting based discrete wavelet transform,” IEEE
Trans. Signal Process., vol. 52, no. 4, pp. 1080–1089, Apr. 2004.
[13] B. F. Wu and C. F. Lin, “A high-performance and memory-efficientpipeline architectures for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec,” IEEE Trans. Circuits Syst. Video Technol., vol. 15,no. 12, pp. 1615–1628, Dec. 2005.
[14] W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,” Appl. Comput. Harmon. Anal., vol. 3, no. 15,pp. 186–200, Apr. 1996.
[15] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into liftingschemes,” J. Fourier Anal. Appl., vol. 4, no. 3, pp. 247–269, 1998.
[16] X. Lan, N. Zheng, and Y. Liu, “Low-power and high-speed VLSI architec-ture for lifting-based forward and inverse wavelet transform,” IEEE Trans.
Consum. Electron., vol. 51, no. 2, pp. 379–385, May 2005.[17] T. Acharya and C. Chakrabarti, “A survey on lifting-based discrete
wavelet transform architectures,” J. VLSI Signal Process., vol. 42, no. 3,pp. 321–339, Mar. 2006.