2.asip for h
Post on 29-May-2018
228 Views
Preview:
TRANSCRIPT
-
8/9/2019 2.ASIP FOR H
1/15
Journal of Signal Processing Systems 50, 5367, 2008
* 2007 Springer Science + Business Media, LLC. Manufactured in The United States.
DOI: 10.1007/s11265-007-0109-y
ASIP Approach for Implementation of H.264/AVC
SUNG DAE KIM AND MYUNG H. SUNWOO
School of Electrical and Computer Engineering, Ajou University, San 5, Wonchun-Dong,
Yeongtong-Ku, Suwon, 442-749, South Korea
Received: 3 April 2007; Accepted: 13 June 2007
Abstract. This paper presents an Application Specific Instruction Set Processor (ASIP) for implementation ofH.264/AVC, called Video Specific Instruction-set Processor (VSIP). The proposed VSIP has novel instructions
and optimized hardware architectures for specific applications, such as intra prediction, in-loop deblocking filter,
integer transform, etc. Moreover, VSIP has coprocessors for computation intensive parts in video signal
processing, such as inter prediction and entropy coding. The proposed VSIP has much smaller area and can
dramatically reduce the number of memory access compared with commercial DSP chips, which result in low
power consumption. Moreover, the proposed hardware accelerators have small size, consume low power
consumption, and thus, they can support real-time video processing. VSIP has been thoroughly verified using an
FPGA board having the Xilinxi Virtex II. VSIP can implement a real-time H.264/AVC decoder. The proposed
VSIP is one of promising solutions for video signal processing.
Keywords: application specific instruction-set processor, hardware software codesign, H.264/AVC,low power design, data reuse, hardware accelerator
1. Introduction
With the rapid progress of semiconductor technolo-
gy, Application Specific Instruction-set Processor
(ASIP), which adopts high performance and low
power of ASIC and flexibility of DSP, has become
increasingly important. The market of ASIP is
growing fast since the sales of portable devices,
such as cellular phones, digital cameras, MP3
(MPEG Layer-3) players, PMP (Portable MultimediaPlayer), etc. are dramatically increasing. These
applications need high performance, low power
consumption and low cost. Application-Specific
Integrated Circuit (ASIC) designs can reduce the
cost, size, and power consumption of systems.
However, ASIC designs have been found inadequate
to upgrade standards since they should be rede-
signed. On the other hand, programmable DSPs can
greatly reduce time-to-market and allow faster
changes and upgrades. However, programmable
DSPs may have the disadvantages related with cost,
size, and power consumption. ASIP can compromise
advantages of ASIC designs and general DSP chips
[15]. In other words, ASIP chips adopt high
performance and low power of ASIC chips and
flexibility of DSP chips. ASIP can give low power at
the algorithm/architecture level, which can provide
the most efficient way to achieve low powerconsumption [6].
Multimedia technology has been developed with
the progress of semiconductor technology. Technol-
ogy related to multimedia codec has been standard-
ized as MPEG-2, MPEG-4, H.261, H.263, etc. The
Joint Video Team (JVT) announced H.264/AVC in
Dec. 2003 [7]. The new video coding standard
H.264/AVC can provide twice as much as higher
-
8/9/2019 2.ASIP FOR H
2/15
compression efficiency than MPEG-4. However, it
has about 2 times more hardware complexity for a
decoder, and about 10 times more hardware com-
plexity for an encoder than the MPEG-4 visual
simple profile codec [8].
In mobile communications, the implementation of
multimedia codec needs high performance, low
power consumption and low cost. The implementa-
tion also requires the flexible system which can
upgrade without replacing the system. The ASIP
approach can be quite suitable for these require-
ments. Hence, we propose an ASIP for implementa-
tion of mobile multimedia codec, called Video
Specific Instruction-set Processor (VSIP) [5]. We
implement the VSIP based on the design flow of
ASIP as shown in Fig. 1 [9].
First, the target application is chosen. H.264/AVCis widely used in mobile communication standards,
such as DMB, DVB-S2, DVB-T, etc. Hence, the
target application of VSIP is video signal processing
including H.264/AVC. Second, we profile the H.264/
AVC tasks. Through profiling, we can find the
complexity of H.264/AVC tasks. According to the
complexity of each task, we can divide the applica-
tion into hardware implementation for high perfor-
mance and software implementation for flexibility.
H.264/AVC has computation intensive parts such as
inter prediction and entropy coding. To achieve low
power consumption and real-time processing, hard-
ware accelerators for inter prediction and entropy
coding are required. Next, we design the optimized
instruction set and their architecture based on the
analysis. H.264/AVC has new features such as intra
prediction, in-loop deblocking filter, integer trans-
form, etc. It is inefficient to implement these blocks
using existing DSP instructions. Hence, we propose
new instructions and their architecture to implement
H.264/AVC efficiently. The optimized instruction set
can reduce computation complexity, redundancy and
overhead. In general, computation cycles to perform
target applications in an ASIP are much less thanthose of general DSPs. Finally, the functions of the
proposed VSIP have been thoroughly verified using
the Xilinx XC2v6000 FPGA.
The proposed VSIP can efficiently perform new
features of H.264/AVC, such as intra prediction, in-
loop deblocking filter, and integer transform. More-
over, VSIP has hardware accelerators for inter
Target applicationselection
Application profiling
H/W, S/W partitioning
Design special instructionsand architecture
Verification and
performance comparison
Design hardwareaccelerators
Chip fabrication
Figure 1. Design flow of ASIP.
54 Kim and Sunwoo
-
8/9/2019 2.ASIP FOR H
3/15
prediction and entropy coding that occupy the largest
portion of power consumption and critical timing
parts of video processing. Hence, VSIP can imple-
ment a real-time and low power H.264/AVC baseline
profile decoder in QCIF format.
This paper is organized as follows. Section 2
introduces H.264/AVC and describes existing DSP
instructions to implement multimedia standards.
Section 3 proposes novel instructions and hardware
accelerators, and Section 4 explains performance
comparisons. Finally, Section 5 contains concluding
remarks.
2. Implementation Analysis for H.264/AVC
This section introduces briefly H.264/AVC and
shows the results of profiling. Then, various imple-
mentations of H.264/AVC are analyzed. This section
also presents existing DSP instructions for video
signal processing.
2.1. Implementation of H.264/AVC
H.264/AVC has adopted new features to improve
code efficiency, which are described as follows.
Figure 2. Complexity analysis of the H.264/AVC baseline profile.
Figure 3. Computation times of various implementations. a DSP. b ASIC. c VSIP + accelerators.
ASIP Approach for Implementation of H.264/AVC 55
-
8/9/2019 2.ASIP FOR H
4/15
H.264/AVC uses several reference frames, variable
block size, and quarter pixel accuracy in Motion
Estimation (ME)/Motion Compensation (MC). These
features enable the encoder to search for the best
match for the current frame. However, the memoryaccess and hardware complexity increase significant-
ly. The past standards, such as MPEG-2, MPEG-4,
H.263, etc., transmit the first frame without com-
pression. On the other hand, the H.264/AVC encoder
adopts intra prediction, which eliminates the redun-
dancy of intra frame.
The block based structure causes blocking arti-
facts. Thus, H.264/AVC adopts the in-loop deblock-
ing filter to eliminate blocking artifacts. The
Exponential Golomb Coding (EGC) and Context
Adaptive Variable Length Coding (CAVLC) are also
the newly adopted features of the H.264/AVCbaseline profile. EGC uses variable length codes
with a regular construction [10]. CAVLC is the
method used to encode the residual data of 44blocks [1113].
Figure 2 shows the operation complexity of the
H.264/AVC baseline profile [14]. As shown in Fig. 2,
ME/MC takes 53% and VLC takes 18.20% of theoperation complexity. Especially, these tasks access
memory frequently. To achieve low power consump-
tion we need the dedicated hardware for these tasks.
In practice, the computation complexity of VLC is not a
dominant part in H.264/AVC. Intra prediction and in-
loop deblocking filter have more computation complex-
ity compared with VLC. However, VLC requires bit
manipulation operations which are inefficient to be
implemented on a general processor. Hence, we employ
the dedicated hardware for VLC. Moreover, inter
prediction can be executed in parallel with intra
prediction and entropy coding can also be executed inparallel with in-loop deblocking filter. Thus parallelism
can be exploited for these tasks. The proposed instruc-
AUPCU
Program Counter
Instruction Register
FSM
Stack
Interrupt Controller
Program Memory
Data Memory 1
Data Memory 2
DPU
Program Bus (16 Bit)
Address Buses (16 Bit)
Data Buses (32 Bit)
AGU
MAC MAC ALU ALU Shifter
Register File
AGU
16 Bit Address Registers
Prefetch Logic
IPA
ME Hardware
Accelerator
MC Hardware
Accelerator
ECA
CAVLC
Accelerator
EGC
Accelerator
Figure 5. Proposed VSIP architecture.
DOTPU4
src1
src2
dst
a0 a1 a2 a3
b0 b1 b2 b3
(a0*b0) + (a1*b1) + (a2*b2) + (a3*a4)
Figure 4. DOTPU4 instruction in TMS320c64.
56 Kim and Sunwoo
-
8/9/2019 2.ASIP FOR H
5/15
tions of VSIP can efficiently support intra prediction and
in-loop deblocking filter [5]. To maximize the usage of
hardware resource, the hardware accelerators of inter
prediction and entropy coding are essential. Thus, the
proposed VSIP employs the hardware accelerators forthese tasks.
Figure 3 shows computation times of DSP, ASIC,
and VSIP implementations according to the profiling
results. Figure 3(a) shows the computation times of
the DSP implementation. If a single DSP is used to
implement the H.264/AVC algorithm, the DSP
serially executes all of the algorithm blocks.
Figure 3(b) shows the computation times of the
ASIC implementation. Each block is executed using
the dedicated hardware. However, all of the blocks
cannot be executed in parallel, since some blocks use
the output of other blocks. For example, the
transform block uses ME results and the entropy
coding block needs transform/quantization results.
Figure 3(c) shows the proposed VSIP having accel-
erators. The VSIP implementation having acceler-
ators requires more computation times than the ASIC
implementation. However, it requires much less
computation times than DSP and can support various
profiles and standards.
2.2. Existing DSP Instructions for Video Signal
Processing
Existing DSPs support various instructions to exe-cute packed operations between two registers. These
operations are used for various video signal process-
ing, such as DCT, IDCT, ME/MC, etc. TMS320c6of Texas Instruments supports special instructions
for multimedia signal processing, such as SUB-
ABS4, AVGx, etc. [15]. The SUBABS4 instruction
calculates absolute differences of four pairs of the
packed data. The AVG4 instruction calculates
averages of the packed data in two registers. After
additions of four packed data, four results are shifted
a bit to the left for division, and 0.5 is added to each
result for rounding. The TMS320c6 series also
support the DOTPU4 instruction which calculatesthe dot product between four sets of packed 8 bit
values. Figure 4 shows the operation flow of the
DOTPU4 instruction. The values in both src1 and
src2 are treated as the unsigned 8 bit packed data.
The 32 bit unsigned result is written into dst. Four
clock cycles are required to execute this instruction.
DCT has a regular computation flow, while ME/MC
and entropy coding have control based computations.
TMS320c55 has a coprocessor for DCT computa-tions, and it requires 2.8 MIPS for DCT computations
to achieve the processing speed of 30 fps for the QCIF
format. TMS320c6 having eight function unitsrequires 1.1 MIPS to implement DCT of 30 QCIF fps
video data using DSP instructions [16].
In entropy coding, the code word table is referred
according to the number of successive zeros in the
input bit stream. Moreover, packed compare oper-
ations are required. To execute these operations,
TMS320c64 supports the LMBD and CMPEQ/GT/LT instructions, and the Blackfin DSP of Analog
Device supports the ONES instruction [16, 17]. The
q3 q2 q1 q0 p0 p1 p2 p3
boundary
Figure 6. Block boundary.
a b c d
e f g h
i j k l
m n o p
A B C D E F G H
I
J
K
L
Q
M
N
O
P
Figure 7. Identification of samples for 44 intra prediction.
ASIP Approach for Implementation of H.264/AVC 57
-
8/9/2019 2.ASIP FOR H
6/15
dst = HADD(src)
dst = HADD(src:mask)
dst = HADD(src:mask1.mask2)
a0 a1 a2 a3
dst
src
a
a0 a1 a2 a3
dst
a0
-
8/9/2019 2.ASIP FOR H
7/15
LMBD instruction counts the number of zeros in a
register. The CMPEQ/GT/LT instructions compare
pairs of 8 bits or 16 bit packed data.
3. Proposed Instructions and Accelerators
This section presents an overall architecture, new
instructions and hardware accelerators for the H.264/
AVC codec.
3.1. Overall Architecture of the Proposed VSIP
Figure 5 shows the overall architecture of the
proposed VSIP. The proposed VSIP consists of two
parts, a programmable DSP part and a hardware
accelerator part. The DSP part has a program control
unit (PCU), a data processing unit (DPU), and anaddress unit (AU). The hardware accelerator part has
an Inter Prediction Accelerator (IPA) and an Entropy
Coding Accelerator (ECA). IPA consists of an ME
accelerator and an MC accelerator. ECA has a
CAVLC accelerator and an EGC accelerator. The
hardware accelerators can operate in parallel with the
DSP units.
PCU consists of a prefetch logic, a program
counter, an instruction register, an FSM (Finite State
Machine), a stack, and an interrupt controller. DPU
consists of two Multiply and Accumulate (MAC)
units for two 16-bit by 16-bit multiplications and
accumulations, two Arithmetic Logic units (ALU), a
barrel shifter and a register file. AU has two address
generation units (AGU) for load and store. Each of
the internal word lengths is 32 bit. The instruction
pipeline consists of six stages, that is, pre-fetch,
fetch, decode, execute1, execute2, and execute3. The
proposed ASIP has 35 arithmetic instructions, 11
logical and shift instructions, 6 program control
instructions, 4 move instructions and 16 special
instructions including instructions for H.264/AVC,
which are described next.
3.2. Proposed Instructions for In-loop Deblocking
Filter and Intra Prediction
The in-loop deblocking filter is used to eliminate
blocking artifacts as mentioned in Section 2. Figure 6
shows 8 pixels of neighboring 44 blocks. The8 pixel values are decided according to the boundary
Strength (bS), which represents the difference of two
neighboring blocks, using p0 p3 and q0 q3.
The equations calculating pixel values are defined
in [7]. The equations can be classified into five
categories as follows.
p2 p1 p0 1
p2 2 p1 2 p0 2
2 p3 3 p2 p1 p0 3
2 p1 p0 4
p0 q0 1 ) 1 5
p0
p3 are the packed data in a register, and q0
q3 are also the packed data in another register. Then,
Eq. (1) shows the additions of three packed data in
one register. Eq. (2) represents one bit shift left
operations of two data followed by additions of three
packed data in the same register. Eq. (3) shows one
bit shift left operation of data and a multiplication
operation of data followed by the additions of four
packed data. Eq. (4) shows one bit shift left
operation of the packed data followed by an addition
Figure 9. Assembly program of core block for in-loop deblocking filter.
ASIP Approach for Implementation of H.264/AVC 59
-
8/9/2019 2.ASIP FOR H
8/15
of two packed data. Eq. (5) shows an addition of the
most significant byte (MSB) of one register and the
least significant byte (LSB) of the other registerfollowed by one bit shift operation. Even though
these computations are packed operations, these
operations do not occur between two registers as
shown in Fig. 6, but they occur between the packed
data within the same register.
As mentioned in Section 2, the intra predictioneliminates the redundancy of intra frame and inter
frame, which has few redundancies between two
frames. Figure 7 shows an identification of samples
-2
2
x(0)
x(1)
x(2)
x(3)
X(0)
X(2)
X(1)
X(3)
-
-
-
a
1/2
1/2
-
-
-
x(0)
x(1)
x(2)
x(3)
X(0)
X(2)
X(1)
X(3)
b
Figure 11. Operation flow of 44 integer transform. a 1-D forward transform. b 1-D inverse transform.
Figure 10. Assembly program of intra prediction.
60 Kim and Sunwoo
-
8/9/2019 2.ASIP FOR H
9/15
for 44 intra prediction. a p in Fig. 7 arepredicted using A Q according to the equations
defined in [7] and some of equations are represented
in Eq. (6), where A, B, and C represent pixel values,
and a pixel value is represented using 8 bits. For a 32
bit architecture, A, B, C and D are stored in one
register since a p and A Q in Fig. 7 are 8 bit
values.
A 2 B C 2 ) 2
A B 1 ) 1
A 3 B 2 ) 2
6
As described in Section 2, existing DSPs support
only packed operations between two registers. A
large number of instruction cycles is required to
implement the in-loop deblocking filter and intra
prediction with the existing packed instructions that
execute packed operations between two registers.
Hence, H.264/AVC may require a new instruction to
execute packed operations within a register.
Figure 8 shows the proposed three horizontal addi-
tion (HADD) instructions. Three HADD instructions
are as follows. The proposed instruction in Fig. 8(a)
packs a 32 bit register into four 8 bit data, adds fourpacked data, and then saturates the result to 8 bit data.
Figure 8(b) is similar with Fig. 8(a). However, the
packed data, which is selected by a mask, is one bit
shifted to left. In Fig. 8(c), mask1 selects the data to be
added, and mask2 selects the data to be shifted. Eqs.
(1), (2), (4), (5) of the in-loop deblocking filter and Eq.
(6) of the intra prediction can be implemented using
the proposed HADD instructions.
Intra prediction and in-loop deblocking filterrequire dot product calculation. TI_s TMS320C6supports the DOTPU4 instruction for dot product
calculation which performs packed multiplications of
two registers and adds four results in four cycles.
ADI_s ADSP-BF53 not supporting these specialinstructions requires more clock cycles to perform
dot product calculation. Hence, we only compare the
proposed HADD instruction with the TMS320C6instruction.
Figure 9 shows the assembly program of the core
block for the in-loop deblocking filter. R0 and R1 are
general registers. The packed pixel data is stored inR0 and R1. Each result of the instruction can be
obtained after one clock cycle. Hence, the proposed
VSIP can execute these equations for in-loop
deblocking filter in one clock cycle.
Figure 10 shows the assembly program of the intra
prediction. Acc represents an accumulator and the
packed pixel data is stored in R0. R1 and R2 have
Loop label1 #2
Acc = fTRAN(R0)
R(4) = MOVR(Acc)Acc = fTRAN(R1)
R(5) = MOVR(Acc)Acc = fTRAN(R2)
R(6) = MOVR(Acc)
Acc = fTRAN(R3)
R(7) = MOVR(Acc)G0 = TRAN(G1)label1
Figure 13. Assembly program of 44 forward 2-D integertransform.
A B C Dsrc
A+B+C+Ddst (B-C)+2(A-D) A+D-(B+C) (A-D)-2(B-C)
fTRAN
A B C DR0 A E I MR4
B F J NR5
C G K OR6
D H L PR7
TRANE F G HR1
I J K LR2
M N O PR3
a
b
Figure 12. Operation flow of fTRAN and TRAN instruction. a Operation flow of fTRAN instruction. b Operation flow of TRAN
instruction.
ASIP Approach for Implementation of H.264/AVC 61
-
8/9/2019 2.ASIP FOR H
10/15
offset values for rounding. The ADDAR instruction
in VSIP calculates an addition of two source data.
After the addition, the result is shifted to right by the
immediate value in the instruction. Each result of the
instruction can be obtained after one clock cycle.
Hence, the proposed VSIP can execute these equa-
tions for intra prediction in two clock cycles.
3.3. Proposed Instructions for Integer Transform
The 44 integer transform can be operated using theforward transform as shown in Fig. 11(a). The
forward transform is executed with four rows of four
packed data. Then, the forward transform is per-
formed again with four columns of four packed data
to get the results of the 44 integer transform.Figure 11(b) represents an inverse transform. Simi-
larly, the 44 inverse integer transform can beexecuted using the operations in Fig. 11(b).
This paper proposes novel instructions to efficient-
ly execute the forward/inverse 44 integer transformas follows.
dst fTRAN src dst iTRAN src :
Each instruction performs the operations of Fig. 11(a)
and (b).
Figure 15. ME computation flow. a Existing computation flow.
b Proposed computation flow.
4 x 4 Current Block
Reference Picture
SAD operation
a
4 x 4 Current Block
Reference Picture
SAD operation
b
Figure 14. Operation flow of the proposed motion estimation. a ME operation in the first cycle. b ME operation in the second cycle.
62 Kim and Sunwoo
-
8/9/2019 2.ASIP FOR H
11/15
Figure 12(a) shows the operation flow of the fTRAN
instruction. The fTRAN instruction reads a 32 bit
general register in one register file, which consists of
four 32 bit registers, and executes the operation flow in
Fig. 12(a). Then, the results are written in anotherregister file consisting of four 32 bit registers. The
iTRAN instruction performs a similar operation.
These instructions can be implemented using the
adders and eight additional 21 multiplexers. Figure12(b) shows the operation flow of the TRAN
instruction. The general register file has a 44 matrixwhose elements are 8 bit pixel data. The TRAN
instruction in VSIP executes the transpose of a 44matrix as shown in Fig. 12(b).
44 integer transform can be easily programmedwith the proposed instructions. The assembly pro-
gram of 44 forward 2-D integer transform is shownin Fig. 13. G0 and G1 represent two general register
files each of which contains R0, R1, R2, and R3 or
R4, R5, R6, and R7, respectively. The Loop instruc-
tion repeats the program until the label. The number
of repeats is determined by the immediate value in
the instruction. The MOVR instruction moves the
source data into the general register that has four
8 bit pixel data. The general register file has four
general registers. Each VSIP instruction requires one
clock cycle. As you can see, the program in Fig. 13
has nine instructions for 1-D forward integer trans-
form. To execute 2-D transform, 18 instructions for
computation and 1 instruction for program control
are needed. Hence, 19 cycles are required to execute
this program. If the data load cycle is added, the total
cycles of 44 forward 2-D integer transform are 23cycles. The implementation using the instructions of
TMS320c55 (SW) for integer transform requiresmore than 1,078 cycles [18]. Hence, the proposed
fTRAN and iTRAN instructions can be more
efficient than the existing DSP instructions for
integer transform.
3.4. Proposed Accelerator for Inter Prediction
The proposed MC accelerator supports the motion
vector with quarter pixel accuracy which is one of
the key features of H.264/AVC. However, it does not
yet support the multiple reference frames which is
another key feature of H.264/AVC. As described in
Section 2, ME/MC should frequently access memo-
First 1 detect Level Decode Table Update
First 1 detect Level Decode Table Update
Pipeline Stage
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
a
Level Decode
Pre TableUpdate
First 1 detect
Level Decode
Pre TableUpdate
First 1 detect
Pipeline Stage
Stage 1 Stage 2 Stage 3
b
Figure 17. Comparison of flows for the level of the nonzero coefficient decoding. a Generic level of the nonzero coefficient decoding flow.
b Proposed level of the nonzero coefficient decoding flow.
00
vlc=1
vlc=2
vlc=3
vlc=4vlc=5
vlc=6
001X
0011X
00111X
001111X0011111X
00111111X
11 11 Xvlc=n
n
Figure 16. Threshold value of each level decoding table.
ASIP Approach for Implementation of H.264/AVC 63
-
8/9/2019 2.ASIP FOR H
12/15
ry. From a performance point of view and a low
power point of view, it can be a serious problem.
Thus, the sliding window method is used to alleviate
this problem [19]. Figure 14 illustrates the proposed
ME operation flow.
The proposed ME architecture supports the [+8,
j7] search window. In the [+8, j7] search window,
16 44 blocks exist in a row. In the first cycle, fourSADs are simultaneously calculated as shown in Fig.
14(a). Next, the search window shifts right and each
operation unit repeats the SAD calculation as shown
in Fig. 14(b). The SADs of upper four pixels of every
block in a row can be obtained after four cycles and
16 SADs are stored in buffers. The SADs of the
second upper are calculated in the same way, and the
16 SADs are accumulated with the 16 SADs in
buffers, respectively. Then, after 16 cycles, the 16
SADs of 44 blocks can be obtained.Figure 15 shows the ME computation flows of
general architectures [20] and the proposed architec-ture. In Fig. 15(a), the pixel values in the dotted
block should be fetched again to calculate the SAD
of the dotted block after the SADs of two adjacent
blocks (block 1 and block 2) are obtained. However,
if a 44 block is shifted pixel by pixel as shown inFig. 15(b), the data in the dotted block in Fig. 15(a)
can be reused. Hence, we can achieve the low power
consumption.
3.5. Proposed Accelerator for Entropy Coding
As described in Section 2, H.264/AVC uses EGC and
CAVLC for entropy coding. Since EGC has a regularcoding structure, the EGC accelerator consists of a
barrel shifter, the first one detector, an adder, etc. The
encoder of CAVLC finds the value of the codeword
and the length of the codeword in memory according
to the data. Therefore, the efficient memory address
generator is needed. The decoder of CAVLC is
usually implemented with a lookup table. In the
decoding process, the level of the nonzero coefficient
decoding is an iterative method, which can be
implemented without a lookup table.
A generic decoding process for the level of the
nonzero coefficient is as follows. First, the decoder
obtains the number of successive zeros in the input bit
stream. Next, the decoder calculates the current symbol
length and decodes the current symbol. Finally, the
decoder updates the table information used for next
symbol decoding. The decoder cannot decode the next
symbol until the table information is decided.To increase the decoding speed, we propose the
pre-table update method. Table updating is decided
whether the current symbol value is greater than the
threshold value. Fig. 16 shows the threshold value of
each level decoding table. As shown here, all
threshold values have regular forms. Runs of zeros
are two and runs of ones are equal to the level
decoding table index. Hence, we can update the table
before current symbol decoding.
The generic level of the nonzero coefficient decod-
ing process is shown in Fig. 17(a). The level decoding
process cannot be performed until finishing the table
update process. Therefore, five pipeline stages are
required to decode two symbols. Figure 17(b) shows
the level of the nonzero coefficient decoding process
using the proposed pre-table update method. Since
three pipeline stages are required to decode two
symbols, we can reduce two computation cycles for
level decoding.
4. Implementation and PerformanceComparisons
H.264/AVC can be implemented using VSIP havinghardware accelerators. The proposed VSIP has been
Table 1. Performance comparisons of 44 integer transform.
Parameter TMS320c55 (SW) [18] TMS320c55 (HW) [18] TMS320c64 [16] Proposed VSIP
MIPS 12.8 2.8 1.1 1.1
Figure 18. FPGA emulation.
64 Kim and Sunwoo
-
8/9/2019 2.ASIP FOR H
13/15
modeled by Verilog HDL and thoroughly verified
using the iPROVEi FPGA board having the
Xilinxi
Virtex II shown in Fig. 18.Several core blocks for generating an intra
predictor and an in-loop deblocking filter are
coded using the proposed special instructions and
the same blocks are also coded using the existing
instructions of TMS320c64. The proposed archi-tecture can reduce the number of clock cycles for
generating an intra predictor about 40% compared
with TMS320c6. Moreover, the total number ofclock cycles to execute the in-loop deblocking
fi lt er c an b e r ed uc ed a bo ut 2 02 5% t ha n
TMS320c6. TMS320C64 supports the DOTPU4instruction that executes packed multiplications of
two registers and adds four results in four cycles.Other DSPs require more instructions, since they
do not support the special instructions.
The fTRAN and iTRAN instructions can be
executed in one cycle. Hence, 23 clock cycles are
required to execute 44 integer transform using theproposed instructions and about 1,092,960 clock
cycles for 30 frames ((23 cycles16 blocks)99macro blocks30 frame) are required for QCIFimages, since a QCIF image has 99 1616 macro
blocks. Table 1 shows the number of the required
instructions for 30 frames on existing DSPs [16, 18]
and the proposed VSIP. VSIP can be more efficientthan the implementation using instructions of
TMS320c55 (SW) and using the coprocessor ofTMS320c55 ( H W) f o r i n te g er t r an s fo r m.TMS320c64 is a large VLIW architecture havingeight function units while VSIP requires only two 32
bit adders.
The optimized VSIP instructions can reduce
computation complexity, redundancy and overhead.
Hence, the computation cycles of VSIP to perform
tasks in H.264/AVC are much less than those of
general DSPs. Hence, the proposed VSIP can
efficiently reduce the number of memory accesses
and achieve low power.The hardware accelerators have been implemented
using the MagnaChip HSI 0.25 mm standard cell
library by the Synopsysi Astro tool as shown in
Fig. 19. The chip specifications are listed in Table 2.
The chip is being fabricated by the Information
Technology System on Chip (ITSOC) MPW service
Table 2. Chip summary of proposed hardware accelerator forME and MC.
Parameter
ME hardware
accelerator
MC hardware
accelerator
Pr ocess technolog y 0.25 mm 1p4m 0.25 mm 1p4m
Logic gate count 40,000 10,000
Maximum frequency 100 MHz 150 MHz
On chip memory size 32 kb
a b
Figure 19. Chips of proposed hardware accelerator for ME and MC. a ME hardware accelerator. b MC hardware accelerator.
Table 3. Performance comparisons of the hardware ME archi-
tectures.
Parameter
Clockcycles/
frame
Search
range
Supported
block size
Gatecounts
(K)
References
[20]
405,6 03 [j16,
+15]
Variable block
support
154
References
[21]
406,0 77 [j8,
+7]
Variable block
support
61
Proposed
architecture
431,2 44 [j8,
+7]
Variable block
support
40
ASIP Approach for Implementation of H.264/AVC 65
-
8/9/2019 2.ASIP FOR H
14/15
in KOREA. The ME chip achieves the gate count
without memory of 40 K and the operating frequency
of 100 MHz. The MC chip achieves the gate count
without memory of 10 K and the operating frequency
of 150 MHz. ME requires more memory than MC.
Moreover the size of ME is almost four times larger
than the size of MC. However, we could not insert
the memory of ME in a chip because of the size
limitation of the ITSOC MPW service in Korea.
Moreover, we separated ME and MC.
The proposed ME accelerator can significantly
reduce the gate counts compared with [21] and [22].
Table 3 shows the comparisons among [21, 22] and
our architecture. Kim et al. [21] can support larger
search ranges than the other architectures. However,
it has much larger gate counts than the other
architectures. The required computation cycles of[21] a n d [22] are comparable to our architecture.
However, the total gate counts of [21] and [22] are
much larger than our architecture.
The proposed hardware accelerator for CAVLC
takes average 368 clock cycles for a macro block. To
achieve the real-time processing requirement for
H.264/AVC decoding with HD1080i format, the
proposed design should run over 90 MHz. The
proposed design can support real-time processing
since the maximum operating frequency of the
proposed design is about 130 MHz.
5. Conclusions
This paper presents the ASIP for video signal
processing, called VSIP. VSIP has the special
instructions and the optimized hardware architec-
tures for H.264/AVC. Moreover, VSIP has the
hardware accelerators for ME/MC and entropy
coding. As shown in performance comparisons,
computation cycles to perform target applications
on our VSIP are much less than those of general
DSPs. Moreover, VSIP can dramatically reduce
memory access by using the proposed specialinstructions and the hardware accelerators. Hence,
VSIP can achieve low power at the algorithm/
architecture level. Since the hardware accelerators
can concurrently operate, the VSIP can efficiently
perform in real-time video processing and it can
support various profiles and standards. The proposed
VSIP is one of promising solutions for video signal
processing.
Acknowledgements
This work was supported in part by the Ubiquitous
Computing and Network (UCN) Project, the Minis-
try of Information and Communication (MIC) 21st
Century Frontier R&D Program in Korea, in part by
IT R&D Project funded by Korean Ministry of
Information and Communications, in part by the
second stage of Brain Korea 21 Project in 2006, and
in part by IDEC.
References
1. J. S. Lee, Y. S. Jeon and M. H. Sunwoo, BDesign of new DSPinstructions and their hardware architecture for high-speed
FFT^, in Proc. IEEE Workshop on Signal Processing Syst.,
Sept. 2001, pp. 8090.
2. J. Glossner, J. Moreno, M. Moudgill, J. Derby, E. Hokenek, D.
Meltzer, U. Shavadron and M. Ware, BTrends in compilable
DSP architecture,^ in Proc. IEEE Workshop on Signal
Processing Syst., 2000, pp. 181199.
3. J. H. Lee, J. H. Moon, K. L. Heo, M. H. Sunwoo, S. K. Oh and
I. H. Kim, BImplementation of Application Specific DSP for
OFDM Systems,^ in Proc, IEEE IEEE Int. Symp. Circuit
Syst., May 2004.
4. S. H. Yoon, J. H. Moon and M. H. Sunwoo, BEfficient DSP
Architecture for High-Quality Audio Algorithms, in Proc.
IEEE Int. Symp. Circuits Syst., May 2005.
5. S. D. Kim, J. H. Lee, C. J. Hyun and M. H. Sunwoo, BASIP
approach for implementation of H.264/AVC,^ in Proc. Asia
South Pacific Design Automation Conf., Jan 2006.
6 . J . C he n a n d K . J . R . L iu , BCost-effective low-power
architectures of video coding systems,^ in Proc. IEEE Int.
Symp. On Circuits and Syst., May 1999, pp. 153156.
7. Draft ITU-T Recommendation and Final Draft International
Standard of Joint Video Specification (ITU-T Rec. H.264/ISO/
IEC 14496-10 (E) AVC). July, 2004.
8. J. Ostermann, T. Wedi, et al., BVideo coding with H.264/
AVC: tools, performance, and complexity, IEEE Circuits
and Systems Magazine, vol. 4, 2004, pp. 728.
9. M. K. Jain, M. Balakrishnam and A. Kumar, BASIP design
methodologies: survey and issues,^ in Fourteenth Internation-
al Conference on VLSI Design, Jan. 2001, pp. 7681.
10. W. Di, G. Wen, H. Mingzeng and J. Zhenzhou, BAn Exp-
Golomb encoder and decoder architecture for JVT/AVS,^ in
Proc. 5th International Conference on ASIC, vol. 2, 2124
Oct., 2003, pp. 910913.
11. G. Bjontcgaard and K. Lillcvold, BContext-adaptive VLC
(CAVLC) coding of coefficients,^ Doc. JVT-028, JVT of IS0/
IEC MPEG & ITU-T VCEG 3rd Meeting, Virginia, USA,
May. 2002.
66 Kim and Sunwoo
-
8/9/2019 2.ASIP FOR H
15/15
12. H.-C. Chang, C.-C. Lin and J.-I. Guo, BA Novel Low-Cost
High-Performance VLSI Architecture for MPEG-4 AVC/
H.264 CAVLC Decoding,^ in Proc. IEEE Int. Symp. Circuits
Syst., May 2005.
13. Y.-K. Lai, C.-C. Chou and Y.-C. Chung, BA simple and cost
effective video encoder with memory-reducing CAVLC,^ in
Proc. IEEE Int. Symp. Circuits Syst., May 2005.
14. W. I. L. Choi, B. Jeon and J. Jeong, BFast motion estimation
with modified diamond search for variable motion block
sizes,^ in Proc. International Conference on Image Process-
ing, vol. 3, Sept. 2003, pp. 1417.
15. TMS320C6000 CPU and Instruction Set Reference Guide,
Texas Instruments Inc., Dallas, TX, 2000.
16. TMS320C64 Image/Video Processing Library, Texas Instru-ments Inc., Dallas, TX, 2003.
17. Blackfini DSP Instruction Set Reference, Analog Device
Inc., Norwood, Mass. 2002.
18. TMS320C55 Hardware Extensions for Image/Video Appli-cations Programmer_s Reference, Texas Instruments Inc.,
Dallas, TX, 2002.19. T. Wiegand, X. Zhang and B. Girod, BLong-Term Memory
Motion-Compensated Prediction,^ Trans. Circuit Syst. Video
Technol., vol. 9, no. 1, Feb. 1999, pp. 7084.
20. E. Iain, G. Richardson, Video Codec Design: Developing
Image and Video Compression Systems, Wiley, 2002.
21. M. H. Kim, I. G. Hwang and S. I. Chae, BA Fast VLSI
Architecture for Full-Search Variable Block Size Motion
Estimation in MPEG-4 AVC/H.264,^ in Proc. of Asia and
South Pacific Design Automation Conference (ASP-DAC
2005), Shanghai, China, Jan 2005.
22. S. Y. Yap and J. V. McCanny, BA VLSI Architecture for
Variable Block Size Video Motion Estimation,^ Trans. Circuit
Syst. Video Technol., vol. 51, no. 7, July 2004.
Sung Dae Kim received the B.S. degree in ElectronicsEngineering from the Ajou University, Suwon, Korea in
2002. He is currently working toward the Ph.D. degree. Hiscurrent research interests are in the areas of digital image/
video processing, DSP, and ASIP, specifically low power and
high performance architectures.
Myung H. Sunwoo received the B.S. degree in ElectronicEngineering from the Sogang University in 1980, the M.S.
degree in Electrical and Electronics from the Korea Advanced
Institute of Science and Technology in 1982, and the Ph.D.
degree in Electrical and Computer Engineering from the
University of Texas at Austin in 1990. He worked for
Electronics and Telecommunications Research Institute
(ETRI) in Daejeon, Korea from 1982 to 1985, and Digital
Signal Processor Operations, Motorola, Austin, TX from 1990
to 1992. Since 1992, he has been a Professor with the Schoolof Electrical and Computer Engineering, Ajou University in
Suwon, Korea. In 2000, he was a Visiting Professor in the
Department of Electrical and Computer Engineering, the
University of California, Davis, CA. He has over 300 papers
and also holds 37 patents. He received more than 20 research
awards including the Best Student Paper Award from the
IEEE Workshop on Signal Processing Systems (SIPS) 2005,
Athens, Greece, the Ministry of Commerce, Industry and
Energy, Samsung Electronics, the Institute of Electronics
Engineers of Korea (IEEK), and professional foundations.
His research interests include SOC architectures and design
for multimedia and communications, application-specific DSP
architectures, and application-specific design. He served on
the Technical Program Chairs of the IEEE Workshop on SIPS
in 2003 and International Conference on SOC Design in 2003and has served on program committee, organizing committee,
steering committee, and executive committee for major
international conferences and workshops including IEEE
Workshop on SIPS, Cool Chips, Design, Automation and Test
in Europe (DATE), IEEE International ASIC/SOC Confer-
ence, Asian-Pacific Conference on CAS (APC-CAS), Asian-
Solid State Circuits Conference (A-SSCC), International SOC
Design Conference (ISOCC), International Symposium on
VLSI Design, Automation and Test (VLSI-DAT), etc. He
served as an Associate Editor for the IEEE Transactions on
Very Large Scale Integration (VLSI) Systems (20022003)
and as a Guest Editor for the Journal of VLSI Signal
Processing (Kluwer, 2004). He is a Director of the National
Research Laboratory sponsored by the Ministry of Science and
Technology, a Director of the New Growth Engine Semicon-ductor Center, and an Executive Director of IEEK. Currently,
He is a Senior Member of IEEE and a Chair of the IEEE CAS
Society of the Seoul Chapter.
ASIP Approach for Implementation of H.264/AVC 67
top related