activity-based motion estimation scheme for
TRANSCRIPT
-
7/28/2019 Activity-Based Motion Estimation Scheme For
1/11
IEEE TRANSACTIO NS ON CIRCU ITS AND SYSTEMS FOR VID EO TECHNO LOGY, VOL. 20, NO. 11, N OVEMBER 2010 1475
Activity-Based Motion Estimation Scheme forH.264 Scalable Video Coding
Sangkwon Na, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE
AbstractThis paper proposes a motion estimation schemeto reduce the computational complexity of multilayer motionestimation for scalable video coding. Based on the result of themotion estimation of the lower resolution layer referred to as baselayer, we developed a new approach for exploring the searchrange of the enhancement layer with high coding efficiency.This approach is based on the activity defined as the absolutedifference between the motion vector predictor and the finalmotion vector. Based on the correlation of the activities betweenneighboring layers, an inter-layer activity model was developedusing a curve-fitted linear equation to exploit the activity in thebase layer for deciding the search center and the search range
of the enhancement layer. Each activity pair in the neighboringlayers is used to associate the relevant macroblock to one oftwo groups; boundary region and interior region. The base-layermotion vector predictor is basically selected over all the activityregions; for each activity region, the proposed motion estimationalgorithm decides whether to include the median motion vectorpredictor or not. Minimal sufficient search range is also decidedfrom the inter-layer activity prediction factor that is adjusted tothe given sequence. The proposed scheme reduced the executiontime of motion estimation by 99.26% at the cost of 1.56% bit-rate increase and 0.048 dB peak signal-to-noise ratio (PSNR)decrease on average compared with the conventional full-searchalgorithm. The fast full-search block matching algorithm can alsobe incorporated to obtain the extra CPU time reduction in themotion estimation process. By adopting the fast full-search blockmatching algorithm (FFSBMA) in JSVM reference software,the CPU time was reduced by up to 91.84% and the memorybandwidth was reduced by 90% at the sacrifice of 1.27% bit-rateincrease and 0.041dB PSNR decrease on average compared withthe FFSBMA only.
Index TermsActivity, H.264/advanced video coding (AVC),motion estimation, scalable video coding (SVC).
I. Introduction
H.264/ADVANCED video coding (AVC) [1] supports a
scalable video coding (SVC) with an improved rate-
distortion performance through Amendment 3 [2] announced
in July 2007. SVC supports the compression of multiple video
sequences with the same content but with different frame rate,Manuscript received July 30, 2009; revised December 28, 2009 and March
26, 2010; accepted May 23, 2010. Date of publication September 20, 2010;date of current version November 5, 2010. This work was supported by theNational Research Foundation of Korea (NRF), under Grant 2010-0000823,funded by the Korean Government (MEST). This paper was recommendedby Associate Editor M. Comer.
The authors are with the Department of Electrical Engineering, KoreaAdvanced Institute of Science and Technology, Yuseong-Gu, Daejeon 305-701, Korea (e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2010.2077493
resolution, and quality. One SVC-coded bit stream is used
for various devices such as TVs, PDAs and cell phones with
different display and computing capabilities. The final draft
of the scalable extension of H.264, i.e., H.264/SVC, supports
temporal, spatial, and quality scalability [2]. Temporal scal-
ability is related to the frame rate and is supported by a
hierarchical B-picture [3]. Spatial scalability allows various
resolutions to be encoded in a single coded stream, and
achieves lower bit-rate than simulcast [4] which contains all
individually coded streams. To remove redundancy between
neighboring layers, spatial scalability exploits three inter-layerpredictions: inter-layer motion prediction, inter-layer residual
prediction, and inter-layer intra-prediction [5]. By means of
quality scalability, video sequences with the same resolution
and frame rate can be coded with multiple quality levels with
different signal-to-noise ratios.
Motion estimation (ME), which has the largest computa-
tional complexity among all encoding processes, is quickly
becoming the computational bottleneck as the image resolution
of video increases. In SVC with multiple layers having differ-
ent resolutions, reducing redundancy among ME processes in
different layers is critical to reduce the overall time complexity.
Chen et al. proposed a ME architecture for H.264/SVC with a
full search and 4 refinement in order to reduce the externalmemory bandwidth and lower the operating frequency. Com-
pared to the full-search block matching algorithm (FSBMA),
the bandwidth overhead is reduced by up to 55% with a quality
loss of 0.1 dB [6], [7]. However, the computational complexity
of ME is still a critical problem since the search range used in
[6] and [7] is the same as the FSBMA in spite of the reduced
external memory bandwidth.
Various fast ME approaches based on the dynamic search
range adjustment have been proposed to reduce the com-
putational complexity [8][13]. In [8] and [9], the search
range is determined according to the magnitude of prediction
errors. Oh et al. [10] suggested the search range adjustment
according to the prediction errors and the block classificationinformation in the previous frames of the block. This approach
is appropriate for low bit-rate video such as video phone
and video conferencing. Yamada et al. [11] also proposed
an adaptive search range selection algorithm based on the
sum of the absolutes of motion vectors and prediction errors
in the previous frame. Song et al. [12] utilized the average
motion vectors in the five previous reference frames and the
prediction error of the current block simultaneously. In [13],
the motion vector difference is utilized to predict the search
1051-8215/$26.00 c 2010 IEEE
-
7/28/2019 Activity-Based Motion Estimation Scheme For
2/11
-
7/28/2019 Activity-Based Motion Estimation Scheme For
3/11
NA AND KYUNG: ACTIVITY-BASED MOTION ESTIMATION SCHEME FOR H.264 SCALABLE VIDEO CODING 1477
Fig. 2. Length of the longer edge of the MBB, LMBB, where (xs, ys) denotesthe BL MVP, and (x, y) denotes the final MV.
TABLE I
Comparison of Five Predictors Including BL MVP in Terms of
Entropy of Bits Representing Difference Between Predictors
and the Motion Vectors Generated Using Full Search and the
Resultant PSNR
Sequence Median Zero Collocated Accelerator BL MVP(0, 0) Block [17] [17]
PSNR (dB)
CITY 44.320 44.321 44.319 44.319 44.350
CREW 44.862 44.860 44.856 44.833 44.866
SOCCER 44.808 44.804 44.801 44.789 44.804
Entropy (bits)
CITY 3.527 3.521 3.526 3.553 3.357
CREW 5.848 5.876 5.963 6.268 5.458
SOCCER 3.897 3.990 4.011 4.300 3.848
Three spatial layers (QCIF, CIF, and 4CIF at 30 frames/s) are assumed withQP = 20 and GOP = 8 (hierarchical B 3).
denotes the length of the longer edge of the MBB. As shown
in Fig. 3, about 90% of MVs of the enhancement layer can
be found within [8, +8] of search center at MVs. The basis
search range, SRbasis, is set at 8, and is independent of the
resolution of the sequence. The distribution in Fig. 3(a) is
quite different from others, because the resolution ratio of CIF
(352 288) to QCIF (176 144) and 4CIF (704 576) to
CIF is integer, i.e., four, while that of 1080p (19201080) to
4CIF is 5.114, i.e., non-integer and > 4. In addition, because
the search range for 1080p is the same as that for 4CIF, the
speed of saturation of sequences is different between 1080p
in Fig. 3(a) and 4CIF in Fig. 3(b). We discuss how to detect
the block which has LMBB that exceeds SRbasis in the nextsection.
Besides the BL MVP, there are some other efficient
predictors [17]. Conventional median predictor is usually
employed in recent video compression. To minimize the
memory bandwidth and retain the processing regularity in
hardware implementation, many very large scale integration
video coders adopt zero motion vector (0, 0) as the predictor.
As it is observed that motion vectors are highly correlated
with the motion vectors of temporally and spatially adjacent
blocks [17], the motion vectors of the collocated block in the
previous frame or the adjacent blocks in the current frame are
Fig. 3. Cumulative distribution of LMBB, the maximum length of MBB,for CITY, CREW, HARBOUR, ICE, SOCCER,1 Aspen, RushFieldCuts, andTouchdownPass sequence with QP = 20. (a) 4CIF and 1080p, (b) CIF and4CIF, and (c) QCIF and CIF as base layer and enhancement layer, respectively.
also considered as the predictor. In addition, the differentially
increased/decreased motion vector named as accelerator mo-
tion vector is also used in [17]. We compared, in Table I, the
BL MVP-based method with four other predictors in terms
of the entropy of bits representing the difference between
predictors and the motion vectors generated using the full
search and the resultant PSNR. The BL MVP is shown to
outperform other predictors in terms of the video quality and
the entropy.
B. Activity
MVs of MBs at the boundary of moving objects are less
correlated than those in the interior. In Fig. 4, blocks C03 are
corresponding to B0 in the base layer, located at the boundary
of the moving object. MVs of blocks at the moving object
boundary such as C03 can be less correlated to each other
1We used 1080p sequences upsampled by 4CIF sequences for CITY, CREW,HARBOUR, ICE, and SOCCER because they are not available in 1080pformat. We included three additional sequences, Aspen, RushFieldCuts, andTouchdownPass which are available in 1080p format.
-
7/28/2019 Activity-Based Motion Estimation Scheme For
4/11
1478 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, N O. 11, NOV EMBER 2010
Fig. 4. Grid on a sample object in (a) base layer and (b) enhancementlayer; the rectangles with a bold line, B0 and C03, denote 4 4 blocksin the corresponding positions in the base layer and the enhancement layer,respectively.
despite the use of the scaled MV of B0 because the motion
properties of C0 and C2 differ from those of C1 and C3. It was
reported that FSBMA generally obtains less correlated MVs
at the boundary of the moving objects [19], [20]. Therefore,
it is necessary to extend the search range for blocks at theboundary of moving objects.
Conventional moving object boundary detection in the video
compression has relied on the sum of absolute AC coef-
ficients [19] and DC coefficient [20], which were used to
evaluate the level of activity and can be obtained from
signals of coded bit-stream. The gradient magnitude also has
been employed to detect the object boundary in the image
segmentation [21][24]. The main purpose of the moving
object boundary prediction in this paper is to judge whether
the wider search range is necessary to achieve the improved
video quality than SRbasis or not before the motion search,
rather than to exactly extract the moving objects.
We define Al, the activity in layer l, as
Al = maxi
(max(|mvdlx[i]|, |mvdly[i]|)) 0 i N 1 (1)
where l denotes the layer index, mvdlx[i] and mvdly[i] denote
x and y-component of the ith MVD of the corresponding MB
in layer l, and N denotes the number of the MVDs given for
the corresponding MB. Because MVD shows how much the
motion of current MB deviates from the MVP, which is either
the BL MVP or the median MVP, we used MVD to predict
the boundary of moving objects in our previous work [25].
Regardless of the source of MVP, low activity usually means
that the final MV is close to the MVP; this case is defined as
regular motion. In other words, small search range is enough
to search for the best-matched block if the block has a regular
motion. High activity occurs due to less correlated MVs at the
boundary of moving objects. As a result, each block can be
partitioned into two groups, a low-activity group and a high-
activity group. Activity regions are defined as follows:
1) interior region (IR) where MVs of the corresponding
blocks in neighboring layers are strongly correlated
(LMBB SRbasis);
2) boundary region (BR) where the corresponding blocks
in neighboring layers are located near the boundary
of moving objects, and MVs are weakly correlated
(LMBB > SRbasis).
A moving object boundary prediction accuracy of the pro-
posed MVD-based moving object boundary prediction was
compared with Roberts [21], Sobel [22], Prewitt [23], Rosen-
feld [24], the sum of absolute AC coefficients [19] and DC
coefficient [20]. The moving object boundary prediction accu-
racy, pacc, consists of two contributing terms: the probability of
a MB containing the moving object boundary when M M,and the probability of the MB being in the interior of the
moving object when M < M. That is
pacc(M) = p(EB|M M) + p(EI|M < M) (2)
where EB denotes the case when LMBB of the given block
is larger than SRbasis (the given block contains the boundary
of an object), EI denotes the case when LMBB of the given
block is smaller than or equal to SRbasis (the given block is in
the interior of an object.), M denotes a boundary prediction
measure such as DC coefficient. The threshold of the given
measure M to determine whether the current MB belongs to
IR or BR, denoted by M, is obtained by the minimum errorBayesian classifier [26]
M = arg minM
perr(M) (3)
whereperr(M) =
M
p(M|EB)dM +
M
p(M|EI)dM. (4)
As shown in Fig. 5, M is determined to minimize the error
probability given in (4) and the shaded region. Prediction
measure M of boundary operators [21][24] is derived from
the convolution computation of the operator mask (2 2 or
3 3) on the pixels. For the sum of absolute AC coefficients,
and DC coefficient, denoted as yAC and yDC, respectively,
can be employed as prediction measure M derived from DCT
transform given as
Y = AXAT (5)
=
yDC yAC,1 yAC,2 yAC,3yAC,4 yAC,5 yAC,6 yAC,7yAC,8 yAC,9 yAC,10 yAC,11yAC,12 yAC,13 yAC,14 yAC,15
(6)
where A =
a a a a
b c c b
a a a a
c b b c
. (7)
yAC is defined by
yAC =
15k=1
|yAC,k|. (8)
In (5)(7), X denotes the prediction errors obtained after ME,
a = 12
, b =
12
cos( 8
), and c =
12
cos( 38
). The proposed
MVD-based moving object boundary prediction employs the
activity of the base layer as prediction measure M where the
threshold of the proposed prediction, act, is given by (3).
Table II shows that the proposed MVD-based moving object
boundary prediction has at least 5% better boundary prediction
-
7/28/2019 Activity-Based Motion Estimation Scheme For
5/11
NA AND KYUNG: ACTIVITY-BASED MOTION ESTIMATION SCHEME FOR H.264 SCALABLE VIDEO CODING 1479
Fig. 5. Two conditional probability curves: p(M|EI) denotes the probabilitycurve ofM given EI (EI = 1 when LMBB of the given block is smaller than orequal to SRbasis), and p(M|EB) denotes the probability curve of M given EB(EB = 1 when LMBB of the given block is larger than SRbasis). M denotes themoving object prediction measure. Shaded region denotes the error probabilitywhen the threshold of M is given as M. SRbasis is defined in Section III-Bas a criterion for deciding whether a MB belongs to IR or BR.
TABLE II
Comparison of the Moving Object Boundary Prediction
Accuracy, pacc , and Operations per 4 4 Block Among
Roberts [21], Sobel [22], Prewitt [23], Rosenfeld [24], the Sum ofAbsolute AC Coefficients [19], DC Coefficient [20], and the
Proposed MVD-Based Moving Object Boundary Prediction
Using Full Search for SOCCER Sequence with QP = 20
Prediction Measure M Pacc Operation/4 4 BlockCompare Add/Sub Multiply
Roberts [21] 56.33% 33 86
Sobel [22] 56.94% 33 208
Prewitt [23] 56.95% 33 208 4
Rosenfeld [24] 54.11% 113 912 16
Sum of absolute 78.88% 17 127 128AC coefficient [19]
DC coefficient [20] 75.81% 1 96 128
Ours (activity) 83.81% 4 2
accuracy than the sum of absolute AC coefficients and DC
coefficient for CITY, CREW, HARBOUR, ICE, and SOCCER
sequences, and lists the number of operations used in the
corresponding prediction method. Boundary operators used in
[21][24] show relatively low accuracy in the moving object
boundary prediction. Because boundary operators are based on
the gradient magnitude, they often mistake a complicated tex-
ture in the scene for a moving object boundary or completely
miss a moving object boundary when the gradient magnitude
between the background and the boundary of the moving ob-
ject is small. The moving object boundary prediction schemeproposed in this paper excels others in terms of prediction
accuracy and computational complexity.
C. Inter-Layer Activity Model
By exploiting the correlation of the mean activities between
two neighboring layers in Fig. 6, inter-layer activity model
(ILAM) is developed to predict the activity of the enhancement
layer from that of the base layer with a linear equation
Al = Al1 + (9)
Fig. 6. Activity plane representing pairs of the mean of activities betweentwo neighboring layers [base layer (BL) = CIF, enhancement layer (EL) =4CIF] for ICE sequence with QP = 20; the dashed line denotes inter-layeractivity model with the given slope, (= Al/Al1) and the given intercept, where Al1 and Al denote the mean of activities over all MBs in a frameat the base and enhancement layer, respectively.
where Al is the predicted activity of the given MB in the
enhancement layer (layer l), an inter-layer activity prediction
factor, , is the slope of ILAM denoted by a dashed line in
Fig. 6, an inter-layer activity prediction offset, , is an intercept
of ILAM, and Al1 denotes the activity of the corresponding
MB in the base layer (layer l 1). Al in Fig. 6 denotes the
mean of activities over all MBs in a frame at layer l. Values of
and in (9) are obtained through the measurement with five
video sequences, such as CITY, CREW, HARBOUR, ICE, and
SOCCER (240 frames with a SVC structure comprising three
layers). The error between Al and Al1 + is measured
with R2 (the coefficient of determination [27]) defined as
SSerr =
l
f
(Alf ( Al1f + ))2 (10)
SStot =
l
f
(Alf A
l)2 (11)
R2 = 1 SSerr
SStot(12)
where l and f denote the index of layer and frame, respec-
tively, Alf denotes Al in the fth frame at layer l, A
ldenotes
the mean of Alf over all frames at layer l, SSerr is the sum of
squared errors between Alf and Al1f + , and SStot is the
variance of Alf. If R2 is close to 1.0, it means that the error
between
Al
and
Al1
+ is small. Table III shows , , andR2 measured with five generic sequences. The second column
shows the inter-layer activity prediction factor, , for the
given sequence; the third column shows the inter-layer activity
prediction offset, ; the forth column shows the coefficient of
determination, R2, for given and . The value of represents
the coefficient of the assumed linear relationship between the
activities in the neighboring layers. Equation (9) is used before
the motion search in the current layer to estimate the minimal
search range to find the best motion vector without too much
quality loss compared to the full search. Because the estimated
search range is given as the product of and the activity of
-
7/28/2019 Activity-Based Motion Estimation Scheme For
6/11
1480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, N O. 11, NOV EMBER 2010
TABLE III
Inter-Layer Activity Prediction Factor, , and Inter-Layer
Activity Prediction Offset, , for the Given Sequence, and the
Coefficient of Determination [27], R2 , for Given and ,
Measured with Five Sequences (240 Frames on a SVC Structure
Comprising Three Layers)
Sequence R2
CITY 1.9 0.6 0.91CREW 2.3 0.4 0.95
HARBOUR 1.7 0.4 0.91
ICE 3.9 0.6 0.92
SOCCER 3.7 0.6 0.91
the base layer, affects both the computation time in ME and
the video quality. varies according to the given sequence
while is relatively steady ( is set to 0.5). Therefore,
needs to be adjusted to satisfy the variation of the motion
nature and the activity relationship between the neighboring
layers, which is discussed in Section IV-C.
IV. Proposed Activity-Based Motion Estimation
Algorithm
A. Overall Procedure of the Proposed Scheme
The proposed activity-based ME (ABME) scheme takes one
of the two paths, i.e., ME for IR and ME for BR, according
to the activity of the base layer, Al1. At the beginning, the
search range is given by inter-layer activity model (ILAM)
using (9). If Al1 is smaller than act, the activity threshold,
ABME takes ME for IR. Otherwise, AMBE takes ME for BR.
The final MV is chosen among the search results in terms of
the rate-distortion cost. During the motion search, parameter
, the inter-layer activity prediction factor in (9), and act areadjusted. The detailed procedure is introduced in Section IV-C.
B. Search Center Set
There are three elements which the search center set consists
of in the enhancement layer: MVl
med, MVl
s and MVz as
described in Fig. 7. MVl
med, the median MVP, is defined as
MVl
med = median(MVl
left, MVlupper, MV
l
upper-right ) (13)
where MVlleft, MVlupper, and MV
l
upper-right denote the MV
of the left, upper, and upper-right block in the enhancement
layer (layer l), respectively. MVl
s
denotes the BL MVP, i.e.,
MV obtained by up-scaling the MV of the base layer as
mentioned in the previous section. MVz denotes a zero motion
vector, (0, 0).
As shown in Table IV, the search center set is formed
according to the given activity region and the availability
of MVl
s. In general, blocks in IR show better rate-distortion
performance with MVl
s since they are placed in the interior
of moving objects and have regular motion. Simulation results
have shown that no significant benefit is obtained by additional
consideration of MVz in IR. In H.264, pictures are divided
into I, P (backward prediction) and B (forward and backward
Fig. 7. Three elements which the search center set consists of in theenhancement layer; MV
l
med, the median motion vector predictor at layer l,
MVl
s, the base-layer motion vector predictor at layer l, and MVz, the zero
motion vector where MVlfinal denotes the final MV at layer l, and Al denotes
the predicted activity based on inter-layer activity model.
TABLE IV
Search Center Set According to Each Activity Region and the
Availability of MVl
s
IR BR(Interior Region) (Boundary Region)
MVl
s {MVl
s} {MVl
s, MVl
med}is available
MV
l
s {
MV
l
med,
MVz} {
MV
l
med,
MVz}is unavailable
prediction) type. MVl
s may not be available for forward or
backward prediction in B picture. In this case, the search
center set consists ofMVl
med and MVz instead ofMVl
s. ME for
BR employs both MVPs (i.e., MVl
s and MVl
med). The search
begins with each element in the search center set.
C. Parameter Adjustment
1) Inter-Layer Activity Prediction Factor, : It is observed
that depends on the nature of motion in the scene, and,
therefore, needs to be adjusted to the given sequence. Wepropose a two-level adjustment scheme for consisting of
MB level and frame level. The search range is not fixed but
adjusted by (9) with a given . After the motion search, we
check whether the search range thus obtained is sufficient
or not as follows. If the best point with the minimum rate-
distortion cost is close enough to the boundary of the search
range, we suspect that there may exist some point with lower
rate-distortion cost than that point beyond the search range.
On the other hand, if the best point is close enough to
the predictor, the prediction is assumed to be quite accurate
obviating the need for further checking of points far from the
-
7/28/2019 Activity-Based Motion Estimation Scheme For
7/11
NA AND KYUNG: ACTIVITY-BASED MOTION ESTIMATION SCHEME FOR H.264 SCALABLE VIDEO CODING 1481
Fig. 8. Optional check of diamond-shaped points (OCDSP), where SR isderived from (9). Point I denotes the point with the minimum R-D costwithin the given search range, d denotes the distance between Predictor andpoint I, and point J denotes the center point of the diamond-shaped searchpattern whose distance from Predictor is twice as long as d. Five gray-colored circles denote optional check points in the diamond-shaped searchpattern, and point K denotes the point with the minimum R-D cost amongsix candidates, i.e., five optional check points and point I. SRnew denotes therequired search range to cover point K (a) when point K is different frompoint I, and (b) when point K is identical to point I.
predictor. The decision is made based on the distance between
the predictor and the best point in terms of the rate-distortion
cost obtained by the motion search within the given search
range.
In Fig. 8, we introduced a procedure called optional check
of diamond-shaped points (OCDSP). We defined the best
point obtained by the motion search as point I and the center
point of the diamond pattern as point J, which is located twice
as far as point I from the point denoted as Predictor along
the direction of Predictor-point I vector. The radius of the
diamond pattern is given as LMBB, the length of the longer
edge of the minimum bounding box (MBB) which covers
both Predictor and point I. We can get a new inter-layeractivity prediction factor, , after the following steps defined
as OCDSP.
1) Set the rate distortion cost of point I to RDCostI.
2) Define the best point among five optional check
points in the diamond pattern as point K.
3) Set the rate-distortion cost of point K to RDCostK.
4) If RDCostI < RDCostK, then point I is renamed as
point K as described in Fig. 8(b).
5) Get SRnew which minimally covers point K from
Predictor.
6) Calculate deductively using (9) by using Al as the
updated search range (SRnew) in; = SRnew
Al1 .
is defined in two levels: in MB level (MB) and in frame
level (frame). First, of 1616 mode is obtained by OCDSP
after the completion of the motion search using the search
range given by (9) with the previous value of frame, to be
defined as MB. The remaining modes, such as 16 8, 8 16
and 88 mode, are tested using the search range given by (9)
with updated MB as OCDSP is continuously performed for
each mode. After the mode decision, of the best mode is
defined as best. The mean of best over all MBs in a frame is
used to update frame. Initially, frame is set to the maximum
among values in Table III to support generic sequences. All
Fig. 9. Result of the adjustment in terms of (a) relative peak signal-to-noise ratio (PSNR), PSNR, and (b) relative computation time of ME, T,between adjusted and fixed given with the value of for CITY sequencewith QP = 20.
these parameters are controlled individually layer by layer.
Fig. 9 shows the variation of along with PSNR (PSNR
relative to that with fixed ) and T (percentage decrement
of computation time of ME relative to the case of fixed ) for
CITY sequence. It is observed that not only is the video quality
improved, but also about 90% computation time reduction in
ME is achieved through the adjustment.
2) Activity Threshold, act: Because it is too time-
consuming to update act with the full search, we employ
the sum of absolute AC coefficients (SAC) as the reference
measure, because SAC is most strongly correlated with ac-
tivity among all boundary prediction measures (see Table II).
Fig. 10 shows a scatter plot of the activity and SAC where
diamond-shaped points denote points in the interior of objects
and star-shaped points denote points in the object boundary,
respectively. The activity region is classified by act into IR
(hatched region) and BR (shaded region). act is determined
as follows:
The Euclidian distances among SACs in IR and BR, dIactand dBact , respectively, are calculated as follows:
dIact =
iIact
(yi yIact,mean )2 (14)
-
7/28/2019 Activity-Based Motion Estimation Scheme For
8/11
1482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, N O. 11, NOV EMBER 2010
Fig. 10. Scatter plot of activity, Ai, and the sum of absolute AC coefficients(SAC), yi, where i denotes the MB index, act denotes the activity thresholddividing object regions into IR (hatched region) and BR (shaded region) while+act and
act denote the increment and the decrement of act, respectively.
Diamond-shaped points denote points in the interior of objects, and start-shaped points denote points in the object boundary.
dBact =
iBact
(yi yBact,mean )2 (15)
where
yIact,mean =1
|Iact|
iIact
yi,
Iact = {i : Ai < act for i}
(16)
yBact,mean =1
|Bact| iBact
yi
Bact = {i : Ai act for i}.
(17)
In (16) and (17), yIact,mean and yBact,mean denote the mean
SAC over all MBs in IR and BR, respectively, and yi denotes
SAC of the ith MB. Then, the euclidian distances obtained
by the increment and decrement of act, +act and
act, are also
calculated, respectively. The change of the euclidian distances
obtained by +act and act, for each activity region, are defined
as follows:
dI+ = dI+act dIact (18)
dB+ = dB+act dBact (19)
dI = dI
act
dIact (20)
dB = dBact
dBact . (21)
According to the following condition, act is updated every
frame.
1) IfdI+ + dB+ are negative, act is incremented by 1.
2) IfdI + dB are negative, act is decremented by 1.
3) Otherwise, act retains its value.
The initial value of act is given by statistical analysis using
(3) for five sequences: CITY, CREW, HARBOUR, ICE, and
SOCCER. act is controlled individually for each spatial layer.
Fig. 11 shows the variation of act along with PSNR (PSNR
Fig. 11. Result of the act adjustment about (a) relative peak signal-to-noiseratio (PSNR), PSNR, and (b) relative computation time of ME, T, betweenadjusted act and fixed act given with the value of act for CITY sequencewith QP=20.
relative to that with fixed act) and T (percentage decrement
of computation time of ME relative to the case of fixed act)
for CITY sequence. About 20% computation time of ME was
reduced compared with the case of fixed act at the cost of
slight quality degradation.
V. Experimental Results
A. Configuration of Experiments
The experiment platform is 4 Dual-Core AMD Opteron,
2.6 GHz CPUs, 16 GB RAM with CentOS 4.5. The experiment
conditions were set as follows.
1) A SVC structure comprising three layers, with resolutiongiven as QCIF, CIF and 4CIF is taken.
2) The search range for each resolution is set as follows:
[16, +16], [32, +32], and [64, +64] for QCIF, CIF,
and 4CIF, respectively.
3) The number of frames in GOP2 is set to 8, and the
hierarchical B-pictures [3] is employed as depicted in
Fig. 12.
2A group of pictures (GOP) consists of a key picture, which is generallycoded as P picture, and several hierarchically coded B pictures that are locatedbetween the key pictures. The coding order of hierarchical prediction isdepicted in Fig. 12.
-
7/28/2019 Activity-Based Motion Estimation Scheme For
9/11
NA AND KYUNG: ACTIVITY-BASED MOTION ESTIMATION SCHEME FOR H.264 SCALABLE VIDEO CODING 1483
Fig. 12. Hierarchical prediction structures for motion-compensated predic-tion with GOP = 8 (IBBBBBBBP).
4) 240 frames are tested for each sequence at 30 frames/s.
5) Intra prediction is restricted in the encoder to mainly
observe the effect of motion estimation.
6) The quantization parameter is set to 20, 24, 28, and 32.
7) The rate-distortion optimization is enabled.
8) The context-adaptive binary arithmetic coding is used.
9) The adaptive inter-layer prediction is enabled.
To evaluate the rate-distortion performance and CPU timein SVC, we first implemented the proposed activity-based
ME scheme into JSVM [28]. The N3SS [14], the 4SS [15],
the DS [16] and the EPZS [17] were simulated. We also
implemented Chens algorithm [13] which is here referred to
as AdaptiveSR method. For performance comparison, Direct-
MaxMv, which uses the maximum absolute value of MVs of
the corresponding MB in the base layer as the search range,
was also implemented.
B. Comparison of Bit-Rate, PSNR, and CPU Time
Table V reports the experimental results of N3SS, 4SS, DS,
EPZS, DirectMaxMv, AdaptiveSR method and the proposed
scheme compared with the reference encoder (JSVM [28]) in
terms of bit-rate, peak signal-to-noise ratio (PSNR), and CPU
time. The relative bit-rate and PSNR are calculated by the
method of Bjontegaard delta bit-rate (BDBR) and Bjontegaard
delta PSNR (BDPSNR) [29], respectively. T denotes the
CPU time reduction in ME for all the spatial layers (CIF,
QCIF and 4CIF) compared with JSVM.
Table V shows that the proposed scheme reduced the
ME execution time by 99.26% on average compared to
JSVM, while the rate-distortion performance loss is almost
negligible+1.56% and 0.048 dB, on average. In case of
full-search, the search point ratios of the base layer (QCIF) to
the enhancement layers (CIF and 4CIF) for CITY are 0.0221and 0.0023, respectively, because of the increased search range
and additional motion vector predictor in the enhancement
layers. Therefore, even if there is no computational reduction
in the base layer, we achieved about 99% time saving due
to 99.6% reduction of the number of search points in the
enhancement layers.
The performance of the proposed method can be affected by
the characteristics of motion rather than the texture. Thus, we
chose five test sequences which have different motion charac-
teristics, i.e., CITY, CREW, HARBOUR, ICE, and SOCCER.
The rate-distortion performance of the proposed method for
CITY is quite better than those of other sequences except
HARBOUR, because in case of CITY the motion is quite
regular due to the camera movement. The rate-distortion
performance is medium for CREW, because the background
is covered by objects and their motion is relatively low.
Most schemes show the best rate-distortion performance for
HARBOUR where motion is very low. On the other hand, the
rate-distortion performance is relatively poor in case of ICE
and SOCCER where there are many fast moving objects.Fast search algorithms such as N3SS, 4SS, DS, and EPZS
tend to produce sub-optimal results, although they are def-
initely faster than the proposed activity-based ME. With
slightly improved video quality, EPZS achieved faster motion
estimation than N3SS, 4SS and DS due to the early-stopping
criteria based on sum of absolute differences. However, these
early-stopping criteria are not appropriate for the hierarchi-
cal prediction structure in H.264/SVC in terms of the rate-
distortion performance. The relative high bit-rates of N3SS,
4SS, DS, and EPZS require high bandwidth in H.264/SVC.
AdaptiveSR takes about 50 times longer CPU time than ours
and has relatively worse rate-distortion performance. Direct-
MaxMv shows an improved rate-distortion performance, butalso takes about 50 times longer CPU time than ours (see
Table V).
C. Results About Incorporating Fast Full-Search Block
Matching Algorithm with the Proposed Scheme
TZ search (TZS) introduced as a new block matching
algorithm in JSVM [28] provides drastically reduced encoding
time, with comparable rate-distortion performance to the full-
search algorithm. As TZS utilizes different search strategies
depending on the location of the best match found so far, the
search begins with a comparison of the rate-distortion cost
of some motion vector candidates (i.e., MVs of surrounding
blocks). The best match among motion vector candidates is
chosen as a starting position for a diamond-shaped search
which is stopped when the best match is located near the
starting position. If a better match is found farther away from
the starting position, the full search is triggered.
Because the proposed scheme plays a crucial role in decid-
ing the MVP and the search range, incorporating TZS with the
proposed scheme can remarkably reduce the execution time
of ME in SVC without significant quality loss compared with
the full-search algorithm. Table VI shows that the CPU time
reduction (T) obtained by the proposed scheme with TZS
is about 91.84% at the cost of mere 1.265% bit-rate increase
and 0.041 dB PSNR decrease on average compared with TZSonly scheme. We used the same configuration as mentioned
in Section V-A, i.e., three spatial layers [QCIF (BL), CIF
(EL), and 4CIF (EL)] with 240 frames. Because the proposed
activity-based ME improves the starting point of TZS by using
the adjusted search range, the bit-rate and quality obtained by
the proposed scheme with TSZ is better than those obtained
by the proposed scheme without TZS. The memory bandwidth
(BW) for reference data loading is also compared among three
schemes. The proposed scheme has resulted in about 90%
reduction of memory bandwidth compared with TZS utilizing
Level C data reuse [30].
-
7/28/2019 Activity-Based Motion Estimation Scheme For
10/11
1484 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, N O. 11, NOV EMBER 2010
TABLE V
Comparison of Bit-Rate (BDBR), PSNR (BDPSNR) and CPU Time Reduction ( T) Among Seven Schemes Including the Proposed
Scheme for Five Sequences Compared with JSVM [28]
CITY CREW HARBOUR ICE SOCCER Average
N3SS +46.202 +23.709 +2.056 +38.622 +49.262 +32.134
4SS +32.454 +23.239 +2.056 +35.144 +54.668 +29.512
DS +27.829 +20.604 +1.568 +23.415 +40.730 +22.829
BDBR (%) EPZS +10.453 +29.530 +8.901 +25.968 +31.324 +21.235
AdaptiveSR +6.439 +8.580 +5.638 +9.531 +3.998 +6.837
DirectMaxMv 0.052 0.286 0.115 +1.470 0.747 +0.054
Proposed +0.046 +0.828 0.063 +4.704 +2.270 +1.557
N3SS 1.487 0.802 0.135 1.107 1.854 1.077
4SS 1.036 0.786 0.097 1.011 2.071 1.000
DS 0.887 0.699 0.074 0.674 1.528 0.772
BDPSNR (dB) EPZS 0.335 1.049 0.427 0.742 1.19 0.749
AdaptiveSR 0.207 0.292 0.267 0.274 0.147 0.237
DirectMaxMv +0.002 +0.012 +0.005 0.042 +0.029 +0.001
Proposed 0.000 0.026 +0.003 0.135 0.083 0.048
N3SS 99.81 99.78 99.83 99.82 99.75 99.80
4SS 99.81 99.81 99.84 99.83 99.79 99.82
DS 99.82 99.81 99.87 99.84 99.78 99.82
T (%) EPZS 99.97 99.97 99.96 99.98 99.96 99.97
AdaptiveSR 81.76 76.35 61.14 40.85 60.30 64.08DirectMaxMv 81.23 52.21 83.04 75.01 29.33 64.16
Proposed 99.57 98.49 99.62 99.40 99.21 99.26
TABLE VI
Comparison of Two Schemes, 1) Proposed Scheme Without TZ Search (TZS), and 2) Proposed Scheme with TZS in Terms of Bit-Rate
(BDBR), PSNR (BDPSNR), CPU Time Reduction ( T) and Memory Bandwidth ( BW) with TZ Search Scheme Where Base Layer = QCIF
and Enhancement Layer = (CIF, 4CIF) with 240 Frames
CITY CREW HARBOUR ICE SOCCER Average
BDBR (%)Proposed w/o TZS +0.352 +1.119 +0.241 +3.508 +1.767 +1.397
Proposed w/ TZS +0.349 +0.885 +0.176 +3.302 +1.613 +1.265
BDPSNR (dB)Proposed w/o TZS 0.011 0.038 0.012 0.102 0.065 0.046
Proposed w/ TZS 0.011 0.030 0.008 0.095 0.060 0.041
T (%) Proposed w/o TZS 64.46 48.61 77.61 51.66 60.72 60.61Proposed w/ TZS 92.05 91.58 94.41 90.14 90.99 91.84
TZS 1035 2092 747 1015 2356 1449
BW (MB/s) Proposed w/o TZS 92 181 85 92 143 118
Proposed w/ TZS 94 232 88 110 198 144
VI. Conclusion
In this paper, we demonstrated a fast multilayer motion
estimation scheme that utilizes the activity defined as the
absolute of the motion vector difference. It was possible to
reduce the execution time of ME by utilizing the motion
property of the base layer, i.e., MVs and MVDs of corre-
sponding blocks in the base layer. According to the activity of
the base layer, the MVP was adaptively selected. The inter-
layer activity model, developed based on the linear relationship
between the activities in the neighboring layers, was used for
deciding the search range to achieve a similar rate-distortion
performance in spite of the reduced execution time of ME.
Two significant parameters related to the activity were ad-
justed to the sequence. Finally, the proposed scheme achieved
99.26% of the CPU time reduction in ME at the sacrifice
of 1.56% bit-rate increase, and 0.048 dB PSNR decrease for
sequences with different activity properties compared with the
conventional full-search algorithm. By adopting the fast full-
search block matching algorithm in JSVM, the CPU time
reduction increased to 99.85% without significant loss of
the rate-distortion performance compared with the full-search
algorithm.
References
[1] Draft ITU-T Recommendation and Final Draft International Standardof Joint Video Specification, document JVT-G050.doc, ITU-T Rec.H.264/ISO/IEC 14 496-10 AVC, Joint Video Team (JVT) of ISO/IECMPEG and ITU-T VCEG, 2003.
[2] I. JTC1, Joint Draft 8 of SVC Amendment, document JVT-X201.doc,ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Jul. 2007.
[3] H. Schwarz, D. Marpe, and T. Wiegand, Analysis of hierarchical Bpictures and MCTF, in Proc. IEEE Int. Conf. Multimedia Expo, Jul.2006, pp. 19291932.
[4] H. Schwarz, D. Marpe, and T. Wiegand, Overview of the scalable videocoding extension of the H.264/AVC standard, IEEE Trans. Circuits Syst.Video Technol., vol. 17, no. 9, pp. 1103 1120, Sep. 2007.
[5] C. Segall and G. Sullivan, Spatial scalability within the H.264/AVCscalable video coding extension, IEEE Trans. Circuits Syst. VideoTechnol., vol. 17, no. 9, pp. 11211135, Sep. 2007.
-
7/28/2019 Activity-Based Motion Estimation Scheme For
11/11
NA AND KYUNG: ACTIVITY-BASED MOTION ESTIMATION SCHEME FOR H.264 SCALABLE VIDEO CODING 1485
[6] Y.-H. Chen, T.-D. Chuang, Y.-J. Chen, and L.-G. Chen, Bandwidth-efficient encoder framework for H.264/AVC scalable extension, in Proc.9th ISMW, Dec. 2007, pp. 401406.
[7] Y.-H. Chen, T.-D. Chuang, Y.-J. Chen, C.-T. Li, C.-J. Hsu, S.-Y. Chien,and L.-G. Chen, An H.264/AVC scalable extension and high profileHDTV 1080p encoder chip, in Proc. IEEE Symp. VLSI Circuits, Jun.2008, pp. 104105.
[8] L.-W. Lee, J.-F. Wang, J.-Y. Lee, and J.-D. Shie, Dynamic search-window adjustment and interlaced search for block-matching algorithm,
IEEE Trans. Circuits Syst. Video Technol., vol. 3, no. 1, pp. 8587, Feb.1993.
[9] J. Feng, K.-T. Lo, H. Mehrpour, and A. Karbowiak, Adaptive blockmatching motion estimation algorithm for video coding, Electron. Lett.,vol. 31, no. 18, pp. 15421543, Aug. 1995.
[10] H.-S. Oh and H.-K. Lee, Block-matching algorithm based on anadaptive reduction of the search area for motion estimation, Real-Time
Imaging, vol. 6, no. 5, pp. 407414, 2000.[11] T. Yamada, M. Ikekawa, and I. Kuroda, Fast and accurate motion
estimation algorithm by adaptive search range and shape selection, inProc. IEEE ICASSP, vol. 2. Mar. 2005, pp. 897900.
[12] T. Song, K. Ogata, K. Saito, and T. Shimamoto, Adaptive search rangemotion estimation algorithm for H.264/AVC, in Proc. IEEE ISCAS,May 2007, pp. 39563959.
[13] Z. Chen, Y. Song, T. Ikenaga, and S. Goto, Adaptive search rangealgorithms for variable block size motion estimation in H.264/AVC,
IEICE Trans. Fundam., vol. E91-A, no. 4, pp. 10151022, 2008.[14] R. Li, B. Zeng, and M. Liou, A new three-step search algorithm for
block motion estimation, IEEE Trans. Circuits Syst. Video Technol.,
vol. 4, no. 4, pp. 438442, Aug. 1994.[15] L.-M. Po and W.-C. Ma, A novel four-step search algorithm for fast
block motion estimation, IEEE Trans. Circuits Syst. Video Technol.,vol. 6, no. 3, pp. 313317, Jun. 1996.
[16] S. Zhu and K.-K. Ma, A new diamond search algorithm for fast block-matching motion estimation, IEEE Trans. Image Process., vol. 9, no. 2,pp. 287290, Feb. 2000.
[17] A. Tourapis, Enhanced predictive zonal search for single and multipleframe motion estimation, in Proc. Visual Commun. Image Process.2002, pp. 10691079.
[18] K. De Wolf, D. De Schrijver, S. De Zutter, and R. Van de Walle,Scalable video coding: Analysis and coding performance of inter-layerprediction, in Proc. 9th ISSPA, Feb. 2007, pp. 14.
[19] B. Shen, I. Sethi, and B. Vasudev, Adaptive motion-vector resamplingfor compressed video downscaling, IEEE Trans. Circuits Syst. VideoTechnol., vol. 9, no. 6, pp. 929936, Sep. 1999.
[20] M.-J. Chen, M.-C. Chu, and S.-Y. Lo, Motion vector composition
algorithm for spatial scalability in compressed video, IEEE Trans.Consumer Electron., vol. 47, no. 3, pp. 319325, Aug. 2001.
[21] L. Roberts et al., Machine Perception of Three-Dimensional Solids. NewYork: Garland, 1980.
[22] R. Duda and P. Hart, Pattern Classification and Scene Analysis. NewYork: Wiley, 1973.
[23] J. Prewitt, Object enhancement and extraction, in Picture Processingand Psychopictorics. New York: Academic Press, 1970, pp. 75149.
[24] A. Rosenfeld and A. Kak, Digital Picture Processing. Orlando, FL:Academic Press, 1982.
[25] S. Na and C.-M. Kyung, A multilayer motion estimation scheme forspatial scalability in H.264/AVC scalable extension, in Proc. Int. Conf.
Multimedia Expo, Jun. 2009, pp. 6972.[26] F. van der Heijden, R. Duin, D. De Ridder, and D. Tax, Classification,
Parameter Estimation, and State Estimation: An Engineering ApproachUsing MATLAB. New York: Wiley, 2004.
[27] N. Draper and H. Smith, Applied Regression Analysis. New York: Wiley,
1998.[28] I. JTC1, Joint Scalable Video Model JSVM-12, document JVT-Y202.doc,
ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Oct. 2007.
[29] G. Bjontegaard, Calculation of Average PSNR Differences BetweenRD-Curves, document VCEG-M33.doc, VCEG 13th Meeting, Apr.2001.
[30] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, On the data reuse and memorybandwidth analysis for full-search block-matching VLSI architecture,
IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, pp. 6172,Jan. 2002.
Sangkwon Na received the B.S. degree in electricalengineering from Ajou University, Suwon, Korea,in 2003. Since 2003, he has been pursuing theunified course of the M.S. and Ph.D. degrees fromthe Department of Electrical Engineering, KoreaAdvanced Institute of Science and Technology, Dae-jeon, Korea.
His current research interests include low-powervideo codec design, wireless surveillance system,and platform-based architecture exploration.
Chong-Min Kyung (F09) received the B.S. de-gree in electronics engineering from Seoul NationalUniversity, Seoul, Korea, in 1975, and the M.S.and Ph.D. degrees in electrical engineering from the
Korea Advanced Institute of Science and Technol-ogy (KAIST), Daejeon, Korea, in 1977 and 1981,respectively.
From April 1981 to January 1983, he was with BellTelephone Laboratories, Murray Hill, NJ, in a post-doctoral position. Since he joined KAIST in 1983,he has been working on system-on-a-chip design
and verification methodology, processor, and graphics architectures for high-speed and/or low-power applications, including mobile video codec. He wasa Visiting Professor with the University of Karlsruhe, Karlsruhe, Germany,in 1989, as an Alexander von Humboldt Fellow, a Visiting Professor withthe University of Tokyo, Tokyo, Japan, from January 1985 to February 1985,with the Technical University of Munich, Munich, Germany, from July 1994to August 1994, with Waseda University, Kyushu, Japan, from 2002 to 2005,with the University of Auckland, Auckland, New Zealand, from February2004 to February 2005, and with Chuo University, Tokyo, from July 2005to August 2005. He is the Director of the Integrated Circuit (IC) Design
Education Center, Daejeon, established in 1995 to promote the IC designeducation in Korean universities through computer-aided design environmentsetup, and chip fabrication services. He is the Director of the SoC Initiativefor Ubiquity and Mobility Research Center, Daejeon, established to promoteacademia/industry collaboration in the SoC design-related area. From 1993 to1994, he served as an Asian Representative in the International Conferenceon the Computer-Aided Design Executive Committee.
Dr. Kyung received the Most Excellent Design Award and the SpecialFeature Award from the University Design Contest in the ASP-DAC 1997and 1998, respectively. He received the Best Paper Awards at the 36th DAC,New Orleans, LA, the 10th International Conference on Signal ProcessingApplication and Technology, Orlando, FL, in September 1999, and the1999 International Conference on Computer Design, Austin, TX. He wasthe General Chair of the Asian Solid-State Circuits Conference 2007, andASP-DAC 2008. In 2000, he received the National Medal from the KoreanGovernment for his contribution to research and education in IC design. Heis a member of the National Academy of Engineering Korea and the Korean
Academy of Science and Technology. He is a Hynix Chair Professor withKAIST.