activity-based motion estimation scheme for

7/28/2019 Activity-Based Motion Estimation Scheme For

1/11

IEEE TRANSACTIO NS ON CIRCU ITS AND SYSTEMS FOR VID EO TECHNO LOGY, VOL. 20, NO. 11, N OVEMBER 2010 1475

Activity-Based Motion Estimation Scheme forH.264 Scalable Video Coding

Sangkwon Na, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE

AbstractThis paper proposes a motion estimation schemeto reduce the computational complexity of multilayer motionestimation for scalable video coding. Based on the result of themotion estimation of the lower resolution layer referred to as baselayer, we developed a new approach for exploring the searchrange of the enhancement layer with high coding efficiency.This approach is based on the activity defined as the absolutedifference between the motion vector predictor and the finalmotion vector. Based on the correlation of the activities betweenneighboring layers, an inter-layer activity model was developedusing a curve-fitted linear equation to exploit the activity in thebase layer for deciding the search center and the search range

of the enhancement layer. Each activity pair in the neighboringlayers is used to associate the relevant macroblock to one oftwo groups; boundary region and interior region. The base-layermotion vector predictor is basically selected over all the activityregions; for each activity region, the proposed motion estimationalgorithm decides whether to include the median motion vectorpredictor or not. Minimal sufficient search range is also decidedfrom the inter-layer activity prediction factor that is adjusted tothe given sequence. The proposed scheme reduced the executiontime of motion estimation by 99.26% at the cost of 1.56% bit-rate increase and 0.048 dB peak signal-to-noise ratio (PSNR)decrease on average compared with the conventional full-searchalgorithm. The fast full-search block matching algorithm can alsobe incorporated to obtain the extra CPU time reduction in themotion estimation process. By adopting the fast full-search blockmatching algorithm (FFSBMA) in JSVM reference software,the CPU time was reduced by up to 91.84% and the memorybandwidth was reduced by 90% at the sacrifice of 1.27% bit-rateincrease and 0.041dB PSNR decrease on average compared withthe FFSBMA only.

Index TermsActivity, H.264/advanced video coding (AVC),motion estimation, scalable video coding (SVC).

I. Introduction

H.264/ADVANCED video coding (AVC) [1] supports a

scalable video coding (SVC) with an improved rate-

distortion performance through Amendment 3 [2] announced

in July 2007. SVC supports the compression of multiple video

sequences with the same content but with different frame rate,Manuscript received July 30, 2009; revised December 28, 2009 and March

26, 2010; accepted May 23, 2010. Date of publication September 20, 2010;date of current version November 5, 2010. This work was supported by theNational Research Foundation of Korea (NRF), under Grant 2010-0000823,funded by the Korean Government (MEST). This paper was recommendedby Associate Editor M. Comer.

The authors are with the Department of Electrical Engineering, KoreaAdvanced Institute of Science and Technology, Yuseong-Gu, Daejeon 305-701, Korea (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2010.2077493

resolution, and quality. One SVC-coded bit stream is used

for various devices such as TVs, PDAs and cell phones with

different display and computing capabilities. The final draft

of the scalable extension of H.264, i.e., H.264/SVC, supports

temporal, spatial, and quality scalability [2]. Temporal scal-

ability is related to the frame rate and is supported by a

hierarchical B-picture [3]. Spatial scalability allows various

resolutions to be encoded in a single coded stream, and

achieves lower bit-rate than simulcast [4] which contains all

individually coded streams. To remove redundancy between

neighboring layers, spatial scalability exploits three inter-layerpredictions: inter-layer motion prediction, inter-layer residual

prediction, and inter-layer intra-prediction [5]. By means of

quality scalability, video sequences with the same resolution

and frame rate can be coded with multiple quality levels with

different signal-to-noise ratios.

Motion estimation (ME), which has the largest computa-

tional complexity among all encoding processes, is quickly

becoming the computational bottleneck as the image resolution

of video increases. In SVC with multiple layers having differ-

ent resolutions, reducing redundancy among ME processes in

different layers is critical to reduce the overall time complexity.

Chen et al. proposed a ME architecture for H.264/SVC with a

full search and 4 refinement in order to reduce the externalmemory bandwidth and lower the operating frequency. Com-

pared to the full-search block matching algorithm (FSBMA),

the bandwidth overhead is reduced by up to 55% with a quality

loss of 0.1 dB [6], [7]. However, the computational complexity

of ME is still a critical problem since the search range used in

[6] and [7] is the same as the FSBMA in spite of the reduced

external memory bandwidth.

Various fast ME approaches based on the dynamic search

range adjustment have been proposed to reduce the com-

putational complexity [8][13]. In [8] and [9], the search

range is determined according to the magnitude of prediction

errors. Oh et al. [10] suggested the search range adjustment

according to the prediction errors and the block classificationinformation in the previous frames of the block. This approach

is appropriate for low bit-rate video such as video phone

and video conferencing. Yamada et al. [11] also proposed

an adaptive search range selection algorithm based on the

sum of the absolutes of motion vectors and prediction errors

in the previous frame. Song et al. [12] utilized the average

motion vectors in the five previous reference frames and the

prediction error of the current block simultaneously. In [13],

the motion vector difference is utilized to predict the search

1051-8215/$26.00 c 2010 IEEE


2/11


3/11

NA AND KYUNG: ACTIVITY-BASED MOTION ESTIMATION SCHEME FOR H.264 SCALABLE VIDEO CODING 1477

Fig. 2. Length of the longer edge of the MBB, LMBB, where (xs, ys) denotesthe BL MVP, and (x, y) denotes the final MV.

TABLE I

Comparison of Five Predictors Including BL MVP in Terms of

Entropy of Bits Representing Difference Between Predictors

and the Motion Vectors Generated Using Full Search and the

Resultant PSNR

Sequence Median Zero Collocated Accelerator BL MVP(0, 0) Block [17] [17]

PSNR (dB)

CITY 44.320 44.321 44.319 44.319 44.350

CREW 44.862 44.860 44.856 44.833 44.866

SOCCER 44.808 44.804 44.801 44.789 44.804

Entropy (bits)

CITY 3.527 3.521 3.526 3.553 3.357

CREW 5.848 5.876 5.963 6.268 5.458

SOCCER 3.897 3.990 4.011 4.300 3.848

Three spatial layers (QCIF, CIF, and 4CIF at 30 frames/s) are assumed withQP = 20 and GOP = 8 (hierarchical B 3).

denotes the length of the longer edge of the MBB. As shown

in Fig. 3, about 90% of MVs of the enhancement layer can

be found within [8, +8] of search center at MVs. The basis

search range, SRbasis, is set at 8, and is independent of the

resolution of the sequence. The distribution in Fig. 3(a) is

quite different from others, because the resolution ratio of CIF

(352 288) to QCIF (176 144) and 4CIF (704 576) to

CIF is integer, i.e., four, while that of 1080p (19201080) to

4CIF is 5.114, i.e., non-integer and > 4. In addition, because

the search range for 1080p is the same as that for 4CIF, the

speed of saturation of sequences is different between 1080p

in Fig. 3(a) and 4CIF in Fig. 3(b). We discuss how to detect

the block which has LMBB that exceeds SRbasis in the nextsection.

Besides the BL MVP, there are some other efficient

predictors [17]. Conventional median predictor is usually

employed in recent video compression. To minimize the

memory bandwidth and retain the processing regularity in

hardware implementation, many very large scale integration

video coders adopt zero motion vector (0, 0) as the predictor.

As it is observed that motion vectors are highly correlated

with the motion vectors of temporally and spatially adjacent

blocks [17], the motion vectors of the collocated block in the

previous frame or the adjacent blocks in the current frame are

Fig. 3. Cumulative distribution of LMBB, the maximum length of MBB,for CITY, CREW, HARBOUR, ICE, SOCCER,1 Aspen, RushFieldCuts, andTouchdownPass sequence with QP = 20. (a) 4CIF and 1080p, (b) CIF and4CIF, and (c) QCIF and CIF as base layer and enhancement layer, respectively.

also considered as the predictor. In addition, the differentially

increased/decreased motion vector named as accelerator mo-

tion vector is also used in [17]. We compared, in Table I, the

BL MVP-based method with four other predictors in terms

of the entropy of bits representing the difference between

predictors and the motion vectors generated using the full

search and the resultant PSNR. The BL MVP is shown to

outperform other predictors in terms of the video quality and

the entropy.

B. Activity

MVs of MBs at the boundary of moving objects are less

correlated than those in the interior. In Fig. 4, blocks C03 are

corresponding to B0 in the base layer, located at the boundary

of the moving object. MVs of blocks at the moving object

boundary such as C03 can be less correlated to each other

1We used 1080p sequences upsampled by 4CIF sequences for CITY, CREW,HARBOUR, ICE, and SOCCER because they are not available in 1080pformat. We included three additional sequences, Aspen, RushFieldCuts, andTouchdownPass which are available in 1080p format.


4/11

1478 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, N O. 11, NOV EMBER 2010

Fig. 4. Grid on a sample object in (a) base layer and (b) enhancementlayer; the rectangles with a bold line, B0 and C03, denote 4 4 blocksin the corresponding positions in the base layer and the enhancement layer,respectively.

despite the use of the scaled MV of B0 because the motion

properties of C0 and C2 differ from those of C1 and C3. It was

reported that FSBMA generally obtains less correlated MVs

at the boundary of the moving objects [19], [20]. Therefore,

it is necessary to extend the search range for blocks at theboundary of moving objects.

Conventional moving object boundary detection in the video

compression has relied on the sum of absolute AC coef-

ficients [19] and DC coefficient [20], which were used to

evaluate the level of activity and can be obtained from

signals of coded bit-stream. The gradient magnitude also has

been employed to detect the object boundary in the image

segmentation [21][24]. The main purpose of the moving

object boundary prediction in this paper is to judge whether

the wider search range is necessary to achieve the improved

video quality than SRbasis or not before the motion search,

rather than to exactly extract the moving objects.

We define Al, the activity in layer l, as

Al = maxi

(max(|mvdlx[i]|, |mvdly[i]|)) 0 i N 1 (1)

where l denotes the layer index, mvdlx[i] and mvdly[i] denote

x and y-component of the ith MVD of the corresponding MB

in layer l, and N denotes the number of the MVDs given for

the corresponding MB. Because MVD shows how much the

motion of current MB deviates from the MVP, which is either

the BL MVP or the median MVP, we used MVD to predict

the boundary of moving objects in our previous work [25].

Regardless of the source of MVP, low activity usually means

that the final MV is close to the MVP; this case is defined as

regular motion. In other words, small search range is enough

to search for the best-matched block if the block has a regular

motion. High activity occurs due to less correlated MVs at the

boundary of moving objects. As a result, each block can be

partitioned into two groups, a low-activity group and a high-

activity group. Activity regions are defined as follows:

1) interior region (IR) where MVs of the corresponding

blocks in neighboring layers are strongly correlated

(LMBB SRbasis);

2) boundary region (BR) where the corresponding blocks

in neighboring layers are located near the boundary

of moving objects, and MVs are weakly correlated

(LMBB > SRbasis).

A moving object boundary prediction accuracy of the pro-

posed MVD-based moving object boundary prediction was

compared with Roberts [21], Sobel [22], Prewitt [23], Rosen-

feld [24], the sum of absolute AC coefficients [19] and DC

coefficient [20]. The moving object boundary prediction accu-

racy, pacc, consists of two contributing terms: the probability of

a MB containing the moving object boundary when M M,and the probability of the MB being in the interior of the

moving object when M < M. That is

pacc(M) = p(EB|M M) + p(EI|M < M) (2)

where EB denotes the case when LMBB of the given block

is larger than SRbasis (the given block contains the boundary

of an object), EI denotes the case when LMBB of the given

block is smaller than or equal to SRbasis (the given block is in

the interior of an object.), M denotes a boundary prediction

measure such as DC coefficient. The threshold of the given

measure M to determine whether the current MB belongs to

IR or BR, denoted by M, is obtained by the minimum errorBayesian classifier [26]

M = arg minM

perr(M) (3)

whereperr(M) =

M

p(M|EB)dM +

M

p(M|EI)dM. (4)

As shown in Fig. 5, M is determined to minimize the error

probability given in (4) and the shaded region. Prediction

measure M of boundary operators [21][24] is derived from

the convolution computation of the operator mask (2 2 or

3 3) on the pixels. For the sum of absolute AC coefficients,

and DC coefficient, denoted as yAC and yDC, respectively,

can be employed as prediction measure M derived from DCT

transform given as

Y = AXAT (5)

=

yDC yAC,1 yAC,2 yAC,3yAC,4 yAC,5 yAC,6 yAC,7yAC,8 yAC,9 yAC,10 yAC,11yAC,12 yAC,13 yAC,14 yAC,15

(6)

where A =

a a a a

b c c b

a a a a

c b b c

. (7)

yAC is defined by

yAC =

15k=1

|yAC,k|. (8)

In (5)(7), X denotes the prediction errors obtained after ME,

a = 12

, b =

12

cos( 8

), and c =

12

cos( 38

). The proposed

MVD-based moving object boundary prediction employs the

activity of the base layer as prediction measure M where the

threshold of the proposed prediction, act, is given by (3).

Table II shows that the proposed MVD-based moving object

boundary prediction has at least 5% better boundary prediction


5/11


Fig. 5. Two conditional probability curves: p(M|EI) denotes the probabilitycurve ofM given EI (EI = 1 when LMBB of the given block is smaller than orequal to SRbasis), and p(M|EB) denotes the probability curve of M given EB(EB = 1 when LMBB of the given block is larger than SRbasis). M denotes themoving object prediction measure. Shaded region denotes the error probabilitywhen the threshold of M is given as M. SRbasis is defined in Section III-Bas a criterion for deciding whether a MB belongs to IR or BR.

TABLE II

Comparison of the Moving Object Boundary Prediction

Accuracy, pacc , and Operations per 4 4 Block Among

Roberts [21], Sobel [22], Prewitt [23], Rosenfeld [24], the Sum ofAbsolute AC Coefficients [19], DC Coefficient [20], and the

Proposed MVD-Based Moving Object Boundary Prediction

Using Full Search for SOCCER Sequence with QP = 20

Prediction Measure M Pacc Operation/4 4 BlockCompare Add/Sub Multiply

Roberts [21] 56.33% 33 86

Sobel [22] 56.94% 33 208

Prewitt [23] 56.95% 33 208 4

Rosenfeld [24] 54.11% 113 912 16

Sum of absolute 78.88% 17 127 128AC coefficient [19]

DC coefficient [20] 75.81% 1 96 128

Ours (activity) 83.81% 4 2

accuracy than the sum of absolute AC coefficients and DC

coefficient for CITY, CREW, HARBOUR, ICE, and SOCCER

sequences, and lists the number of operations used in the

corresponding prediction method. Boundary operators used in

[21][24] show relatively low accuracy in the moving object

boundary prediction. Because boundary operators are based on

the gradient magnitude, they often mistake a complicated tex-

ture in the scene for a moving object boundary or completely

miss a moving object boundary when the gradient magnitude

between the background and the boundary of the moving ob-

ject is small. The moving object boundary prediction schemeproposed in this paper excels others in terms of prediction

accuracy and computational complexity.

C. Inter-Layer Activity Model

By exploiting the correlation of the mean activities between

two neighboring layers in Fig. 6, inter-layer activity model

(ILAM) is developed to predict the activity of the enhancement

layer from that of the base layer with a linear equation

Al = Al1 + (9)

Fig. 6. Activity plane representing pairs of the mean of activities betweentwo neighboring layers [base layer (BL) = CIF, enhancement layer (EL) =4CIF] for ICE sequence with QP = 20; the dashed line denotes inter-layeractivity model with the given slope, (= Al/Al1) and the given intercept, where Al1 and Al denote the mean of activities over all MBs in a frameat the base and enhancement layer, respectively.

where Al is the predicted activity of the given MB in the

enhancement layer (layer l), an inter-layer activity prediction

factor, , is the slope of ILAM denoted by a dashed line in

Fig. 6, an inter-layer activity prediction offset, , is an intercept

of ILAM, and Al1 denotes the activity of the corresponding

MB in the base layer (layer l 1). Al in Fig. 6 denotes the

mean of activities over all MBs in a frame at layer l. Values of

and in (9) are obtained through the measurement with five

video sequences, such as CITY, CREW, HARBOUR, ICE, and

SOCCER (240 frames with a SVC structure comprising three

layers). The error between Al and Al1 + is measured

with R2 (the coefficient of determination [27]) defined as

SSerr =

l

f

(Alf ( Al1f + ))2 (10)

SStot =

l

f

(Alf A

l)2 (11)

R2 = 1 SSerr

SStot(12)

where l and f denote the index of layer and frame, respec-

tively, Alf denotes Al in the fth frame at layer l, A

ldenotes

the mean of Alf over all frames at layer l, SSerr is the sum of

squared errors between Alf and Al1f + , and SStot is the

variance of Alf. If R2 is close to 1.0, it means that the error

between

Al

and

Al1

+ is small. Table III shows , , andR2 measured with five generic sequences. The second column

shows the inter-layer activity prediction factor, , for the

given sequence; the third column shows the inter-layer activity

prediction offset, ; the forth column shows the coefficient of

determination, R2, for given and . The value of represents

the coefficient of the assumed linear relationship between the

activities in the neighboring layers. Equation (9) is used before

the motion search in the current layer to estimate the minimal

search range to find the best motion vector without too much

quality loss compared to the full search. Because the estimated

search range is given as the product of and the activity of


6/11


TABLE III

Inter-Layer Activity Prediction Factor, , and Inter-Layer

Activity Prediction Offset, , for the Given Sequence, and the

Coefficient of Determination [27], R2 , for Given and ,

Measured with Five Sequences (240 Frames on a SVC Structure

Comprising Three Layers)

Sequence R2

CITY 1.9 0.6 0.91CREW 2.3 0.4 0.95

HARBOUR 1.7 0.4 0.91

ICE 3.9 0.6 0.92

SOCCER 3.7 0.6 0.91

the base layer, affects both the computation time in ME and

the video quality. varies according to the given sequence

while is relatively steady ( is set to 0.5). Therefore,

needs to be adjusted to satisfy the variation of the motion

nature and the activity relationship between the neighboring

layers, which is discussed in Section IV-C.

IV. Proposed Activity-Based Motion Estimation

Algorithm

A. Overall Procedure of the Proposed Scheme

The proposed activity-based ME (ABME) scheme takes one

of the two paths, i.e., ME for IR and ME for BR, according

to the activity of the base layer, Al1. At the beginning, the

search range is given by inter-layer activity model (ILAM)

using (9). If Al1 is smaller than act, the activity threshold,

ABME takes ME for IR. Otherwise, AMBE takes ME for BR.

The final MV is chosen among the search results in terms of

the rate-distortion cost. During the motion search, parameter

, the inter-layer activity prediction factor in (9), and act areadjusted. The detailed procedure is introduced in Section IV-C.

B. Search Center Set

There are three elements which the search center set consists

of in the enhancement layer: MVl

med, MVl

s and MVz as

described in Fig. 7. MVl

med, the median MVP, is defined as

MVl

med = median(MVl

left, MVlupper, MV

l

upper-right ) (13)

where MVlleft, MVlupper, and MV

l

upper-right denote the MV

of the left, upper, and upper-right block in the enhancement

layer (layer l), respectively. MVl

s

denotes the BL MVP, i.e.,

MV obtained by up-scaling the MV of the base layer as

mentioned in the previous section. MVz denotes a zero motion

vector, (0, 0).

As shown in Table IV, the search center set is formed

according to the given activity region and the availability

of MVl

s. In general, blocks in IR show better rate-distortion

performance with MVl

s since they are placed in the interior

of moving objects and have regular motion. Simulation results

have shown that no significant benefit is obtained by additional

consideration of MVz in IR. In H.264, pictures are divided

into I, P (backward prediction) and B (forward and backward

Fig. 7. Three elements which the search center set consists of in theenhancement layer; MV

l

med, the median motion vector predictor at layer l,

MVl

s, the base-layer motion vector predictor at layer l, and MVz, the zero

motion vector where MVlfinal denotes the final MV at layer l, and Al denotes

the predicted activity based on inter-layer activity model.

TABLE IV

Search Center Set According to Each Activity Region and the

Availability of MVl

s

IR BR(Interior Region) (Boundary Region)

MVl

s {MVl

s} {MVl

s, MVl

med}is available

MV

l

s {

MV

l

med,

MVz} {

MV

l

med,

MVz}is unavailable

prediction) type. MVl

s may not be available for forward or

backward prediction in B picture. In this case, the search

center set consists ofMVl

med and MVz instead ofMVl

s. ME for

BR employs both MVPs (i.e., MVl

s and MVl

med). The search

begins with each element in the search center set.

C. Parameter Adjustment

1) Inter-Layer Activity Prediction Factor, : It is observed

that depends on the nature of motion in the scene, and,

therefore, needs to be adjusted to the given sequence. Wepropose a two-level adjustment scheme for consisting of

MB level and frame level. The search range is not fixed but

adjusted by (9) with a given . After the motion search, we

check whether the search range thus obtained is sufficient

or not as follows. If the best point with the minimum rate-

distortion cost is close enough to the boundary of the search

range, we suspect that there may exist some point with lower

rate-distortion cost than that point beyond the search range.

On the other hand, if the best point is close enough to

the predictor, the prediction is assumed to be quite accurate

obviating the need for further checking of points far from the


7/11


Fig. 8. Optional check of diamond-shaped points (OCDSP), where SR isderived from (9). Point I denotes the point with the minimum R-D costwithin the given search range, d denotes the distance between Predictor andpoint I, and point J denotes the center point of the diamond-shaped searchpattern whose distance from Predictor is twice as long as d. Five gray-colored circles denote optional check points in the diamond-shaped searchpattern, and point K denotes the point with the minimum R-D cost amongsix candidates, i.e., five optional check points and point I. SRnew denotes therequired search range to cover point K (a) when point K is different frompoint I, and (b) when point K is identical to point I.

predictor. The decision is made based on the distance between

the predictor and the best point in terms of the rate-distortion

cost obtained by the motion search within the given search

range.

In Fig. 8, we introduced a procedure called optional check

of diamond-shaped points (OCDSP). We defined the best

point obtained by the motion search as point I and the center

point of the diamond pattern as point J, which is located twice

as far as point I from the point denoted as Predictor along

the direction of Predictor-point I vector. The radius of the

diamond pattern is given as LMBB, the length of the longer

edge of the minimum bounding box (MBB) which covers

both Predictor and point I. We can get a new inter-layeractivity prediction factor, , after the following steps defined

as OCDSP.

1) Set the rate distortion cost of point I to RDCostI.

2) Define the best point among five optional check

points in the diamond pattern as point K.

3) Set the rate-distortion cost of point K to RDCostK.

4) If RDCostI < RDCostK, then point I is renamed as

point K as described in Fig. 8(b).

5) Get SRnew which minimally covers point K from

Predictor.

6) Calculate deductively using (9) by using Al as the

updated search range (SRnew) in; = SRnew

Al1 .

is defined in two levels: in MB level (MB) and in frame

level (frame). First, of 1616 mode is obtained by OCDSP

after the completion of the motion search using the search

range given by (9) with the previous value of frame, to be

defined as MB. The remaining modes, such as 16 8, 8 16

and 88 mode, are tested using the search range given by (9)

with updated MB as OCDSP is continuously performed for

each mode. After the mode decision, of the best mode is

defined as best. The mean of best over all MBs in a frame is

used to update frame. Initially, frame is set to the maximum

among values in Table III to support generic sequences. All

Fig. 9. Result of the adjustment in terms of (a) relative peak signal-to-noise ratio (PSNR), PSNR, and (b) relative computation time of ME, T,between adjusted and fixed given with the value of for CITY sequencewith QP = 20.

these parameters are controlled individually layer by layer.

Fig. 9 shows the variation of along with PSNR (PSNR

relative to that with fixed ) and T (percentage decrement

of computation time of ME relative to the case of fixed ) for

CITY sequence. It is observed that not only is the video quality

improved, but also about 90% computation time reduction in

ME is achieved through the adjustment.

2) Activity Threshold, act: Because it is too time-

consuming to update act with the full search, we employ

the sum of absolute AC coefficients (SAC) as the reference

measure, because SAC is most strongly correlated with ac-

tivity among all boundary prediction measures (see Table II).

Fig. 10 shows a scatter plot of the activity and SAC where

diamond-shaped points denote points in the interior of objects

and star-shaped points denote points in the object boundary,

respectively. The activity region is classified by act into IR

(hatched region) and BR (shaded region). act is determined

as follows:

The Euclidian distances among SACs in IR and BR, dIactand dBact , respectively, are calculated as follows:

dIact =

iIact

(yi yIact,mean )2 (14)


8/11


Fig. 10. Scatter plot of activity, Ai, and the sum of absolute AC coefficients(SAC), yi, where i denotes the MB index, act denotes the activity thresholddividing object regions into IR (hatched region) and BR (shaded region) while+act and

act denote the increment and the decrement of act, respectively.

Diamond-shaped points denote points in the interior of objects, and start-shaped points denote points in the object boundary.

dBact =

iBact

(yi yBact,mean )2 (15)

where

yIact,mean =1

|Iact|

iIact

yi,

Iact = {i : Ai < act for i}

(16)

yBact,mean =1

|Bact| iBact

yi

Bact = {i : Ai act for i}.

(17)

In (16) and (17), yIact,mean and yBact,mean denote the mean

SAC over all MBs in IR and BR, respectively, and yi denotes

SAC of the ith MB. Then, the euclidian distances obtained

by the increment and decrement of act, +act and

act, are also

calculated, respectively. The change of the euclidian distances

obtained by +act and act, for each activity region, are defined

as follows:

dI+ = dI+act dIact (18)

dB+ = dB+act dBact (19)

dI = dI

act

dIact (20)

dB = dBact

dBact . (21)

According to the following condition, act is updated every

frame.

1) IfdI+ + dB+ are negative, act is incremented by 1.

2) IfdI + dB are negative, act is decremented by 1.

3) Otherwise, act retains its value.

The initial value of act is given by statistical analysis using

(3) for five sequences: CITY, CREW, HARBOUR, ICE, and

SOCCER. act is controlled individually for each spatial layer.

Fig. 11 shows the variation of act along with PSNR (PSNR

Fig. 11. Result of the act adjustment about (a) relative peak signal-to-noiseratio (PSNR), PSNR, and (b) relative computation time of ME, T, betweenadjusted act and fixed act given with the value of act for CITY sequencewith QP=20.

relative to that with fixed act) and T (percentage decrement

of computation time of ME relative to the case of fixed act)

for CITY sequence. About 20% computation time of ME was

reduced compared with the case of fixed act at the cost of

slight quality degradation.

V. Experimental Results

A. Configuration of Experiments

The experiment platform is 4 Dual-Core AMD Opteron,

2.6 GHz CPUs, 16 GB RAM with CentOS 4.5. The experiment

conditions were set as follows.

1) A SVC structure comprising three layers, with resolutiongiven as QCIF, CIF and 4CIF is taken.

2) The search range for each resolution is set as follows:

[16, +16], [32, +32], and [64, +64] for QCIF, CIF,

and 4CIF, respectively.

3) The number of frames in GOP2 is set to 8, and the

hierarchical B-pictures [3] is employed as depicted in

Fig. 12.

2A group of pictures (GOP) consists of a key picture, which is generallycoded as P picture, and several hierarchically coded B pictures that are locatedbetween the key pictures. The coding order of hierarchical prediction isdepicted in Fig. 12.


9/11


Fig. 12. Hierarchical prediction structures for motion-compensated predic-tion with GOP = 8 (IBBBBBBBP).

4) 240 frames are tested for each sequence at 30 frames/s.

5) Intra prediction is restricted in the encoder to mainly

observe the effect of motion estimation.

6) The quantization parameter is set to 20, 24, 28, and 32.

7) The rate-distortion optimization is enabled.

8) The context-adaptive binary arithmetic coding is used.

9) The adaptive inter-layer prediction is enabled.

To evaluate the rate-distortion performance and CPU timein SVC, we first implemented the proposed activity-based

ME scheme into JSVM [28]. The N3SS [14], the 4SS [15],

the DS [16] and the EPZS [17] were simulated. We also

implemented Chens algorithm [13] which is here referred to

as AdaptiveSR method. For performance comparison, Direct-

MaxMv, which uses the maximum absolute value of MVs of

the corresponding MB in the base layer as the search range,

was also implemented.

B. Comparison of Bit-Rate, PSNR, and CPU Time

Table V reports the experimental results of N3SS, 4SS, DS,

EPZS, DirectMaxMv, AdaptiveSR method and the proposed

scheme compared with the reference encoder (JSVM [28]) in

terms of bit-rate, peak signal-to-noise ratio (PSNR), and CPU

time. The relative bit-rate and PSNR are calculated by the

method of Bjontegaard delta bit-rate (BDBR) and Bjontegaard

delta PSNR (BDPSNR) [29], respectively. T denotes the

CPU time reduction in ME for all the spatial layers (CIF,

QCIF and 4CIF) compared with JSVM.

Table V shows that the proposed scheme reduced the

ME execution time by 99.26% on average compared to

JSVM, while the rate-distortion performance loss is almost

negligible+1.56% and 0.048 dB, on average. In case of

full-search, the search point ratios of the base layer (QCIF) to

the enhancement layers (CIF and 4CIF) for CITY are 0.0221and 0.0023, respectively, because of the increased search range

and additional motion vector predictor in the enhancement

layers. Therefore, even if there is no computational reduction

in the base layer, we achieved about 99% time saving due

to 99.6% reduction of the number of search points in the

enhancement layers.

The performance of the proposed method can be affected by

the characteristics of motion rather than the texture. Thus, we

chose five test sequences which have different motion charac-

teristics, i.e., CITY, CREW, HARBOUR, ICE, and SOCCER.

The rate-distortion performance of the proposed method for

CITY is quite better than those of other sequences except

HARBOUR, because in case of CITY the motion is quite

regular due to the camera movement. The rate-distortion

performance is medium for CREW, because the background

is covered by objects and their motion is relatively low.

Most schemes show the best rate-distortion performance for

HARBOUR where motion is very low. On the other hand, the

rate-distortion performance is relatively poor in case of ICE

and SOCCER where there are many fast moving objects.Fast search algorithms such as N3SS, 4SS, DS, and EPZS

tend to produce sub-optimal results, although they are def-

initely faster than the proposed activity-based ME. With

slightly improved video quality, EPZS achieved faster motion

estimation than N3SS, 4SS and DS due to the early-stopping

criteria based on sum of absolute differences. However, these

early-stopping criteria are not appropriate for the hierarchi-

cal prediction structure in H.264/SVC in terms of the rate-

distortion performance. The relative high bit-rates of N3SS,

4SS, DS, and EPZS require high bandwidth in H.264/SVC.

AdaptiveSR takes about 50 times longer CPU time than ours

and has relatively worse rate-distortion performance. Direct-

MaxMv shows an improved rate-distortion performance, butalso takes about 50 times longer CPU time than ours (see

Table V).

C. Results About Incorporating Fast Full-Search Block

Matching Algorithm with the Proposed Scheme

TZ search (TZS) introduced as a new block matching

algorithm in JSVM [28] provides drastically reduced encoding

time, with comparable rate-distortion performance to the full-

search algorithm. As TZS utilizes different search strategies

depending on the location of the best match found so far, the

search begins with a comparison of the rate-distortion cost

of some motion vector candidates (i.e., MVs of surrounding

blocks). The best match among motion vector candidates is

chosen as a starting position for a diamond-shaped search

which is stopped when the best match is located near the

starting position. If a better match is found farther away from

the starting position, the full search is triggered.

Because the proposed scheme plays a crucial role in decid-

ing the MVP and the search range, incorporating TZS with the

proposed scheme can remarkably reduce the execution time

of ME in SVC without significant quality loss compared with

the full-search algorithm. Table VI shows that the CPU time

reduction (T) obtained by the proposed scheme with TZS

is about 91.84% at the cost of mere 1.265% bit-rate increase

and 0.041 dB PSNR decrease on average compared with TZSonly scheme. We used the same configuration as mentioned

in Section V-A, i.e., three spatial layers [QCIF (BL), CIF

(EL), and 4CIF (EL)] with 240 frames. Because the proposed

activity-based ME improves the starting point of TZS by using

the adjusted search range, the bit-rate and quality obtained by

the proposed scheme with TSZ is better than those obtained

by the proposed scheme without TZS. The memory bandwidth

(BW) for reference data loading is also compared among three

schemes. The proposed scheme has resulted in about 90%

reduction of memory bandwidth compared with TZS utilizing

Level C data reuse [30].


10/11


TABLE V

Comparison of Bit-Rate (BDBR), PSNR (BDPSNR) and CPU Time Reduction ( T) Among Seven Schemes Including the Proposed

Scheme for Five Sequences Compared with JSVM [28]

CITY CREW HARBOUR ICE SOCCER Average

N3SS +46.202 +23.709 +2.056 +38.622 +49.262 +32.134

4SS +32.454 +23.239 +2.056 +35.144 +54.668 +29.512

DS +27.829 +20.604 +1.568 +23.415 +40.730 +22.829

BDBR (%) EPZS +10.453 +29.530 +8.901 +25.968 +31.324 +21.235

AdaptiveSR +6.439 +8.580 +5.638 +9.531 +3.998 +6.837

DirectMaxMv 0.052 0.286 0.115 +1.470 0.747 +0.054

Proposed +0.046 +0.828 0.063 +4.704 +2.270 +1.557

N3SS 1.487 0.802 0.135 1.107 1.854 1.077

4SS 1.036 0.786 0.097 1.011 2.071 1.000

DS 0.887 0.699 0.074 0.674 1.528 0.772

BDPSNR (dB) EPZS 0.335 1.049 0.427 0.742 1.19 0.749

AdaptiveSR 0.207 0.292 0.267 0.274 0.147 0.237

DirectMaxMv +0.002 +0.012 +0.005 0.042 +0.029 +0.001

Proposed 0.000 0.026 +0.003 0.135 0.083 0.048

N3SS 99.81 99.78 99.83 99.82 99.75 99.80

4SS 99.81 99.81 99.84 99.83 99.79 99.82

DS 99.82 99.81 99.87 99.84 99.78 99.82

T (%) EPZS 99.97 99.97 99.96 99.98 99.96 99.97

AdaptiveSR 81.76 76.35 61.14 40.85 60.30 64.08DirectMaxMv 81.23 52.21 83.04 75.01 29.33 64.16

Proposed 99.57 98.49 99.62 99.40 99.21 99.26

TABLE VI

Comparison of Two Schemes, 1) Proposed Scheme Without TZ Search (TZS), and 2) Proposed Scheme with TZS in Terms of Bit-Rate

(BDBR), PSNR (BDPSNR), CPU Time Reduction ( T) and Memory Bandwidth ( BW) with TZ Search Scheme Where Base Layer = QCIF

and Enhancement Layer = (CIF, 4CIF) with 240 Frames

CITY CREW HARBOUR ICE SOCCER Average

BDBR (%)Proposed w/o TZS +0.352 +1.119 +0.241 +3.508 +1.767 +1.397

Proposed w/ TZS +0.349 +0.885 +0.176 +3.302 +1.613 +1.265

BDPSNR (dB)Proposed w/o TZS 0.011 0.038 0.012 0.102 0.065 0.046

Proposed w/ TZS 0.011 0.030 0.008 0.095 0.060 0.041

T (%) Proposed w/o TZS 64.46 48.61 77.61 51.66 60.72 60.61Proposed w/ TZS 92.05 91.58 94.41 90.14 90.99 91.84

TZS 1035 2092 747 1015 2356 1449

BW (MB/s) Proposed w/o TZS 92 181 85 92 143 118

Proposed w/ TZS 94 232 88 110 198 144

VI. Conclusion

In this paper, we demonstrated a fast multilayer motion

estimation scheme that utilizes the activity defined as the

absolute of the motion vector difference. It was possible to

reduce the execution time of ME by utilizing the motion

property of the base layer, i.e., MVs and MVDs of corre-

sponding blocks in the base layer. According to the activity of

the base layer, the MVP was adaptively selected. The inter-

layer activity model, developed based on the linear relationship

between the activities in the neighboring layers, was used for

deciding the search range to achieve a similar rate-distortion

performance in spite of the reduced execution time of ME.

Two significant parameters related to the activity were ad-

justed to the sequence. Finally, the proposed scheme achieved

99.26% of the CPU time reduction in ME at the sacrifice

of 1.56% bit-rate increase, and 0.048 dB PSNR decrease for

sequences with different activity properties compared with the

conventional full-search algorithm. By adopting the fast full-

search block matching algorithm in JSVM, the CPU time

reduction increased to 99.85% without significant loss of

the rate-distortion performance compared with the full-search

algorithm.

References

[1] Draft ITU-T Recommendation and Final Draft International Standardof Joint Video Specification, document JVT-G050.doc, ITU-T Rec.H.264/ISO/IEC 14 496-10 AVC, Joint Video Team (JVT) of ISO/IECMPEG and ITU-T VCEG, 2003.

[2] I. JTC1, Joint Draft 8 of SVC Amendment, document JVT-X201.doc,ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Jul. 2007.

[3] H. Schwarz, D. Marpe, and T. Wiegand, Analysis of hierarchical Bpictures and MCTF, in Proc. IEEE Int. Conf. Multimedia Expo, Jul.2006, pp. 19291932.

[4] H. Schwarz, D. Marpe, and T. Wiegand, Overview of the scalable videocoding extension of the H.264/AVC standard, IEEE Trans. Circuits Syst.Video Technol., vol. 17, no. 9, pp. 1103 1120, Sep. 2007.

[5] C. Segall and G. Sullivan, Spatial scalability within the H.264/AVCscalable video coding extension, IEEE Trans. Circuits Syst. VideoTechnol., vol. 17, no. 9, pp. 11211135, Sep. 2007.


11/11


[6] Y.-H. Chen, T.-D. Chuang, Y.-J. Chen, and L.-G. Chen, Bandwidth-efficient encoder framework for H.264/AVC scalable extension, in Proc.9th ISMW, Dec. 2007, pp. 401406.

[7] Y.-H. Chen, T.-D. Chuang, Y.-J. Chen, C.-T. Li, C.-J. Hsu, S.-Y. Chien,and L.-G. Chen, An H.264/AVC scalable extension and high profileHDTV 1080p encoder chip, in Proc. IEEE Symp. VLSI Circuits, Jun.2008, pp. 104105.

[8] L.-W. Lee, J.-F. Wang, J.-Y. Lee, and J.-D. Shie, Dynamic search-window adjustment and interlaced search for block-matching algorithm,

IEEE Trans. Circuits Syst. Video Technol., vol. 3, no. 1, pp. 8587, Feb.1993.

[9] J. Feng, K.-T. Lo, H. Mehrpour, and A. Karbowiak, Adaptive blockmatching motion estimation algorithm for video coding, Electron. Lett.,vol. 31, no. 18, pp. 15421543, Aug. 1995.

[10] H.-S. Oh and H.-K. Lee, Block-matching algorithm based on anadaptive reduction of the search area for motion estimation, Real-Time

Imaging, vol. 6, no. 5, pp. 407414, 2000.[11] T. Yamada, M. Ikekawa, and I. Kuroda, Fast and accurate motion

estimation algorithm by adaptive search range and shape selection, inProc. IEEE ICASSP, vol. 2. Mar. 2005, pp. 897900.

[12] T. Song, K. Ogata, K. Saito, and T. Shimamoto, Adaptive search rangemotion estimation algorithm for H.264/AVC, in Proc. IEEE ISCAS,May 2007, pp. 39563959.

[13] Z. Chen, Y. Song, T. Ikenaga, and S. Goto, Adaptive search rangealgorithms for variable block size motion estimation in H.264/AVC,

IEICE Trans. Fundam., vol. E91-A, no. 4, pp. 10151022, 2008.[14] R. Li, B. Zeng, and M. Liou, A new three-step search algorithm for

block motion estimation, IEEE Trans. Circuits Syst. Video Technol.,

vol. 4, no. 4, pp. 438442, Aug. 1994.[15] L.-M. Po and W.-C. Ma, A novel four-step search algorithm for fast

block motion estimation, IEEE Trans. Circuits Syst. Video Technol.,vol. 6, no. 3, pp. 313317, Jun. 1996.

[16] S. Zhu and K.-K. Ma, A new diamond search algorithm for fast block-matching motion estimation, IEEE Trans. Image Process., vol. 9, no. 2,pp. 287290, Feb. 2000.

[17] A. Tourapis, Enhanced predictive zonal search for single and multipleframe motion estimation, in Proc. Visual Commun. Image Process.2002, pp. 10691079.

[18] K. De Wolf, D. De Schrijver, S. De Zutter, and R. Van de Walle,Scalable video coding: Analysis and coding performance of inter-layerprediction, in Proc. 9th ISSPA, Feb. 2007, pp. 14.

[19] B. Shen, I. Sethi, and B. Vasudev, Adaptive motion-vector resamplingfor compressed video downscaling, IEEE Trans. Circuits Syst. VideoTechnol., vol. 9, no. 6, pp. 929936, Sep. 1999.

[20] M.-J. Chen, M.-C. Chu, and S.-Y. Lo, Motion vector composition

algorithm for spatial scalability in compressed video, IEEE Trans.Consumer Electron., vol. 47, no. 3, pp. 319325, Aug. 2001.

[21] L. Roberts et al., Machine Perception of Three-Dimensional Solids. NewYork: Garland, 1980.

[22] R. Duda and P. Hart, Pattern Classification and Scene Analysis. NewYork: Wiley, 1973.

[23] J. Prewitt, Object enhancement and extraction, in Picture Processingand Psychopictorics. New York: Academic Press, 1970, pp. 75149.

[24] A. Rosenfeld and A. Kak, Digital Picture Processing. Orlando, FL:Academic Press, 1982.

[25] S. Na and C.-M. Kyung, A multilayer motion estimation scheme forspatial scalability in H.264/AVC scalable extension, in Proc. Int. Conf.

Multimedia Expo, Jun. 2009, pp. 6972.[26] F. van der Heijden, R. Duin, D. De Ridder, and D. Tax, Classification,

Parameter Estimation, and State Estimation: An Engineering ApproachUsing MATLAB. New York: Wiley, 2004.

[27] N. Draper and H. Smith, Applied Regression Analysis. New York: Wiley,

1998.[28] I. JTC1, Joint Scalable Video Model JSVM-12, document JVT-Y202.doc,

ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Oct. 2007.

[29] G. Bjontegaard, Calculation of Average PSNR Differences BetweenRD-Curves, document VCEG-M33.doc, VCEG 13th Meeting, Apr.2001.

[30] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, On the data reuse and memorybandwidth analysis for full-search block-matching VLSI architecture,

IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, pp. 6172,Jan. 2002.

Sangkwon Na received the B.S. degree in electricalengineering from Ajou University, Suwon, Korea,in 2003. Since 2003, he has been pursuing theunified course of the M.S. and Ph.D. degrees fromthe Department of Electrical Engineering, KoreaAdvanced Institute of Science and Technology, Dae-jeon, Korea.

His current research interests include low-powervideo codec design, wireless surveillance system,and platform-based architecture exploration.

Chong-Min Kyung (F09) received the B.S. de-gree in electronics engineering from Seoul NationalUniversity, Seoul, Korea, in 1975, and the M.S.and Ph.D. degrees in electrical engineering from the

Korea Advanced Institute of Science and Technol-ogy (KAIST), Daejeon, Korea, in 1977 and 1981,respectively.

From April 1981 to January 1983, he was with BellTelephone Laboratories, Murray Hill, NJ, in a post-doctoral position. Since he joined KAIST in 1983,he has been working on system-on-a-chip design

and verification methodology, processor, and graphics architectures for high-speed and/or low-power applications, including mobile video codec. He wasa Visiting Professor with the University of Karlsruhe, Karlsruhe, Germany,in 1989, as an Alexander von Humboldt Fellow, a Visiting Professor withthe University of Tokyo, Tokyo, Japan, from January 1985 to February 1985,with the Technical University of Munich, Munich, Germany, from July 1994to August 1994, with Waseda University, Kyushu, Japan, from 2002 to 2005,with the University of Auckland, Auckland, New Zealand, from February2004 to February 2005, and with Chuo University, Tokyo, from July 2005to August 2005. He is the Director of the Integrated Circuit (IC) Design

Education Center, Daejeon, established in 1995 to promote the IC designeducation in Korean universities through computer-aided design environmentsetup, and chip fabrication services. He is the Director of the SoC Initiativefor Ubiquity and Mobility Research Center, Daejeon, established to promoteacademia/industry collaboration in the SoC design-related area. From 1993 to1994, he served as an Asian Representative in the International Conferenceon the Computer-Aided Design Executive Committee.

Dr. Kyung received the Most Excellent Design Award and the SpecialFeature Award from the University Design Contest in the ASP-DAC 1997and 1998, respectively. He received the Best Paper Awards at the 36th DAC,New Orleans, LA, the 10th International Conference on Signal ProcessingApplication and Technology, Orlando, FL, in September 1999, and the1999 International Conference on Computer Design, Austin, TX. He wasthe General Chair of the Asian Solid-State Circuits Conference 2007, andASP-DAC 2008. In 2000, he received the National Medal from the KoreanGovernment for his contribution to research and education in IC design. Heis a member of the National Academy of Engineering Korea and the Korean

Academy of Science and Technology. He is a Hynix Chair Professor withKAIST.

activity-based motion estimation scheme for

Documents