fast block motion estimation using gray code...

Fast Block Motion Estimation Using Gray‐Code Kernels

Yair Moshe

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE MASTER DEGREE

University of Haifa

Faculty of Social Sciences

Department of Computer Science

November, 2007


By: Yair Moshe

Supervised By: Dr. Hagit Hel‐Or

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE MASTER DEGREE

University of Haifa Faculty of Social Sciences

Department of Computer Science

November, 2007

Approved by: ____________________________ Date: ___________________

(Supervisor)

Approved by: ____________________________ Date: ___________________

(Chairperson of M.A Committee)

I

Acknowledgment

To be written.

II

Table of Contents

ABSTRACT ............................................................................................................................................... IV

LIST OF FIGURES AND TABLES ................................................................................................................. V

1. INTRODUCTION……….. ........................................................................................................................ 1

1.1. FUNDAMENTALS OF VIDEO COMPRESSION……….. ................................................................................ 11.2. FAST MOTION ESTIMATION TECHNIQUES……….. ................................................................................. 81.3. ORGANIZATION OF THESIS………… ................................................................................................... 14

2. FAST PATTERN MATCHING USING WALSH‐HADAMARD PROJECTION KERNELS……….. ................... 15

3. THE GRAY‐CODE KERNELS……….. ...................................................................................................... 20

4. THE FME‐GCK ALGORITHM……….. .................................................................................................... 28

5. COMPLEXITY ANALYSIS………… ......................................................................................................... 33

6. FME‐GCK RESULTS……….. ................................................................................................................. 36

6.1. SIMULATION RESULTS……….. .......................................................................................................... 366.2. VIDEO ENCODING RESULTS………..................................................................................................... 46

7. AN ADAPTIVE FME‐GCK……….. ......................................................................................................... 55

8. ADAPTIVE FME‐GCK RESULTS……….. ................................................................................................ 61

9. CONCLUSION……….. .......................................................................................................................... 63

BIBLIOGRAPHY ....................................................................................................................................... 64

III


Yair Moshe

ABSTRACT

Motion estimation plays an important role in modern video coders. In such coders, motion is

estimated using a block matching algorithm that estimates the amount of motion on a block‐

by‐block basis. A full search technique for finding the best matching blocks delivers good

accuracy but is usually not practical because of its high computational complexity. In this

dissertation, a novel fast block‐based motion estimation algorithm is proposed. This

algorithm uses an efficient projection framework which bounds the distance between a

template block and candidate blocks. Fast projection is performed with a family of highly

efficient filter kernels – the Gray‐Code Kernels – using only 2 operations per pixel for each

filter kernel. The projection framework is combined with a rejection scheme which allows

rapid rejection of candidate blocks that are distant from the template block. The tradeoff

between computational complexity and quality of results could be easily controlled in the

proposed algorithm, thus it enables adaptivity to image content to further improve the

results. Experiments show that the proposed adaptive algorithm significantly outperforms

popular fast motion estimation algorithms, such as three‐step search and diamond search.

IV

V

List of Figures and Tables

Figure 1: Classical hybrid DPCM‐transform video encoding scheme. ................................................ 4 Figure 2: Block matching algorithm for motion estimation. ............................................................. 5 Figure 3: Frame prediction using block‐based motion estimation. ................................................... 5 Figure 4: Three step search. ......................................................................................................... 9 Figure 5: Diamond search. .......................................................................................................... 10

age struFigure 7: Projection of onto vector produce bound on distance . ........... 16 Figure 6: A pyramid im cture. .......................................................................................... 12

s lower Figure 8: The projection vectors of the WHT of order . U ........................................................ 19 Figure 9: The set of Gray Code Ke d their recursive definition visualized as a binary tree. ...... 21 rnels an

GCK. ...Figure 11: GCK with initial vector creates the WH basis set. ................................................. 24 Figure 10: Efficient filtering using ..................................................................................... 23

Figure 12: Extension of GCK to two dimensions. ........................................................................... 25 Figure 13: ‘Snake’ ordering of WH kernels. .................................................................................. 26 Figure 14: Increasing frequency ordering of WH kernels. .............................................................. 27 Figure 15: The FME‐GCK algorithm. ............................................................................................. 30 Figure 16: Image padding for rapid boundary calculation. ............................................................. 31 Figure 17: Motion information as overlaid arrows. ....................................................................... 37 Figure 18: Effect of different values of the parameter on motion estimation accuracy. ............... 39 Figure 19: FME‐GCK motion estimation accuracy vs. three‐step motion estimation accuracy. .......... 40 Figure 20: FME‐GCK motion estimation accuracy relative to the optimal results. ............................ 41 Figure 21: Effect of different values of the parameter on motion estimation accuracy. ................ 44 Figure 22: Effect of size of the search area on motion estimation accuracy. ................................... 45 Figure 23: FME‐GCK rate‐distortion video encoding results for Container QCIF. .............................. 47 Figure 24: FME‐GCK rate‐distortion video encoding results for Silent Voice QCIF. ........................... 48 Figure 25: FME‐GCK rate‐distortion video encoding results for Foreman QCIF. ............................... 49 Figure 26: FME‐GCK rate‐distortion video encoding results for Paris CIF. ........................................ 50 Figure 27: FME‐GCK rate‐distortion video encoding results for Foreman CIF. .................................. 51 Figure 28: FME‐GCK rate‐distortion video encoding results for Tempete CIF. .................................. 52 Figure 29: FME‐GCK rate‐distortion video encoding results for CIF. ..................................... 53 Mobile Figure 30: Size of residual signal using FME‐GCK with constant nd different values of . ..... 57 aFigure 31: Size of residual signal using FME‐GCK with constant and different values of . ..... 58 Figure 32: Adaptive FME‐GCK results (QCIF resolution). ................................................................ 62 Figure 33: Adaptive FME‐GCK results (CIF resolution). .................................................................. 62

Table 1: Video sequences used for simulation experiments. .......................................................... 36 Table 2: Video sequences used for video coding experiments. ....................................................... 46

1. Introduction

1.1. Fundamentals of Video Compression

This chapter gives a short overview of video compression fundamentals, not intended to be

a complete overview of this topic. Details that are irrelevant to the rest of the discussion are

intentionally ignored or only briefly introduced. For more details regarding video

compression the reader is referred to [1‐6].

Digital video is a representation of a natural visual scene sampled spatially and temporally. It

is a sequence of images, called frames, displayed at a certain frame rate to create the illusion

of animation. This rate, as well as the image size and pixel depth, depends heavily on the

application [1]. Even a very economical application, video streaming for a cellular phone,

might generate 15 fps (frames per second) with QCIF (176 × 144) image size and with bit

depth of 12 bits per pixel. This results in 15 x 176 x 144 x 12 = 4,561,920 bps (bits per

second). However, available bandwidth for this application is smaller by two orders of

magnitude. This situation is similar for most video applications, so significant bit rate

reduction is a necessary requirement.

Digital video compression has become an essential part of modern multimedia systems since

it enables significant bit rate reduction of the video signal for transmission or storage. Video

compression is normally lossy, namely the decompressed video sequence differs from the

original, but is 'close enough' to be useful in a many applications. The goal of a video

compression algorithm is to achieve efficient compression while minimizing the distortion

introduced by the compression process.

A video coder compacts a digital video sequence by decreasing redundancies, namely

components that are not necessary for faithful reproduction of the data:

• Spatial redundancy – Neighboring pixels of a frame are statistically correlated. Most of

the intensity values within an image change continuously from pixel to pixel.

1

• Temporal redundancy ‐ In a video sequence, the difference between consecutive frames

is small. This is true since natural video scenes typically involve smooth camera or object

motion and since the time interval between two consecutive frames is relatively short.

• Psychovisual redundancy – The visual perception of the human visual system is not

uniformly sensitive to the information contained in a video sequence. For example, it is

more sensitive to low spatial frequencies than to high spatial frequencies. Another

example is that it is more sensitive to changes in luma (intensity) than to changes in

chroma (color).

• Statistical redundancy – In a video stream, some data symbols may appear more

frequently than others.

Various techniques might be used for reducing these redundancies. Spatial redundancy is

reduced by predictive coding, predicting a pixel from its neighbors, and by transform coding.

Psychovisual redundancies are mainly reduced by careful quantization. Statistical

redundancy is reduced by entropy coding. One form of entropy coding is variable length

coding in which symbols with a higher occurrence probability are encoded with shorter

lengths.

Reduction of temporal redundancy of a video sequence is the most significant of these

redundancies. Changes between consecutive frames are attributed to the translation of

moving objects in the image. It is thus imperative to apply a process of estimating motion

vectors (displacement vectors) from one frame to the previous. This is referred to as motion

estimation. Translational motion estimation is a very simple model. It cannot accommodate

motions other than translation, such as rotation or camera zooming. Occlusion and

disocclusion of objects, together with lighting changes and various noise artifacts existing in

the frames, complicate the situation even further. Therefore, in order to attain good‐quality

frames in the receiver, coding of the residual (prediction error) is necessary. Differential

signals between the intensity value in the current frame and those of their counterparts in

the previous frame, which are translated by the estimated motion vectors, are encoded.

Adding the transmitted residual frame to the predicted frame, the decoder may reconstruct

the latter frame with satisfactory quality. This reconstruction process is referred to as

2

motion compensation. Through appropriate manipulations, the total amount of data for

both the motion vectors and residual is expected to be much less than the raw data existing

in the image frames, thus resulting in significant data compression [2].

In order to encourage interworking and competition, it has been necessary to define

standard methods of video encoding and decoding to allow products from different

manufacturers to communicate effectively. This has led to the development of several key

international standards for video compression, including the MPEG and H.26x series of

standards. Compression involves a complementary pair of systems, an encoder and a

decoder. The encoder converts the source data into a compressed form, occupying a

reduced number of bits, prior to transmission or storage, and the decoder converts the

compressed form back into a representation of the original video. The standards do not

define an encoder; rather, they define the output that an encoder should produce. A

decoding method is defined in each standard but manufacturers are free to develop

alternative decoders as long as they achieve results in accord with the standard [4]. Most of

the key international standards for video coding, including MPEG‐1 [7], MPEG‐2 [8], MPEG‐4

[9], H.261 [10], H.263 [11], and H.264/MPEG‐4 AVC [12], share the same hybrid DPCM‐

transform model, that will now be briefly described.

Figure 1 shows a classical hybrid DPCM‐transform video encoding scheme. The video

sequence is divided into GOPs (groups of pictures). The first frame of every GOP is an intra

frame and every other frame in the GOP is an inter frame. Intra frames are self‐sufficient and

are coded independently of previous frames. They are used as anchors for temporal

prediction. Inter frames are coded using motion‐compensated prediction from the previous

frame, which could be either an intra or an inter frame. The algorithm processing the frames

of a video sequence is block‐based. A video frame is divided into nonoverlapped rectangular

blocks, called macroblocks, with each macroblock having the same size, usually 16 x 16

pixels. Each macroblock is divided into smaller equal‐size regions, called blocks.

Blocks of an intra macroblock are transformed, quantized, and entropy coded. The purpose

of the transform is to decorrelate the picture content. Quantization reduces the number of

3

bits to encode by adaptively weighting the transform coefficients according to the human

visual system sensitivity. Entropy coding assigns longer code words for symbols with lower

probability.

Figure 1: Classical hybrid DPCM‐transform video encoding scheme.

For inter macroblock, a more complicated process that involves motion estimation and

motion compensation is used. The most frequently used technique in motion estimation for

video coding is the block matching algorithm (BMA). Each macroblock is assumed to move as

one, that is, all pixels in a macroblock share the same motion vector. As illustrated in Figure

2, a template macroblock in the current frame is compared to candidate blocks in a search

area, usually centered on the current macroblock position. The candidate block that

minimizes a matching criterion is chosen as ‘best match' and used as a predictor. The relative

position of each template macroblock in the current frame and its best match in the

previous frame produce a motion vector. The selected best matching region in the reference

frame is subtracted from the current macroblock to produce a residual (difference)

macroblock that is transformed, quantized, entropy coded, and transmitted together with

the motion vector. Figure 3 shows an example of frame prediction using motion

compensation.

4

1N − N

current macroblock

search area

‘best match’ macroblock

motion vector

Figure 2: Block matching algorithm for motion estimation.

(b) (a)

(c) (d)

(a) Frame N‐1. (b) Frame N. (c) Frame N with superimposed motion vectors. (d) Residualimage – subtraction of the Nth motion compensated residual frame from frame N. Meangray represents zero while brighter and darker intensities represent higher residual values(after contrast enhancement).

Figure 3: Frame prediction using block‐based motion estimation.

5

Various distortion measures could be used for finding the best match for a macroblock in the

motion estimation process. Mean Squared Error (MSE) provides a measure of the energy

remaining in the difference macroblock. MSE for a ‐sample macroblock can be

calculated as follows:

( )1 1 2

20 0

1 k k

ij iji j

MSE C R

where is a sample of the template macroblock and is a sample of the candidate block.

Mean Absolute Error (MAE) provides a reasonably good approximation of residual energy

and is easier to calculate than MSE, since it requires a magnitude calculation instead of a

square calculation for each pair of samples:

k

− −

= =

= −∑∑ (1.1)

1 1

20 0

1 k k

ij iji j

MAE C Rk

− −

= =

= −∑∑ (1.2)

The comparison may be simplified further by neglecting the term and simply calculating

the Sum of Absolute Differences (SAD):

1 1

0 0

k k

ij iji j

SAD C R− −

= =

= −∑∑ (1.3)

SAD gives a reasonable approximation to block energy and so Equation (1.3) is a commonly

used matching criterion for block‐based motion estimation [4].

Lately, a new distortion measure for motion estimation has been proposed – the Sum of

Absolute Transformed Differences (SATD). This measure takes a frequency transform, usually

a Hadamard transform, of the differences between the pixels in the template macroblock

and the corresponding pixels in a candidate block:

(1.4) 1 1

0 0

1 ( )2

ij ij ijk k

iji j

D C R

SATD HDH− −

= =

= −

= ∑∑

where is the kernel matrix of the Hadamard transform. The constant can, of course, be

neglected. SATD is considerably slower than the SAD but it more accurately predicts quality

from the viewpoints of both objective and subjective metrics. Therefore it is used in the

H.264 reference model software [13, 14], as well as in other new video encoders.

6

Frame reconstruction is the reverse process that starts with entropy decoding. For intra

macroblocks, each quantized block is then rescaled and inverse transformed to produce a

reconstructed macroblock. Note that the nonreversible quantization process implies that the

reconstructed macroblock is not identical to the original. For inter macroblocks, each

quantized block is rescaled and inverse transformed to produce a decoded residual. The

motion compensated prediction is added to the residual to produce a reconstructed

macroblock. The reconstructed frame is stored so it may be used as a reference frame for

the next encoded frame. This is necessary to ensure the encoder and decoder use identical

reference frames for motion compensated prediction.

There are many variations on the basic motion estimation and compensation process. The

reference frame may be a previous frame, a future frame or a combination of predictions

from two or more previously encoded frames. If a future frame is chosen as the reference, it

is necessary to encode this frame before the current frame (that is, frames must be encoded

out of order). Moving objects in a video scene rarely follow 16×16‐pixel boundaries and so it

may be more efficient to use a variable block size for motion estimation and compensation.

Objects may move by a fractional number of pixels between frames so a better prediction

may be obtained by interpolating the reference frame to sub‐pixel positions before

searching these positions for the best match [4]. In this dissertation, we ignore these

variations for the sake of simplicity. This does not reduce the generality of the proposed

solution since it could be readily extended to support theses variations.

7

1.2. Fast Motion Estimation Techniques

Motion estimation, although efficient in reducing temporal redundancy, incurs high

computational complexity. A brute force technique for finding the best matching region

within the search area in the reference frame is called full search. It is performed by

comparing all candidate blocks in the search area with the template macroblock. Full search

is usually impractical for real‐time applications due to the large number of comparisons

required. Measurements of video encoders' complexity using full search motion estimation

show that motion estimation comprise about 50%–90% of the overall encoder's complexity

[15]. So, many alternative ‘fast search’ motion estimation algorithms have been proposed in

the literature. A fast and accurate block matching algorithm is a critical part of every

practical video coder with significant impact on coding efficiency. According to [15], the main

concepts of these fast algorithms can be classified into six categories: reduction in search

positions, predictive search, simplification of matching criterion, bitwidth reduction,

hierarchical search, and fast full search.

The most popular category is the reduction in search positions. These algorithms reduce

search complexity by limiting the number of candidate blocks. In doing so, they use the

assumption that the matching error monotonically increases with the distance from the

search position with the minimum distortion (the optimal motion vector). This assumption is

not always valid and the process may converge to a local minimum on the error surface

rather than to the global minimum as in the full search algorithm. Well‐known algorithms in

this category are the two‐dimensional logarithmic search [16], three‐step search [17], four‐

step search [18], cross search [19], diamond search [20], and center‐biased diamond search

[21]. Diamond search based algorithms have significantly better performance in speed and

quality than their prior algorithms; about 30‐100 times faster than full search with about 0.3‐

3dB drop in PSNR compared to it [15]. However, due to its simplicity, three‐step search is

still commonly used.

In three‐step search [17], the first step computes the matching criteria for nine points in the

search area (see Figure 4). Of these nine points, the one corresponding to the minimum

matching error is selected. In the next step, another set of nine points are chosen

8

surrounding the selected point in a similar fashion to the first step, with the distances

between the nine points reduced by half. The third and final step is similar with a set of

candidate points located in an even smaller grid. Figure 4 demonstrates this procedure.

1 11

1

1

11

1 1

2

2

2

2 2

2

2

2

3

333

3

3 3 3

i+2 i+4 i+6ii-2i-4i-6

j+2

j+4

j+6

j

j-2

j-4

j-6

Figure 4: Three step search. Points (i+4,j+4), (i+6,j+4), and (i+7,j+5) give the minimum distortion in steps 1, 2, and 3, respectively.

Diamond search [20] employs two search patterns ‐ a large diamond search pattern and a

small diamond search pattern. The large pattern comprises of nine sampled points forming a

diamond shape. The small pattern comprises of five sampled points forming a smaller

diamond shape. In the first stage, the large search pattern is used repeatedly until the

minimum matching error occurs at the center point of the diamond pattern. The search

pattern is then replaced with the small search pattern for the second search stage. Of the

five sampled points in this stage, the position yielding the minimum matching error is

selected as the best matching block. Figure 5 demonstrates this procedure.

9

1 11

1

1

11

1

4

i+2 i+4 i+6ii-2i-4i-6

j+2

j+4

j+6

j

j-2

j-4

j-6

1 2

2

2

3

33

3 34

4

4

Figure 5: Diamond search. Points (i+1,j+1), (i+1,j+3), and again (i+1,j+3) give the minimum distortion in the first step, and point (i+3,j+3) gives the minimum distortion in the second and final step.

For video sequences with fast movement, fast search algorithms such as three‐step search

and diamond search perform poorly due to the frequent failure of the monotonically

increasing distortion model assumption. Predictive motion estimation [22, 23] utilizes the

motion information in the spatial and/or temporal neighboring macroblocks to form an

initial estimate of the current motion vector, thus it can effectively reduce the search area as

well as the computation. The motion vector predictors can be taken as the median or the

actual values of neighboring macroblocks on the left, top, and top right. Zero motion

vectors, or the motion vectors of the collocated macroblocks in the previous frame, may also

be used.

Another approach for fast motion estimation is to speed up the calculation of matching error

for each candidate block. This is usually achieved by subsampling the pixels in the template

10

and candidate blocks. Aliasing effects can be avoided by using low‐pass filtering or by

periodic alternation of different subsampling patterns [24, 25]. Finding the optimal match

with minimum matching error using this technique is not guaranteed. It may be combined

with the former two techniques to limit the number of search positions and to predict the

current motion vector.

Bitwidth reduction is a fast motion estimation technique that is rarely used and has some

relative advantages only for specific hardware configurations. Details of this approach could

be found in [15].

Another approach uses a multiresolution structure, also known as a pyramid structure,

which is a powerful computational configuration for image processing tasks. An example of

this structure is shown in Figure 6. Pyramids of the image frames are reconstructed by

successive two‐dimensional filtering and subsampling of the current and past image frames.

In this hierarchical search, conventional block matching, either full search or any fast

method, is first applied to the highest level of the pyramid. This motion vector is further

refined in the following levels [26]. The search area at the finer levels can be much smaller

than the original search range. Similar to the previously described approach, this technique

also has the disadvantage of possibly being trapped in a local minimum. In spite of this fact,

it has been regarded as one of the most efficient methods for motion estimation with very

large frames and search areas. In [15] it is reported to be about 10‐30 times faster with

about 0.2‐0.5 dB drop in PSNR compared to full search.

A different approach for fast motion estimation is to use some matching criteria to rule out

search positions while ensuring the global minimal matching error is still attained. First, a

simple test is performed to determine the candidate blocks that are possibly the optimal

one. Then, only these blocks are further processed with more precise distortion calculations.

Using an appropriate test, many search positions are determined as suboptimal and can be

excluded from being further considered in the motion vector search. Thus, search

complexity is reduced.

11

Level 0

Level 1

Level m-1

Level m

Level n

Figure 6: A pyramid image structure.

One well‐known example of this approach is the successive elimination algorithm [27] that

eliminates impossible candidate blocks by testing whether the absolute difference between

template macroblock pixel sum and candidate block pixel sum is larger than the up‐to‐date

minimum SAD. The sum of all pixels in template macroblock only has to be computed once,

and the sum of all pixels in a candidate block can be computed efficiently by exploiting

common sums. Another example of this approach is the block sum pyramid [28]. This

algorithm constructs, for each macroblock, the same pyramid structure described earlier.

Successive elimination is then performed hierarchically from the top level to the bottom

level of the pyramid. An improvement of this algorithm based on a winner‐update strategy is

presented in [29]. An ascending list of lower bounds on the matching error for each search

position is maintained. Computation of the matching error can be avoided if one of its lower

bounds is larger than the global minimum matching error. The algorithm computes the

lower bounds only when the previous lower bounds in the same list are smaller than the

global minimum matching error. In [15] the winner‐update strategy is reported to be about

10 times faster than full search (thus 3‐10 slower than diamond search).

Orthogonal transforms have also been shown to be useful for block motion estimation.

However, only very few algorithms using the Walsh‐Hadamard Transform (WHT) for block

motion estimation have been proposed in the literature. In [30] a hierarchical motion

12

estimation algorithm is propose in which the SAD of the WHT coefficients is used as a

distortion measure in four search levels. This algorithm is reported to be 17 times faster than

full search with < 0.1 dB drop in PSNR on average.

In [31] a fast full search algorithm using the MSE is proposed. The MSE is calculated using the

WHT coefficients with the lowest coefficients first. An early termination criterion based on

the successive elimination algorithm [27] is used for early exclusion of impossible

candidates. Efficient calculation of the transform coefficients is performed based on the

overlapping nature of search regions, using approximately 2 operations per pixel per

transform coefficient. This result is similar to ours but is less general and suffers from

constants that degrade the overall algorithm performance significantly. Our algorithm also

has some important advantages compared to [31], as will be described in details later. This

algorithm is reported to be 16‐26 times faster than full search for 16x16 macroblocks.

Very recently, another fast motion estimation algorithm using the WHT has been proposed

[32]. This algorithm uses a winner‐update strategy [29] in the Walsh Hadamard (WH)

domain together with a simple scheme for predictive motion estimation. The algorithm

requires about 65 times fewer operations than full search with < 0.25 dB drop in PSNR. It is

difficult to compare the performance of this algorithm to other competing algorithms since it

considerably benefits from its predictive motion estimation scheme, which can be combined

with other algorithms.

13

1.3. Organization of Thesis

This dissertation is organized as follows: Fast pattern matching algorithms using WH

projection kernels and Gray‐Code Kernels (GCK) are first described in Chapter 2 and Chapter

3, respectively. The proposed fast block motion estimation algorithm, based on these fast

pattern matching algorithms, is presented in Chapter 4. Complexity analysis and results are

given in Chapter 5 and Chapter 6, respectively. The proposed algorithm is further refined for

adaptivity in Chapter 7. Adaptive algorithm results are given in Chapter 8. Finally,

conclusions are drawn in Chapter 9.

14

2. Fast Pattern Matching Using Walsh‐Hadamard Projection Kernels

The block motion estimation problem is a variant of the pattern matching problem. In this

chapter a novel pattern matching technique, suggested in [33, 34], is presented. The

suggested approach uses an efficient WH projection scheme which bounds the distance

between a pattern and an image window using very few operations on average. The

projection framework is combined with a rejection scheme which allows rapid rejection of

image windows that are distant from the pattern.

The pattern matching problem involves finding a particular pattern in an image where the

pattern is usually much smaller than the image. This can be performed naively by scanning

the entire image and evaluating the similarity between the pattern and a local 2D window

about each pixel. Assume a 2D pattern, , , is to be matched within an image

, of size . For each pixel location , in the image, the Euclidean distance may

be calculated:

(2.1) ( ){ }

122

,, 0

( , ) ( , ) ( , )k

E x yi j

d I p I x i y j p i j−

= + + −∑=

where , denotes a local window of at coordinates , . In the context of motion

estimation, this procedure is equivalent to full search block matching of a template

block to a set of candidate blocks in a search area of size with the MSE criterion.

Referring to the pattern and window as vectors in , is the difference

vector between and . The Euclidean distance can then be rewritten in vectorial form:

( , ) TEd p w d d d= = (2.2)

Now assume, as illustrated in Figure 7, that and are not given but only the values of

their projection onto a vector . Let

(2.3) T T Tb u d u p u w= = −

15

be the projected distance value. Since the Euclidean distance is a norm, it follows from the

Cauchy‐Schwartz inequality that a lower bound on the actual Euclidean distance can be

inferred from the projection values. Using Cauchy‐Schwartz inequality for norms, it follows

that:

Tu d u d≥ (2.4)

This implies:

( )

( , )T T

E

u p w u p u wd p w p w d

u u− −

= − = ≥ =T

(2.5)

and:

22 2( , ) /Ed p w b u≥ (2.6)

Figure 7: Projection of onto vector produces lower bound on distance .

d

b2

b1

w

p

u1

u2

If a collection of projection vectors are given … along with the corresponding

projected distance values , the lower bound on the distance can then be tightened

(see [34] for details):

(2.7) 2 2( , ) ( ( ,TE md p b U LB p w≥ =

where … and … so that . As the number of projection

vectors increases, the lower bound on the distance , becomes tighter. In the

extreme case when the rank of equals , the lower bound reaches the Euclidean

distance.

1) )Tw U b−

16

An iterative scheme for calculating the lower bound is also possible. Given an additional

projection vector and projection value , the previously computed lower bound

can be updated without recalculating the inverse of the entire system . The

component of in the kernel of is calculated as:

1 1T

m m m m mu u U U u 1+ + += − (2.8)

so that 0. If the projection vectors are orthogonal, an updated lower bound is:

2 21 2

1( , ) ( , )m m2

1mLB p w LB p w bγ+ += + (2.9)

where (see [34] for details).

If the projection vectors are also orthonormal, the distance lower bound after

projections can be reduced to:

2 1( , ) )T Tm (TLB p w b U U b b b−= = (2.10)

and the normalizing factor in equation (2.9), , can be discarded.

Returning to the pattern matching, a window can be determined as being far from the

pattern if the lower bound is above a certain threshold. Windows can be rejected as non‐

pattern without actually computing the true distance. In the context of this problem, since

lower bounds are compared, the true lower bound is also not required. Thus, even if the

projection vectors are orthogonal and not orthonormal, the normalizing factor in equation

(2.9), , can be discarded.

In order for this approach to be efficient, vectors should be chosen according to the

following two necessary requirements:

• The projection vectors should be highly probable of being parallel to the vector

.

• Projections of image windows onto the projection vectors should be fast to compute.

The first requirement implies that, on average, the first few projection vectors produce a

tight lower bound on the pattern‐window distance. This, in turn, will allow rapid rejection of

image windows that are distant from the pattern. The second requirement arises from the

17

fact that the projection calculations are performed many times for each window of the

image. Thus, the complexity of calculating the projection plays an important role when

choosing the appropriate projection vectors.

A set of projection vectors shown in [33, 34] to satisfy the above two requirements are the

WH basis vectors. For natural images, these vectors capture a large portion of the pattern‐

window distance with few projections on average. In addition, an efficient method for

calculating the projection values for these vectors was introduced (but not used here).

The Walsh‐Hadamard transform has long been used for image representation under

numerous applications [35]. The elements of the WH (nonnormalized) basis vectors are

orthogonal and contain only binary values (±1). Thus, computation of the transform requires

only integer additions and subtractions. The WHT of an image window of size (with a

power of 2) is obtained by projecting the window onto WH basis vectors. In the case of

pattern matching within an image, it is required to project each window of an

image onto the vectors. This results in a highly overcomplete image representation.

The projection vectors associated with the 2D WHT of order 8 are shown in Figure 8.

Each basis vector is of size 8x8, where white represents the value +1 and black represents

the value ‐1. In Figure 8, the basis vectors are displayed in order of increasing sequency (the

number of sign changes along rows and columns of the basis vector). The algorithm

suggested in [33, 34] induces an ordering of basis vectors that is not exactly according to

sequency, and it is shown by experiments that this ordering still captures the increase in

spatial frequency.

18

Figure 8: The projection vectors of the WHT of order .

Projection vectors are ordered with increasing spatial frequency. White represents the value +1 and black represents the value ‐1.

As discussed above, the second critical requirement of the projection vectors is the

efficiency of computation. A method for calculating the projections of all image windows

onto a sequence of WH vectors is discussed in the next chapter. This method, in addition to

being very efficient, does not bind the algorithm to any fixed ordering of basis vectors.

Finally, we note that the projections approach described above deals with the Euclidean

distance however it is applicable to any distance measure that forms a norm. The

correctness of the iterative scheme is proved in [33, 34] only for norm‐2. However, in our

case the iterative projection scheme will be used with the well‐known SAD (norm‐1) distance

measure. This is applicable since as more projections are performed a lower bound on the

SATD (see previous chapter) is tightened. In [36], the correctness of the iterative projection

scheme with the SAD as the distance measure is proven.

19

3. The Gray‐Code Kernels

In [37, 38] a family of filter kernels – the Gray‐Code Kernels (GCK) – is introduced. Filtering an

image with a sequence of Gray‐Code Kernels is highly efficient and requires only 2

operations per pixel for each filter kernel, independent of the size or dimension of the

kernel. This family of kernels includes the WH kernels among others, thus it enables very

efficient projection onto the WH basis vectors.

Consider first the 1D case where signal and kernels are one‐dimensional vectors. Denote by

a set of 1D filter kernels expanded recursively from an initial seed vector as follows:

(3.1) { }{ }

(0)

( ) ( 1) ( 1) ( 1) ( 1)

0

. . ,

1 1

s

k k k k ks s k s s s

k

V

V v v s t v Vα

α

− − − −

=

⎡ ⎤= ⎣ ⎦ ∈

∈ + −

Where indicates the multiplication of kernel by the value and … denotes

concatenation.

The set of kernels and the recursive definition can be visualized as a binary tree of depth .

An example is shown in Figure 9 for 3. The nodes of the binary tree at level represent

the kernels of . The leaves of the tree represent the eight kernels of . The branches

are marked with the values of α used to create the kernels (where +/‐ indicates +1/‐1).

Denote | | the length of . It is easily shown that is an orthogonal set of 2 kernels

of length 2 . Furthermore, given an orthogonal set of seed vectors … , it can be shown

that the union set … is orthogonal with 2 vectors of length 2 . If the

set forms a basis. Figure 9 also demonstrates the fact that the values … along the tree

branches uniquely define a kernel in .

20

The sequence … , 1 1 that uniquely defines a kernel is called

the α‐index of . Two kernels , are defined to be α‐related if and only if the

hamming distance between their α‐index (the number of positions for which their α‐indices

are different) is one. Without loss of generality, let the α‐indices of two α‐related kernels be

… , 1,… and … , 1,… . We denote the corresponding kernels as

and respectively. Since … uniquely define a kernel in , two α‐related

kernels always share the same prefix vector of length 2 ∆. The arrows of Figure 9

indicate examples of α‐related kernels in the binary tree of depth 3.

Of special interest are sequences of kernels that are consecutively α‐related. An ordered set

of kernels … that are consecutively α‐related form a sequence of Gray‐Code

Kernels (GCK). The sequence is called a Gray‐Code Sequence (GCS). The term Gray Code

relates to the fact that the series of α‐indices associated with a GCS form a Gray Code [39].

The kernels at the leaves of the tree in Figure 11 in a left to right scan, are consecutively

α‐related, thus forming a GCS. Note, however that this sequence is not unique and that

there are many possible ways of reordering the kernels to form a GCS.

[s ‐s s ‐s][s s ‐s ‐s][s s s s]

+ ‐ + ‐ + ‐+ ‐

++ ‐ ‐

+ ‐s

[s s]

[s ‐s]

[s ‐s ‐s s]

[s s ‐s ‐s s s ‐s ‐s] [s ‐s s ‐s s ‐s s ‐s] [s ‐s ‐s s s ‐s ‐s s]

[s ‐s ‐s s ‐s s s ‐s][s ‐s s ‐s ‐s s ‐s s][s s ‐s ‐s ‐s ‐s s s] [s s s s ‐s ‐s ‐s ‐s]

[s s s s s s s s]

α-related α-related

Figure 9: The set of Gray Code Kernel eir recursiv de tion visualized as abinary tree.

s and th e fini

In this example, the tree is of depth and creates kernels of length 8.Arrows indicate examples of pairs of kernels that are α‐related.

21

The main idea presented in [37, 38] relies on the fact that two α‐related kernels share a

special relationship. Given two α‐related kernels , their sum and their

difference are defined as follows:

(3.2) p

m

v v vv v v

+ −

+ −

= += −

In [38] it is proven that the following relation holds:

[ ]0 0p mv vΔ Δ⎡ ⎤ =⎣ ⎦ (3.3)

where Δ is the length of the common prefix and 0 denotes a vector with Δ zeros.

For example, consider the two α‐related kernels from Figure 9 whose indices are

and :

[ ][ ]

v s s s s s s s sv s s s s s s s s+

−

== − −−

They share a common prefix of length Δ 2 . Then:

− (3.4)

[ ][ ]2 2 0 0 2 2 0 00 0 2 2 0 0 2 2

p t t

m t t t t

v s s s sv s s

==

t t

s s (3.5)

and equation (3.3) holds with:

[ ] [ ]2 20 0 0 2 2 0 0 2 2 0 0 0t p t t t t t t m tv s s s s v⎡ ⎤ = =⎣ ⎦ (3.6)

For simplicity of explanation, we now expand to an infinite sequence such that

0 for 0 and for 2 . Using this convention, equation (3.3) can be rewritten in

a new notation:

(3.7) ( ) (p mv i v i)− Δ =

and this gives rise to the following corollary:

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

v i v i v i v iv i v i v i v i+ + − −

− − + +

= + −Δ + + −Δ= − −Δ + − −Δ

(3.8)

Equation (3.8) is the core principle behind an efficient filtering scheme.

Let and be the signals resulting from convolving a signal with filter kernels and

respectively:

22

(3.9)

( ) ( ) ( )

( ) ( ) ( )j

j

b i x j v i j

b i x j v i j

+ +

− −

= −

= −

∑

∑

Then, by linearity of the convolution operation and corollary (3.8) we have the following:

(3.10) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )

b i b i b i b ib i b i b i b i+ + − −

− − + +

= + −Δ + + −Δ= − −Δ + − −Δ

This forms the basis of an efficient scheme for convolving a signal with a set of GCK. Given

the result of convolving the signal with the filter kernel ( ), convolving with the filter

kernel ( ) requires only two operations per pixel independent of the kernel size. This

scheme is illustrated in Figure 10.

Figure : Efficient tering using GCK. 10 filGiven (convolution of a signal with the filter kernel ), the convolution result can be computed using 2 operations per pixel regardless of kernel size.

Considering definition (3.1), and setting the prefix string to 1 , we obtain that is

the WH basis set of order 2 . A binary tree can be designed such that its leaves are the WH

kernels ordered in dyadic (or Paley) order [35] of increasing sequency and they form a GCS

(i.e., are consecutively α‐related). An example for 2 is shown in Figure 11 where every

two consecutive kernels are α‐related. Thus, given the result of filtering an image with the

first WH kernel, filtering with the second kernel requires only two operations

(additions/subtractions) per pixel. Subsequently, by ordering the WH kernels to form a GCS,

filtering with the other kernels can be performed using only 2 operations per pixel per kernel

regardless of signal and kernel size.

23

[1]

[1 ‐1]

[1 1]

+‐

‐+ ‐ +

[1 1 1 1][1 1 ‐1 ‐1]

[1 ‐1 1 ‐1]

[1 ‐1 ‐1 1]

α-related

Figure 11: GCK with ector es the WH basis set. initial v creatUsing initial vector and depth , a binary tree creates the WH basisset of order 4. Consecutive kernels are α‐related, as shown by the arrows.

For separable kernels, such as the WHT, the previous definitions and results can be

generalized to two (and more) dimensions. The computation cost remains at two operations

per pixel per kernel regardless of the dimension. For example, Figure 12 shows the two‐

dimensional WH kernels of size 4x4. In this figure, every pair of horizontally or vertically

neighboring kernels is α‐related. For more details the reader in referred to [38].

It was shown that successive filtering with α‐related kernels can be applied efficiently.

However, the efficiency of using the GCK in a particular application is determined not only by

the computational complexity of applying each kernel, but also by the total number of

kernels taking part in the process. This, in turn, depends upon the order in which the kernels

are applied. It is desired to order kernels into an optimal GCS. To do so, a priority value

should be assigned to each kernel, representing its contribution in achieving the goal of the

process ‐ in our case, the ability of matching macroblocks based on the projection values of

the specific kernel. If the order of the kernels within the sequence is insignificant, this

problem is shown in [38] to be to NP‐hard.

24

Figure 12: Extension of GCK to two dimensions. The outer product of two sets of one‐dimensional Gray‐Code Kernels forms the set of two‐dimensional kernels. When , the Walsh‐Hadamard kernels of size 4x4 are obtained.

One possible sequence of 2D WH kernels is that in which kernels are ordered with increasing

sequency (the number of sign changes along each dimension of the kernel ‐ analogous to

frequency). The sequency order is known to perform well on natural images due to energy

compaction in the low order sequencies. However, consecutive kernels in the 2D WH

sequency order are not necessarily α‐related, thus they do not form a GCS. Luckily,

horizontally or vertically neighboring kernels in the 2D WH array are α‐related, so a ‘snake’

ordering is possible, as depicted by overlaid arrows in Figure 13. The ‘snake’ ordering,

originally suggested here, forms a GCS and, although not exactly according to sequency,

captures the increase in spatial frequency.

25

Figure 13: ‘Snake’ ordering of WH kernels. The projection vectors of the WHT of order . ‘Snake’ ordering is depicted by overlaid arrows and numbers.

‘Snake’ ordering approximates the increase in spatial frequency while forming a GCS. Thus,

filtering with any kernel in a snake‐ordered sequence (except the first kernel) requires

maintaining only the projection onto the previous kernel in the sequence and the signal

itself. However, it is possible to select an ordering of kernels such that consecutive kernels

are not necessarily α‐related rather every kernel is α‐related to at least one kernel that

precedes it anywhere in the sequence. This still allows an efficient projection using a

projection into a preceding α‐related kernel and the signal, but incurs higher memory

complexity since preceding projections must be maintained in memory. One such ordering is

the ‘increasing frequency’ ordering, originally suggested here, and depicted in Figure 14. The

kernels are arranged in this order in increasing spatial frequency, thus it has better energy

compaction in the first kernels compared to ‘snake’ ordering. In the algorithm presented in

the next chapter, memory complexity will not be an issue, so the increasing frequency order

is used.

26

Figure 14: Increasing frequency ordering of s. WH kernelThe projection vectors of the WHT of order . Increasing frequency ordering is depicted by overlaid arrows and numbers.

27

4. The FME‐GCK Algorithm

In this chapter, a novel fast block motion estimation algorithm is presented. The algorithm is

based on the fast pattern matching technique described in previous chapters, hence it is

denoted FME‐GCK. The motivation for the FME‐GCK algorithm comes from the fact that the

block motion estimation problem is a variant of the pattern matching problem. Therefore,

the fast pattern matching technique described in previous chapters can be tailored for the

fast block motion estimation, with proper adjustments.

The FME‐GCK algorithm has all the advantages of the fast pattern matching techniques

described in earlier chapters and it also exploits additional redundancies of the block motion

estimation process. It is fast and efficient (see Chapter 5 and Chapter 6), involves integer

computations only and incurs sequential memory access. In contradiction to most classical

motion estimation algorithms, the FME‐GCK enables adaptivity to image content (see

Chapter 7 and Chapter 8).

There are few differences between the pattern matching problem and the block motion

estimation problem. The pattern matching problem involves finding one pattern in an image

while the block motion estimation problem involves finding many different macroblocks in a

frame. Every candidate region of the reference frame in the block motion estimation

problem is a candidate for a best match for several neighboring macroblocks from the

current frame. Furthermore, the current frame forms the reference frame for block motion

estimation of the consecutive image in the video sequence. The FME‐GCK exploits these

additional redundancies in the block motion estimation problem.

In block motion estimation, it is assumed that differences between consecutive frames are

due to translation of complete macroblocks. This is a very simple model that might result in

non‐negligible residual or noise. Therefore, instead of searching for an exact pattern match,

a ‘noisy’ version of the template macroblock is sought. Candidate region that produces the

lowest lower bound will be considered as the best match. The fast pattern matching

28

techniques described in previous chapters are shown in [33, 34] is to be effective even under

very noisy conditions, thus their appropriateness to the block motion estimation problem.

Assume a video sequence is composed of images , , , … of size , macroblocks are

of size , and search areas of size . Also assume a set of WH basis vectors

is given such that every basis vector is α‐related to at least one basis vector that

precedes it in the sequence. Denote by the projection values of macroblocks of image

onto WH basis vector . Denote by , or , a square regions of size of image at

coordinates , , and denote by the search area around macroblock .

The FME‐GCK algorithm

For each image

1) Project onto to obtain and store the resulting projections in

memory.

2) For each Inter macroblock ,

2.1) For each candidate region , ,

2.1.1) Calculate the norm‐1 lower bound on the distance between , and ,

using and .

2.2) Calculate the actual SAD between , and the candidate regions ,

with th smallest distance lower bou om , . e nd fr

2.3) Of the candidate regions , , select , with the smallest SAD from

, as the best matching macroblock.

A block diagram of the FME‐GCK algorithm is shown in Figure 15.

29

'best' matchingmacroblock

Memory

Split intomacroblocks

Compute lowerbounds

Select bestcandidates

Calculate SAD

Select bestcandidate

Project

( )1, 1jx ypjI

( ){ } ( )( ) ( 1) ( 1) ( )1, 1 2, 2 2, 2 1, 1, . .j j j jx y x y x y x yLB p w s t w sa p− − ∈

Input videosequence

Current image projections

{ } 1( )

0

mji ib

−

=

Previous image projections

{ } 1( 1)

0

mji ib

−−

=

{ } 1( 1)2 , 2 0i i

qjx y iw

−−

=

( ){ } 1( ) ( 1)1, 1 2 , 2 0

,i i

qj jx y x y i

d p w−

−

=

Figure 15: The FME‐GCK algorithm.

Step 1 of the algorithm, image projections, is performed for all frames, both Inter and Intra,

while the following steps are performed only for Inter frames, where motion information is

required. Image projections are stored in memory since they are required for motion

estimation of the following image in the video sequence.

In order to perform efficient GCK calculations, each basis vector should be α‐related to at

least one basis vector that precedes it in the sequence (from within the projection values

stored in memory). The order of kernels used within the FME‐GCK is depicted in Figure 14 as

overlaid arrows and numbers. This ‘increasing frequency’ ordering has been chosen due to

its good energy compaction property.

Step 1 of the algorithm is performed using GCK with only 2 operations per pixel for each WH

kernel. An exception for this efficient calculation is the first kernel (DC component) that can

be calculated using 4 operations per pixel as described in [40].

30

Notice that the GCK approach cannot be used efficiently for projecting macroblocks on the

top and left image boundaries. This limitation, although seemingly minor, might increase

algorithm complexity substantially. In an experiment with the Foreman video sequence at

CIF (352x288) resolution, boundary macroblock projections were performed by direct

filtering with WH basis vectors and nonboundary macroblock projections were performed

using GCK. In CIF resolution, only about 0.7% of the candidate regions are top or left

boundary regions. However, boundary projections were found to require about 55% of the

calculation time spent on non‐DC projections.

A solution to this problem is to zero‐pad the upper and left boundaries of the image by

Δ 1 rows and Δ 1 columns respectively. This, naturally, also increases the size

of the projection images . The upper Δ rows and left Δ columns of these projection

images are filled with zeros. This is correct since projecting a zero macroblock

onto any kernel results in zero. For all other image pixels starting from the Δ 1 row and

Δ 1 column, projections are performed using the efficient GCK method. The proposed

technique for fast boundary calculation is depicted in Figure 16.

Figure 16: Image padding for rap y calculatioid boundar n. Zero‐padding each image with rows and columns enables rapid boundary calculation. The upper rows and left columns of each corresponding projection images are filled with zeros. GCK based computations start with the row and column.

Image

k-1

k-1

Boundary

Boundary

31

Step 2.1.1 of the algorithm is based on the projection framework described in Chapter 2.

Although the WH basis vectors are not orthonormal, they are orthogonal. Therefore, the

term in equation (2.7) can be ignored. The projection scheme is used with norm‐1

since this forms a partial calculation of the SATD distance measure. As additional projections

are applied, a better of approximation of the SATD is obtained. In [36], the correctness of the

iterative projection scheme with the SAD distance measure is proven.

The FME‐GCK algorithm gives good time‐quality tradeoff compared to classical fast block

motion estimation techniques. This is described in details in Chapter 6. Usually, only a few

projections are required for highly accurate motion estimation. However, if the

algorithm results are guaranteed to be identical with that of the full search, though this is

not a common configuration. Thus, convergence to the optimal solution is guaranteed. In

this sense, FME‐GCK can be considered as a fast full search motion estimation algorithm.

In Chapter 7 a variant of the FME‐GCK that adaptively changes algorithm parameters is

described.

32

5. Complexity Analysis

It has already been mentioned in the previous chapter that the FME‐GCK algorithm uses two

parameters that affect the tradeoff between complexity and accuracy of motion estimation.

These parameters are , the number of projections to perform for each image, and , the

number of candidate macroblocks for which the SAD value is calculated. Larger produces

more accurate results at the cost of higher time and memory complexity. Memory

complexity is affected since projections of image and projections of image must

be stored in memory, thus, memory complexity is approximately 2 1 where

is the size of the video frames. Larger also produces more accurate results at the

cost of higher time complexity; it does not however, affect memory.

Let us assume 1 time unit for each operation of addition, subtraction, multiplication,

absolute value, and minimum of two numbers. We obtain that performing a single SAD

computation between two macroblocks requires 3 1 time units.

Performing the FME‐GCK algorithm involves projections of each image. Time complexity

of this step is 2 time units per pixel for every projection except for the first projection that

requires 4 time units per pixel to calculate – a total of 2 1 time units per template

macroblock. Calculating the lower bound for candidate macroblocks within the search area

requires another 3 1 time units per template macroblock where there search area

is of size . Finding the candidate regions with the smallest lower bounds, if

performed naively, requires no more than time units. Calculating the SAD for these

candidate regions and selecting the one with the minimal SAD requires 3 1

1 3 1 time units. Thus, a total of 2 1 3 1

3 1 time units are required per template macroblock. Additional calculations are

required due to the aforementioned boundary padding. However, this extra overhead is can

be compensated by a non‐naive algorithm for finding the candidate regions with the

lowest lower bounds. Efficient algorithms for selecting the smallest values in a list are

described in [41].

33

Two possible configurations of and are ones that result in FME‐GCK complexity that is

approximately equal to three‐step search [17] or to diamond search [20, 21]. These

configurations allow comparison of the accuracy of motion information produced by

FME‐GCK to the accuracy of motion information produced by three‐step search or by

diamond search under the same computational constraints. Three‐step search incurs 25

block matching operations per macroblocks. Thus, performing three‐step requires

25 3 1 time units plus 24 time units for calculating the minimum over all SAD values

per macroblock. With 16, this sums up to 19,199 time units. Diamond search is shown

in [21] to reduce block matching operations from 25 to an average of 15.5 per macroblock

with 16, 15. Thus, performing a diamond search requires 15.5 3 1 time units

plus 14.5 time units for calculating the minimum over all SAD values per macroblock. In the

given configuration this sums to 11,903 time units per template macroblock. Note that when

comparing the number of time units to perform FME‐GCK with the time units to perform

three‐step search or diamond search, a factor should multiply FME‐GCK’s complexity. The

factor is added since in FME‐GCK both Inter and Intra macroblocks must be projected, in

contradiction to zero calculations for Intra macroblocks incurred by the three‐step search

and the diamond search. The value of depends on the intra periodicity in the video

sequence. A typical value for is 1.10.

Considering the calculation above we obtain that 11, 4 and 5, 13 are

FME‐GCK configuration similar in their computational complexity to three‐step search and

that 5, 4 is an FME‐GCK configuration similar in its computational complexity to

diamond search. Note that it has been verified by real‐time code profiling that these

configurations are indeed similar to their counterparts (see next chapter for details). These

configurations will be used in the next chapter for comparison purposes.

It is important to note that the theoretical complexity comparison of FME‐GCK to three‐step

search and to diamond search does not take into account the fact that FME‐GCK incurs

sequential memory access while three‐step search and diamond search incur many

unpredictable branches. This difference might have a significant effect on running times to

the favor of the FME‐GCK algorithm, depending of the specific hardware configuration. One

hardware configuration where sequential memory access in highly beneficial is DSP (Digital

34

Signal Processor) chips. DSP chips are widely used for many signal processing applications.

Currently, a work is performed in the Signal and Image Processing Laboratory in the

Technion – IIT, comparing the FME‐GCK algorithm performance to three‐step search

performance and to diamond search performance using the DM642 and DM6437 DSP chips

from Texas Instruments.

35

6. FME‐GCK Results

FME‐GCK was implemented using a highly efficient ANSI‐C code, together with its full search,

three‐step, and diamond search counterparts, in order to enable fair time and quality

comparison. Implementation was performed and measured on a Pentium 4 PC at 3GHz

running Windows XP. In general, computational complexity was found to coincide with

theoretical complexity calculation as described in Chapter 5. Both diamond search and

FME‐GCK with 5, 4 execute on this hardware configuration at the speed of about

110 CIF frames per second.

First, an extensive set of simulations was performed. Then, FME‐GCK and its counterparts

were integrated with a video encoder in order to measure the effect on the real video

encoding. In both simulation and video encoding tests, motion estimation was performed

with GOP size of 15 (for every 15th frame no motion estimation was performed) for the

luminance (Y) component with macroblocks of size 16x16 and search area of size 15x15,

except when noted otherwise.

6.1. Simulation Results

All simulation results were obtained using the video sequences that appear in Table 1.

Table 1: Video sequences used for simulation experiments. For each resolution, video sequences are sorted by ascending order of estimated coding‐difficulty.

QCIF (176x144) CIF (352x288) Sequence frames sequence frames

Akiyo 300 Akiyo 300 Miss_america 150 Silent 300 Trevor 150 Foreman 300 Carphone 300 Tempete 260 Coastguard 300 Mobile 300 Foreman 300 Stefan 300

Higher Coding Difficulty

36

Figure 17 shows frame number 169 of the Foreman CIF video sequence. In this frame the

background is approximately static with head motion to the right. Thus, most motion vectors

in the background region are zero and most motion vectors in the head region point to the

left. The computed motion vectors for full‐search, diamond search and FME‐GCK with

4, 4, are displayed as overlaid arrows. While all three resulting motion fields look

similar, it can be observed that the motion information produced by FME‐GCK is closer to

the optimal (full search) results compared to motion information produced by diamond

search.

(a) (b)

Figure 17: Motion information as overlaid arrows.Frame number 169 of the Foreman CIF video sequence. The computed motion vectors for (a) full‐search, (b)diamond search, and (c) FME‐GCK with , , are displayed as overlaid arrows. While all threeresulting motion fields look similar, it can be observed that the motion information produced by FME‐GCK iscloser to the optimal (full search) results compared to motion information produced by diamond search.

(c)

Figure 18 depicts the effect of different values of the parameter on FME‐GCK motion

estimation accuracy with a constant 4. Motion estimation accuracy is measured in mean

SAD per macroblock between macroblocks and their ‘best’ matching counterparts. Full

search, three‐step search and diamond search results are displayed as reference. As

expected, FME‐GCK algorithm results converge to the optimal, namely increasing the

number of projections produces lower SAD values, thus approaching full search SAD values.

37

2 4 6 8 10 12150

200

250

300

350

400

450

m (number or projections)

Mea

n S

AD

per

mac

robl

ock

akiyo.cif

Full searchThree-step searchDiamond searchFME-GCK

400

2 4 6 8 10 12100

150

200

250

300

350


Mea

n S

AD

per

mac

robl

ock

akiyo.qcif


2 4 6 8 10 12300

350

400

450

500

550

600

650

700


Mea

n S

AD

per

mac

robl

ock

miss america.qcif


1300

2 4 6 8 10 12400

500

600

700

800

900

1000

1100

1200


Mea

n S

AD

per

mac

robl

ock

silent.cif


2 4 6 8 10 12500

600

700

800

900

1000

1100

1200

1300

1400


Mea

n S

AD

per

mac

robl

ock

trevor.qcif


1900

2 4 6 8 10 121000

1100

1200

1300

1400

1500

1600

1700

1800


Mea

n S

AD

per

mac

robl

ock

foreman.cif


38

2 4 6 8 10 12900

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900


Mea

n S

AD

per

mac

robl

ock

foreman.qcif


2 4 6 8 10 12800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800


Mea

n S

AD

per

mac

robl

ock

carphone.qcif


3400

2 4 6 8 10 121400

1600

1800

2000

2200

2400

2600

2800

3000

3200


Mea

n S

AD

per

mac

robl

ock

tempete.cif


2 4 6 8 10 12800

1000

1200

1400

1600

1800

2000

2200

2400

2600


Mea

n S

AD

per

mac

robl

ock

coastguard.qcif


5500

2 4 6 8 10 122000

2500

3000

3500

4000

4500

5000


Mea

n S

AD

per

mac

robl

ock

mobile.cif


5000

2 4 6 8 10 122500

3000

3500

4000

4500


Mea

n S

AD

per

mac

robl

ock

stefan.cif


Figure 18: Effect of differen s of the parameter on motion estimation accuracy. t valueResults are for a constant . Full search, three‐step search, and diamond search results are displayed asa reference. FME‐GCK algorithm results converge to the optimal ones. FME‐GCK significantly outperformsthree‐step search (indicated by a light arrow) and is comparable to diamond search (indicated by a darkarrow).

39

For all video sequences except one, FME‐GCK outperforms three‐step search for 5. For

9 out of the 12 video sequences, 4 is sufficient to outperform three‐step search. Note

that an FME‐FCK configuration equal in its computational complexity to three‐step search,

indicated by a light arrow, is 11, 4, so for the same motion accuracy the gain in

computation time by using FME‐GCK compared to three‐step is significant. For a

configuration comparable in its computational time to diamond search, indicated by a dark

arrow, 5, 4, FME‐GCK outperforms diamond search only for few video sequences.

This will change to the favor of FME‐GCK with the introduction of an adaptive FME‐GCK in

Chapter 7.

Figure 19 depicts FME‐GCK motion estimation accuracy vs. three‐step motion estimation

accuracy for similar computational complexity – FME‐GCK with 11, 4. It is again

shown that for this configuration, FME‐GCK significantly outperforms three‐step search for

all video sequences except one (Tempete CIF).

Figure 19: FME‐GCK motion estimation accuracy vs. three‐step motion estimation accuracy. Comparison is performed for similar computational complexity – FME‐GCK with , . For this configuration, FME‐GCK significantly outperforms three‐step search for all video sequences except one.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Mea

n S

AD

per

mac

robl

ock

FME-GCK (m=11,q=4)Three-step search

akiyoCIF

trevorQCIF

mobileCIF

stefanCIF

silentCIF

miss americaQCIFakiyo

QCIF

foremanQCIFcoastguard

QCIFcarphone

QCIF

foremanCIF

tempeteCIF

40

Looking at Figure 18, it is not obvious that motion estimation accuracy of FME‐GCK can be

approximately predicted from image content. Figure 20 depicts the same FME‐GCK

convergence lines that appear in Figure 18, but now the y‐axis represents

. Thus, Figure 20 represents FME‐GCK motion estimation accuracy

compared to the optimal (full search) motion estimation accuracy. All 12 video sequences

are sorted by ascending order of their coding‐difficulty. Easier‐to‐code video sequences are

plotted as lighter lines while more difficult‐to‐code video sequences are plotted as darker

lines. It could be observed that, in general, more difficult‐to‐code video sequences produce

larger values, thus they require more projections in order to obtain the same motion

estimation accuracy compared to easy‐to‐code video sequences relative to the optimal

results. This fact is exploited in the adaptive FME‐GCK algorithm, as described in Chapter 7.

Figure 20: FME‐GCK motio ation accuracy relative to the optimal results. n estimResults are for a constant . Video sequences are sorted by ascending order of their coding‐difficulty. Easier‐to‐code video sequences are plotted as lighter lines while more difficult‐to‐code video sequences are plotted as darker lines. In general, more difficult‐to‐code video sequences produce larger values, thus they require more projections in order to reach the same motion estimation accuracy compared to easy‐to‐code video sequences.

2 3 4 5 6 7 8 9 10 11 120

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

m (number of projections)

(SA

DFM

E-G

CK -

SAD

Full

Sea

rch) /

SA

DFu

ll S

earc

h

akiyo QCIFakiyo CIFmiss america QCIFsilent CIFtrevor QCIFcarphone QCIFcoastguard QCIFforeman QCIFforeman CIFtempete CIFmobile CIFstefan CIF


41

It is also possible to a keep the parameter constant, 5 in our case, and select

different values of the parameter , as depicted in Figure 21. In this case too, FME‐GCK

algorithm results converge to the optimal, namely increasing the number of SAD calculation

per macroblock produces lower SAD values, thus approaching full search SAD values. For 9

out of the 12 video sequences, 5 is sufficient to outperform three‐step search. An

FME‐FCK configuration equal in its computational complexity to three‐step search, indicated

by a light arrow, is 5, 13, so for the same motion accuracy the gain in computation

time by using FME‐GCK compared to three‐step is significant. A comparison of diamond

search results and FME‐GCK results with 5, 4 was performed in the context of

Figure 18 and is indicated in Figure 21 again by a dark arrow.

Figure 22 depicts the effect of the size of the search area on FME‐GCK motion estimation

accuracy. Experiments were performed for the Carphone QCIF and Foreman CIF video

sequences with search areas of sizes 7x7, 15x15, and 31x31. The y‐axis represents

or

, thus it represents FME‐GCK

motion estimation accuracy compared to diamond search or to three‐step search motion

estimation accuracy respectively. It can be observed that for larger search areas, as more

projections are performed, FME‐GCK results improve both compared to three‐step search

and compared to diamond search. Today, with the advent of high resolution video

sequences, large search areas are commonly used. Thus, FME‐GCK is expected to have even

improved results compared to three‐step search and to diamond in the near future.

42

2 4 6 8 10 12 141050

1100

1150

1200

1250

1300

q (number or SADs per macroblock)

Mea

n S

AD

per

mac

robl

ock

foreman.cif


2 4 6 8 10 12 14480

500

520

540

560

580

600


Mea

n S

AD

per

mac

robl

ock

silent.cif


2 4 6 8 10 12 14120

122

124

126

128

130

132

134


Mea

n S

AD

per

mac

robl

ock

akiyo.qcif


2 4 6 8 10 12 14520

540

560

580

600

620

640

660

680


Mea

n S

AD

per

mac

robl

ock

trevor.qcif


2 4 6 8 10 12 14300

310

320

330

340

350

360


Mea

n S

AD

per

mac

robl

ock

miss america.qcif


134

2 4 6 8 10 12 14120

122

124

126

128

130

132


Mea

n S

AD

per

mac

robl

ock

akiyo.qcif


43

2 4 6 8 10 12 142700

2800

2900

3000

3100

3200

3300


Mea

n S

AD

per

mac

robl

ock

stefan.cif


2 4 6 8 10 12 142000

2100

2200

2300

2400

2500

2600


Mea

n S

AD

per

mac

robl

ock

mobile.cif


2 4 6 8 10 12 141550

1600

1650

1700

1750

1800

1850

1900

1950

2000

2050


Mea

n S

AD

per

mac

robl

ock

tempete.cif


2 4 6 8 10 12 14900

920

940

960

980

1000

1020

1040

1060

1080


Mea

n S

AD

per

mac

robl

ock

foreman.qcif


2 4 6 8 10 12 14900

950

1000

1050

1100

1150


Mea

n S

AD

per

mac

robl

ock

coastguard.qcif


1000

980

960

2 4 6 8 10 12 14820

840

860

880

900

920

940


Mea

n S

AD

per

mac

robl

ock

carphone.qcif


Figure 21: Effect of differe s of the parameter on motion estimation accuracy. nt valueResults are for a constant . Full search, three‐step search, and diamond search results are displayed asa reference. FME‐GCK algorithm results converge to the optimal ones. FME‐GCK significantly outperformsthree‐step search (indicated by a light arrow) and is comparable to diamond search (indicated by a darkarrow).

44

We summarize the simulation results section by stating that the FME‐GCK algorithm

significantly outperforms three‐step search and produces motion information that is almost

as accurate as diamond search. This will further improve in the favor of FME‐GCK with the

introduction of an adaptive FME‐GCK in Chapter 7. In addition, when larger search areas are

used, FME‐GCK results improve compared to both three‐step search and diamond search.

foreman CIF

carphone QCIF

2 3 4 5 6 7 8 9 10 11 12-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05


(FM

E-G

CK SA

D -

TSS

SAD

) / T

SS

SAD

search area 7x7search area 15x15search area 31x31

2 4 6 8 10 12-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3


(FM

E-G

CK S

AD -

Dia

mon

d SAD

) / D

iam

ond S

AD


2 4 6 8 10 12-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25


(FM

E-G

CK S

AD -

TSS

SAD

) / T

SS

SAD


0.6

2 4 6 8 10 12-0.1

0

0.1

0.2

0.3

0.4

0.5

m (number of projections)(FM

E-G

CK SA

D -

Dia

mon

d SAD

) / D

iam

ond SA

D


Figure Results are given for Carphone QCIF and Foreman CIF video search sequences with a constant

relative to three‐step search and diamond search results with search areas of size 7x7,15x15, and 31x31. Relative to both three‐step search and diamond search, FME‐GCK resultsimprove for larger search areas.

22: Effect of size of the search area on motion estimation accuracy.

45

6.2. Video Encoding Results

In order to compare FME‐GCK results as part of real video encoding, FME‐GCK and its

counterparts were integrated with a video encoder. The standard JVT H.264/AVC reference

software [13] was degenerated and used for all tests. Code features that were degenerated

are the ones not currently supported by FME‐GCK implementation ‐ B pictures, motion

estimation of sub‐macroblocks smaller than 16x16 in size, subpixel motion estimation,

multiple reference frames for motion estimation. It is important to note that FME‐GCK can

readily support all these features in a future version of its implementation.

All experiments were performed according to the common testing conditions recommended

in [42]. Therefore, the video sequences that appear in Table 2 were coded with QP

(quantization parameters) values of 28, 32, 36, 40.

Table 2: Video sequences used for video coding experiments. For each resolution, video sequences are sorted by ascending order of estimated coding‐difficulty.

QCIF (176x144) CIF (352x288) Sequence frames sequence frames

Container 300 Paris 300 Silent Voice 300 Foreman 300 Foreman 300 Tempete 260 Mobile 300


Rate‐distortion results for all seven video sequences can be found in Figures 24‐29. Rate‐

distortion results for three‐step search and diamond search are displayed as a reference. For

every QP value in these figures, distortion (PSNR) was kept roughly constant and mean

Δbitrate results are computed relative to full search according to [43]. A smaller Δbitrate

indicates more accurate motion estimation.

46

27

28

29

30

31

32

33

34

35

36

37

30 50 70 90 110 130 150

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)

FME‐GCK (m=5,q=13)

Full Search Three‐step Search Diamond Search

FME‐GCK (m=5,q=4)


QP bitrate Y‐PSNR bitrateY‐PSNRbitrateY‐PSNRbitrateY‐PSNR bitrateY‐PSNR

28 127.54 36.13 127.5736.12 127.6136.13 128.6036.12 127.7736.12

32 82.81 33.37 82.83 33.37 82.84 33.37 83.70 33.37 82.98 33.37

36 54.32 30.68 54.31 30.68 54.33 30.68 55.05 30.67 54.44 30.68

40 38.21 28.26 38.18 28.27 38.17 28.26 38.89 28.26 38.33 28.26

0.00%0.00 0.02%0.00 1.29%‐0.08 0.24%‐0.02

Figure 23: FME‐GCK rate‐distortion video encoding results for Container QCIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK is outperformed by both three‐step search and diamond search.

47

27

28

29

30

31

32

33

34

35

36

30 80 130 180 230

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)



FME‐GCK (m=5,q=4)



28 175.26 35.51 180.3935.51 175.2935.51 187.1535.44 179.6835.48

32 108.66 32.63 112.6432.64 108.8632.64 115.3832.55 111.1432.60

36 68.32 30.16 70.79 30.17 68.45 30.17 71.41 30.07 69.49 30.13

40 45.33 27.78 46.79 27.78 45.51 27.79 46.68 27.73 45.75 27.76

3.35%‐0.19 0.03%0.00 6.84%‐0.37 2.49%‐0.14

Figure 24: FME‐GCK rate‐distortion video encoding results for Silent Voice QCIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK outperforms three‐step search and is outperformed by diamond search.

48

26

27

28

29

30

31

32

33

34

35

36

50 100 150 200 250 300 350 400 450

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)



FME‐GCK (m=5,q=4)


QP bitrate Y‐PSNR bitrateY‐PSNRbitrateY‐PSNRbitrate Y‐PSNR bitrateY‐PSNR

28 356.01 35.05 409.1634.93 374.2935.02 361.1235.03 357.6435.05

32 205.46 32.11 241.4031.96 217.5832.07 209.1432.10 206.5732.10

36 115.57 29.42 136.9529.24 122.6929.37 118.2129.40 116.2329.41

40 68.85 27.04 79.94 26.81 72.81 26.98 71.00 27.03 69.24 27.04

21.38%‐0.95 6.84%‐0.32 2.36%‐0.11 0.70%‐0.03

Figure 25: FME‐GCK rate‐distortion video encoding results for Foreman QCIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK outperforms both three‐step search and diamond search.

49

26

27

28

29

30

31

32

33

34

35

36

200 400 600 800 1000 1200

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)



FME‐GCK (m=5,q=4)



28 992.99 35.52 1028.5635.51996.1835.52 1027.1735.50 1003.1135.52

32 655.76 32.41 682.7532.40658.0032.41 680.2232.38 663.4632.40

36 415.56 29.46 435.2229.45417.3229.46 432.1929.42 420.8929.44

40 261.27 26.76 274.3026.76262.4226.76 272.1026.71 264.8526.74

4.49% ‐0.29 0.38%‐0.02 4.39% ‐0.28 1.43% ‐0.09

Figure 26: FME‐GCK rate‐distortion video encoding results for Paris CIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK outperforms three‐step search and is outperformed by diamond search.

50

27

28

29

30

31

32

33

34

35

36

150 350 550 750 950 1150 1350 1550 1750

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)



FME‐GCK (m=5,q=4)



28 1286.80 35.54 1517.2635.391335.4835.511318.7435.50 1295.4535.53

32 727.59 32.84 878.2032.66761.0232.80748.1132.80 733.3432.82

36 403.07 30.42 492.5130.21425.6730.39416.8130.38 406.6930.41

40 241.83 28.30 289.3928.04257.2728.26252.1028.28 244.2728.29

26.22%‐1.02 5.87% ‐0.25 4.02% ‐0.17 1.17% ‐0.05

Figure 27: FME‐GCK rate‐distortion video encoding results for Foreman CIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK outperforms both three‐step search and diamond search.

51

24

25

26

27

28

29

30

31

32

33

34

35

400 900 1400 1900 2400 2900 3400

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)



FME‐GCK (m=5,q=4)



28 2757.57 34.06 2782.1934.042759.6034.062862.5133.99 2802.5834.03

32 1707.69 30.75 1726.7330.741709.2230.751779.1830.67 1739.3230.71

36 966.05 27.70 979.3927.70967.1727.701009.4827.62 986.4627.66

40 508.25 25.03 514.2425.01507.8525.03531.9224.96 520.3125.00

1.35% ‐0.07 0.08% 0.00 5.81% ‐0.30 2.68% ‐0.14

Figure 28: FME‐GCK rate‐distortion video encoding results for Tempete CIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK outperforms three‐step search and it outperformed by diamond search.

52

27

28

29

30

31

32

33

34

35

36

150 350 550 750 950 1150 1350 1550 1750

Y‐PSNR [dB]

bitrate [Kbits/sec]

Full Search

Three‐step Search

Diamond Search

FME‐GCK (m=5,q=4)



FME‐GCK (m=5,q=4)



28 1286.80 35.54 1517.2635.391335.4835.511318.7435.50 1295.4535.53

32 727.59 32.84 878.2032.66761.0232.80748.1132.80 733.3432.82

36 403.07 30.42 492.5130.21425.6730.39416.8130.38 406.6930.41

40 241.83 28.30 289.3928.04257.2728.26252.1028.28 244.2728.29

26.22%‐1.02 5.87% ‐0.25 4.02% ‐0.17 1.17% ‐0.05

Figure 29: FME‐GCK rate‐distortion video encoding results for Mobile CIF. Bitrate is given in Kbits/sec; Y‐PSNR is given in dB. Full search, three‐step search, and diamond search results are displayed as a reference. FME‐GCK outperforms both three‐step search and diamond search.

53

For six out of the seven video sequences that appear in Table 2, FME‐GCK outperforms

three‐step search. For three out of these seven video sequences, FME‐GCK also outperforms

diamond search.

The video coding results corroborate the simulation results from section 6.1. The FME‐GCK

algorithm significantly outperforms three‐step search and produces motion information that

is almost as accurate as diamond search. This will further improve in the favor of FME‐GCK

with the introduction of an adaptive FME‐GCK in Chapter 7.

54

7. An Adaptive FME‐GCK

An important advantage of FME‐GCK compared to classical fast block motion estimation

techniques is that it enables adaptivity to image content. In this chapter, the adaptive

capabilities of FME‐GCK are exploited to produce a varying complexity block motion

estimation algorithm. For some video coding applications, a varying complexity block motion

estimation algorithm is a necessity. Furthermore, even if not a necessity, an adaptive varying

complexity may significantly improve motion estimation accuracy compared with a non‐

adaptive method.

Following are some example scenarios in which flexibility in controlling the tradeoff between

complexity and quality is required [3]:

1. Software video codec – Encoding is carried out in software. The upper bound on

computational complexity depends on the available processing resources. These

resources are likely to vary from platform to platform (for example, depending on the

specification of a PC) and may also vary depending on the number of other applications

contending for resources.

2. Power‐limited video codec ‐ In a mobile or handheld computing platform, power

consumption is at a premium. It is now common for a processor in a portable PC or

personal digital assistant to be power‐aware, e.g. a laptop PC may change the processor

clock speed depending on whether it is running from a battery or from an AC supply.

Power consumption increases depending on the activity of peripherals, e.g. hard disk

accesses, display activity, etc. There is therefore a need to manage and limit

computation in order to maximize battery life.

3. Multichannel video coding – One of the tasks of a video server might be to encode

several video sequences simultaneously. Available computational resources are limited

and should be divided between different coding processes. It might be beneficial to

allocate more computational resources to the difficult‐to‐code video sequences in an

effort to equate the quality of the coded sequences.

55

In all scenarios, desired algorithm complexity may depend on external parameters, on the

characteristics of the input video sequences, or on both. Since external parameters are

application specific, the rest of this chapter will deal with adaptively changing FME‐GCK

parameters based only on the characteristics of the input video sequence.

It is well‐known that some video scenes are more difficult‐to‐code (or less code‐able) than

others. Material containing an abundance of spatial detail and/or rapid, possibly non‐

translational, movement generally requires more encoded bits than material containing little

detail and/or simple motion. The less code‐able material is not modeled well by the

translational block‐based motion model used in modern video coders, thus resulting in

relatively large values in the residual signal. These relatively large values require many bits to

code, thus the coding efficiency of these video scenes is low. Increasing the computational

resources for motion estimation of difficult‐to‐code scenes, if performed wisely, should

improve their coding efficiency.

FME‐GCK uses two parameters that affect the tradeoff between complexity and accuracy of

resulting motion vectors. These parameters are , the number of projections to perform for

each image, and , the number of candidate macroblocks for which the SAD value is

calculated (Step 2.2 in algorithm). Larger and larger produce more accurate results at

the cost of higher (time and memory) complexity.

Figure 23 shows the mean SAD between macroblocks and their ‘best’ matching regions in

the previous frame for the sequences Akiyo and Stefan of length 300 frames each in CIF

(352x288) resolution. Large SAD indicates large values in the residual signal; small SAD

indicates small values in the residual signal. Akiyo is a ‘talking head’ sequence with a small

amount of simple motion while the Stefan sequence comprises of complex local and global

motions. The reconstructed macroblocks were produced by FME‐GCK with constant 4

and with different values of . Since Akiyo is easy‐to‐code, its residual signal is small, and

since Stefan is difficult‐to‐code, its residual signal is substantially larger. The difference is

more than an order of magnitude. As expected, for both sequences, larger values of

(more projections), produce a smaller residual signal. More importantly however, is the fact

56

that increasing the number of projections produces a reduction in SAD substantially greater

for the Stefan sequence than for the Akiyo sequence. For example, increasing the number of

projections from 2 to 3 reduces the mean SAD per macroblock in the Stefan sequence by

328.29. On the other hand, reducing the number or projections from 3 to 2 increases the

mean SAD per macroblock sequence in the Akiyo sequence only by 23.66. Thus, using more

projections for Stefan is much more effective in raising mean coding efficiency than using

more projections for Akiyo. In addition, since mean SAD is a measure of subjective image

quality (though not a very good one), using more projections for the Stefan sequence

increases its subjective quality towards approaching Akiyo’s subjective quality. Thus, using

more projections for the Stefan sequence helps the encoder to achieve constant image

quality across different video scenes.

Figure 30: Size of residual signal using FME‐GCK with constant and different values of .

2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500


Mea

n S

AD

per

mac

robl

ock

akiyo CIFstefan CIF

157.06180.72 153.28 151.70 151.08 150.67

3492.93

3164.64 3075.582967.82 2932.12 2911.59

Results are for Akiyo and Stefan video sequences, 300 frames in length each, CIF (352x288) resolution. Akiyo is easy‐to‐code while Stefan is difficult‐to‐code. For both sequences, larger values of (more projections) result in smaller residual signal, but the expected improvement in coding efficiency for adding more projections, is substantially larger for Stefan compared with Akiyo.

57

Figure 24 shows that similar conclusions can be deduced when is kept constant, 5 in

this case, and varies. As expected, for both sequences, larger values of (more SAD

calculations), results in a smaller residual signal. However, the reduction in SAD produced by

using larger values is substantially larger for the Stefan sequence than for the Akiyo

sequence. For example, increasing SAD calculations per macroblock from 2 to 3 reduces the

mean SAD per macroblock by 86.35 for the Stefan sequence but only by 3.38 for the Akiyo

sequence. Thus, performing more SAD calculations for the Stefan sequence is much more

effective in raising mean coding efficiency than performing more SAD calculations for the

Akiyo sequence. As before, performing more computations for motion estimation of the

Stefan sequence helps the encoder to achieve a constant image quality.

Figure 31: Size of residual signal using FME‐GCK with constant and different values of .

2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

q (number of SADs per macroblock)

Mea

n S

AD

per

mac

robl

ock

akiyo CIFstefan CIF

156.29 152.91 151.70 151.07 150.80 150.59

3101.97 3015.62 2967.82 2936.29 2912.70 2894.51

Results are for Akiyo and Stefan video sequences, 300 frames in length each, CIF (352x288) resolution. Akiyo is easy‐to‐code while Stefan is difficult‐to‐code. For both sequences, larger values of (more SAD calculations) result in smaller residual signal, but the expected improvement in coding efficiency for adding more SAD calculations, is substantially larger for Stefan compared with Akiyo.

58

Let us assume that we are required to encode a large set of video sequences. This set of

video sequences contains a variety of video scenes with varying coding‐difficulty. Classical

block motion estimation algorithms typically allot the same amount of time for computing

motion vectors for all video scenes, resulting in varying encoding coding efficiency. Scenes

with complex motion result in substantially larger residuals than scenes with simple motion

or with small amount of motion. If a constant bitrate is required at the output of the

encoding process, residual signals of difficult‐to‐code scenes will be more coarsely

quantized, leading to a reduced subjective image quality. Figure 23 and Figure 24 show that

this undesirable effect might be mitigated by selecting larger values for and for more

difficult‐to‐code video scenes. Setting nonconstant values for and is practical only if it is

possible to change computational resources allocation dynamically in the encoding system.

This leads to a higher mean subjective quality or to closer to constant subjective quality of

the encoded video sequences.

In order to change and dynamically, an estimate of the resulting encoding bitrate is

required. This estimate is produced by a video bitrate control algorithm. Any practical video

encoder contains a bitrate control algorithm that attempts to maximize the visual quality

and to achieve a desired target of encoded bits. The bitrate control estimates the number of

bits required for coding each picture. This estimate might be used to control and . Some

examples of well‐known video bitrate control algorithms are MPEG‐2 Test Model 5 [44],

H.263 Test Model 8 [45], MPEG‐4 Annex‐L [9]. MPEG‐2 Test Model 5 and MPEG‐4 Annex‐L

are frame‐level rate‐control, estimating the output bitrate at the frame level, while H.263

Test Model 8 also estimates the output bitrate at the macroblock level.

If a bitrate estimate is not available, a simple estimate of image code‐ability can be used.

One example of such a simple estimate is the size of the residual. Easy‐to‐code material is

expected to result in a small residual signal while difficult‐to‐code material is expected to

result in a larger residual signal. The size of residual of the previous frame can be used to

estimate the code‐ability of the current frame in the video sequences. Due to temporal

redundancy this is a good estimate, except for the first frame of every scene (following a

scene change).

59

The change in , namely the number of projections to perform for each image, is associated

with a complete frame. On the other hand, adaptivity of , namely the number of candidate

macroblocks for which the SAD value is calculated, is applied at the macroblock level with

varying as a function of code‐ability of each macroblock. Changing the value of can use a

simple frame‐based estimate of code‐ability. On the other hand, changing the value of

should use a more accurate estimate which is macroblock‐dependant due to its spatial

locality.

Changing the value of has two disadvantages. Since computation of the lower bound

requires both current and previous image projections, changing takes effect with one

frame of delay. In addition, due to the same reason, raising raises not only time but also

memory complexity. Adaptivity of does not have these two disadvantages.

In the adaptive results given in the next chapter, the size of residual of the previous frame is

used to estimate the coding‐difficulty of the current frame. This estimate is used to

adaptively control .

60

8. Adaptive FME‐GCK Results

Following are adaptive FME‐GCK simulation results with a constant 4 and with variable

values of the parameter . The size of the residual of every frame was used as a simple

code‐ability estimate of its consecutive frame. This code‐ability estimate is used to control

the parameter . As before, macroblocks are of size 16x16 and search area is of size 15x15.

Figure 25 shows time, measured in operations per macroblock, vs. motion accuracy,

measured in mean SAD per macroblock, for a video sequence that is a concatenation of all

six QCIF video sequences that appear in Table 1. Results for different configurations of

thresholds for transitioning between values of are plotted. It is shown that different time‐

accuracy tradeoffs can be used according to thresholds selection. In all configurations, more

projections are preformed for more difficult‐to‐code scenes and fewer projections are

performed for easier‐to‐code video scenes. By this adaptivity, the mean SAD for the

concatenated video sequence in reduced. One adaptive FME‐GCK configuration shown in

Figure 25 is with similar computation complexity as the diamond search, yet outperforms it.

Figure 26 shows time, measured in operations per macroblock, vs. motion accuracy,

measured in mean SAD per macroblock, for a video sequence that is a concatenation of all

six CIF video sequences that appear in Table 1. Results for different configurations of

thresholds for transitioning between values of are plotted. Similar to QCIF, an adaptive

FME‐GCK configuration with similar computation complexity to diamond search,

outperforms it. We conclude that if thresholds of the adaptive FME‐GCK are appropriately

selected, adaptive FME‐GCK significantly outperforms diamond search on average.

It should be noted that a residual based code‐ability estimate was used to produce Figure 25

and Figure 26. A more sophisticated estimate is expected to improve adaptive FME‐GCK

performance. Such an estimate can be also used to adaptively control the parameter to

further improve FME‐GCK performance.

61

1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23

x 104

607

607.5

608

608.5

609

609.5

610

610.5

611

611.5

612

Operations per macroblock

Mea

n S

AD

per

mac

robl

ock

Diamond searchAdaptive FME-GCK

Figure 32: Adaptive FME‐GCK results (QCIF resolution). Results are given for a concatenation of six QCIF video sequences. Different time‐accuracy tradeoffs can be used according to thresholds selection. For the same computational complexity, adaptive FME‐GCK outperforms diamond search.

1.18 1.182 1.184 1.186 1.188 1.19 1.192 1.194 1.196 1.198 1.2

x 104

1317

1317.5

1318

1318.5

1319

1319.5

1320

1320.5

1321

Operations per macroblock

Mea

n S

AD

per

mac

robl

ock

Diamond searchAdaptive FME-GCK

Figure 33: Adaptive FME‐GCK results (CIF resolution). Results are given for a concatenation of six CIF video sequences. Different time‐accuracy tradeoffs can be used according to thresholds selection. For the same computational complexity, adaptive FME‐GCK outperforms diamond search.

62

9. Conclusion

In this dissertation a novel fast block motion estimation algorithm called FME‐GCK has been

presented. FME‐GCK uses an efficient projection framework which bounds the distance

between a template block and candidate blocks using highly efficient filter kernels.

Candidate regions that are distant from the template macroblock are quickly rejected using

a rapid computation of lower bounds. For the few remaining candidate blocks, the SAD

distortion measure is used.

The FME‐GCK algorithm enables flexibility in the tradeoff between coding efficiency and

computational complexity by allowing adaptivity of the motion estimation process based on

image content and complexity limitations. Algorithm results are guaranteed to converge to

the optimal (full search) results with the increase of allowed computation.

When tuned to a computational complexity equal to that incurred by three‐step search or by

diamond search, and when its adaptivity parameters are appropriately selected, the

FME‐GCK algorithm significantly outperforms both three‐step search and diamond search. In

addition, FME‐GCK incurs only integer arithmetic and sequential memory access, thus it is

appropriate for embedded systems or for any other application where the constraints on

memory complexity are not very tight.

63

Bibliography [1] D. Salomon, Data Compression: The Complete Reference, 4th ed. London: Springer, 2007.

[2] Y. Q. Shi and H. Sun, Image and Video Compression for Multimedia Engineering: Fundamentals, Algorithms, and Standards. Boca Raton, Fla: CRC Press, 1999.

[3] I. E. G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester: Wiley, 2002.

[4] I. E. G. Richardson, H.264 and MPEG‐4 Video Compression: Video Coding for Next Generation Multimedia. Chichester ; Hoboken, NJ: Wiley, 2003.

[5] M. Ghanbari, Standard Codecs: Image Compression to Advanced Video Coding. London: Institution of Electrical Engineers, 2003.

[6] R. Schafer and T. Sikora, "Digital Video Coding Standards and Their Role in Video Communications," Proceedings of the IEEE, vol. 83 (6), pp. 907‐924, 1995.

[7] "Information Technology – Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to About 1.5 Mbit/s ‐ Part 2: Video," ISO/IEC 11172‐2 (MPEG‐1 Video), 1993.

[8] "Information Technology ‐ Generic Coding of Moving Pictures and Associated Audio Information: Video ": ISO/IEC 13818‐2 and ITU‐T Rec. H.262 (MPEG‐2 Video) 1995.

[9] "Information Technology ‐ Coding of Audio Visual Objects ‐ Part 2: Visual," ISO/IEC 14496‐2 (MPEG‐4 Video), 1999.

[10] "Video Codec for Audiovisual Services at p x 64 Kbit/s," ITU‐T Recommendation H.261, 1993.

[11] "Video Coding for Low Bit Rate Communication," ITU‐T Reommendation H.263, 1998.

[12] "Advanced Video Coding for Generic Audiovisual Services," ITU‐T Reccomendation H.264 and ISO/IEC 14496‐10 AVC, 2003.

[13] "H.264/AVC Reference Software ver. 11.1," Joint Video Team (JVT) of ISO/IEC MPEG & ITU‐T VCEG, http://iphome.hhi.de/suehring/tml/, August 2006.

[14] K.‐P. Lim, G. Sullivan, and T. Wiegand, "Text Description of Joint Model Reference Encoding Methods and Decoding Concealment Methods," Joint Video Team (JVT) of ISO/IEC MPEG & ITU‐T VCEG Doc. JVT‐X101, July 2007.

[15] Y.‐W. Huang, C.‐Y. Chen, C.‐H. Tsai, C.‐F. Shen, and L.‐G. Chen, "Survey on Block Matching Motion Estimation Algorithms and Architectures with New Results," Journal of VLSI Signal Processing, vol. 42 (3), pp. 297–320, 2006.

[16] J. R. Jain and A. K. Jain, "Displacement Measurement and Its Application in Interframe Image Coding," IEEE Transactions on Communications, vol. 29 (12), pp. 1799–1808, 1981.

64

http://iphome.hhi.de/suehring/tml/

[17] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, "Motion‐Compensated Interframe Coding for Video Conferencing," Proceedings of the National Telecommunications Conference (NTC'81), pp. G5.3.1‐5, 1981.

[18] L. M. Po and W. C. Ma, "A Novel Four‐step Search Algorithm for Fast Block Motion Estimation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 6 (3), pp. 313–7, 1996.

[19] M. Ghanbari, "The Cross‐Search Algorithm for Motion Estimation," IEEE Transactions on Communications, vol. 38 (7), pp. 950–3, 1990.

[20] S. Zhu and K.‐K. Ma, "A New Diamond Search Algorithm for Fast Block Matching Motion Estimation," Proceedings of IEEE International Conference on Information, Communications, and Signal Processing (ICICS’97), pp. 292–296, 1997.

[21] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, "A Novel Unrestricted Center‐Biased Diamond Search Algorithm for Block Motion Estimation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 8 (4), pp. 369‐377, 1998.

[22] C. H. Hsieh, P. C. Lu, J. S. Shyn, and E. H. Lu, "Motion Estimation Algorithm Using Interblock Correlation," IEEE Electronic Letters, vol. 26 (5), pp. 276–277, 1990.

[23] J. Chalidabhongse and C. C. J. Kuo, "Fast Motion Vector Estimation Using Multiresolution‐Spatio‐Temporal Correlations," IEEE Transactions on Circuits and Systems for Video Technology, vol. 7 (3), pp. 477–488, 1997.

[24] A. Zaccarin and B. Liu, "Fast Algorithms for Block Motion Estimation," Proccedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’92), pp. 449–452, 1992.

[25] B. Liu and A. Zaccarin, "New Fast Algorithms for the Estimation of Block Motion Vectors," IEEE Transactions on Circuits and Systems for Video Technology, vol. 3 (2), pp. 148–157, 1993.

[26] D. Tzovaras, M. G. Strintzis, and H. Sahinolou, "Evaluation of Multiresolution Block Matching Techniques for Motion and Disparity Estimation," Signal Processing: Image Communication, vol. 6, pp. 56–67, 1994.

[27] W. Li and E. Salari, "Successive Elimination Algorithm for Motion Estimation," IEEE Transactions on Image Processing, vol. 3 (1), pp. 105–107, 1995.

[28] C.‐H. Lee and L.‐H. Chen, "A Fast Motion Estimation Algorithm Based on the Block Sum Pyramid," IEEE Transactions on Image Processing, vol. 6 (11), pp. 1587‐91, 1997.

[29] Y.‐S. Chen, Y.‐P. Hung, and C.‐S. Fuh, "A Fast Block Matching Algorithm Based on the Winner‐Update Strategy," IEEE Transactions on Image Processing, vol. 10 (8), pp. 1212‐22, 2001.

[30] S.‐Y. Choi and S.‐I. Chae, "Hierarchical Motion Estimation in Hadamard Transform Domain," Electronics Letters, vol. 35 (25), pp. 2187‐8, 1999.

[31] M. Brunig and B. Menser, "A Fast Exhaustive Search Algorithm Using Orthogonal Transforms," Proceedings of the 7th International Workshop on Systems, Signals, and Image Processing (IWSSIP'2000), pp. 111‐4, 2000.

65

66

[32] S.‐W. Liu, S.‐D. Wei, and S.‐H. Lai, "Winner Update on Walsh‐Hadamard Domain for Fast Motion Estimation," Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06), vol. 3, pp. 794‐797, 2006.

[33] Y. Hel‐Or and H. Hel‐Or, "Real‐Time Pattern Matching Using Projection Kernels," Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV'03), pp. 1486‐93, 2003.

[34] Y. Hel‐Or and H. Hel‐Or, "Real‐Time Pattern Matching Using Projection Kernels," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27 (9), pp. 1430‐ 1445, 2005.

[35] K. G. Beauchamp, "Applications of Walsh and Related Functions," Academic Press, 1984.

[36] N. Li, C.‐M. Mak, and W.‐K. Cham, "Fast Block Matching Algorithm in Walsh Hadamard Domain," Proceedings of the 7th Asian Conference on Computer Vision (ACCV'06), pp. 712‐721, 2006.

[37] G. Ben‐Artzi, H. Hel‐Or, and Y. Hel‐Or, "Filtering with Gray‐Code Kernels," Proceedings of the 17th International Conference on Pattern Recognition (ICPR'04), vol. 1, pp. 556‐9, 2004.

[38] G. Ben‐Artzi, H. Hel‐Or, and Y. Hel‐Or, "The Gray‐Code Filter Kernels," IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 29 (3), pp. 382‐393, 2007.

[39] M. Gardner, "The Binary Gray Code," in Knotted Doughnuts and Other Mathematical Entertainments, W. H. Freeman, Ed., 1986, pp. 11‐27.

[40] P. Simard, L. Bottou, P. Haffner, and Y. LeCun, "Boxlets: A Fast Convolution Algorithm for Neural Networks and Signal Processing," Advances in Neural Information Processing Systems, 1999.

[41] D. Knuth, The Art of Computer Programming, 3rd ed. vol. 3: Sorting and Searching. Redwood City, CA: Addison‐Wesley, 1997.

[42] T. Tan, G. Sullivan, and T. Wedi, "Recommended Simulation Common Conditions for Coding Efficiency Experiments," ITU‐T Q.6/SG16, Document VCEG‐AA10d1, October 2005.

[43] G. Bjontegaard, "Calculation of average PSNR differences between RD‐curves," ITU‐T Q.6/SG16, Document VCEG‐M33, April 2001.

[44] "MPEG‐2 Video Test Model 5," ISO/IEC JTC1/SC29/WG11 Document 93/457, 1993.

[45] "Rate Control for Low‐delay Video Communications [H.263 TM8 rate control]," ITU‐T Q6/SG16 Document Q15‐A‐20, 1997.

fast block motion estimation using gray code...

Documents