number theoretic transform -based block motion estimation - oulu

NUMBER THEORETIC TRANSFORM -BASEDBLOCK MOTION ESTIMATION

Tuukka Toivonen

Toivonen T. (2002) Number Theoretic Transform -Based Block Motion Esti-mation. Department of Electrical Engineering, University of Oulu, Oulu, Finland.Diploma Thesis, 85 p.

ABSTRACT

A new fast full search algorithm for block motion estimation is presented, whichis based on convolution theorem and number theoretic transforms. The algo-rithm applies the sum of squared differences (SSD) criterion, and the encodedvideo quality is equivalent or even better than what is achieved with conventionalmethods, but the algorithm has low theoretical complexity. The algorithm is im-plemented for H.263 software video encoder. However, efficient implementationfor general purpose microprocessors is difficult, and the best advantage seems tobe achieved with application specific integrated circuits (ASIC) due to congruentarithmetic and regularity of data flow. Furthermore, a review of the currentlyknown fast full search block motion estimation algorithms, Partial DistortionElimination (PDE), Successive Elimination Algorithm (SEA), and others is given.These algorithms are suitable to be used with existing video coding standards,such as MPEG or H.263.

Keywords: Video coding, Partial Distortion Elimination, Successive Elimina-tion Algorithm, cross correlation, block matching, full search, Winograd FourierTransform Algorithm.

Toivonen T. (2002) Lukuteoreettiseen muunnokseen perustuva lohkopohjainenliikkeenestimointi. Oulun yliopisto, sähkötekniikan osasto. Diplomityö, 85 s.

TIIVISTELMÄ

Työssä esitellään uusi nopea täyden etsinnän algoritmi, joka perustuu konvoluu-tioteoreemaan ja lukuteoreettisiin muunnoksiin. Algoritmi soveltaa neliövirhei-den summakriteeriä (SSD), ja koodatun videon laatu on vastaava tai jopa pa-rempi kuin mitä saavutetaan perinteisillä menetelmillä, mutta algoritmillä on al-hainen teoreettinen kompleksisuus. Algoritmi on toteutettu H.263-ohjelmistopoh-jaiseen videokooderiin. Kuitenkin tehokas toteutus yleiskäyttöisiin mikroproses-soreihin on vaikeaa, ja paras etu saavutettaneen sovelluskohtaisilla integroiduil-la piireillä (ASIC) kongruentin aritmetiikan ja tietovuon säännöllisyyden vuok-si. Lisäksi annetaan yleiskatsaus nykyisin tunnettuihin nopeisiin täyden etsin-nän lohkopohjaisiin liikkeenestimointimenetelmiin, osittaisvirheen eliminointiin(PDE), peräkkäiseen eliminointialgoritmiin (SEA) ja muihin. Nämä algoritmit so-veltuvat käytettäviksi olemassa olevien videonkoodausstandardien kanssa, kutenMPEG tai H.263.

Avainsanat: Videon koodaus, osittaisvirheen eliminointi, peräkkäinen eliminaa-tioalgoritmi, ristikorrelaatio, lohkojen yhteensovitus, täysi etsintä, WinogradFourier -muunnosalgoritmi.

CONTENTS

ABSTRACTTIIVISTELMÄCONTENTSPREFACELIST OF SYMBOLS AND ABBREVIATIONS

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.1. Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2. Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2. MOTION ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1. Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2. Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Exhaustive Search Algorithm (ESA) . . . . . . . . . . . . . . 232.3.2 Three Step Search (TSS) . . . . . . . . . . . . . . . . . . . . 232.3.3 Partial Distortion Elimination (PDE) . . . . . . . . . . . . . . 252.3.4 Successive Elimination Algorithm (SEA) . . . . . . . . . . . 262.3.5 Multilevel Successive Elimination Algorithm (MSEA) . . . . 272.3.6 Winner-Update Strategy . . . . . . . . . . . . . . . . . . . . 302.3.7 Category-Based Block Motion Estimation Algorithm (CBME) 312.3.8 Fast Convolution Algorithms . . . . . . . . . . . . . . . . . . 34

3. FAST COMPUTATION OF NORMS . . . . . . . . . . . . . . . . . . . . 363.1. Differential Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 363.2. Norm Pyramid Calculation . . . . . . . . . . . . . . . . . . . . . . . 38

4. NUMBER THEORETIC TRANSFORMS . . . . . . . . . . . . . . . . . 414.1. Computing Correlation via 48-point WNTTA . . . . . . . . . . . . . 42

4.1.1 Winograd Short Length Algorithms . . . . . . . . . . . . . . 434.1.2 Longer Length Transforms . . . . . . . . . . . . . . . . . . . 434.1.3 Practical Implementation . . . . . . . . . . . . . . . . . . . . 46

4.2. Computing Correlation via 32-point Transforms . . . . . . . . . . . . 494.2.1 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 504.2.2 Radix-2 Algorithms . . . . . . . . . . . . . . . . . . . . . . 534.2.3 Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3. Reducing Congruent Reductions . . . . . . . . . . . . . . . . . . . . 574.3.1 Fast Computation . . . . . . . . . . . . . . . . . . . . . . . . 574.3.2 Multiplying by±2n(mod 224+1) . . . . . . . . . . . . . . . 594.3.3 Reduction Elimination . . . . . . . . . . . . . . . . . . . . . 614.3.4 Lookup Tables . . . . . . . . . . . . . . . . . . . . . . . . . 61

5. RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636. DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698. REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

APPENDICESA. SEA INEQUALITIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75B. INVERSIBILITY OF A NTT . . . . . . . . . . . . . . . . . . . . . . . . 77C. THE EUCLIDEAN ALGORITHM . . . . . . . . . . . . . . . . . . . . . 78D. SOME SHORT LENGTH WINOGRAD FOURIER TRANSFORM ALGO-

RITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79D.1. N = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79D.2. N = 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80D.3. N = 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

E. WFTA INDEX PERMUTATION . . . . . . . . . . . . . . . . . . . . . . 84

PREFACE

This diploma thesis was completed in the Information Processing Laboratory of De-partment of Electrical Engineering, University of Oulu.

The purpose is to present a new full search motion estimation algorithm, based onnumber theoretic transforms and suitable for being used with many existing video cod-ing standards. The development started in fall 2000, and it was based on ideas inventedby professor Janne Heikkilä.

The work was funded as a part of Image Sequence Analysis Techniques for Emerg-ing Applications (ISAAC) -project by several enterprises: Elektrobit, Hantro Products,Instrumentointi, Jutel, and Nokia Mobile Phones. The major part of the funds werecontributed by the National Technology Agency.

I am grateful to professors Janne Heikkilä and Olli Silvén who were the supervi-sors of this thesis. They encouraged me to finish the thesis—and my undergraduatestudies—eventually.

Oulu, 19th March 2002

Tuukka Toivonen

LIST OF SYMBOLS AND ABBREVIATIONS

⊗ Kronecker (or tensor or direct) product

⊕ Bitwise exclusive or (XOR) operation

∧ Bitwise logicalAND operation

bxc Greatest integer not larger thanx, that is, the integer part ofx

−→v Vectorv, that is, a column matrix

(y,x) Vector, that is, a column matrix

[yx

]Mh, Mw Height (number of rows) and width (number of columns) of the matrixM ,

respectively

Mhw The height and width of the square matrixM

M (y,x) An element in the matrixM . y and x are zero-based row and columnindices, respectively

MT Transpose of the matrixM

|M | Determinant of the matrixM

‖M‖1 L1-norm: The sum of absolute values of the matrixM elements

‖M‖2 L2-norm: the square root of sum of squared absolute values of the matrixM elements

Bt The current block, whose motion vector is estimated

Cτ A candidate block, which is compared against current blockBt

Ft The current frame in an image sequence

Fτ The reference frame in an image sequence, used for motion compensation

Sτ The search area (window), which contains all the candidate blocksCτ

−→b The best motion vector estimate so far, at some step, before the motion

estimation algorithm has completed

−→c The motion vector, which is currently being tested in a motion estimationalgorithm

−→m The overall best motion vector estimate, based on some criterion

rh, rw Maximum possible range of a motion vector,−rh ≤my ≤ rh

sh, sw Range length of a motion vector,sh = 2rh +1

8

ASIC Application Specific Integrated Circuit

BSPA Block Sum Pyramid Algorithm (same as MSEA)

CBME Category-Based Block Motion Estimation Algorithm

CIF Common Intermediate Format

Codec A system consisting both an encoder and a decoder

ESA Exhaustive Search Algorithm

FFT Fast Fourier Transform

IEC International Electro-Technical Commission

ISO International Standards Organization

ITU International Telecommunications Union (formerly CCITT)

MAD Mean Absolute Difference

MAE Mean Absolute Error (same as MAD)

MSE Mean Squared Error

MSEA Multilevel SEA (same as BSPA)

NTT Number Theoretic Transform

PDE Partial Distortion Elimination

Pixel Picture element

SEA Successive Elimination Algorithm

SSD Sum of Squared Differences

SAD Sum of Absolute Differences

TSS Three Step Search

VLSI Very Large Scale Integration

WFTA Winograd Fourier Transform Algorithm

WNTTA Winograd Number Theoretic Transform Algorithm

1. INTRODUCTION

The demand for communications with moving video picture is rapidly increasing.Video is required in many remote video conferencing systems, and it is expected thatin near future cellular telephone systems will send and receive real-time video.

A typical system, which relays video over a low bandwidth transmission channel, isshown in Figure 1. The multimedia terminals could be, for example, cellular phones orhandheld computers. Both terminals contain compatible codecs: a video encoder- anddecoder pair, whose purpose is to compress the video stream to be transmitted over aslow link, such as radio waves or Internet. Often a bidirectional connection is desired,where both terminals transmit and receive video, and thus they both need an encoderand a decoder running in real-time.

Low bandwidthchannel

terminalMultimedia

encoderVideo

Videodecoder

terminalMultimedia

encoderVideo

Videodecoder

coderChannel

coderChannel

Figure 1. Wireless video conferencing application.

A major problem in a video is the high requirement for bandwidth. A typical systemneeds to send dozens of individual frames (pictures) per second to create an illusionof a moving picture. For this reason, several methods and standards for compressionof the video have been developed. Each individual frame is coded so that redundancyis removed. Furthermore, between consecutive frames, a great deal of redundancyis removed with a motion compensation system. A simplified example diagram of avideo encoder is presented in Figure 2a, and the corresponding decoder in Figure 2b.

1.1. Standards

Both terminals in the Figure 1 need to use a video decoder that is capable of decodingthe video stream produced by the other terminal. Since there are endless ways tocompress and encode data, and many terminal vendors which each may have an unique

10

+ QuantizationDiscretecosine

transform

Videosequence

Compressedbitstream

Previousencoded

frame

Dequantization

transform

Inversecosine

+Motion

compensation

-

Motionestimation

Variablelengthcoding

Motion vectors

(a) Encoder.

Variablelength

decoding

Compressedbitstream

DequantizationInversecosine

transform

Motioncompensation

Previousdecoded

frame

Transmittedvideosequence

Motion vectors

+

(b) Decoder.

Figure 2. Typical video codec.

11

idea of data compression, common standards are needed, that rigidly define how thevideo is coded in the transmission channel. There are mainly two standard series incommon use, both having several versions. International Telecommunications Union(ITU) started developing Recommendation H.261 in 1984, and the effort was finishedin 1990 when it was approved. The standard is aimed for video conferencing andvideo phone services over the integrated service digital network (ISDN) with bit rate amultiple of 64 kilobits per second.

In 1996 a revised version of the standard, Recommendation H.263, was finalizedwhich adopts some new techniques for compression, such as half pixel and optionallysmaller block size for motion compensation. As a result it has better video qualitythan H.261. Recommendation H.261 divides each frame into 16×16 picture element(pixel) blocks for backward motion compensation, and H.263 can also take advantageof 8× 8 pixel blocks. A new ITU standard in development is called H.26L, and itallows motion compensation with greater variation in block sizes. [1, 2]

MPEG-1 is a video compression standard developed in joint operation by Interna-tional Standards Organization (ISO) and International Electro-Technical Commission(IEC). The system development was started in 1988 and finished in 1990, and it wasaccepted as standard in 1992. MPEG-1 can be used at higher bit rates than H.261,at about 1.5 megabits per second, which is suitable for storing the compressed videostream on compact disks or for using with interactive multimedia systems. The stan-dard covers also audio associated with a video.

For motion estimation, MPEG-1 uses the same block size as H.261, 16×16 pixels,but in addition to backward compensation, MPEG can also apply bidirectional motioncompensation. A revised standard, MPEG-2, was approved in 1994. Its target is athigher bit rates than MPEG-1, from 2 to 30 megabits per second, where applicationsmay be digital television or video services through a fast computer network. The latestISO/IEC video coding standard is MPEG-4, which was approved in the beginning of1999. It is targeted at very low bit rates (8–32 kilobits per second) suitable for e.g.mobile video phones. MPEG-4 can be also used with higher bit rates, up to 4 megabitsper second. [1, 2]

1.2. Video Compression

In a typical standard, such as MPEG-1 or H.261, the first video encoding stage per-forms motion compensation for each frame (Figure 2a). It subtracts the contents of thecurrent frame from the previous frame, and the residue will be easier to compress.

Then the frame is subdivided into blocks, each typically 8 by 8 pixels, and the dis-crete cosine transform is performed to the blocks. This shifts most of the block energiesto low frequencies, while the high frequencies have almost no energy at all. The highfrequencies can be therefore left out or represented coarsely without affecting muchthe image quality: this happens in the quantization stage.

In the last stage, the frame is coded with a variable length code, which is a losslesscompression method. Frequently appearing values are replaced with short code words,while rare values are replaced with longer codes. Since the compression in the laststage is lossless, the decoder will receive an exact copy of the quantized frame (in theabsence of transmission errors).

12

Most of the currently used motion compensation systems employ a block-based mo-tion estimation. Typically the blocks used for motion compensation are 16 by 16 pix-els and called macroblocks. Since the macroblocks are usually different size from thetransformed blocks, they have different names for clarity. A macroblock is assumedto be similar to another macroblock in the previous coded frame, but possibly in atranslated position.

The current macroblock (in the current frame) is compared to multiple candidatemacroblocks in the reference frame (which is the previous transmitted frame if back-ward motion estimation is used). The best matched candidate macroblock is selected,as in Figure 3. Then the spatial distance, i.e. the motion vector, between the selectedcandidate macroblock and the current macroblock is computed, encoded with the vari-able length code, and transmitted to the decoder. The decoder (as well as the encoderin a closed loop) uses the motion vector for motion compensation.

From this on, we call a macroblock simply as a block, and do not consider discretecosine transformed blocks further.

CτBest matched candidate blockSearch area S τ

Motion vector

B tCurrent block

Figure 3. Motion estimation.

Usually the most demanding part of a video encoder is the motion estimation [9].Especially at low bit rates, the estimation algorithms giving the best possible videoquality are desired. For example, the exhaustive search algorithm (ESA) simply com-pares the current block to all candidate blocks (within a limited distance), one by one,for finding the best match. The problem with this type of motion estimation is thehigh requirement of computation. Each comparison of two blocks may require nearlya thousand arithmetic operations, and to obtain a high quality coded video within afixed bandwidth, the current block has to be compared to more than a thousand blocksin the previous frame.

A device performing these operations in real-time will be very expensive and con-sume large amount of power. This can not be allowed in low-cost mobile consumerdevices. Therefore, many faster alternatives have been developed, usually by sacrific-ing estimation accuracy and thus degrading the video quality.

In this diploma thesis many previously presented motion estimation algorithms are

13

reviewed, with the emphasis on full search algorithms, which do not degrade the videoquality, as compared to ESA. Then a new fast full search algorithm is presented, whichis based on fast correlation via number theoretic transforms. Good overviews of othermotion estimation techniques are given in [2, 3, 4, 5, 6, 8].

14

2. MOTION ESTIMATION

2.1. Models

A video sequence can be considered to be a discretized three-dimensional projectionof the real four-dimensional continuous space-time. The objects in the real world maymove, rotate, or deform. The movements can not be observed directly, but insteadthe light reflected from the object surfaces and projected onto an image. The lightsource can be moving, and the reflected light varies depending on the angle between asurface and a light source. There may be objects occluding the light rays and castingshadows. The objects may be transparent (so that several independent motions couldbe observed at the same location of an image) or there might be fog, rain or snowblurring the observed image. The discretization causes noise into the video sequence,from which the video encoder makes its motion estimations. There may also be noisesin the image capture device (such as a video camera) or in the electrical transmissionlines. A perfect motion model would take all the factors into account and find themotion that has the maximum likelihood from the observed video sequence.

However, no such model is in use, and a number of simplifications must be doneinstead. In this thesis, only the projected two-dimensional motion in time is considered.The mapping of the motion back into the three-dimensional space is ignored. It isalso assumed that the light emitted from a surface point to the camera is constant(only ambient lightning is taken into account). This yields the optical flow model, inwhich all motion is based only on the variation of intensities. For example, a movinguniform, flat surface would not be considered a motion in this model, because themotion would not be detectable inside the surface (only at the surface edges, if thosewould be visible). On the other hand, varying lightning could generate motion in thismodel, even if real objects would not move in the image.

In a strict sense, optical flow is the instantaneous velocity of the image intensitypattern. It corresponds to continuous observation model, and assumes that image in-tensity is constant along a projected point trajectory. For processing a video sequencein a digital device, only the displacement of intensities between two consecutive framesis considered. This discretized model is called a correspondence field. [2]

The optical flow model can be expressed with the following differential equation:

∂Ft (y,x)∂y

my +∂Ft (y,x)

∂xmx +

∂Ft (y,x)∂t

= 0 (1)

where(my,mx) is the instantaneous velocity of a point and the imageFt (y,x) is con-tinuous both spatially and in time to allow differentiating. However, there are twounknowns,my andmx, and only one equation, hence the equation is underconstrained.Many different additional constraints are proposed, such as using the image color infor-mation. The equation can be approximated in a discrete case, which allows estimatingthe correspondence field. [3, 5]

There are two basic types of motion estimation algorithms: they can either find outthe motion of each individual pixel, or they can use a parametric model for modelingthe motion of a group of pixels in a region of support with only a few parameters. In theformer case a dense motion field is generated, where a two-component motion vector is

15

assigned to each pixel. This type of motion estimation can be used in object tracking,video surveillance, and computer vision. However, it is usually not appropriate fora motion compensation system in a video codec, because the transmission of the allmotion vectors requires too much space in a communications line or storage device.[2]

In the latter case, only a few parameters are computed from which the motion vectorsfor a region of support are derived. There are many different motion models, thatexactly describe how the parameters are used for computing the pixel motion vectors,and they have varying number of parameters. Some of these are shown in Table 1. Foreach pixelFt (yt ,xt) in a frame in the region of support, the corresponding pixel in theprevious frame is computed from[

yτxτ

]=[

yt

xt

]−−→m (yt ,xt) (2)

where−→m = (my,mx) is the motion vector for the particular pixel. Sometimes fractionalvalues are used for the motion vector component values, in which case interpolation isused to calculate a pixel intensity between sample points.

Table 1. Motion modelsModel Parameters Reference coordinates(yτ,xτ) =

Translational 2

[yt

xt

]+[

a1

a2

]Rotation/translation 3

[cosa3 −sina3

sina3 cosa3

][yt

xt

]+[

a1

a2

]Affine 6

[a3 a4

a5 a6

][yt

xt

]+[

a1

a2

]Projective linear 8

[a1+a2yt+a3xt1+a7yt+a8xta4+a5yt+a6xt1+a7yt+a8xt

]Quadratic 12

[a1 +a3yt +a5xt +a7y2

t +a9ytxt +a11x2t

a2 +a4yt +a6xt +a8y2t +a10ytxt +a12x2

t

]The translational model is the simplest: it assumes that the objects are two-

dimensional and may only move vertically and horizontally. When the option to ro-tate is added, the obtained model can describe any two-dimensional rigid object mo-tion. However, in reality a two-dimensional picture is only a projection of a three-dimensional space onto two dimensions.

The affine model can describe any motion of a rigid planar object in three dimensionsunder an orthographic projection. To describe the motion of a planar object under aperspective projection, the projective linear model can be used, which contains theaffine model as a special case. When the object in the three-dimensional space is notplanar, but a parabolic surface, the quadratic model can describe its motion under anorthographic projection [3]. It can also describe the exact projected motion of a planarsurface under a perspective projection [2].

The different motion models are shown in Figure 4 (applied with a block-shaped

16

(a) Translational. (b) Translational/rotational.

(c) Affine. (d) Projective.

(e) Quadratic.

Figure 4. Motion models in backward motion estimation.

17

region of support). Only the motion vectors in each corner of the region of support areshown for clarity: in reality a different motion vector is calculated for each pixel.

With all common video coding standards (see section 1.1) each frame in a videosequence is divided into multiple regions of support, each for which the motion modelparameters are computed. When the region of support may be arbitrarily shaped, thecodec uses a region-based motion model. Arbitrary regions of support are not usuallyavailable, but their use is expected to increase in future. For example, the MPEG-4standard supports arbitrary regions. Another extreme is the global region of support,where the frame is not divided into smaller regions at all, but only one set of motionmodel parameters are estimated for a whole image. This could be suitable for cancelingout the tremble of a portable video camera.

Most often the region of support is rectangular, and a block-based motion model isused: each frame is divided into equally sized square blocks, usually 16×16 pixels.As a slight generalization, some blocks may be subdivided to obtain a smaller regionof support. Furthermore, usually a simple translational motion model is used as inFigure 4a. This is the case with, for example, H.261, H.263, H.26L, MPEG-1, MPEG-2, and usually also with MPEG-4 standard. The advantages are that the parametersfor a simple motion model along with a regularly shaped region of support are sim-ple to estimate. Nevertheless, the smaller the block size is, the better this model canapproximate more complex motion models.

Sometimes the motion parameters are constrained or shared between different re-gions of support. For example, motion vectors could be computed by interpolatingthem from vectors given at four corners of a square block. This would require 4 vec-tors, each containing 2 components, and thus 8 parameters would be used. However,adjacent block corners can share the motion vectors, thus reducing the average numberof parameters to 2 per block. This forces the motion field to be smooth, which usuallyis valid assumption inside object boundaries. However, it causes problems when anobject moves in front of another object: there is a motion discontinuity at the edge,which can not be modeled. This can be avoided by subdividing the regions at theedges. In [3] a model is mentioned where the regions of support are formed from anadaptively subdivided triangular mesh. The motion vectors are interpolated from thethree corners and shared between adjacent triangles.

A motion vector can describe the motion of a pixel in the region of support sincethe previous frame in the video sequence, as in Figure 4. In this case backward mo-tion estimation is employed. In backward estimation causality is maintained becauseno pictures newer than the current picture are needed. Alternatively forward motionestimation can be performed, but this requires buffers for storing future frames and theframes can not be decoded in order. For example, H.261 uses only backward motionestimation while MPEG-1 can do both forward and backward estimation simultane-ously, using interpolation for computing an estimated pixel intensity from two motionvectors and two other frames.

Since the translational model with a block-based region of support is the most widelyused model in video codecs, in the rest of this thesis the other models are not consideredfurther, unless a method is generally applicable to all models. A backward motionestimation is assumed, although the presented algorithms can be directly used withalso forward or bidirectional motion estimation.

18

2.2. Criteria

In this section several block-based translational motion estimation methods are pre-sented. The problem is to find the motion vector−→m = (my,mx) for the square currentblockBt (y,x) at time instancet, so that the error between the blockBt and the matchingblockCτ at time instanceτ is minimized. If backward motion estimation is employed,τ < t. In practice the range of the motion vector−→m is limited, usually to much smallerthan the whole frame size. For example, in H.263 codec in the standard operationmode the vector components may be between−16 and 15.5. Since the current blockis 16× 16 pixels, this restricts the search area into a 47.5× 47.5 pixel square blockSτ (y,x), or 48×48 pixels with integer dimensions. These blocks are represented inFigure 5: the problem reduces into finding the best match for the current block insidethe search area.

48

1648 ?

16

Search area

Current block

Figure 5. Goal of a typical motion estimation algorithm.

In the notation used in this thesis, the element at the upper-left corner of a matrixMis denoted asM (0,0), and the element at the lower-left corner asM (Mh−1,0).

Usually each pixel in video sequences have three components that represent both theimage color and intensity. Usually only the luminance (intensity) information is usedfor the motion estimation. Furthermore, the motion vector is initially estimated only tointeger accuracy; the current block and the candidate blocks may be later interpolatedinto higher resolution and the best motion vector estimate refined into fraction pixelaccuracy as in the H.263 standard.

Before a motion estimation algorithm can be formulated, we must define a criterionwhich measures the goodness of a motion estimate or defines how the estimation erroris calculated. In computer vision and visual surveillance systems the goal is to extractthe motion of the three-dimensional objects that appear in the image. This is difficultsince there are many disturbancing factors, as described in section 2.1.

However, in video compression applications the goal is somewhat simpler: the re-dundancy in the image has to be removed for reducing the amount of bits required torepresent the video sequence. In this case it is sufficient to find a motion according tothe optical flow model, and to efficiently remove the redundancy.

Several different criteria have been proposed, some try to give more robust motionestimates and a high visual image quality, while some try to reduce the computationalload. A simple error criterion is the correlation between the two blocks:

COR(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

Bt (y,x)C(cy,cx)τ (y,x) (3)

19

where(cy,cx) = −→c is the current candidate motion vector andC(cy,cx)τ is a candidate

block in position(cy,cx) relative to the current block. To estimate a motion vector, thiscriterion may be evaluated for selected candidate motion vectors, and the vector whichgives the maximum correlation value is chosen.

Another basic criterion, SSD, usually yields very good, if not the best results. It canbe computed from

SSD(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

[Bt (y,x)−C

(cy,cx)τ (y,x)

]2

(4)

=∥∥∥∥Bt −C

(cy,cx)τ

∥∥∥∥2

2(5)

The equation is then minimized for a good motion estimate. SSD is also sometimescalled (inaccurately) mean squared error, MSE. It is actually the L2-norm squared ofthe difference of the current and a candidate block in the search area.

The square in (4) requires a multiplication, which is often cumbersome to compute.Therefore in most practical systems SAD criterion (also called mean absolute differ-ence, MAD, or mean absolute error, MAE) is used:

SAD(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

∣∣∣∣Bt (y,x)−C(cy,cx)τ (y,x)

∣∣∣∣ (6)

=∥∥∥∥Bt −C

(cy,cx)τ

∥∥∥∥1. (7)

This is the L1-norm of the difference of the blocks. There is very little difference in theimage quality between the SSD and SAD criteria: experiments have been published in[7, 11].

Sometimes even the subtraction is computed too inefficiently. Some researchers[11, 12] have proposed bit-plane matching (BPM):

BPM(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

T [Bt (y,x)]⊕T

[C

(cy,cx)τ (y,x)

](8)

where⊕ is the exclusive or (XOR) operation andT is a function that transforms theluminance of a pixel value into single bit, either zero or one. This is very fast on motionestimators implemented as application specific integrated circuits (ASIC), because theXOR operation can be implemented efficiently. In the Feng method [11] a simplethreshold function is used

T (a) ={

0, a≥M1, a < M

(9)

whereM is the average image intensity. An alternative way to convert the image intotwo levels is discussed in [12], where a copy of the frame is band-pass filtered andthen an one-bit conversion is performed according to pixelwise comparison with theoriginal image.

Criterion (8) yields substantially lower quality than SSD or SAD. To overcome this,

20

the criterion can be computed as a sum of two different BPM-criteria withM1 andM2,where the average is taken over two different regions of a frame (for example, wholeframeFt and a single blockBt) [11]:

FBPM(cy,cx) = BPM1(cy,cx)+BPM2(cy,cx) (10)

This is called feature-bit-plane matching (FBPM), and it is much better than simpleBPM, but still yields lower image quality in a motion compensation system than SADor SSD.

According to [3], SSD criterion (4) is sensitive to outliers and the median method

MED(cy,cx) = med

{[Bt (y,x)−C

(cy,cx)τ (y,x)

]2∣∣∣∣∣ 0≤ y < Bh,0≤ x < Bw

}(11)

is suggested, in which the median of the pixel intensity squared differences in theregion of support is selected as a criterion. Even more robust criterion [5] is theLorentzian function

LOR(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

ln

1+

(Bt (y,x)−C

(cy,cx)τ (y,x)

)2

2ω2

(12)

whereω adjusts the cost of the increasing differences. This cancels out the effect ofoutliers and thus produces more reliable estimates.

Another criterion is presented e.g. in [8], the matching pixel count (or PDC, pixeldifference classification [11]):

MPC(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

T

[Bt (y,x) ,C

(cy,cx)τ (y,x)

](13)

where

T (a,b) ={

1, |a−b| ≤ α0, |a−b|> α (14)

whereα is a predetermined threshold. The criterion (13) counts the number of pixels,whose values are close each other. It is then maximized for obtaining the motionestimate.

Many of the error criteria, such as (4), (6), and (12) can be generalized to

Φ(cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

φ(∣∣∣∣Bt (y,x)−C

(cy,cx)τ (y,x)

∣∣∣∣) (15)

whereφ(ε) is the criterion evaluation function when given the pixel absolute valuedifferenceε. The evaluation function is plotted for several criteria in Figure 6.

There is also a relatively new class of motion estimation criteria developed in [13].Most criteria try to minimize the displaced pixel difference in some way or another;

21

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Cri

teri

a

Error

SSDSADLOR

Figure 6. Different error criteria.

however, to achieve the best encoded video quality within a given bit rate, also thebits used for transmission of the motion vectors should be minimized. The followingcriterion accounts not only the mean squared error of the motion compensation, butalso the bits used for transmission:

CRD(cy,cx) = B(cy,cx)+λD(cy,cx) (16)

whereB denotes the number of bits required for coding the block, including the motionvector,D denotes the MSE of the residue of the motion compensated image, andλ is aconstant based on theoretical rate-distortion curve.

The SSD criterion can be used directly for computing the MSE forD(cy,cx); othercriteria can be used for approximating it. Practical, efficient, rate-minimizing methodshave been recently found, that are not only fast, but may also yield sometimes evenbetter image quality than the best search algorithms that try only to maximize theoptical flow match. Examples are Zonal Search and Diamond Search [16, 17, 18, 19,20, 21, 22], which tend to minimize the length of the motion vector, and thus less bitsare required for coding it.

All the criteria presented so far are computed in spatial domain. Some of them canbe also computed in transform domain, such as the SSD criterion. The advantage isthat with common video coding standards, the images will be transformed in any case(usually with the discrete cosine transform), so extra effort may not be needed for thetransform. The criterion can be also modified so that lower frequencies are weightedmore which may yield better results. On the other hand, with most transforms thetranslation in the spatial domain does not correspond to translation of coefficients inthe transform domain, but has other properties, such as phase rotation with the Fourier

22

transform. Therefore, criteria need to be to adapted to the particular transform.Each of the presented spatial domain criteria requireBh×Bw iterations, or 256 it-

erations for a common block size of 16×16 pixels. For criteria (3)–(6) each iterationrequires one addition, and besides that, correlation (3) requires a multiplication, SSD(4) requires an addition and a multiplication, and SAD (6) requires an addition and anabsolute value operation. However, with many algorithms the criterion is not actuallyevaluated, since the actual value is of no interest. Only the motion vector which givesthe criterion minimum or maximum is of interest, and this can be used as an advantage.

In this thesis the SSD and SAD criteria are emphasized, since they are most com-monly used in practice and simple enough to allow easier analysis and developing offast motion estimation algorithms.

2.3. Algorithms

The motion estimation algorithms can be grouped into two different major classes: thealgorithms in the first class examine all possible motion vectors, and the algorithmsin the second class reject the less probable vectors without examining them. The al-gorithms in the former class are called full search1 algorithms. The latter class savescomputation by sacrificing the estimation accuracy, and as an typical example ThreeStep Search is presented.

More complete list of algorithms in the second class can be found in [8]. Blockmotion estimation can be considered to be a form of vector quantization, so some al-gorithms have been adapted from that field [30]. In this thesis the full search algorithmsare emphasized.

A typical spatial-domain block motion estimation algorithm consists of two parts:the outer loop, which goes over the candidate motion vectors−→c and evaluates the cho-sen criterion for each of the vectors. The other part is the inner loop, which computesone of the criteria presented in section 2.2.

Computation reduction is often achieved by arranging the candidate motion vectorsin some specific order, and then testing only a first few of them, ignoring the rest. Lessoften the criterion summation indices (that is,y andx in equation (15), the matchingscan) are rearranged and possibly the summation truncated. It is important to under-stand that both parts in a motion estimation algorithm can be optimized, and ofteneven independently. For example, [24] uses a spiral-shaped ordering for the candidatemotion vectors, and a dithered subblock order for the matching scan.

An algorithm may have a deterministic execution time, in which case it has alwaysequally many arithmetic operations, independent of the video sequence. This is prefer-able, since it eases the implementation of a real-time system, which requires a predeter-mined limit on time. Video conferencing applications are examples of systems bearingreal-time constraints. In contrast, video encoders whose purpose is to compress videosequence for storing it onto a computer disk or other medium may use more or lesstime depending on complexity of a video sequence.

1The special notation used in this thesis is to have different meaning for “full search” and ESA. Theformer refers to all algorithms that test all possible motion vectors, the latter refers to a single specificalgorithm, described in section 2.3.1.

23

Another algorithm property is the regularity of data flow. With many algorithms, theapproximate motion vector is first determined by accessing coarsely an image block.Then it is refined by accessing the block finely, close around the first coarse estimate.As a result, the access pattern depends on the data content of the block. This makescaching and prefetching the pixels on demand more difficult. Therefore, a regular dataflow is preferable, especially for ASIC implementations [4]. Unfortunately, very rareof the fast algorithms possess regular data flow.

2.3.1. Exhaustive Search Algorithm (ESA)

The ESA algorithm simply applies the matching criterion to all possible motion vec-tors and finds the vector that gives the smallest error. It is a very simple algorithmwhich nevertheless produces good results. However, the computation complexity isoverwhelming.

Typically the SAD criterion is used with ESA. Let us denote the motion vector thatgives the smallest SAD value, found so far while the algorithm is running, as SAD(

−→b ).

At each step, the SAD function for the current candidate vector being tested, SAD(−→c ),

is computed. If SAD(−→c ) < SAD(

−→b ), the stored values for

−→b and SAD(

−→b ) are

replaced with−→c and SAD(−→c ), respectively. Similarly any of the other spatial domain

criteria may be utilized.Most often the search range length issh×sw = 32×32 pixels (s= 2r +1 wherer is

the maximum length of a motion vector), the block sizeBh×Bw = 16×16 pixels andthus the search area isSh×Sw = 47×47 pixels. The matching criterion is evaluatedonce for each candidate motion vector, orsh×sw times. In the example 1024 combi-nations are tested. A subtraction and an absolute value operation is considered to beequivalent to addition from the complexity viewpoint, so computing the criterion (6)for one candidate vector requires 3×256= 768 additions, or in total 786432 additions.

2.3.2. Three Step Search (TSS)

The name of TSS is misleading: it has three steps only when the maximum motionvector lengthrh = rw = 7. In the discussion following, it is assumed that the searchrange is symmetric, i.e.rhw = rh = rw, and the number of stepsN is an integer. Ingeneral, there areN = log2(rhw+1) steps, and at each step a point is selected as acenter point. In the initialization stage, the center of the search range is selected, andthe criterion is computed for the center point. The initial step size is set tod0 = rhw+1

2 .In the first step the criterion is computed for eight surrounding points at locations

(±dn,±dn), (0,±dn), and(±dn,0), relative to the center point of the step. The centerat the next step will be the evaluated point which gives the smallest error, and the stepsizedn is halved:dn = dn−1

2 . This is continued until the step sizedn = 1, and the eightpoints evaluated are the nearest neighbors of the step center.

The point which gives the smallest criterion value among all tested points is selectedas the final motion vector−→m. Figure 7 depicts the Three Step Search forrhw = 7. Eachsquare corresponds to a candidate motion vector, and the bold square contains theminimum criterion value that is found.

24

-7 -6 -5 -4 -3 -1 0 1 2 3 4 5 6

7

6

5

-7

-6

-5

-4

-3

-2

-1

0

3

4

7-2

1

2

1

1

1 1 2

1

11

2 2 2

2

22

1

3 3 3

3

333

3 2

0

Figure 7. Three Step Search withrhw = 7.

TSS reduces radically the number of candidate vectors−→c to test, but the amountof computation required for evaluating the matching criterion value for each vectorstays the same. TSS may not find the global minimum (or maximum) of the matchingcriterion; instead it may find only a local minimum and this reduces the quality of themotion compensation system. On the other hand, most criteria can be easily used withTSS.

The algorithm evaluates the criterionN× 8+ 1, or 8× log2(rhw+1) + 1 = 8×log2(shw+1)− 7 times. For example, if the search range isshw = 31, the criterionneeds to be evaluated at 33 points. With SAD (6) and with block sizeBh = Bw = 16,only 25344 additions are required. Compared to the ESA, complexity is reduced dras-tically by almost 97 %.

TSS finishes in predetermined number of steps, so the computation time is alwaysthe same (ignoring possible memory cache effects). However, the pixels that are ex-amined at each step vary depending on the input sequences, and therefore the data flowis not regular.

There are many modifications of the Three Step Search, that work similarly but em-ploy a different search pattern. The Cross Search Algorithm [14], for example, requiresless computation but has slightly worse estimation quality. Many newer modificationsprefer smaller motion vectors, since in many video conferencing applications motionsare small. Examples are Four Step Search [15], Diamond Search [16, 17, 18, 19] andZonal Search [20, 21, 22].

25

2.3.3. Partial Distortion Elimination (PDE)

The PDE is a full search algorithm: it considers the matching criterion for all candidatemotion vectors. However, the criterion such as SAD (6) is not computed fully, sinceonly the vector coordinates, where the criterion has its minimum, is needed to know,not necessarily the criterion value at any point.

The matching criterion, such as SAD (6) or SSD (4), consists of a long summationof Bh×Bw non-negative numbers which are added together term by term. Let us callthe partial sum ofn+1 first terms as PSADn (for partial SAD):

PSADn(cy,cx) =n

∑i=0

∣∣∣∣Bt (yi ,xi)−C(cy,cx)τ (yi ,xi)

∣∣∣∣ (17)

where 0≤ n < BhBw andyi andxi determine the order, in which the sum terms areaccumulated and the matching scan performed. For usual top-down scan,

yi =⌊

iBw

⌋xi = i modBw. (18)

The cumulative sum monotonically increases,

PSADn(−→c )≤ PSADn+1

(−→c ) (19)

for all 0≤ n< BhBw−1, and if it is at any point larger than the smallest criterion valueSAD(

−→b ) so far, we know that the current motion vector will be worse motion estimate

than an earlier one. Therefore, the rest terms need not to be computed, and they can beignored without any degradation in the estimation quality.

A comparison between the partially computed criterion value PSADn(−→c ), and the

best so far, SAD(−→b ), would be needed after adding each term, which may be costly

too. The comparison may be done only after several added terms. Usually it is doneafter each scan line, i.e. afterBw added terms (for example [62] whereBw = 16).

The algorithm works faster when a small minimum value is found for the criterionearly in the set of candidate vectors that will be checked. Therefore a good guess of thebest motion vector may be done in the beginning and a spiral search starting from thatpoint can be performed [32]. PDE can be efficiently combined with TSS to provideeven faster algorithm.

The order in which the matching scan is performed is significant. In [25, 26] anadaptive matching scan direction based on gradient magnitude is proposed, instead offixed top-down scan (18) so that greater terms tend to be added first. This is doneby heuristically selecting one of four possible scanning orders: top-down, down-top,left-right or right-left. The result is that the rest of the sum terms can be discardedsooner.

This is further improved in [27, 32], where another scanning order is proposed: thetwo blocks to scan (the current block and a candidate block) can be divided into smallersubblock pairs, which are sorted with the greatest matching error first. The matchingscan is done subblock-by-subblock, as in Figure 8.

26

Another improvement is the dithering scanning order [27]: matching error from eachblock pair can be computed in a predetermined fixed dithering pattern. The sortedsubblock matching is reported to perform better with large blocks, but the ditheringorder can be used with smaller blocks.

Smallestdifferences

Largestdifferences

4 1

32

16

16

Figure 8. Subblock matching scan.

PDE is widely used in industry according to [7], probably because it is used inseveral demonstration implementations of common encoders, e.g. [62]. However,for real-time implementations it is difficult to predetermine the execution time of thePDE algorithm, since it varies depending on the input data: PDE is not deterministic.PDE and modified PDE algorithms are compared in [7], which shows that the averageamount of saved computation, as compared to the ESA, is 84–95 %.

2.3.4. Successive Elimination Algorithm (SEA)

The SEA [28] is also a full search algorithm. It reduces computation by taking intoconsideration the following inequality:

SAD(cy,cx)≥∣∣∣∣‖Bt‖1−

∥∥∥∥C(cy,cx)τ

∥∥∥∥1

∣∣∣∣ (20)

where

‖Bt‖1 =Bh−1

∑y=0

Bw−1

∑x=0

|Bt (y,x)| (21)

is the L1-norm of the current block and similarly for a candidate blockCτ. Inequality(20) can be used with the SAD criterion. For SSD criterion

SSD(cy,cx)≥1

Bh×Bw

(‖Bt‖1−

∥∥∥∥C(cy,cx)τ

∥∥∥∥1

)2

(22)

whereBh×Bw is the block size. Proof for the inequalities (20) and (22) is presentedin appendix A.

For finding a single motion vector,‖Bt‖ needs to be computed only once in thebeginning and‖Cτ‖ can be computed e.g. differentially with little operations, as de-scribed in section 3. The reduction in operations is achieved by eliminating the SADor SSD computation from portion of the candidate motion vectors. While testing a

27

motion vector candidate−→c , it is first tested if|‖Bt‖1−‖Cτ‖1| ≥ SAD(−→b ). If true,

we know that also SAD(−→c ) ≥ SAD(

−→b ) and the actual value of SAD

(−→c ) needs notto be computed: the candidate vector−→c is eliminated. Otherwise, the algorithm pro-ceeds similarly as ESA computing SAD

(−→c ) and comparing it to SAD(−→b ), possibly

updating it.When a small value for SAD(

−→b ) is found early in the set of motion vectors being

tested, the inequality (20) eliminates larger portion of the SAD criterion computations.Therefore a good guess of the motion vector should be made in the beginning, similarlyto the PDE.

We have shown how SEA can be applied with the SAD and SSD criteria, and [29]presents a method for using SEA with a bit-plane criterion with 23–49 % saved time.SEA is reported to work significantly worse with the SSD criterion [30]; on the otherhand, [7] reports that when combined with PDE and with SSD criterion SEA is 8 %faster than with SAD. Using SEA with a bit-rate minimization criterion is consideredin [33], so that the function (16) is minimized efficiently.

SEA can eliminate 88 % of computation time as compared to ESA with SAD crite-rion, according to [7], or 35–57 % according to [31]. SEA can be also combined withPDE which gives further reduction in the computation time, in total 97 % [7]. As withthe PDE, also SEA computation time varies depending on input data.

Further improvements to SEA are given in [31, 32], in which the norms are com-puted from subblocks of the candidate blocks and the current block. The sum of thesenorms is then used as a lower bound. WithBhw = 16, and when using a fixed subblocksize, it is reported that 4×4-pixel or subblock performs best and can eliminate 70–79% of computation time, as compared to ESA [7]; but then [32] reports that 8×8-pixelsubblock is a good choice.

In [30] overlapping norms of the block rows and columns are used in addition to thebasic SEA scheme to eliminate more candidate blocks. All these modifications givesmaller lower bound for the SAD value, which eliminates more of the candidate blocksbefore an actual SAD function needs to be computed.

Practical implementation of SEA into MPEG-1, 2 and 4 is considered in [34]. In[35] the SEA algorithm is revised so that it suits better for very large scale integration(VLSI) technology and saves power that way: the SEA is applied to row norms ofblocks. Furthermore, the last row is ignored which decreases the lower bound butenables evaluating the elimination inequality from data accessible one row before theresult is used. The SEA is applied into an conventional systolic full search architecture,where it is used to reduce the computation and thus by 44 % power consumption.

2.3.5. Multilevel Successive Elimination Algorithm (MSEA)

In [36, 37, 38] the SEA is extended to subdivide the highest block level into four sub-blocks, for each which the norm is computed, which in turn may be subdivided hierar-chically until one-pixel blocks are achieved. The resulting method is called multi-levelSEA (MSEA, or Block Sum Pyramid Algorithm, BSPA).

The algorithm begins with the norm pyramid calculation, as described in section3.2. Let us assume that the block width and height areBhw = Bh = Bw = 2L−1 whereL = log2Bhw+1 is the number of levels in the pyramid, an integer. Let us denote the

28

norm of the blockB subblock(i, j) at level l , 0≤ l < L (l is a zero-based index), as‖B, l ,(i, j)‖ so that

‖B, l ,(i, j)‖1 =i+Bh

2l −1

∑y=i

j+Bw2l −1

∑x= j

|B(y,x)| (23)

for L1-norm and similarly for L2-norm. With this notation,‖B‖= ‖B,0,(0,0)‖.After the norm pyramid is computed and stored, each candidate motion vector is

processed one by one. First the absolute difference of the current block and a candidateblock norms—let us call this as MSAD0

(−→c ) for multilevel SAD—is computed andcompared to the best SAD value so far. If the SAD is smaller, the candidate block iseliminated similarly as in SEA and the algorithm proceeds to test the next candidatevector.

Otherwise, the next level of the norm pyramid is entered. This contains the normsof the four quadrant subblocks of the current and the candidate block, and the sum ofthe absolute differences, i.e. MSAD for the second level in the pyramid, is computed:

MSAD1(−→c )=

1

∑y=0

1

∑x=0

∣∣∣∣∥∥∥∥Bt ,1,

(yBh

2,x

Bw

2

)∥∥∥∥1−∥∥∥∥C

−→cτ ,1,

(yBh

2,x

Bw

2

)∥∥∥∥1

∣∣∣∣ . (24)

This provides a lower bound for the SAD criterion, just like SEA (20), but a tighter one.This is then compared to SAD(

−→b ), and if the MSAD is greater, the current candidate

vector can be eliminated. Otherwise, the next level of the norm pyramid is entered.This continues until either the vector is eliminated, or the last levelL−1 is reached.

For general levell ,

MSADl(−→c )=

2l−1

∑y=0

2l−1

∑x=0

∣∣∣∣∥∥∥∥Bt , l ,

(yBh

2l ,xBw

2l

)∥∥∥∥1−∥∥∥∥C

−→cτ , l ,

(yBh

2l ,xBw

2l

)∥∥∥∥1

∣∣∣∣ (25)

and it can be proved [36] that

MSADl(−→c )≤MSADl+1

(−→c ) (26)

for 0≤ l < L− 1. At each level the condition is tighter than in the previous level,which eliminates more of the candidate vectors. Evaluating each level also requiresmore computation than the previous one. The first level,l = 0, has only one absolutedifference calculation and corresponds to the SEA. The last level,l = L− 1, corre-sponds to the ESA with SAD criterion: SAD (6) is equivalent to MSADL−1(cy,cx)and involvesBh×Bw absolute difference calculations.

ESA can be considered as a special case of MSEA, where only the last levelL−1 isused. SEA is another special case of MSEA, where only the first 0 and lastL−1 levelsare used. The algorithm using different number of levels is compared in [37] (resultsshown in Table 2). For a video sequence containing much motion, such as sports, thereduction may be significantly less than for other sequences.

Several modifications of MSEA are given in [39]. In the first PDE is used for com-

29

Table 2. MSEA performance, as compared to ESANumber of levels exploited Time reduction relative to ESA

1 level (ESA) 0 %2 levels (SEA) 67–85 %

3 levels 79–93 %4 levels 87–96 %5 levels 95–98 %

puting each of the MSAD lower bound levels. After each row of norms is added as in(25), the partial sum is compared to the best SAD(

−→b ) value so far. If the partial sum

is larger, the candidate vector can be skipped over. The candidate vectors are tested ina spiral search order. About 6 % is saved as compared to the MSEA where top-downPDE is used only for the last level.

Then the authors of [39] apply adaptive PDE to the MSEA, in which the SAD cri-terion is computed from sorted subblock pairs of the candidate and the current block,with the greatest differences first (see Figure 8). And ultimately, the levelle in which acandidate block will be eliminated, is estimated. If the estimation is correct, the com-putation of all MSADl values,l < le, is saved. Otherwise, extra work needs to be done,but the result is still the same. The elimination level will be estimated for half of thecandidate vectors, and the other half will be used for the estimation. About 89 % ofthe estimations will be correct, and 2.1 % more computation is saved.

MSEA can be also applied to other algorithms, that reduce the number of examinedcandidate vectors. For example, MSEA is applied to Three Step Search in [40].

Numerical Example In this example a particular candidate block is being comparedto the current block, which are given as

Bt =

34 108 154 4411 115 38 201163 175 224 95187 191 219 83

and Cτ =

104 241 149 8918 177 231 238212 253 91 16921 104 173 114

. (27)

The block sizeBhw = 4. The sum of the absolute differences (6) of the block elementsis 1216. By inequality (20), this is larger than the absolute difference of the norms. Thecurrent block norm‖Bt‖1 = 2042, the candidate block norm‖Cτ‖1 = 2384, and theirabsolute difference is MSAD0 = 342. Suppose now that the SAD of the best previousmatch would be SAD(

−→b ) = 500.

In MSEA (and SEA), the absolute difference of the norms, 342, is first computedand compared to the best match so far. Because the norm difference is less than thebest SAD, the SAD for this candidate block might also be less than 500. We mightcompute the SAD betweenBt andCτ, get 1216, and see that it is greater than 500.The block could be then skipped: it is how SEA operates. However, in MSEA the nextlevel of norms is entered.

The norms for the four pairs of 2×2-element submatrices ofBt andCτ are acquired.

30

The results are

Bt :

[268 437716 621

]and Cτ :

[540 707590 547

]. (28)

Now the sum of absolute difference values of these partitioned matrices is computed.We get MSAD1 = 742, which is greater than 500. From this we know that MSAD2 =SAD would be also greater, and the exact value of 1216 needs never to be computed.The candidate block can be safely skipped without degrading block matching result.

2.3.6. Winner-Update Strategy

There is an interesting connection between the PDE and MSEA algorithms: both pro-vide a monotonically ascending lower bound for the SAD criterion value for any par-ticular candidate motion vector. The difference is that with MSEA the list convergesmuch faster to the actual SAD value, having only log2Bhw+1 list elements, while thelist with PDE converges linearly withBh×Bw elements, each element being a valueof the equation (17) with an increasingn. The computation amount for each new listelement stays constant with PDE (assumingn increases linearly), but increases quicklywith MSEA.

The Winner-Update Strategy exploits the list of lower bounds for finding the bestmotion vector. Both PDE and MSEA decrease computation by visiting allsh×sw can-didate vectors−→c in some preprogrammed order, usually from top-down, sometimes ina spiral or some other order. Those vectors, which give a good match, should be visitedfirst, because the rest can be then eliminated sooner. Which would be the best order tovisit the candidate vectors? Certainly there is not a single best order, and research hasbeen done to adaptively select the best order among a few preprogrammed ones.

In the Winner-Update Strategy [41, 42] another approach is used for the adaptation.For the lower bound list both PDE and MSEA may be used. At the first step, the initiallower bound for all candidate motion vectors is computed and stored. Then, at eachsubsequent step, the candidate vector with the smallest lower bound is chosen and thebound updated: the next stricter bound in the list is computed. The result will be closerto the final criterion value of the best match.

Then, again the candidate with the smallest lower bound is selected and the boundupdated. The process is repeated until the lower bound list of one candidate block isexhausted, and the computed final bound value is equal to the actual criterion valueand better than any other partially computed lower bound.

Numerical Example Assume that the PDE algorithm is used, and there are four can-didate vectors. The list of lower bounds contains 4 elements, i.e. the block size is only4 pixels for simplicity. After the initial step the following bounds would be obtained:132, 10, 45, and 246. The second candidate vector with the bound 10 is selected andthe next bound revealed by computing it. Let’s suppose that the next absolute sumdifference for the vector is 30, which is added to 10 for the next bound value 40. Nowthe lower bounds are 132, 40, 45, and 246. Again the same vector is updated. Let ussuppose the next difference is 50, so that the list of lower bounds becomes to 132, 90,45, and 246.

31

At this point the third vector is updated, for which we get a difference of 60, andform a new list with 132, 90, 105, and 246. Now the second candidate has again thelowest lower bound. It has been updated already three times, so there is only oneupdate to do. The last absolute difference is 20 and the updated list is 132, 110, 105,and 246.

The final SAD for the second candidate has now been computed, but the third can-didate might still have lower SAD, since the lower bound 105 is less than 110. So, itis updated once more, which gives the next sum term 15 and a lower bound value of120. The lower bounds for all other candidate vectors are now greater than 110, whichis the final SAD value of the second candidate vector. Thus no more work is neededand the second candidate can be selected as the best match among all possibilities.

The example is illustrated in Table 3 with the computed differences (∆) and accumu-lating PSAD values displayed. The numbers in parentheses show the order, in whichthe differences are computed. The underlined lower bound values are selected at somestep in the example. The cells with question marks denote the absolute differencecomputations, which are saved as compared to ESA.

Table 3. The Winner-Update StrategyCandidate ∆0 PSAD0 ∆1 PSAD1 ∆2 PSAD2 ∆3 PSAD3

#1 132(0) 132 ? ? ?#2 10(0) 10 30(1) 40 50(2) 90 20(4) 110#3 45(0) 45 60(3) 105 15(5) 120 ?#4 246(0) 246 ? ? ?

The first and fourth candidate vectors are never selected, and only the initial lowerbound is computed for them. This is the behavior of a real motion estimation algorithm:there are only few candidate vectors that are updated often. For most candidates, theinitial lower bound is so large, that they need never to be updated.

The most demanding part of the algorithm is selecting the candidate vector withthe smallest criterion value at each step. A priority queue, such as the pairing heap[44] can be used. However, in [42] the authors use a hashing-alike scheme instead.Additionally, they completely eliminate many of the candidates in the first step byguessing the best motion vector, and computing its criterion value completely. Afterthis, many of the other candidate vectors are rejected in the initial step by comparingtheir first lower bound value to the completely calculated criterion.

The authors of [42] report that the running time of the motion estimation algorithmis reduced by 88–96 % as compared to the ESA, when the algorithm is implemented inC-language on a general purpose microprocessor. With a video sequence which con-tains much motion, the algorithm performs worse. The authors also present a methodfor using the Winner-Update Strategy with Three Step Search, which further reducescomputation but may not yield optimal match.

2.3.7. Category-Based Block Motion Estimation Algorithm (CBME)

Yet another algorithm that computes not only the lower bound, but also the upperbound for the SAD function, and eliminates full SAD computations based on the lower

32

bound, is presented in [43]. The algorithm is especially useful on VLSI devices, sinceit exploits the property that arithmetic is significantly faster with a short word length.

The BPM criterion (8) approximates the SAD function by using only one-bit arith-metic. CBME also approximates SAD by using less bits for each pixel, but neverthe-less it is a full search algorithm and finds the global SAD minimum. Only the lowerand upper bounds of the SAD criterion are computed using a short two-bit word length(whend = 64, as suggested in [43]).

First each pixel in the current block and the search area is categorized by dividingthe values byd and truncating the result:

B′t (y,x) =⌊

Bt (y,x)d

⌋(29)

S′τ (y,x) =⌊

Sτ (y,x)d

⌋. (30)

Since the pixel values are between 0. . .255, there will be256d (= 4 in [43]) categories

of pixels. The division and truncation is simply a bit shift for a properly selectedd(it must be a power of two). Now, let us define a categorized criterion function thatoperates on divided values:

CSADφ (cy,cx) =Bh−1

∑y=0

Bw−1

∑x=0

φ(∣∣B′t (y,x)−S′τ (cy +y+ ry,cx +x+ rx)

∣∣) (31)

where functionφ is defined below. Equation (31) requires as many iterations,Bh×Bw,as SAD, but they are much simpler due to the 2-bit, instead of 8-bit, subtraction andabsolute operations.

Based on each of the 2-bit absolute differencesε′ = |B′t (y,x)−C′τ (y,x)|, the corre-

sponding lower limitφmin(ε′) and upper limitφmax(ε′) for the 8-bit absolute differenceε = |Bt (y,x)−Cτ (y,x)| can be deduced:

φmin(ε′)

={

0, ε′ = 0ε′d−d+1, ε′ > 0

(32)

φmax(ε′)

={

d−1, ε′ = 0ε′d+d−1, ε′ > 0

. (33)

By using either (32) or (33) with (31), a lower or an upper bound can be obtained forthe SAD function, respectively.

In practice, if (31) is evaluated directly, two additions with a long word length is re-quired, for accumulating bothφmin andφmax. Instead, [43] propose using256

d counters,one for each category. Each counter counts the occurrences of each possible absolutedifferencesε′. Let the counter final values beNε′. The lower and upper bounds areobtained by summing the bounds for eachε′ together:

CSADφ =256/d

∑ε′=0

Nε′φ(ε′)

(34)

33

for both φmin andφmax. This can be simplified becaused is fixed. For example, ifd = 64, there are 4 categories and

CSADφmin = N1 +(d+1)N2 +(2d+1)N3 (35)

= N1 +65N2 +129N3 (36)

CSADφmax = (d−1)N0 +(2d−1)N1 +(3d−1)N2 +(4d−1)N3 (37)

= d(N0 +2N1 +3N2 +4N3)−BhBw (38)

= 64N0 +128N1 +192N2 +256N3−BhBw (39)

whereBhBw = ∑ε′ Nε′, the total number of pixels2.The algorithm begins similarly to PDE or SEA: an initial guess of the best candidate

vector is made, and its SAD value is computed and assigned to SAD(−→b ). However,

unlike the other algorithms, CBME has two phases. In the first phase the candidatevectors are only eliminated: their actual SAD value is not computed even if a candidatevector would not be eliminated. In the second phase, the SAD value of all remainingcandidate vectors is computed, and the optimum is selected. The latter part correspondsto ESA, but with most candidate vectors eliminated in advance, in the first phase. Theelimination phase is described next.

In the elimination phase, each candidate motion vector is processed in turn. Firstboth the lower bound CSADφmin and the upper bound CSADφmax are computed. They

are compared to SAD(−→b ), and the action following is one of three possibilities:

1. CSADφmax

(−→c )< SAD(−→b ): the criterion upper bound for the current candidate

vector is better (smaller) than the best candidate vector SAD value found so far,so the best SAD is updated. The candidate vector−→c is not eliminated, and theexact SAD will be computed later in the second phase.

2. CSADφmin

(−→c ) ≥ SAD(−→b ): the criterion lower bound is worse than the best

candidate vector found so far, so the candidate vector−→c is eliminated similarlyas in SEA or PDE.

3. CSADφmin

(−→c )< SAD(−→b )≤ CSADφmax

(−→c ): the criterion for the current can-

didate vector may or may not be better than the best found so far. The SAD(−→b )

is not updated, and the exact SAD value needs to be computed later in the secondphase.

A VLSI architecture for the algorithm, including an efficient VLSI counter structure,is presented in [43]. The article concludes that CBME reduces computation time by84 % and reduces considerably power consumption, as compared to the ESA.

2The equation (12) in the paper [43] is incorrect and produces unnecessarily small lower bound for theminimum criterion value.

34

2.3.8. Fast Convolution Algorithms

The SSD equation (4) can be expanded into three terms (41)–(43):

SSD(cy,cx) =Bh−1

∑y=0

Bx−1

∑x=0

[Bt (y,x)−Sτ (cy +y+ rh,cx +x+ rw)]2 (40)

=Bh−1

∑y=0

Bw−1

∑x=0

Bt (y,x)2

︸︷︷︸‖Bt‖2

2

(41)

+Bh−1

∑y=0

Bw−1

∑x=0

Sτ (cy +y+ rh,cx +x+ rw)2

︸︷︷︸∥∥∥∥C(cy,cx)τ

∥∥∥∥2

2

(42)

−2Bh−1

∑y=0

Bw−1

∑x=0

Bt (y,x)Sτ (cy +y+ rh,cx +x+ rw)︸︷︷︸correlationr(cy+rh,cx+rw)

. (43)

Of these, term (41) is the L22-norm of the current block. It is a constant, independent ofthe candidate motion vector, and needs not to be computed: we are interested only infindingcy andcx where the SSD function has minimum, not the actual minimum value.Term (42), the L22-norm of the current candidate block, can be computed differentiallywith little operations, as described in section 3, and finally the correlation (43) can becomputed with some fast convolution algorithm, such as a fast transform possessingthe convolution property. The search area and the current block can be transformed,the transformed blocks multiplied elementwise together, and finally the correlation canbe obtained with an inverse transform.

We need linear correlation in (43), but if using a fast transform, the convolutiontheorem gives us cyclical convolution. Another of the input blocks must be flippedcyclically around its origin (at the upper-left corner) to convert the convolution intocorrelation. To obtain linear correlation, at least one of the blocks must be zero-padded.Furthermore, to perform element-by-element multiplication in transform domain, theblocks must have equally many elements.

These requirements can be satisfied by padding the current blockBt with zeros upto the search areaSτ size, and flipping the zero-padded current block cyclically aroundorigin intoB′t :

B′t (y,x) ={

Bt (−y modSh,−x modSw) , −y modSh < Bh,−x modSw < Bw

0, otherwise.

(44)In the flip the origin of the block should not move, i.e.B′t (0,0) = Bt (0,0), or elsethe obtained correlation will be incorrectly positioned after the inverse transform. Forexample, if the current block size is 16×16 pixels and the search area size is 48×48,the current block is zero-padded up to 48×48 pixels and then flipped cyclically around

35

origin. This is illustrated in Figure 9.

padd

ed w

ith ze

ros

Figure 9. Cyclic flip around origin for a 48×48-pixel block.

There are obviously many fast convolution algorithms [47, 48, 49, 50, 51] that aresuitable for estimating motion using the ideas presented in this section. The advan-tage of most fast convolution-based search algorithms is the very regular flow of data,unlike with many other fast motion estimation algorithms, and deterministic time re-quirements which make them very practical in real-time video codecs. However, thesehave been little examined since many of the algorithms, especially the well-known fastFourier transform (FFT) requires complex, floating point multiplications.

Some results are presented in [45], which reduces computation time 77 % on a sig-nal processor by decomposing a two-dimensional finite impulse response (FIR) filter,and [46], which applies fast Fourier transform and Winograd short length algorithms.These still have very high computation requirements. However, starting from section4, in the rest of this thesis the computational burden is significantly eased by usingnumber theoretic transforms [47, 48, 49, 51, 56, 57].

36

3. FAST COMPUTATION OF NORMS

3.1. Differential Calculation

In the SEA and fast convolution algorithms the norms of the candidate blocks need tobe known, either the L1 or L2

2-norm. If it is computed with a straightforward formula(21) or (42), the resulting algorithm will be no faster than ESA. Therefore fast algo-rithms need to be developed. It turns out that the norm can be computed differentiallywith little operations.

Let us first denote a norm of a candidate block as

Np(y,x) =∥∥∥C(y−rh,x−rw)

τ

∥∥∥p

p. (45)

The top of the search area is first subdivided intoBw vertical stripes, eachBh×1 pixels(Figure 10 illustrates this forBh = Bw = 16 andSw = Sh = 48), and the sum of thepixel values from each stripe is stored into aBw-element array, squared if L2

2-norm isdesired:

stripe(x) =Bh−1

∑y=0

Sτ (y,x)p , x∈ [0,Sw−1] (46)

wherep = 1 for the L1-norm (for SEA) orp = 2 for L22-norm (for a fast convolution

algorithm). No absolute value operations are required, since pixels are assumed to benonnegative.

Sτ

48

48

Search area

Figure 10. Vertical stripes.

Now the initial value of the candidate block norm (45) can be computed with justBw−1 additions as follows:

Np(0,0) =Bw−1

∑x=0

stripe(x) . (47)

We search for the best motion vector by scanning from left to right and then from topto bottom, so the next needed value isNp(0,1). This can be done differentially withone subtraction and one addition:

Np(0,1) = Np(0,0)−stripe(0)+stripe(Bw) (48)

37

and similarly for all remainingNp(0,x) values until the right edge is reached:

Np(0,x) = Np(0,x−1)−stripe(x−1)+stripe(x+Bw−1) (49)

for x = 1, . . . ,sw−1.When the right edge is reached (x = sw− 1), we move one pixel row downward.

Accordingly, the computed stripe sums are adjusted, again differentially. For example,

stripe(0) := stripe(0)−Sτ (0,0)p +Sτ (Bh,0)p (50)

and generally for all stripes

stripe(x) := stripe(x)−Sτ (0,x)p +Sτ (Bh,x)p (51)

wherex∈ [0,Sw−1]. Now the algorithm loops back to the beginning and we can apply(49) again to find theNp(1,x) values. Similarly we update the stripes each time afterreaching the right edge, and (51) is generalized to

stripe(x) := stripe(x)−Sτ (y−1,x)p +Sτ (y+Bh−1,x)p (52)

for y = 1, . . . ,sh− 1 andx ∈ [0,Sw−1] (the formula can be computed for differentvalues ofx in parallel).

The algorithm can be optimized by noticing that if in successive calls to the motionestimation algorithm horizontally consecutive blocks are processed, the search areasoverlap horizontally bySw−Bw pixels, and we need to compute actually only thelast Bw values of (46). The firstSw−Bw values can be saved from the computationof the previous motion vector. In this case,Bw stripes need to be initialized for eachcurrent block, each containingBh−1 additions. Thus,(Bh−1)Bw additions (andBhBw

multiplications, ifp = 2) are needed for initializing the stripes.When each row is processed, the stripes are updated. This requiresSw additions and

Sw subtractions (counted as additions), and possibly multiplications. The stripes needto be updatedSh−Bh times. Thus, 2(Sh−Bh)Sw additions (and multiplications) areused for this.

In the beginning of each row, the initial norm value is computed. This requiresBw−1 additions for each of theSh−Bh + 1 scanned rows, or(Sh−Bh +1)(Bw−1)additions. Then, when proceeding to the right,Sw−Bw additions and the same amountof subtractions are required for updating the norm, for each of theSh−Bh + 1 rows,or in total 2(Sh−Bh +1)(Sw−Bw) additions. Then, adding the all required amount ofcomputation for the all norms, we get

(Bh−1)Bw +2(Sh−Bh)Sw +(Sh−Bh +1)(Bw−1)+2(Sh−Bh +1)(Sw−Bw)= (4Sh−4Bh +2)Sw−Sh(Bw +1)+(2Bh−2)Bw +Bh−1 (53)

≈ 4(Sh−Bh)Sw−ShBw +2BhBw (54)

additions andBhBw +2(Sh−Bh)Sw multiplications, if L22-norm is used. By allocating

more memory, it is possible to square only once each pixel and store the result into anarray. This reduces the amount of multiplications toBh×Bw. If we assume that the

38

current block and the search area have equal height and width, i.e.Bhw = Bh = Bw andShw = Sh = Sw, approximation (54) reduces to 4S2

hw+2B2hw−5BhwShw additions.

3.2. Norm Pyramid Calculation

In some algorithms, particularly MSEA and its variants, a pyramid consisting of hier-archical norms is required. For example, if we have a 16×16 -pixel block, a norm ofthe complete block can be computed with (21) or (42). When the norm is smaller thanthe SAD of the earlier best match, the norms of the four 8×8 subblocks are needed.If also this second level test fails, norms of 16 subblocks, of size 4×4 are needed, andso on until the subblock size is 1 pixel. The L1-norm of a pixel is the pixel value itself,the L2

2-norm is the pixel value squared.The norm sum pyramid can be build bottom-up, starting from the last 1-pixel block

level [36, 37, 38, 42]. For L1-norm, nothing needs to be done. For L22-norm, the squares

of each pixel is computed and stored into an array, whose size is equal to the originalblock. The rest of the creation of the norm pyramid does not differ between L1- andL2

2-norms. The maximum block size whose norm is needed isBhw (square blocks areassumed). There areL = log2Bhw+1 levels in the norm pyramid (L is assumed to bean integer), numbered from 0 (topmost) toL−1 (bottommost).

Let us denote the norm of a pixel in the image frameF at position (y,x) as‖F,L−1,(y,x)‖ at the last (bottommost) level, as in the definition (23). Therefore‖F,L−1,(y,x)‖1 = ‖F(y,x)‖1 = |F(y,x)| = F(y,x) for L1-norm, because pixel val-ues are nonnegative, and‖F,L−1,(y,x)‖2

2 = ‖F(y,x)‖22 = F(y,x)2 for L2

2-norm. Thesecond-to-last levelL−2 contains norms of the 2×2 -pixel subblocks, and it can becomputed from the last level norms:

‖F,L−2,(y,x)‖=1

∑i=0

1

∑j=0‖F,L−1,(y+ i,x+ j)‖ (55)

for y ∈ [0,Fh−2] andx ∈ [0,Fw−2]. That is, the resulting matrix at levelL−2 hasonly one row and column less than the last level matrix at levelL−1. This is becausethe norms are computed for all 2×2-pixel subblocks, also for overlapping and not onlyfor adjacent subblocks.

The next level is constructed similarly:

‖F,L−3,(y,x)‖=1

∑i=0

1

∑j=0‖F,L−2,(y+2i,x+2 j)‖ (56)

for y∈ [0,Fh−3] andx∈ [0,Fw−3]. The result contains the norms for all 4×4-pixelsubblocks of the original matrix. These equations can be generalized so that the levell is computed froml +1, as shown in the following equation:

‖F, l ,(y,x)‖=1

∑i=0

1

∑j=0

∥∥∥F, l +1,(

y+2L−l−2i,x+2L−l−2 j)∥∥∥ (57)

for y∈[0,Fh−2L−l−2−1

], x∈

[0,Fw−2L−l−2−1

]and levelsl = L−2,L−3, . . . ,0.

39

After the norm pyramid has been constructed, the norm of the candidate blockC(y,x)τ

(relative to the frame origin) can be accessed as‖Fτ,0,(y,x)‖, as well as any subblocknorm between the pyramid levels 0 andL− 1. The resulting algorithm is shown inFigure 11 for one-dimensional case. In the figure the frame size isFw = 9 pixels andthe block size, for which the norms are computed, isBw = 8 pixels. Therefore thereareL = log2Bw+1 = 4 levels in the norm pyramid. The topmost level contains normsfor two different, overlapping 8-pixel blocks. The figure also illustrates the unused 11elements in the top-right corner, if a rectangular array ofL×Fw elements is allocated.

+ + + + + +

+ +

+ + + + + + + +

-1

0

L

Figure 11. Norm pyramid computation.

For each level, computing one norm value requires 3 additions. For levell , there are(Fh−2L−l−2

)(Fw−2L−l−2

)norms to compute, except forl = L−1. Therefore, using

the algorithm presented above,

L−2

∑l=0

3×(

Fh−2L−l−2)×(

Fw−2L−l−2)

(58)

additions are required for computing the whole block sum pyramid. For L22-norm,

additionallyFh×Fw squares are required for computing the first level norms. However,the formula (57) can be broken into separate vertical and horizontal additions

‖F, l ,(y,x)‖′ =1

∑i=0

∥∥∥F, l +1,(

y,x+2L−l−2i)∥∥∥ (59)

‖F, l ,(y,x)‖ =1

∑j=0

∥∥∥F, l ,(

y+2L−l−2 j,x)∥∥∥′ (60)

in which case there are only two additions per norm. Using this method only about

L−2

∑l=0

2×(

Fh−2L−l−2)×(

Fw−2L−l−2)

(61)

40

additions are required.To simplify the result, let us assume that the block sizeBh×Bw is much smaller than

the frame sizeFh×Fw. This is true if the frame size is the size of one complete videoframe, and the block size is 16×16, as is the case with most video coding standards. Inthis case it is often convenient to allocate an array forFh×Fw×L norm values, even ifthe bottommost rows and rightmost columns will be unused at the higher levels of thenorm pyramid. The amount of additions can be then approximated with 2FwFh log2Bhw

for the whole frame. There areFhBh× Fw

Bwblocks in a frame. Therefore, for a single

current block, there are 2BhBw log2Bhw additions, and alsoBhBw multiplications if L22-

norms are computed instead of L1-norms. This is less than in the differential normcalculation (section 3.1), but more memory is required for storing the norm pyramid.

41

4. NUMBER THEORETIC TRANSFORMS

A number theoretic transform (NTT) is defined as follows3:

Xk ≡N−1

∑n=0

xnωkn(modq), k = 0,1, . . . ,N−1 (62)

wherexn are theN integer values between 0 andq−1, inclusively, to be transformed,ωis the transform kernel (a well-chosen integer number between 0 andq−1), andXk arethe transformed integer values also between 0 andq−1. All operations are performedmoduloq.

The number theoretic inverse transform is defined as

xn ≡ N−1N−1

∑k=0

Xkω−kn(modq), n = 0,1, . . . ,N−1 (63)

whereN−1 is the number theoretic inverse ofN such that

N ·N−1 ≡ 1(modq) (64)

and similarlyω−1 is the number theoretic inverse ofω. It is desirable but not necessarythat the modulusq is a prime number.

The number theoretic transform exists and has cyclic convolution property, if thefollowing conditions are fulfilled4:

1. ωN ≡ 1 (modq).

2. gcd(ωk−1,q

)= 1 for all k∈ [1,N−1].

Some justification for these conditions is presented in appendix B, and a more completetreatise can be found in e.g. [51]. The function gcd(a,b) is the greatest commondivisor of a andb (see appendix C). It is also desirable that the binary representationof q is simple, which allows congruent reduction with only bit shifts, additions andsubtractions.

Since the number theoretic transform (62) is very similar to the Fourier transform,we can use most fast Fourier transform algorithms for computing number theoretictransforms. In this thesis four transform algorithms are investigated, Winograd numbertheoretic transform algorithm (WNTTA, or WFTA for Fourier transform), radix-2, bit-shifting radix-2, and mixed radix transform with radices 2 and 16.

The number theoretic (as well as the Fourier) transform can be written as a matrixmultiplication −→

X ≡ TN ·−→x (modq) (65)

3Where the notation “a≡ b (modq)” means that the remainders ofa andb after dividing byq areequal, or thata is congruent tob. See [47, 55].

4Many papers give different conditions: that the prime factorspk of q minus one have to be divisibleby N and that ordqω = N (which follow from conditions 1 and 2). However, these are not alonesufficient.

42

where theN×N symmetric matrixTN transforms the lengthN column vector−→x givingtransformed vector

−→X . The elementTN (i, j) is

TN (i, j)≡ ωi j (modq) (66)

and for the inverse transform,

−→x ≡ T−1N ·−→X (modq) (67)

where the elements ofT−1N are

T−1N (i, j)≡ N−1ω−i j (modq). (68)

For a two-dimensional transform of an imagex, which is anN×N matrix,

X ≡ TN ·x ·TTN (modq) (69)

where theN×N matrixTN first transforms each column ofx (multiplication from leftby TN) and then all rows of the result are transformed by transpose ofTN (multiplica-tion from right byTT

N). Actual transposing is not necessary due to the symmetricity ofTN. For the inverse two-dimensional transform, simply substituteT−1

N for TN in (69).An useful although easily seen property is that

ω−m≡ ωNω−m≡ ωN−m (modq) (70)

sinceωN ≡ 1, as required earlier forω being the kernel of a NTT.Number theoretic transform is very sensitive to the modulusq and the transform

lengthN. If either of these parameters are changed, the transform kernelω and matrixT coefficients need to be recalculated with a complex procedure—if the transform willexist at all. Therefore, for simplicity we restrict the derivation of a motion estimationalgorithm to a special case where the current block sizeBw = Bh = 16 and the searcharea isSw = Sh = 48 giving the motion vector range−ry . . . ry =−rx . . . rx =−16. . .16.However, it is certainly possible to derive suitable NTT for many other cases of interest.

Since the quotient number system can hold onlyq distinct values, all unsignedinput and output data must be less than or equal toq− 1. The pixel luminancevalues will be between 0. . .255, and the correlation result values may be at most∑Bh−1

i=0 ∑Bw−1j=0 255·255= 16646400 which is slightly less than 224. Therefore, we need

q to be of magnitude 224, that is, 24 bits are necessary to represent the integer values5.

4.1. Computing Correlation via 48-point WNTTA

Since 48 is not a power of 2, a radix-2 or similar algorithm can not be used for com-puting the 48-point transform. Instead we can use the Winograd number theoretictransform algorithm.

5Obviously we could scale the input pixel values, say, between 0. . .127. Less bits would be needed atthe cost of lower fidelity.

43

4.1.1. Winograd Short Length Algorithms

Winograd developed constructive minimum multiplicity theory [52], which allowedhim to develop efficient algorithms for computing fast Fourier transforms of severalshort lengths (2, 3, 4, 5, 7, 8, 9, and 16). The Winograd short length algorithms can berepresented as −→

X = TN ·−→x = BNDNAN ·−→x (71)

whereBN andAN are incidence matrices containing only numbers 1, 0, and−1: mul-tiplications by these matrices can be computed with only additions and subtractions.In addition, efficient algorithms have been developed for multiplication byBN andAN,still reducing the number of additions6. DN is a diagonal matrix and thus the productof DN andAN

−→x can be computed simply by multiplying each element ofAN−→x by the

corresponding diagonal element ofDN.TheDN matrix is square, but generallyBN andAN matrices are not. UsuallyAN has

a row or two more than columns, and similarlyBN may have more columns than rows.The square diagonal matrixDN has the same number of rows and columns asAN hasrows andBN has columns. For example, in 16-point transform the size of matrixA16

is 18×16, the size ofB16 is 16×18, and the size ofDN is 18×18. In practice thismeans that the input data is expanded slightly when multiplied by aAN matrix andcontracted back to the original size when multiplied by the correspondingBN matrix.An exception is the 3-point transform, whereA3 andB3 are 3×3 square matrices.

Winograd’s short length Fourier transform algorithms can not be used directly fornumber theoretic transforms, but only the diagonal matrixDN needs to be modified.This allows us to use linear algebra7 moduloq and to insert unknownsDN (k,k) in thediagonal positions ofDN, premultiply this byBN, and postmultiply it byAN. Since

TN = BNDNAN (72)

and the matrixTN is known from (66), there are more than enough equations for solv-ing all unknownsDN (k,k). The matricesBN andAN will not need to be modified forinverse transform either. When substitutingT−1

N in place ofTN in the equation (72)and solvingDN diagonal elements, a Winograd short length algorithm for the inversetransform is obtained.

Some useful short length algorithms along with their transposes, necessary for two-dimensional transforms, are printed in appendix D.

4.1.2. Longer Length Transforms

Short length Winograd algorithms can be combined into longer length transforms. It ispossible to permute the rows and columns of a matrixTN so that it can be written as aKronecker product of shorter length number theoretic (or Fourier) transform matrices.The lengths of the short length algorithms have to be relatively prime. The permutationis described in several references, for example [47, 50], and a simple MATLAB program

6The computation flow charts for all of the Winograd’s incidence matrices are shown in [56].7A computer program is recommended, suitable programs are for example Macsyma/Maxima and

Maple. Another way to compute the constants is given in [58].

44

is presented in appendix E, which can be used for computing any permutation with twofactors.

The product of the short length transform lengths is the total transform length. Dueto the permutations ofTN columns and rows, the input vectors (before transform) andoutput vectors (after transform) have to be permuted correspondingly.

In our case we have 48-point transform (N = 48), which is product of 16 and 3.Therefore we have

T′48 = T3⊗T16 (73)

whereT′48 is a permutation of a 48-point transform matrixT48 and⊗ is the Kroneckerproduct. Another permutation ofT48 gives

T′′48 = T16⊗T3. (74)

In this work the form (73) was used, since it offers slight advantages in the amount ofadditions.

Substituting (73) to (65) we get

−→X′ = (T3⊗T16) ·

−→x′ (75)

where−→x′ and

−→X′ are corresponding permutations of the input and output vectors. Now

consider the fact that Winograd algorithms exist for both lengthsN1 = 3 andN2 = 16.Let’s write the equation as

−→X′ = (B3D3A3⊗B16D16A16) ·

−→x′ . (76)

Using now Kronecker product properties, we can rearrange this as

−→X′ = (B3⊗B16)(D3⊗D16)(A3⊗A16)

−→x′ (77)

= B48D48A48 ·−→x′ (78)

whereB48, D48 andA48 are similar matrices as corresponding shorter length Winogradalgorithm matrices, i.e.A48 and B48 are incidence matrices andD48 is a diagonalmatrix. The multiplication count will be low, although it may not be optimal (as in theshort length Winograd algorithms).

Another way to comprehend (76) and (77) is to think as if the vector−→x′ would be

rearranged into a 3×16 matrix, on which separate transforms would be performed onthe three rows by matrixT16 and on the sixteen columns by matrixT3. Thus we havemapped the 48-point one-dimensional transform into a 3×16 point two-dimensionaltransform by permuting the input and output data. The Figure 12 illustrates the map-ping of a 6-point transform withN1 = 2 andN2 = 3. The numbers denote the indices ofthe input and output vectors. In the figure there are two stages, 3 and 2 point transforms.If a WFTA or WNTTA would be used, both of these stages would be decomposed intothree stages, or six stages in total. The stages would be then rearranged as in (77).

The algorithm for multiplying by the diagonal matrixD48 is obvious, and algorithmsfor multiplying by theB48 andA48 matrices can be easily derived from the short length

45

3-pointshort lengthtransforms

2-pointshort lengthtransforms

0

3

4

1

2

5

0

3

2

5

4

1

Figure 12. Data flow in a 6-point rearranged transform.

Winograd algorithms for multiplying by theB3, B16, A3, andA16 matrices by imagin-

ing−→x′ as the two-dimensional array:B3 andA3 transform the 16 columns andB16 and

A16 transform the 3 rows of the array.Another way to represent this is to derive matricesB′3, B′16, A′

3, andA′16 such that

B48 = B′3B′16 (79)

A48 = A′16A

′3 (80)

whereA′16 (of size 54×48) andA′

3 (of size 48×48) contain the elements from matricesA16 andA3, respectively, and zeros; andB′3 (of size 48×48) andB′16 (of size 48×54)contain the elements from matricesB3 andB16, respectively, and zeros. This makes itpossible to convert the Kronecker products in (77) into ordinary matrix products

−→X′ = B′3B′16D48A′

16A′3 ·−→x′ (81)

which is useful for analyzing the algorithm.For two-dimensional transform we have two choices. We either simply transform

separately first the rows and then the columns of the input block (which is representedas a matrixx), or we can apply (69) which leads to the following equation:

X′ = B′3B′16D48A′16A

′3x′B′3B′16D48A′

16A′3 (82)

= B′3B′16D48(A′

16A′3x′B′3B′16

)D48A′

16A′3. (83)

The parentheses in the latter line show the order in which the matrix multiplicationsare the most efficient to perform: we can first compute the formula inside the paren-theses, and then multiply the intermediate result from both sides byD48 simultane-ously. Since both multiplications by the diagonal matrix result in element-by-elementmultiplications by constants, these constants can be premultiplied together. It is also

46

advantageous to multiplyx′ first by bothA′3 andB′3, because this does not expand data.

If the multiplication ofx′ from left by A′16A

′3 would be done before the multiplication

from right byB′3B′16, six extra matrix rows would need to be multiplied, rather than ifmultiplying first from left byA′

3 followed from right byB′3.Because the multiplication of a matrix from right side by an incidence matrix corre-

sponds to multiplying the rows of the matrix by the transpose of the incidence matrix,the algorithms given in e.g. [47, 52, 53, 56] can not be used directly for computing themultiplications. Instead algorithms for multiplying by the transposes of the incidencematrices are required. These algorithms are given in [54] (although the article containssome errors), and for lengths 3, 8 and 16 in appendix D.

For the zero-padded transforms it is tempting to use separate horizontal and verticaltransforms, since we need to do only 16 times of the vertical (or horizontal, whicheveris first) transform. For transforming the 48× 48 block, which is not zero-padded,multiplications can be saved by premultiplying the twoD48 matrices together. Thelatter holds also for computing the inverse transform (from which a 32× 32 -pointblock is needed) and this yields the algorithm shown in Figure 13.

4.1.3. Practical Implementation

The inputs in Figure 13 areSτ, which is a 48×48 -point block from the previous frame,andBt , which is a 16×16 -point block in the most recent frame.Sτ is first multiplied intheRIGHT B3 code block in the figure using only additions. The data permutation thatis required is integrated in theRIGHT B3 block and is not therefore shown separatelyin the figure: the loop is unrolled and the index translations are preprogrammed in thecode.

The blockBt is zero-padded to 48×48 points, flipped cyclically around origin (seeFigure 9) and then permuted before its rows are multiplied by theA′

3 matrix. All ofthese operations are integrated in the A3 code block in the horizontal transform routinein the figure: both cyclic flip and data permutation are easy to combine into genericindex translations and the zero-pad values are not actually stored anywhere, but insteadthe accesses to the zeros are just removed. This reduces both the number of additionsand memory accesses at the cost of increased code size due to loop unrolling.

In-place transform is not done for several reasons: first, the original blocks will beneeded later in the program and they must not be overwritten. Second, sinceBt iszero-padded before transform, the transformed data would not fit in the original array.Third, the input block elements are single-byte pixels, which are too small to store the25-bit values from the NTT. Fourth, since WNTTA is used, the 48×48 -point data isexpanded into 54×54 points. For these reasons two local 54×54 -element arrays areallocated for storing the transforms of the both input blocks.

After the initial multiplication in the code blocksRIGHT B3 or A3 the rest of theNTT is done in-place. This is made possible by leaving two empty rows and columnsafter each 16 rows or columns after storing the result of theRIGHT B3 and A3 blocksin the array. This is shown in Figure 14 forSτ, where the white stripes are two elementwide empty spaces. The blocksRIGHT B16 and A16 take 16 consecutive input valuesfrom the gray areas of the figure, multiply the taken vector by matrixB16 from right orby A16 from left, and store the 18 resulting values that overwrite the input 16 values

47

right B3 A3

A16

Centralmul.

B16

left A3

right B16

left A16

mul.Central

left B16

right A16

left B3

right A3

A3

B3

Centralmul.

B16

B3

elementwisemultiplications

right B3

left A3

right B16

left A16

Centralmul.

left B16

right A16

left B3

right A3

32

32

correlation

48

4816

mod16

tim

es 4

8-po

int h

oriz

onta

l NT

T

A16

48 ti

mes

48-

poin

t ver

tical

NT

T

mod

mod

48-p

oint

NT

T

48-p

oint

NT

T

mod

mod

mod

mod

48-p

oint

inve

rse

NT

T

Current block

16

Search area

SτBt

Figure 13. Flow chart of the 48-point algorithm.

48

and the next two consequential empty elements in the array.

right B16

right A16

54 54

5454

Figure 14. Array interleaving in the 48-point transform.

When transformingSτ, the code blockLEFT A16 operates vertically and fills thetwo vertically consequential empty elements leaving no unused elements in the 54×54array. When the central multiplications are next performed multiplying the array el-ements by theD48 matrix from both left and right simultaneously, 54× 54 = 2916multiplications are needed (of which 25 are multiplications by ones). When perform-ing the horizontal transforms for theBt block, only 16 horizontal stripes in the arraycontain meaningful data at this point and thus only 16×54= 864 multiplications areneeded (of which 80 are multiplications by ones). The stripe positions are permuted,so they are not vertically consecutive.

When transformingSτ, the code blocksLEFT B16 andRIGHT A16 contract thedata again leaving unused elements in the array, as in Figure 14. This is the finaltransform result, except that it is not permuted back in to natural order. This is neveraccomplished, since the elementwise multiplications can be done in the permuted orderas well.

Similarly in the horizontal transforms ofBt the data is contracted leaving 16×48elements. After the first B3 code block the array contains 16 horizontally transformedrows. These are then transformed vertically leaving finally the blockBt fully trans-formed and permuted in the same order asSτ in the array. These are then elementwisemultiplied which equals correlation in the spatial domain. The products are stored inone of the arrays (leaving the another array empty).

The inverse transform is made very similarly as the transform ofSτ, but with a fewdifferences: the input data is already in one of the local arrays and in permuted order,not in the natural order. However, for the inverse transform we need to permute thedata into another, different, order. This can be thought as two consequential indexmappings, one from the WNTTA output order to the natural order and another fromthe natural order to the WNTTA input order.

These two permutations are again combined and preprogrammed into the code blockRIGHT B3 of the inverse transform. The code block reads its input data from the localarray and stores it to the another (which was empty). Of the inverse transform result,only the 32×32 uppermost and leftmost points in the array are needed. This is takeninto account by not performing those additions whose result will not be used.

The last code block in the inverse transform,RIGHT A3, not only multiplies thematrix formed from the array elements by the matrixA′

3, but also permutes the datainto the natural order and stores it into an external 32× 32 element array where thefinal correlation result is left in.

49

Since the algorithm has many consequential index permutations, they were com-bined into a single permutation and the program loops were unrolled so that no tablesare needed for storing the permutations. TheAUTOGEN programming technique [51]was decided to be used: a C language program was written which generates automat-ically another C program, the actual transform routine. TheAUTOGEN program alsocomputes all data dependencies and removes those additions and subtractions that neednot to be done due to zero-padding of input or ignored values in output.

The 48-point algorithm requires 2304 general multiplications. The rest of the multi-plications are made by precomputed constants and may be easier to carry out. TheNTT parameters that were used areω = 4575581,N = 48, andq = 224− 26 + 1.This gives the corresponding 3 and 16 -point transform parametersω3 = 4210052andω16 = 289677. Also the transform with parametersω = 7419476,N = 48, andq = 224−212+1 was implemented, but the first transform is generally better becausethe modulus is closer to an integer power of 2 and it is a prime.

As a proof of concept, the 48-point algorithm was implemented into Telenor’s H.263software video encoder [62] with ease. The first and last block row and column inframes were ignored: no motion vectors were computed for them. Those were consid-ered as a special case with little of significance.

4.2. Computing Correlation via 32-point Transforms

The correlation can be computed also with 32-point transforms. There are severaladvantages from that:

• It is possible to reduce the number of transforms by combining blocks in trans-form domain, if open loop motion estimation is performed.

• It is possible to use a simple transform kernel, especially with a Fermat numbertransform, so that multiplications can be accomplished with bit shifting.

• Simpler radix-2 algorithm can be used.

• Simpler modulus can be used for which the congruent reduction is faster.

There are also some disadvantages:

• Block combining requires quite much memory.

• Fermat number transform requires 33-bit arithmetic, which is very difficult ongeneral purpose microprocessors.

• Multiplication count is higher.

• Efficient integration into existing video encoders is not so straightforward.

Instead of generating the 32×32-point correlation with a single 48×48-point trans-form, as in the WNTTA, we generate four 16×16-point correlations for finding eachmotion vector. In addition, to remove redundant transforms, all motion vectors foreach frame are estimated in a single procedure at once. This makes the integration

50

into existing encoders a little more complicated, since the encoder’s motion estimationalgorithm can not be simply replaced.

All of the 32-point algorithms require only 4096 general multiplications for the cor-relation computation in the transform domain per one computed motion vector. Restof the multiplications are made by constants, and in particular, with some transformsthe constants are integer powers of 2 and may be carried out by bit shifting.

The NTT parameters that were used in the radix-2 and mixed radix algorithms areω = 2585,N = 32, andq= 224+1; for the bit-shifting radix-2 algorithmω = 524160=219−27. For storing transformed blocks (see the next subsection), a total of two 4×Fh×Fw element buffers are required (whereFh is the frame height andFw the width inpixels). Each element is 4 bytes (32 bits) long. For example, when the frame size is atypicalFh = 288,Fw = 352, the buffers use 3.1 megabytes of memory. Using only onecircular buffer, the required memory space is somewhat smaller (about half), but theaddressing would be more complicated.

Most current video encoders perform closed loop motion estimation, in which thereference frameFτ is not taken directly from the video sequence, but is instead quan-tized in the same way as the decoder will receive it for performing motion compensa-tion, shown in Figure 2.

However, to combine blocks efficiently, the motion estimation has to be performedin open loop, using the unquantized frames from the original video sequence. Closedand open loop motion estimation is compared in [59], and little difference is foundbetween the two, especially for high bit rates. Therefore, open loop estimation shouldbe acceptable in many situations. If desired, the 32-point NTT can be also used inclosed loop with a slight performance penalty.

4.2.1. The Procedure

Two buffers are used for holding the transformed blocks: both can holdFhBh× Fw

Bwnum-

ber theoretically transformed 32× 32-element blocks. When the motion estimationprocedure begins, the first buffer is unused (empty) and the another buffer containstransformed 32×32 -pixel overlapping (by 16 pixels) blocks from the reference frameFτ.

While the procedure estimates the motion vectors, the first buffer is filled with trans-formed blocks from the current frameFt and the contents of the second buffer are usedfor performing the motion estimation. When the procedure terminates the buffers areexchanged.

The procedure works as follows (shown in Figure 15): loop over all blocks in aframe (i.e. the loop step size is 16 pixels, one block, horizontally and vertically). Foreach block at the block rowi and columnj,

1. Zero-pad the 16×16-pixel blockBt in the current frameFt into 32×32 pixelsand transform it. Save the transformed block in the first (originally empty) bufferat position(i, j).

2. Flip the transformed block cyclically around upper-left corner (origin) to convertconvolution into correlation. The saved block is not modified: it is only read andthe result is stored in a temporary 32×32-element array. The flipping operation

51

Initialize vertical stripes if not first nor last row

Loop over all block rows, andfor each row:

Transform the block and save result

Flip the transformed block around origin

Loop over all single blocks in the current row, andfor each block:

For all 4 quarters of the 48×48 pixel search areain the reference frame, which overlap with the block:

Multiply the transformed blockelementwise by the quarter

Inverse transform the result,forming 16×16 points of correlation

Find the best match for the blockusing the 4 correlations

if n

ot f

irst

nor

last

colu

mn

or r

ow o

f bl

ocks

add it to the saved block in the previous row

Fetch the saved transformed block,shift it down cyclically 16 points, and

Fetch the saved block in the previous row,shift it right cyclically 16 points, andadd it to the saved block in theprevious row and previous column nor column

first rowif not

if notfirst row

Figure 15. Chart of the 32-point procedure.

is same in both the transform and spatial domain, and for the latter it is depictedin Figure 16. In the transform domain (where the block is actually flipped) theregenerally are no zeros (unlike in spatial domain due to zero-padding as shown inthe figure) so every block element has to be read and stored into the temporaryarray.

3. Get the four transformed full (not zero-padded) 32×32-element blocks, whichare the quarters of the current 48× 48-pixel search areaSτ, from the secondbuffer containing the transformed blocks from the reference frameFτ. The quar-ters are shown in Figure 17: they all spatially overlap with the current 16×16-pixel block, which is at the position(i, j). The first quarter is at the position(i−1, j−1), the second is at(i−1, j), the third at(i, j−1), and the fourth at(i, j). The 16×16-pixel blockBt is in the center position of the 48×48-pixelsearch areaSτ.

4. Multiply the zero-padded, transformed, and flipped block elementwise with all

52

32

32

padd

ed w

ith

zero

s

cyclic fliparound origin

Figure 16. Flipping a 32×32-pixel block cyclically.

Sτ

3

1Q

Q Q4

2Q

48

48

Figure 17. The four 32×32-pixel overlapping quarters.

four of the transformed full 32×32-pixel quarter blocks.

5. Inverse transform the four results from the previous step. The upper-left 16×16elements of the outcomes are the partial correlation values of the required32×32-element correlation between the 16×16 and 48×48-pixel blocks, cor-responding to the four quarters of the full correlation.

6. Stitch together the partial correlations obtained in the previous step and performthe best motion vector search using the correlation as described in sections 2.3.8and 3.

The four transformed full blocks used at the step 3 are constructed from the savedzero-padded blocks, computed at the step 1, in transform domain. Since number the-oretic transform is a linear operation, elementwise addition of two transformed blocksequals to elementwise addition in spatial domain. The equivalent operation to shift-ing (cyclically) untransformed blocks is also possible to do simply in the transformdomain.

With two 16-point shift operations and additions per block, on an average, a full32×32-element transformed block is constructed and stored in the first buffer. Thisis shown in Figures 18 and 19, which represent the operations in spatial domain forclarity: the actual operations are performed in the transform domain, where the blocksdo not have the zero-padding visible as in the figures.

In practice the shift and addition operations are combined: shifting a block by 16elements down or right in spatial domain corresponds to multiplying the elements ofevery other row or column in transform domain by−1 (≡ q−1), respectively. This canbe easily seen with the convolution theorem: a 16-pixel cyclic shift down of a 32×32-pixel block is equivalent to a convolution of the block and a 32×32-pixel unit impulseblock, in which all other elements are zero except the element at the 16th row and

53

i,j-1 i,j

Figure 18. Buffer filling.

the first column. Since the convolution corresponds to elementwise multiplication intransform domain, we see that the shift corresponds to multiplication of each elementof the transformed block by some constant, specifically by 1 or−1 when the shift is16 pixels.

Instead of multiplying half of the elements of the block by−1 and then adding theblock to a second block, half of the elements are added and the other half are subtractedfrom the second block.

If the motion estimation is carried out in a closed loop, the transformed and zero-padded blocks created at step 1 can not be saved to the buffer and reused for the ref-erence frame when the next frame is encoded. Instead the full 32×32-pixel blocks inthe search area need to be fetched and transformed directly from the reference frameat the step 3.

It is also possible to combine the blocks in a closed loop similarly as in Figure 19. Inthis case the zero-padding is performed directly to blocks obtained from the referenceframe, instead of the second buffer, as in an open loop. Two blockwise additionsare needed per estimated motion vector, but the transforms can be accomplished withzero-padded blocks, thus saving the 16 rightmost vertical transforms.

4.2.2. Radix-2 Algorithms

To transform the 32× 32-pixel zero-padded block (shown left in the Figure 16) wecan use the radix-2 algorithm to first transform the 16 leftmost columns and then the32 rows of the block. This was implemented using decimation in frequency, which isshown in Figures 20 and 21a: the input pixel values are in natural order but the outputvalues need to be permuted (bit reversed), as shown in the figure.

All multiplications in the first stage of the column transforms are by small values(because the input pixel values are between 0. . .255) and we can use easily directtable lookups instead of multiplications moduloq.

The inverse transform has the same structure, but the difference is that there areno zeros in the input, and only the 16×16-element upper-left corner of the result isuseful (the rest are aliased due to cyclic correlation). No table lookups can be easilyused. First all 32 rows and then the 16 leftmost columns are inverse transformed.

54

++

+

Shiftdown

Shiftdown

Shiftright

Fullblock

1 4

4

1

3 2

3

3

4

2

2

4

1

3

2

4

Figure 19. Constructing a full 32× 32-element block from four 16× 16-elementblocks, which are zero-padded to 32×32 elements.

55

48

12

48

12

48

12

48

12

15

14

13

12

11

10

987654321

8

8

8

8

8

8

8

8

86

42

1012

14

86

42

1012

31302928272625242322212019181716151413121110

9876

43210 0

168

24

201228

2181026

6

1430

311523

7271119

3291321

525

917

1

4

22

5

14

Figure 20. Radix-2 decimation in frequency flow chart.

a−bb

a

n

a+b

n( )ω

(a) Decimation in frequency.

nω

nω

a+ba

b a−b

n

(b) Decimation in time.

Figure 21. Butterfly operations.

56

Additionally after each inverse transform the resulting values need to be multiplied byN−1 as can be seen in formula (63). It is also possible to not multiply byN−1 butinstead byN−2 only after the 16 vertical inverse transforms.

Whenω = 524160 or 65520, all of its even powers are of form±2n and odd powersare linear combinations of two terms, where both term is of form±2n (shown in Table4). The transform can be performed with only bit shifts with these kernels. As canbe seen from Figure 20, there are total of 49 multiplications in one 32-point one-dimensional transform. Of these, only 8 are by odd powers ofω and the rest 41 are byeven powers.

Whenn < 8, it is sometimes possible to perform the multiplication by 2n with onlyone bit shift, without immediate congruent reduction. This is possible if the multipli-cand is known to be small, not much larger than 224 so that a 32-bit register in themicroprocessor does not overflow. In most cases this is not possible, and generally acongruent reduction has to be performed. This can be done together with the multipli-cation by the constant±2n, as described in section 4.3.2.

Table 4. Powers ofω = 524160(mod 224+1)m ωm ω−m

1 219−27 216−24

2 23 −221

3 222−210 213−21

4 26 −218

5 −(213+21

)222+210

6 29 −215

7 −(216+24

)219+27

8 212 −212

9 −(219+27

)216+24

10 215 −29

11 −(222+210

)213+21

12 218 −26

13 21−213 210−222

14 221 −23

15 24−216 27−219

The implementation of the bit shifting algorithm also saves some additions com-pared to the other algorithms, because it uses signed arithmetic. A computer programwas written which generated a C-language program for the one-dimensional transform.The program also analyzed the necessary places, where the congruent reduction wasrequired.

4.2.3. Other Algorithms

Well known improvements from the radix-2 algorithm are the radix-4 and split-radixalgorithms. Both of these algorithms require the existence of a numberi such thati2 ≡ −1 (modq). Such a number does not exist in all rings, but when the modulus

57

q = 2n +1 wheren is even, as in our case, it is easy to find8:

i =√−1≡

√q−1 =

√2n = 2n/2 (modq). (84)

These algorithms are more efficient for Fourier transform than the usual radix-2 algo-rithm because they replace some of the multiplications by arbitrary constant complexnumbers with multiplications by±i. In the complex domain, where the numbers arerepresented as the real and imaginary components, multiplication by±i requires onlyexchanging the real and imaginary components and the sign of another.

In the number theoretic domain, it is no easier to multiply byi than it is to multiplyby some power ofω, whenω is simple such as an integer power of 2. Therefore thesealgorithms do not seem to be any more efficient than the ordinary radix-2 algorithm,and they were not implemented.

A mixed radix algorithm with one radix-2 stage and one radix-16 stage was im-plemented, where the radix-16 stage was computed using Winograd’s 16-point shortlength transform algorithm. The whole algorithm contains 41 multiplications and sincethe multiplications in the Winograd algorithm are grouped together, less congruent re-ductions are needed. The algorithm is shown in Figure 22.

The Winograd’s 16-point algorithm has 13 multiplications, and for two 16-pointtransforms that makes 26 multiplications. Since the radix-2 stage has only 15 multipli-cations, it is more beneficial to use table lookup in the Winograd’s 16 point transformstage. This can be done in the transform routine with the decimation in time algorithm.It is similar to decimation in frequency as in Figure 22 (or 20) but mirrored horizon-tally: the input values are permuted and the output values are in natural order. Thecorresponding butterfly is shown in Figure 21b.

In the inverse transform there is no significant difference between decimation in fre-quency or time algorithms, and the decimation in frequency was implemented9 (Figure22).

4.3. Reducing Congruent Reductions

There are two ways to speed up congruent reductions: compute them using only bitshifts, additions and subtractions instead of a complete division (remainder) procedure,or avoid the reductions in the first place.

4.3.1. Fast Computation

Let’s suppose that we want to obtainx′ that is congruent tox moduloqbut smaller usingbit shifts and no general division. Assume first that our modulus isq = 224−26 + 1.We can writex as a sum of two numbers, one consisting the less significant and anotherthe most significant bits ofx raised to a proper power of 2. We can representx in thefollowing way:

x = x0:23+224x24:31 (85)

8Several numbers with the propertyi2 ≡−1 may exist.9The author considers the decimation in frequency algorithm easier to implement, since program errors

in index permutations are easier to find.

58

13579

1113151719212325272931

Winograd 16-pointshort lengthalgorithm

Winograd 16-pointshort lengthalgorithm

15

14

13

12

11

10

987654321

302928272625242322212019181716151413121110

987654321

02468

1012141618202224262830

31

0

Figure 22. Combined Radix-2 and Winograd-16 decimation in frequency flow chart.

wherexm:n means the bits from bit positionsm to n of x, or mathematically

xm:n =⌊ x

2m

⌋mod 2n−m+1. (86)

Equality (85) can be written as

x = x0:23+(

224−26 +1+26−1)

x24:31 (87)

and sinceq = 224−26 +1≡ 0 (modq), we get

x ≡ x0:23+(

26−1)

x24:31 (modq) (88)

= x0:23+26x24:31−x24:31. (89)

59

Similarly, whenq = 224+1, we can write

x = x0:23+224x24:31 (90)

= x0:23+(224+1−1

)x24:31 (91)

≡ x0:23−x24:31 (modq). (92)

Unlike (89), this may be negative. If we use unsigned numbers and want the result tobe positive, we can add suitable timesq to ensure that the result is positive:

x≡ nq+x0:23−x24:31 (modq). (93)

The positive-making termnq would not be necessary, if we would use signed integernumbers instead of unsigned. This alternative was implemented only in the bit-shiftingradix-2 algorithm.

The congruent reduction algorithm above does not compute directly the remainder ofx divided byq (which is always less thanq), but instead the result may be greater thanq.To obtain the congruent result reduced into a number between 0. . .q−1, a comparisonto q and a conditional subtraction ofq if greater is necessary. This complete reductiondoes not need to be done before the final correlation result is returned from an inversetransform algorithm.

4.3.2. Multiplying by±2n(mod224+1)

Usually multiplications modulo someq are more difficult to perform than normal mul-tiplications, but in some cases the congruence may make the task actually simpler.When the multiplier is±2n, wheren is some integer, the multiplication can be donewith only a bit shift. In addition, the two bit shifts from the congruent reduction canbe combined with the multiplication so that the total count of shifts is only 2 insteadof 3—and more importantly, no over 32-bit numbers need to be handled.

In this section it is assumed thatn < 24 and modulusq = 224+1. It is also assumedthat unsigned numbers are used; however, the algorithm is directly applicable for alsosigned numbers. The only difference is that bit shifts to right have to be sign extending(arithmetic shift).

Assume now that the underlying microprocessor has at least two 32-bit temporaryregistersr1 andr2 (which may contain a value at most 232−1). A numberx, whichwe want to multiply, is less than or equal to 232−1 and placed in registerr1. We don’twant to multiply it directly by 2n, since the resulty would require two registers whichis cumbersome to handle. Instead we implicitly just assume that the lowestn bits arezero (i.e.y0:n−1 = 0) andr1 holds the value ofyn:n+31.

We want to subtracty24:n+31 from y0:23 to perform the congruent reduction. Sinceyn:n+31 = x0:31 it follows thaty24:n+31 = x24−n:31 andy0:23 = 2nx0:23−n. Now we canjust assignx to another register,r2 := x , divide it by 224−n (with a bit shift to right)aligningy24:n+31 at the right edge ofr2 at bit positions 0 :n+7. The registerr1 holdingx is multiplied by 2n (with a bit shift to left) and masked to zero preserving onlyx0:23−n

bits in positionsn : 23 with a bitwiseAND operation (see Figure 23).With a positive multiplier+2n registerr2 = y24:n+31 should be now subtracted from

60

��

��

��

��

��

0

n24 0

n

n+32 n

r1r2

x0:23- n

y0:23

r1

n24: +31y

�� x

y

0 024+8

24- :31nx

n

Figure 23. Combining congruent reduction with multiplication by±2n.

r1 = y0:23. y24:n+31 may be at most 2n+8−1 andy0:23 may be at most 224−1. Both areequal to zero or greater. When the first is subtracted from the latter, we know that theresultzmust be in

[−2n+8 +1,224−1

]. If −2n+8+1≥−q⇒ n≤ 16, then−q≤ z< q

and a single comparison to zero and addition ofq if less is enough to reduce the resultcompletely between 0 andq−1, inclusively. Ifn > 16, extra congruent reduction maybe necessary (unless the input valuex is known to be much less than 232−1).

If the multiplier is −2n, the result should be negated. The negation is, ofcourse, equivalent to subtractingr1 from r2, i.e. reversing the subtraction opera-tion. When the first is subtracted from the latter, we know that the resultz must bein[−224+1,2n+8−1

]. If 2n+8− 1 < q ⇒ n ≤ 16, then−q ≤ z < q and a single

comparison is enough, similarly to the positive multiplier case, to reduce the resultcompletely between 0 andq−1, inclusively.

Numerical Example To perform an inverse radix-2 number theoretic transform (withq = 224+1), the result has to be multiplied byN−1. In our case, we have two inversetransforms, vertical and horizontal, and it is enough to multiply only once byN−2 afterthe both inverse transforms. In this case,N = 32. A straightforward way is to firstsolve

32−2 =(322)−1 ≡ 16760833(modq). (94)

Then it is possible to simply perform a wide multiplication by 16760833, division byq, and return the remainder as the result.

However, once we notice that

16760833= 224−214+1 (95)

we can do the multiplication using two bit shifts and couple of additions and subtrac-tions. As well the congruent reduction can be computed with only bit shifts, additions,and subtractions as shown in section 4.3.1. This requires quite many operations, evenif they are simple. It is even better to combine multiplication and modulo reductioninto a single operation. First note that

224−214+1≡−214 (modq). (96)

61

According to the preceding section, the multiplication algorithm is carried out as fol-lows:

r1 := input

r2 := r1

r1 := r1 ·214

r1 := r1 ∧ 224−1

r2 :=⌊ r2

210

⌋r2 := r2− r1

r2 := r2 +q if r2 < 0

result := r2

where∧ is the bitwise logicalAND operation,⌊

x2n

⌋is a bit shift right andx ·2n a bit

shift left byn bits.

4.3.3. Reduction Elimination

Since for all algorithms the modulus is only about 224, many additions and subtrac-tions can be done before the 32-bit word of the microprocessor overflows. Congruentreduction does not need to be done except just before that.

Simulation of the 48-point algorithm was implemented with MATLAB and intervalarithmetic. Using this simulation, the places where the congruent reduction is neces-sary, i.e. where any of the elements in the arrays might overflow, were found. Figure13 shows where the congruent reductions are necessary due to additions and subtrac-tions for the 48-point algorithm. Note that the figure does not show the reduction aftermultiplications. Generally the reduction is necessary to perform twice after each mul-tiplication when the modulusq = 224−26 + 1. The interval arithmetic computationswere not included in theAUTOGEN program, since a large amount of experimentingby hand was necessary.

For the mixed radix 32-point algorithm a similar simulation program was written.For the radix-2 algorithm, this optimization was not performed, since it would havebeen more difficult: in radix-2 algorithm the numbers are processed recursively, whichwas not the case for a Winograd’s algorithm. For the bit-shifting radix-2 algorithm, therecursion was removed by unrolling the transform algorithm inner loops, and theAU-TOGEN program acquired automatically the required places for congruent reductions.

4.3.4. Lookup Tables

In every transform several additions and subtractions are done with pixel values beforethe first multiplication. This is especially true for the WNTTA-type transforms. Af-ter these additions, the numbers are still moderately small, since originally the pixelvalues are all between 0. . .255. With interval arithmetic, the exact lower and upperboundaries for numbers after the additions and subtractions are computed. The val-ues are used as indices to look-up tables which return the fully reduced product of the

62

value multiplied by some constant. This saves not only the multiplications, but moreimportantly the congruent reductions.

It is interesting to notice that all multiplications could have been eliminated in theWNTTA transform algorithm (83), but this would require about 8 megabytes for look-up tables.

63

5. RESULTS

The algorithms were tuned for x86-processors using a small amount of hand-writtenassembly code (AMD Athlon processor at 800 MHz was used). The assembly codewas added primarily for exploiting some x86 instructions, for which there is no directsupport in the C-language (such as obtaining 64-bit multiplication result). Althoughtheoretical multiplication count is very low, especially with the 48-point algorithm, onegeneral multiplication moduloq on a general purpose microprocessor will be a veryslow operation. If SSD equation is applied directly, only a relatively fast multiplicationinstruction is required and no congruent reductions at any stage.

Although the congruent reduction after an actual multiplication instruction requiresonly bit shifts and additions, there are lots of instructions, some of them very slow.For direct SSD or SAD computation, MMX instruction set helps a lot: for example,a single MMX instruction can take the absolute difference of 8 pixel value pairs andcompute the sum of those. For computations moduloq it helps little, since MMXdoesn’t have a suitable wide multiplication instruction, and one 64-bit MMX registercan hold only two 25-bit number theoretic values: parallelism is not very high.

The Tables 5–11 contain some statistics from the implemented algorithms for com-puting one motion vector. The statistics measured from the separate parts of the al-gorithms are shown, as well as the total statistics measured from running the wholemotion estimation algorithm (since also the latter is measured, the sum of the pieceswill not add exactly up to the total measured value). In the different 32-point algo-rithms, some parts are exactly same. Those are enclosed in parentheses. In the 48-pointalgorithm, the indented captions indicate the subparts of the outer routines.

Table 5. Motion estimation using the 32-point radix-2 algorithmAdds Muls Mods ModMuls Access Insns Ops Time [µs]

Initialize vertical stripes 240 256 - - 865 1505 1762 2.1Transform 9216 - 6112 2352 84561 140802 155636 144.0Flip block - - - - 2085 3424 3473 2.3

4×Elementwise multiply - - 1024 1024 3501 15891 23109 17.74×Inverse transform 9984 - 6912 2352 80107 139842 156350 135.0

Find best match 7552 2976 - - 15487 30275 33502 27.7Move down and add 1536 - - - 2084 3203 3247 4.0Move right and add 1536 - - - 2087 3267 3320 3.2Total sum 60016 3232 37856 15856 441863 805157 918037 807.0

Table 6. Motion estimation using the 32-point bit-shifting radix-2 algorithm and signedarithmetic

Adds Muls Mods ModMuls Access Insns Ops Time [µs](Initialize vertical stripes) 240 256 - - 865 1505 1762 2.1Transform 6544 - 3392 880 (shifts) 20069 44062 44075 31.3

(Flip block) - - - - 2085 3424 3473 2.34×Elementwise multiply - - - 1024 3096 7642 14816 11.74×Inverse transform 7296 - 3248 288 (shifts) 17595 42377 42392 25.4

(Find best match) 7552 2976 - - 15487 30275 33502 27.7Move down and add 1024 - - - 2062 2666 2678 4.0Move right and add 1024 - - - 2065 2741 2744 3.2Total sum 45568 3232 16384 4096 (no shifts) 125739 286003 318229 234.5

64

Table 7. Motion estimation using the 32-point mixed radix algorithmAdds Muls Mods ModMuls Access Insns Ops Time [µs]

(Initialize vertical stripes) 240 256 - - 865 1505 1762 2.1Transform 10144 - 1504 1968 21510 51751 62654 51.6

(Flip block) - - - - 2086 3429 3473 2.3(4×Elementwise multiply) - - 1024 1024 3500 15891 23086 17.24×Inverse transform 10560 - 768 1968 25835 55505 69327 49.4

(Find best match) 7552 2976 - - 15594 30275 33504 28.2(Move down and add) 1536 - - - 2082 3203 3244 4.0(Move right and add) 1536 - - - 2087 3267 3320 3.2Total sum 63248 3232 8672 13936 161406 378752 476958 374.0

Table 8. Motion estimation using the 48-point WNTTA algorithmAdds Muls Mods ModMuls Access Insns Ops Time [µs]

Fast correlation 79384 - 22968 10463 339640 725324 838365 688.9. 2D transform 32724 - 2304 2891 116059 224208 245428 197.4. 1D transforms and elementwise

multiply15744 - 4032 4656 90704 217482 273510 227.3

. . 16 horizontal 1D transforms 3936 - - 784 14693 17146 17303 17.5

. . 48 vertical 1D transforms andelementwise multiplication

11808 - 4032 4656 75895 200455 256371 209.4

. Inverse 2D transform 30916 - 5632 2916 128972 282037 317898 253.1Find best match 7836 3588 - - 18382 36788 40573 33.7Total sum 87220 3588 11968 10463 360209 764312 881107 723.8

Table 9. Exhaustive search with the SSD criterionAdds Muls Access Insns Ops Time [µs]

Total sum 491071 246016 547464 1351775 1606582 1407.5

Table 10. Partial distortion elimination with the SSD criterionAccess Insns Ops Time [µs]

Minimum 68872 143352 167794 136.8Maximum 960813 2680624 4453971 1434.7Typical average 268155 637858 754287 659.4

Table 11. Operation costsDatamoves

Bitshifts

Bitwiselogicaloperations

Addi-tions

Time[clockcycles]

Adds/bit shifts – 0/1 – 1/0 0.50Muls – – – – 1.95Absolute value 2 – – 2 1.45Mods

(q = 224+1

)1 1 1 2 2.96

ModMuls(q = 224+1

)1 1 1 1 6.49

Mods(q = 224−26 +1

)2 2 1 2 3.07

ModMuls(q = 224−26 +1

),

large inputs4 5 2 6 20.72

ModMuls(q = 224−26 +1

),

small inputs5 4 2 3 17.97

65

The table legends are described here:

Adds Number of normal and congruent additions. Derived theoretically and donot include overhead such as index calculations. One addition is equallyfast to a simple bit shift, as shown in Table 11.

Muls Number of normal multiplications, where input numbers are 8 bits longand the 16 bit result is needed. Derived theoretically and do not includeoverhead such as index calculations.

Mods Number of congruent reductions. Derived theoretically, and do not countthe necessary reductions after a multiplication to make the result to fitin a single 32-bit register, except in the transform and inverse transformroutines of Table 6, where the multiplication is a single bit shift which isactually combined with the congruent reduction.

ModMuls Number of multiplications moduloq. Derived theoretically. One oper-ation not only includes the actual multiplication operation but also thenecessary congruent reductions after the multiplication. For the 48-pointalgorithm one multiplication moduloq is more costly than for a 32-pointalgorithm.

In the Table 6 there are no actual multiplications at all in the transformor inverse transform routines. Instead there are two kinds of “multiplica-tion” operations: simple multiplications by powers of two (bit shifts), andcongruent reductions, which are combined with the multiplications by apower of two. In the latter case the bit shifts are not counted in the tablecolumn.

Access Number of memory (or cache) accesses. Counted with the Athlon pro-cessor performance counters [60]. This depends quite much on the un-derlying hardware: an architecture with a lot of internal registers requiresless memory accesses than an x86-compatible processor which has fewregisters.

Insns Number of instructions executed. Counted with the performance countersand depends on the hardware architecture.

Ops Number of micro-operations executed. Counted with the performancecounters and depends on the hardware architecture. This number shouldcorrespond better to RISC-architectures with simple instructions.

The Table 11 compares the complexity of different operations. The operations men-tioned in the table were executed consecutively 256 times and the average time of oneoperation was measured. The measured time is only a rough suggestion of the actualcost of the operation, because the context affects the instruction execution time quitemuch.

For example, even if the table shows that taking an absolute value is faster thana multiplication, nevertheless the SSD criterion is slightly faster with the Telenor’s

66

encoder than the SAD criterion (which uses an absolute value operation instead of amultiplication). As it can be seen, in the 48-point algorithm a multiplication moduloqis very slow due to the fact that the congruent reduction must be performed twice.

All of the algorithms, except the bit-shifting, could be made about 9 % faster bypremultiplying all constant multipliers by 256. This would remove one very costlytwo-register shift instruction (shld) [61]. This optimization is not shown in the tables.Nevertheless, it is obvious that the bit-shifting algorithm is superior to all other algo-rithms that were tested, and saves over 83 % of the computation time as compared tothe ESA with SSD criterion.

The benchmarked 32-point algorithms use open loop motion estimation for effi-ciency (see section 4.2.1). If instead closed loop estimation would be needed, it wouldbe slightly faster if the transformed blocks in the search area would be combined (as inFigure 19), than if the full 32×32-pixel blocks would be directly transformed withoutzero-padding. This can be estimated from Table 6: the direct transform without zero-padding would require 64 one-dimensional transforms, while with zero-padded blocksonly 48 transforms would be required per estimated motion vector. It can be reckonedthat to transform the full 32×32-pixel block, 41.7µs would be expended. If insteadthe blocks were combined in the transform domain, the transforms would need only31.3µs but the additions would consume extra 7.2µs, and only 3.2µs would be saved,or 1.4 % of the total motion estimation time. However, less multiplications (or bitshifts with the bit-shifting algorithm) would be needed, because combining the blocksin the transform domain requires only additions. This might be more advantageouswith certain architectures.

The 48-point algorithm minimizes the count of general multiplications, where nei-ther of the multiplicands is a constant, but nonetheless it does not perform well, sincethe congruent reduction after a multiplication in that case is so slow operation.

The approximative comparison between different fast full search algorithms is givenin Table 12, where higher percentage means higher computation savings, as comparedto ESA, and thus faster algorithm. However, it should be noted that there are muchmore considerations than just the plain execution speed given in the table: many papersbenchmark motion estimation methods with good-quality video sequences with littlenoise and often with little motion. In applications where these assumptions do nothold, the savings will be less for methods which are based on elimination by a lowerbound. Furthermore, an ASIC will perform differently from software implementation.

Table 12. Fast full search motion estimation algorithmsAlgorithm Savings NTT Algorithm SavingsESA 0 % Radix-2 43 %PDE 84–95 % Bit-shifting 83 %SEA 88 % Mixed radix 73 %SEA with PDE 97 % WNTTA 49 %MSEA 95–98 %Winner-Update Strategy 88–96 %CBME 84 %Correlation via 2-D FIR filtering 77 %

67

6. DISCUSSION

Number theoretic transforms have been known since seventies, and they have beenused successfully in signal processing [56]. However, a major difficulty impeding theuse of NTT is the very inflexible relation between the transform and word lengths. Itis particularly difficult to find a Fermat number transform for a long transform length.

The progress at the end of the twentieth century lead to applications in digital imageand video processing. The useful transform lengths became much shorter, especiallydue to block-based algorithms, than in e.g. audio processing. Also the dynamic rangeof pixel intensities is smaller than the dynamic range of many other signals, such ashigh quality audio samples. Moreover, in this thesis non-prime moduli are appliedwhich relax the transform requisites.

Correlation computation via number theoretic transforms seems not to be very ben-eficial on general purpose computers, since the arithmetic moduloq is difficult to per-form efficiently. For computing the correlation, an efficient FFT algorithm or a fastpolynomial transform might be more attractive. Although the bit-shifting radix-2 al-gorithm performs motion estimation generally much faster than ESA with the SSD orSAD criterion, or even the basic PDE with SSD criterion, the bit-shifting algorithmprobably benefits less from special instruction sets such as MMX, which were notused in any of the benchmarked programs. However, the bit-shifting algorithm mightbe better on some processors, where no such SIMD-style (single instruction multipledata) instruction sets were readily available.

On a custom hardware, such as field programmable gate array (FPGA) or applicationspecific integrated circuit (ASIC), number theoretic transforms could be very fast andthe radix-2 Fermat number transform could be easily used where modulusq = 232+1 (33 bits), transform kernelω = 4 and transform lengthN = 32. Multiplying bypowers of 4 is extremely simple (a bit shift) and the congruent reduction is easierto do efficiently than on a general purpose microprocessor [57]. Also the transformq = 224+1, ω = 219−27, N = 32 could be efficient on a custom hardware: althoughthere are some additions and bit shifts more than in the Fermat number transform, 25bits would suffice for storing the numbers. All addition and multiplication units wouldbe smaller and conceivably less memory would be needed for the buffers.

The fast convolution algorithms are attractive for motion estimation in a real-timesystem with a custom hardware, such as in a video phone. They are always executedin the same, deterministic amount of time. They possess a very regular flow of data,which makes implementation on an ASIC device simple and fast. A major problemhas been the demand for floating point and complex numbers, which require largesilicon surface for implementation. Using number theoretic domain the arithmetic issimplified considerably.

Simple arithmetic is also possible with other fast full search algorithms, which savecomputation by eliminating candidate motion vectors based on a criterion lower bound.However, the running time of these algorithms can not be known in advance, so theyare difficult to use in real-time systems. The execution time increases with a videocontaining much motion or noise.

Improved versions of the algorithms appear to be somewhat faster than motion esti-mation via NTT, but the test sequences that emerge in literature are often non-typical

68

for consumer mobile video phones, and more typical for high-end video conferencingsystems. The fast full search algorithms based on lower bound are also irregular andtherefore not straightforward to implement in ASIC devices.

In this thesis, only software implementations of different motion estimation algo-rithms were benchmarked and compared. In upcoming research, a hardware imple-mentation of a NTT-based motion estimation algorithm will be investigated. A VHDLmodel of a motion estimation engine will be written, and its complexity and powerconsumption evaluated in simulations.

The motion estimation algorithms presented in the thesis consider only motion vec-tors with integer pixel displacements. In practice most standards, such as H.263 orMPEG, require motion vectors with a half or quarter pixel precision. The integer vec-tor coordinates may be refined into higher precision in two ways.

The most straightforward and high quality method is to interpolate the image blocksinto a higher resolution by computing missing pixel values between existing pixelsat integer coordinates. Then any conventional block-matching algorithm can be usedfor computing a criterion around the integer-pixel coordinate at sub-pixel locations.Alternatively, it is possible to avoid interpolating the blocks by interpolating directlythe criterion function, in which case much computation is saved at the cost of increasedmatching error [9].

69

7. CONCLUSIONS

Although many of the previously presented fast full search algorithms, Partial Dis-tortion Elimination (PDE), Successive Elimination Algorithm (SEA), Multilevel Suc-cessive Elimination Algorithm (MSEA), Winner-Update Strategy, and Category-basedBlock Motion Estimation algorithm (CBME), seem to perform better than NTT-basedalgorithms, their execution speed varies much depending on the input data. If the videosequence is very noisy or contains much motion, they perform significantly worse thanotherwise. This makes the implementation of the methods for real-time encoders diffi-cult.

On the other hand, NTT-based algorithms are extremely regular. Unlike otherfast full search algorithms, or even conventional search strategies such as Three StepSearch, correlation-based algorithms (as NTT) have absolutely regular data flow, andthey are therefore most suitable for ASIC implementation in this aspect. The roughcomparison between different fast full search algorithms is given in Table 12, but inmany applications the savings will be less for methods which are not based on corre-lation.

Additionally, from many papers it is not clear how the algorithms were implemented.Usually the benchmarks are given for general purpose microprocessor software imple-mentations: this is the case for the NTT-based algorithms. For many algorithms, thereare significant differences in implementations on different architectures, such as anASIC or software.

The common factor in all NTT-based motion estimation algorithms is the require-ment for congruent arithmetic. Beyond that, many different transforms can be used forblock motion estimation. In the 32-point radix-2 transform algorithm, the fast Fouriertransform is directly used in the number theoretic domain. The 32-point bit-shiftingalgorithm optimizes the algorithm specifically for the number theoretic transform, par-ticularly by replacing multiplications with bit shifts.

The mixed radix algorithm applies first one radix-2 stage, which decomposes the32-point transform into two 16-point transforms, which are then transformed with aWinograd’s short length algorithm. Finally, the WNTTA algorithm applies 48-pointWFTA in the number theoretic domain. The advantages with Winograd’s algorithmsare that they minimize the amount of multiplications and group them in the center ofthe transform algorithm. This decreases the amount of slow congruent reductions.

Number theoretic transforms are not ideal for software implementations, but theyhave been used successfully in literary with custom hardware [57]. NTT-based mo-tion estimation algorithms are promising for low-cost, low-bandwidth, and low powerconsumption video phone applications.

70

8. REFERENCES

[1] Gibson J. D., Berger T., Lookabaugh T., & Baker R. L. (1998) Digital Compres-sion for Multimedia: Principles & Standards. Morgan Kaufmann, San Francisco,478 pp.

[2] Tekalp A. (1995) Digital Video Processing. Prentice Hall, New Jersey, 526 pp.

[3] Stiller C. & Konrad J. (1999) Estimating Motion in Image Sequences—A tutorialon modeling and computation of 2D motion. IEEE Signal Processing Magazine16 (4), pp. 70–91.

[4] Kuhn P. (1999) Algorithms, Complexity Analysis and VLSI Architectures forMPEG-4 Motion Estimation. Kluwer Academic Publishers, Boston, 239 pp.

[5] Konrad J. (1999) Motion Detection and Estimation. In: Bovik A. (ed.) Image andVideo Processing Handbook. Academic Press, 1999.

[6] Dufaux F. & Moscheni F. (1995) Motion Estimation Techniques for Digital TV:A Review and a New Contribution. Proceedings of the IEEE 83 (6), pp. 858–876.

[7] Wang H. & Mersereau R. (1999) Fast Algorithms for the Estimation of MotionVectors. IEEE Transactions on Image Processing 8 (3), pp. 435–438.

[8] Cheung C. (1998) Fast Motion Estimation Techniques for Video Compression.Ph.D thesis. City University of Hong Kong.

[9] Erol B., Kossentini F., & Alnuweiri H. (2000) Efficient Coding and MappingAlgorithms for Software-Only Real-Time Video Coding at Low Bit Rates. IEEETransactions on Circuits and Systems for Video Technology 10 (6), pp. 843–854.

[10] Chahine M. & Konrad J. (1995) Estimation and Compensation of AcceleratedMotion for Temporal Sequence Interpolation. Signal Processing: Image Com-munication 7, pp. 503–527.

[11] Pourreza H., Rahmati M., & Behazin F. (2000) Simple And Efficient Bit-PlaneMatching Algorithms For Video Compression. In: Workshop on Real-Time Im-age Analysis, August 31–September 1, Oulu, Finland, pp. 33–42.

[12] Wong P. & Au O. (1999) Modified One-Bit Transform for Motion Estimation.IEEE Transactions on Circuits and Systems for Video Technology 9 (7), pp.1020–1024.

[13] Hoang D., Long P., & Vitter J. (1998) Efficient Cost Measures for Motion Esti-mation at Low Bit Rates. IEEE Transactions On Circuits and Systems for VideoTechnology 8, pp. 488–500.

[14] Ghanbari M. (1990) The Cross-Search Algorithm for Motion Estimation. IEEETransactions on Communications 38 (7), pp. 950–953.

71

[15] Po L. & Ma W. (1996) A Novel Four-Step Search Algorithm for Fast Block Mo-tion Estimation. IEEE Transactions on Circuits and Systems for Video Technol-ogy 6 (3), pp. 313–317.

[16] Zhu S. & Ma K. (1997) A New Diamond Search Algorithm for Fast Block Match-ing Motion Estimation. In: International Conference on Information, Communi-cations and Signal Processing, September 9–12, Singapore.

[17] Tham J., Ranganath S., Ranganath M., & Kassim A. (1998) A Novel UnrestrictedCenter-Biased Diamond Search Algorithm for Block Motion Estimation. IEEETransactions on Circuits and Systems for Video Technology 8 (4), pp. 369–377.

[18] Zhu S. & Ma K. (2000) A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation. IEEE Transactions on Image Processing 9 (2), pp.287–290.

[19] Tourapis A., Shen G., Liou M., Au O., & Ahmad I. (2000) A New PredictiveDiamond Search Algorithm for Block Based Motion Estimation. In: Proceedingsof Visual Communications and Image Processing, June 20–23, Perth, Australia.

[20] Tourapis A., Au O., & Liou M. (1999) Fast Motion Estimation using CircularZonal Search. In: Proceedings of Visual Communications and Image Processing,January 23–29, San Jose, California, USA.

[21] Tourapis A. & Au O. (1999) Fast Motion Estimation Using Modified CircularZonal Search. In: Proceedings of IEEE International Symposium on Circuits andSystems, May 30–June 2, Orlando, Florida, USA, Vol. 4, pp. 231-234.

[22] Tourapis A., Au O., Liou M., Shen G., & Ahmad I. (2000) Optimizing the MPEG-4 Encoder—Advanced Diamond Zonal Search. IEEE International Symposiumon Circuits and Systems, May, Geneva, Switzerland, Vol. 3, pp. 674–677.

[23] Liu B. & Zaccarin A. (1993) New fast algorithms for the estimation of block mo-tion vectors. IEEE Transactions on Circuits and Systems for Video Technology 3(2), pp. 148–157.

[24] Cheung C. & Po L. (2000) Normalized Partial Distortion Search Algorithm forBlock Motion Estimation. IEEE Transactions on Circuits and Systems for VideoTechnology 10 (3), pp. 417–422.

[25] Kim J. & Choi T. (1999) Adaptive Matching Scan Algorithm Based on Gradi-ent Magnitude for Fast Full Search in Motion Estimation. IEEE Transactions onConsumer Electronics 45 (3), pp. 762–772.

[26] Kim J. & Choi T. (2000) A Fast Full-Search Motion-Estimation Algorithm Us-ing Representative Pixels and Adaptive Matching Scan. IEEE Transactions onCircuits and Systems for Video Technology 10 (7), pp. 1040–1048.

72

[27] Kim J. & Ahn B. (2001) Lossless Computational Reduction of Full Search Algo-rithm in Motion Estimation Using Appropriate Matching Unit from Image Local-ization. In: IEEE International Conference on Information Technology: Codingand Computing, April 2–4, Las Vegas, USA.

[28] Li W. & Salari E. (1995) Successive Elimination Algorithm for Motion Estima-tion. IEEE Transactions on Image Processing 4 (1), pp. 105–107.

[29] Wang Y. & Tu G. (2000) Successive Elimination Algorithm for Binary BlockMatching Motion Estimation. Electronics Letters 36 (23), pp. 2007–2008.

[30] Lin Y. & Tai S. (1997) Fast Full-Search Block-Matching Algorithm for Motion-Compensated Video Compression. IEEE Transactions on Communications 45(5), pp. 527–531.

[31] Oh T., Kim Y., Hong W., & Ko S. (2000) A Fast Full Search Motion EstimationAlgorithm Using the Sum of Partial Norms. In: IEEE International Conferenceon Consumer Electronics, June 13–15, Los Angeles, pp. 236–237.

[32] Jung S., Shin S., Baik H., & Park M. (2000) Nobel Successive Elimination Al-gorithms for the Estimation of Motion Vectors. In: IEEE International Sympo-sium on Multimedia Software Engineering, December 11–13, Tamkang Univer-sity, Taipei, Taiwan, pp. 332–335.

[33] Coban M. Z. & Mersereau R. M. (1997) Computationally Efficient ExhaustiveSearch Algorithm for Rate-constrained Motion Estimation. In: Proceedings ofInternational Conference on Image Processing, October 26-29, Washington, DC,Vol. 1, pp. 101–104, 1997.

[34] Noguchi Y., Furukawa J., & Kiya H. (1999) A Fast Full Search Block Match-ing Algorithm for MPEG-4 Video. In: IEEE International Conference on ImageProcessing, October 24–28, Kobe, Japan, Vol. 1, pp. 61–65.

[35] Do V. L. & Yun K. Y. (1998) A Low-Power VLSI Architecture for Full-SearchBlock-Matching Motion Estimation. IEEE Transactions on Circuits and Systemsfor Video Technology 8 (4), pp. 393–398.

[36] Lee C. & Chen L. (1997) A Fast Motion Estimation Algorithm Based on theBlock Sum Pyramid. IEEE Transactions on Image Processing 6 (11), pp. 1587–1591.

[37] Gao X. Q., Duanmu C. J., Zou C. R., & He Z. Y. (1999) Multi-Level SuccessiveElimination Algorithm for Motion Estimation in Video Coding. In: IEEE Inter-national Symposium on Circuits and Systems, Orlando, Florida, May 30–June 2,Vol. 4, pp. 227–230.

[38] Gao X. Q., Duanmu C. J., & Zou C. R. (2000) Multilevel Successive EliminationAlgorithm for Block Matching Motion Estimation. IEEE Transactions on ImageProcessing 9 (3), pp. 501–504.

73

[39] Jung S., Shin S., Baik H., & Park M. (2000) Advanced Multilevel SuccessiveElimination Algorithms For Motion Estimation in Video Coding. Lecture Notesin Computer Science 1, Springer-Verlag, pp. 431–442.

[40] Lin C., Chang Y., & Chen Y. (1998) Hierarchical Motion Estimation AlgorithmBased on Pyramidal Successive Elimination. In: Proceedings of InternationalComputer Symposium, December 17–19, Tainan, Taiwan, pp. 41-44.

[41] Chen Y., Hung Y., & Fuh C. (2000) Fast Block Matching Algorithm Based onthe Winner-Update Strategy. In: Proceedings of the Fourth Asian Conference onComputer Vision, January 8–11, Taipei, Taiwan, Vol. 2, pp. 977–982.

[42] Chen Y., Hung Y., & Fuh C. (2001) Fast Block Matching Algorithm Based onthe Winner-Update Strategy. IEEE Transactions on Image Processing 10 (8), pp.1212–1222.

[43] Mahmoud H. A. & Bayoumi M. (2000) A Low Power Architecture for a New Ef-ficient Block-Matching Motion Estimation Algorithm. In: International Confer-ence on Communication Technology Proceedings, August 21–23, Beijing, China,Vol. 2, pp. 1173–1179.

[44] Weiss M. A. (1999) Data Structures and Algorithm Analysis in Java. Addison-Wesley, 576 pp.

[45] Naito Y., Miyazaki T., & Kuroda I. (1996) A Fast Full-Search Motion EstimationMethod for Programmable Processors with a Multiply-Accumulator. IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing, May 7–10,Atlanta, Georgia, pp. 3221–3224.

[46] Chen C. & Duluk J. F. (1996) System and Method for Cross Correlation with Ap-plication to Video Motion Vector Estimator. United States Patent no. 5,535,288,July 9.

[47] Elliott D. F. & Rao K. R. (1982) Fast Transforms: Algorithms, Analyses, Appli-cations. Academic Press, Orlando, 488 pp.

[48] Nussbaumer H. J. (1982) Fast Fourier Transform and Convolution Algorithms.Springer-Verlag, Berlin, 276 pp.

[49] Blahut R. E. (1985) Fast Algorithms for Digital Signal Processing. Addison-Wesley, 441 pp.

[50] Burrus C. S. & Parks T. W. (1985) DFT/FFT and Convolution Algorithms. JohnWiley & Sons, 256 pp.

[51] Proakis J. G., Rader C. M., Ling F., & Nikias C. L. (1992) Advanced DigitalSignal Processing. Macmillan.

[52] Winograd S. (1978) On Computing the Discrete Fourier Transform. Mathematicsof Computation 32, pp. 175–199.

74

[53] Silverman H. F. (1977) An introduction to programming the Winograd Fouriertransform algorithm. IEEE Transactions on Acoustic, Speech, And Signal Pro-cessing 25, pp. 152–165.

[54] Silverman H. F. (1989) Programming the WFTA for Two-Dimensional Data.IEEE Transactions on Acoustic, Speech, And Signal Processing 37, pp. 1425–1431.

[55] Rosen K. H. (1993) Elementary Number Theory And Its Applications. Addison-Wesley, 544 pp.

[56] McClellan J. H. & Rader C. M. (1979) Number Theory in Digital Signal Process-ing. Prentice-Hall, Englewood Cliffs, New Jersey.

[57] Alfredsson L. (1996) VLSI Architectures and Arithmetic Operations with Ap-plication to the Fermat Number Transform. Linköping Studies in Science andTechnology 425 (dissertation), Linköping.

[58] Tommila M. (2000-1-22) Apfloat: A C++ High Performance Arbitrary Preci-sion Arithmetic Package, version 2.31. URL:http://www.iki.fi/~mtommila/apfloat/.

[59] McVeigh J. S. & Wu S. (1994) Comparative Study of Partial Closed-loop Ver-sus Open-loop Motion Estimation for Coding of HDTV. International Journal ofImaging Systems and Technology 5 (4), pp. 268-275.

[60] AMD Athlon Processor x86 Code Optimization Guide (2000). Advanced MicroDevices, 320 pp.

[61] IA-32 Intel Architecture Software Developer’s Manual Volume 2: Instruction SetReference (2000). Intel, 946 pp.

[62] Lillevold K. O. (2000-12-10) Telenor’s H.263 software simulation model 2.0.URL: http://www.lillevold.com/tmn.htm.

http://www.iki.fi/~mtommila/apfloat/

http://www.iki.fi/~mtommila/apfloat/

http://www.lillevold.com/tmn.htm

75

A. SEA INEQUALITIES

Proof for (20) follows from the triangle inequality generalization∣∣∣∣∣ N

∑k=1

ak

∣∣∣∣∣≤ N

∑k=1

|ak| . (97)

For our purposes,ak is one difference term in the SAD function (6), i.e.ak = Bt (y,x)−Cτ (y,x) for some(y,x). With this substitution, we get∣∣∣∣∣Bh−1

∑y=0

Bw−1

∑x=0

[Bt (y,x)−Cτ (y,x)]

∣∣∣∣∣≤ Bh−1

∑y=0

Bw−1

∑x=0

|Bt (y,x)−Cτ (y,x)| . (98)

The right side of (98) corresponds to the SAD value, the left side (which is the lowerbound for the SAD) can be rearranged to∣∣∣∣∣Bh−1

∑y=0

Bw−1

∑x=0

Bt (x,y)−Bh−1

∑y=0

Bh−1

∑x=0

Cτ (y,x)

∣∣∣∣∣≤ SAD(cy,cx) (99)

which completes the proof for (20), presented also in [28]. Proof for (22) followssimilarly from proof of

1N

(N

∑k=1

ak

)2

≤N

∑k=1

a2k. (100)

Let us assume that allak are real. In this case, for all pairs ofi and j,(ai −a j

)2 ≥ 0 (101)

a2i +a2

j −2aia j ≥ 0 (102)

12

(a2

i +a2j

)≥ aia j (103)

by expanding the square. Let us then sum together the left and right sides of allN×Ninequalities, one for each pair ofi and j, 1≤ i, j ≤ N:

12

N

∑i=1

N

∑j=1

(a2

i +a2j

)≥

N

∑i=1

N

∑j=1

aia j (104)

At the left side, there are 2N2 sum terms (two terms for eachi, j pair). Each particulara2

k, for some fixedk, is presented 2N times in the sum. This can be best illustrated byimagining anN×N-element matrix, in which elementi, j contains the suma2

i + a2j ,

as in Figure 24. In this matrix there would be a single complete row and column ofelements, which would contain the terma2

k. Thus the left side can be presented as ininequality (105).

Similarly, the right side contains a sum withN2 terms, each being a product of twoak. In all elements of a row in a matrix containing the products there is a commonfactor ai that multiplies the sum of all other elementsa j in the row, j = 1. . .N. By

76

a1 +a1 a1 +a2 a1 +a3 a1 +a4

a2 +a1 a2 +a2 a2 +a3 a2 +a4

a3 +a1 a3 +a2 a3 +a3 a3 +a4

a4 +a1 a4 +a2 a4 +a3 a4 +a4

Figure 24. The 2N sum terms for eachak, N = 4. The 8a2 terms are underlined forclarification.

factoring it can be seen that the right side corresponds to(∑N

i=1ai)(

∑Nj=1a j

), and this

gives the right side of inequality

12

N

∑k=1

2Na2k ≥

(N

∑k=1

ak

)2

. (105)

This completes our proof for (100). Both [7, 30] show this for using the SSD criterionwith the SEA.

77

B. INVERSIBILITY OF A NTT

Since the correlation is computed in the transform domain, an inverse transform needsto be done to obtain the spatial domain correlation. This raises the question, when aNTT is inversible? The answer can be easily given: when the inverse of the determinantof the transform matrixTN in (65) exists. However, this needs to be simplified to givea result that is more easily evaluated.

The matrixTN is a special kind of Vandermonde matrix, and by applying the Van-dermonde determinant it can be easily seen that

|TN|=N−1

∏i=0

N−1

∏j=i+1

(ω j −ωi) (106)

which gives|TN| in a factored form. For|TN| to be inversible, each factor must alsobe inversible. From (106) the factors are

ωi (ω j−i −1)

= ω j−k(

ωk−1)

(107)

wherek = j − i, i ∈ [0,N−1], j ∈ [i +1,N−1] andk ∈ [1,N−1]. Sinceω must berelatively prime10 to q, all powersωi are also and thereforeωi is inversible moduloq.This yields thatTN is inversible exactly when gcd

(ωk−1,q

)= 1.

For a Fourier transform the only element in the complex plane which is not inversibleis 0, and since (107) is never zero for the Fourier transform, it is always inversible.

The convolution property does not directly follow from the inversibility. That isconsidered in [51].

10If it would not be,ωN 6≡ 1 for anyN 6= 0 and the inverseω−1 would not exist. From (70),ω−1 ≡ωN−1 (modq).

78

C. THE EUCLIDEAN ALGORITHM

The Euclidean algorithm finds the greatest common divisor (gcd) of two integers. Itcan also be used to find the inverse of a number modulo someq. The function gcd hasthe following property:

gcd(a,b) = gcd(b,a−nb) (108)

for any arbitrary integersa, b andn. Whenn = 0, gcd(a,b) = gcd(b,a) and we mayalways ordera andb so thata≥ b. Let’s do this and letd = ba/bc, the integer part ofa divided byb, andr the remainder of the division. Nowa = bd+ r and

gcd(a,b) = gcd(bd+ r,b) = gcd(b, r) (109)

using (108) to subtractd timesb from a so that only the remainder is left. Sincer isless thanb, this process can be repeated until the remainder is zero. Then for the lastpair ofa andb, b dividesa and the greatest common divisor isb.

If gcd(a,b) = 1, the inverseb−1 (moda) exists. For finding it, lett0 = 0 andt1 = 1.Each time, after applying (109), except the first time, compute recursively

tn = dn−1tn−1 + tn−2 (110)

wheredn is the quotient at thenth step. At the last step, when the remainder is zero,the value oftn is the inverseb−1 moda.

Numerical Example Find the greatest common divisor of 39 and 25 and the inverseof 25 mod 39.

Step 1 : gcd(39,25) = gcd(25,39−1·25)Step 2 : gcd(25,14) = gcd(14,25−1·14)Step 3 : gcd(14,11) = gcd(11,14−1·11)Step 4 : gcd(11,3) = gcd(3,11−3·3)Step 5 : gcd(3,2) = gcd(2,3−1·2)Step 6 : gcd(2,1) = 1

The process stops at gcd(2,1) because 1 divides 2 and the remainder is zero. So wehave found that gcd(39,25) = 1. Lets now find the inverse 25−1 (mod 39).

t0 = 0

t1 = 1

t2 = d1t1 + t0 =−1·1+0 =−1

t3 = d2t2 + t1 =−1·−1+1 = 2

t4 = d3t3 + t2 =−1·2−1 =−3

t5 = d4t4 + t3 =−3·−3+2 = 11

t6 = d5t5 + t4 =−1·11−3 =−14

To obtain a positive result, the modulus 39 may be added once, yielding−14+39=25. This is the inverse, which can be easily verified: 25·25−1 = 25·25≡ 1 (mod 39).

79

D. SOME SHORT LENGTH WINOGRAD FOURIERTRANSFORM ALGORITHMS

u = 2π/N. diag(−→x ) denotes a diagonal square matrix, whose diagonal elements are

taken from−→x . In the algorithms shown below, indices start from 1 (unlike in the text).

D.1. N = 3

B3 =

1 0 01 1 11 1 −1

D3 =

1 0 00 cosu−1 00 0 −i sinu

A3 =

1 1 10 1 10 1 −1

t1 = x(2) + x(3);m0 = x(1) + t1;m1 = (cos(u)-1) * t1;m2 = -i*sin(u) * (x(2) - x(3));s1 = m0 + m1;X(1) = m0;X(2) = s1 + m2;X(3) = s1 - m2;

Transpose: BT3

s1 = x(2) + x(3);y(1) = x(1) + s1;y(2) = s1;y(3) = x(2) - x(3);

Transpose: AT3

t1 = x(1) + x(2);y(1) = x(1);y(2) = t1 + x(3);y(3) = t1 - x(3);

80

D.2. N = 8

B8 =

1 0 0 0 0 0 0 00 0 0 1 1 0 1 10 0 1 0 0 1 0 00 0 0 1 −1 0 −1 10 1 0 0 0 0 0 00 0 0 1 −1 0 1 −10 0 1 0 0 −1 0 00 0 0 1 1 0 −1 −1

D8 = diag

1111

cosu−i−i

−i sinu

A8 =

1 1 1 1 1 1 1 11 −1 1 −1 1 −1 1 −11 0 −1 0 1 0 −1 01 0 0 0 −1 0 0 00 1 0 −1 0 −1 0 10 1 0 −1 0 1 0 −10 0 1 0 0 0 −1 00 1 0 1 0 −1 0 −1

t1 = x(1) + x(5);t2 = x(3) + x(7);t3 = x(2) + x(6);t4 = x(2) - x(6);t5 = x(4) + x(8);t6 = x(4) - x(8);t7 = t1 + t2;t8 = t3 + t5;m0 = t7 + t8;m1 = t7 - t8;m2 = t1 - t2;m3 = x(1)-x(5);m4 = cos(u) * (t4-t6);m5 = -i * (t3-t5);m6 = -i * (x(3)-x(7));m7 = -i*sin(u) * (t4+t6);s1 = m3 + m4;s2 = m3 - m4;s3 = m6 + m7;s4 = m6 - m7;X(1) = m0;X(2) = s1 + s3;X(3) = m2 + m5;X(4) = s2 - s4;X(5) = m1;X(6) = s2 + s4;X(7) = m2 - m5;X(8) = s1 - s3;

Transpose: BT8

s1 = x(2) + x(4);s2 = x(2) - x(4);s3 = x(6) + x(8);s4 = x(8) - x(6);y(1) = x(1);y(2) = x(5);y(3) = x(3) + x(7);y(4) = s1 + s3;y(5) = s2 + s4;y(6) = x(3) - x(7);y(7) = s2 - s4;y(8) = s1 - s3;

Transpose: AT8

t1 = x(1) + x(2);t2 = x(1) - x(2);t3 = x(5) + x(6);t4 = x(6) - x(5);t5 = t1 + x(3);t6 = t1 - x(3);t7 = t2 + x(8);t8 = t2 - x(8);y(1) = t5 + x(4);y(2) = t7 + t3;y(3) = t6 + x(7);y(4) = t7 - t3;y(5) = t5 - x(4);y(6) = t8 + t4;y(7) = t6 - x(7);y(8) = t8 - t4;

81

D.3. N = 16

B16 =

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 1 −1 1 0 0 0 1 0 1 1 1 00 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 00 0 0 0 1 0 −1 1 0 −1 0 0 −1 0 1 1 0 −10 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 1 0 −1 −1 0 1 0 0 1 0 −1 1 0 −10 0 0 1 0 −1 0 0 0 0 0 −1 0 1 0 0 0 00 0 0 0 1 0 1 1 −1 0 0 0 −1 0 −1 1 1 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 1 1 −1 0 0 0 1 0 1 −1 −1 00 0 0 1 0 −1 0 0 0 0 0 1 0 −1 0 0 0 00 0 0 0 1 0 −1 −1 0 1 0 0 −1 0 1 −1 0 10 0 1 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 00 0 0 0 1 0 −1 1 0 −1 0 0 1 0 −1 −1 0 10 0 0 1 0 1 0 0 0 0 0 −1 0 −1 0 0 0 00 0 0 0 1 0 1 −1 1 0 0 0 −1 0 −1 −1 −1 0

A16 =

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 −1 1 −1 1 −1 1 −1 1 −1 1 −1 1 −1 1 −11 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 01 0 0 0 −1 0 0 0 1 0 0 0 −1 0 0 01 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 00 1 0 −1 0 −1 0 1 0 1 0 −1 0 −1 0 10 0 1 0 0 0 −1 0 0 0 −1 0 0 0 1 00 1 0 −1 0 1 0 −1 0 −1 0 1 0 −1 0 10 1 0 0 0 0 0 −1 0 −1 0 0 0 0 0 10 0 0 −1 0 1 0 0 0 0 0 1 0 −1 0 00 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −10 0 1 0 0 0 −1 0 0 0 1 0 0 0 −1 00 0 0 0 1 0 0 0 0 0 0 0 −1 0 0 00 1 0 1 0 −1 0 −1 0 1 0 1 0 −1 0 −10 0 1 0 0 0 1 0 0 0 −1 0 0 0 −1 00 1 0 1 0 1 0 1 0 −1 0 −1 0 −1 0 −10 1 0 0 0 0 0 1 0 −1 0 0 0 0 0 −10 0 0 1 0 1 0 0 0 0 0 −1 0 −1 0 0

82

D16 = diag

11111

cos2ucos2ucos3u

cosu+cos3ucos3u−cosu

−i−i−i

−i sin2u−i sin2u−i sin3u

−i (sinu−sin3u)−i (sinu+sin3u)

t1 = x(1) + x(9);t2 = x(5) + x(13);t3 = x(3) + x(11);t4 = x(3) - x(11);t5 = x(7) + x(15);t6 = x(7) - x(15);t7 = x(2) + x(10);t8 = x(2) - x(10);t9 = x(4) + x(12);t10 = x(4) - x(12);t11 = x(6) + x(14);t12 = x(6) - x(14);t13 = x(8) + x(16);t14 = x(8) - x(16);t15 = t1 + t2;t16 = t3 + t5;t17 = t15 + t16;t18 = t7 + t11;t19 = t7 - t11;t20 = t9 + t13;t21 = t9 - t13;t22 = t18 + t20;t23 = t8 + t14;t24 = t8 - t14;t25 = t10 + t12;t26 = t12 - t10;m0 = t17 + t22;m1 = t17 - t22;m2 = t15 - t16;m3 = t1 - t2;m4 = x(1) - x(9);

m5 = cos(2*u) * (t19-t21);m6 = cos(2*u) * (t4-t6);m7 = cos(3*u) * (t24+t26);m8 = (cos(u)+cos(3*u)) * t24;m9 = (cos(3*u)-cos(u)) * t26;m10 = -i * (t18-t20);m11 = -i * (t3-t5);m12 = -i * (x(5)-x(13));m13 = -i*sin(2*u) * (t19+t21);m14 = -i*sin(2*u) * (t4+t6);m15 = -i*sin(3*u) * (t23+t25);m16 = -i*(sin(u)-sin(3*u)) * t23;m17 = -i*(sin(u)+sin(3*u)) * t25;s1 = m3 + m5;s2 = m3 - m5;s3 = m11 + m13;s4 = m13 - m11;s5 = m4 + m6;s6 = m4 - m6;s7 = m8 - m7;s8 = m9 - m7;s9 = s5 + s7;s10 = s5 - s7;s11 = s6 + s8;s12 = s6 - s8;s13 = m12 + m14;s14 = m12 - m14;s15 = m15 + m16;s16 = m15 - m17;s17 = s13 + s15;s18 = s13 - s15;s19 = s14 + s16;s20 = s14 - s16;X(1) = m0;X(2) = s9 + s17;X(3) = s1 + s3;X(4) = s12 - s20;X(5) = m2 + m10;X(6) = s11 + s19;X(7) = s2 + s4;X(8) = s10 - s18;X(9) = m1;X(10) = s10 + s18;X(11) = s2 - s4;X(12) = s11 - s19;X(13) = m2 - m10;X(14) = s12 + s20;X(15) = s1 - s3;X(16) = s9 - s17;

83

Transpose: BT16

s1 = x(2) + x(16);s2 = x(2) - x(16);s3 = x(3) + x(15);s4 = x(3) - x(15);s5 = x(4) + x(14);s6 = x(4) - x(14);s7 = x(6) + x(12);s8 = x(6) - x(12);s9 = x(7) + x(11);s10 = x(11) - x(7);s11 = x(10) + x(8);s12 = x(10) - x(8);s13 = s1 + s11;s14 = s1 - s11;s15 = s2 + s12;s16 = s2 - s12;s17 = s5 + s7;s18 = s5 - s7;s19 = s8 - s6;s20 = s8 + s6;y(1) = x(1);y(2) = x(9);y(3) = x(5) + x(13);y(4) = s3 + s9;y(5) = s13 + s17;y(6) = s3 - s9;y(7) = s13 - s17;y(8) = s18 - s14;y(9) = s14;y(10) = -s18;y(11) = x(5) - x(13);y(12) = s4 + s10;y(13) = s19 + s15;y(14) = s4 - s10;y(15) = s15 - s19;y(16) = s16 + s20;y(17) = s16;y(18) = -s20;

Transpose: AT16

t1 = x(1) + x(2);t2 = x(1) - x(2);t3 = x(3) + x(4);t4 = x(3) - x(4);t5 = x(7) + x(3);t6 = x(7) - x(3);t7 = x(6) + x(8);t8 = x(8) - x(6);t9 = t1 + t3;t10 = t2 + t7 + x(9);t11 = t1 + t6;t12 = t2 - t7 - x(10);t13 = t1 + t4;t14 = t2 + t8 + x(10);t15 = t1 - t5;t16 = t2 - t8 - x(9);t17 = x(11) + x(14);t18 = x(14) - x(11);t19 = x(15) + x(12);t20 = x(15) - x(12);t21 = x(17) + x(16);t22 = x(16) + x(18);t23 = t21 + t17;t24 = t22 + t18;t25 = t22 - t18;t26 = t21 - t17;y(1) = t9 + x(5);y(2) = t10 + t23;y(3) = t11 + t19;y(4) = t12 + t24;y(5) = t13 + x(13);y(6) = t14 + t25;y(7) = t15 + t20;y(8) = t16 + t26;y(9) = t9 - x(5);y(10) = t16 - t26;y(11) = t15 - t20;y(12) = t14 - t25;y(13) = t13 - x(13);y(14) = t12 - t24;y(15) = t11 - t19;y(16) = t10 - t23;

84

E. WFTA INDEX PERMUTATION

function [Ri, Ro] = wfta_reorder(fac)% [Ri, Ro] = wfta_reorder(fac): generate reordering vectors for WFTA% Input fac is a vector of Winograd Fourier transform decomposition lengths,% for example f = [ 16 3 ] means using inner factorization of 16% and outer factorization of 3, ie. length 3*16=48 transform% The order of factorization matters: fac=[16 3] is different from fac=[3 16].% Returns Ri, the input reordering vector% Ro, the output reordering vector% Note: currently no lengths above 2 are supported, so must be length(fac)<=2

if length(fac)==1% No mapping necessary for this caseRi = (1:fac)’;Ro = (1:fac)’;return

end

% From C.S.Burrus, T.W.Parks: DFT/FFT and Convolution Algorithms, 1985% around page 62, eq. 2.111

if length(fac) > 2error(’Only lengths less than or equal to 2 currently supported’);

end

N1 = fac(1);N2 = fac(2);N = N1*N2;Ri = zeros(N,1);Ro = zeros(N,1);Rf = zeros(N,1);

K1 = N2;K2 = N1;i = 1;for n2=0:N2-1

for n1=0:N1-1n = mod(K1*n1+K2*n2,N);Rf(i) = n;i = i + 1;

endend

%---

K3 = 1;while mod(K3*N2,N1) ~= 1

K3 = K3+1;endK3 = K3*N2;

K4 = 1;while mod(K4*N1,N2) ~= 1

K4 = K4+1;

85

endK4 = K4*N1;

i = 1;for k2=0:N2-1

for k1=0:N1-1k = mod(K3*k1+K4*k2,N);Ri(i) = k;i = i + 1;

endend

% Convert to Matlab 1-based indicesRf = Rf+1;Ri = Ri+1;

% Rf is a map from Fourier coefficient to permuted Fourier coefficients,% but we want here map from permuted coefficients to actual coefficients.% Therefore, we inverse the mapping here (which is possible since it’s 1->1)Ro = inv_map(Rf);

function im = inv_map(m)% im = inv_map(m): invert mapping vector m% m is assumed to be a 1-based mapping vector,% for example m = [ 1 2 3 4 ] is length-4 identity% mapping, or m = [ 4 3 2 1 ] would reverse the elements% of a length-4 vector v. Each element in the mapping vector m% should be integer number between 1..length(m), and there should% be only one of each number.% To map a vector using a mapping vector in Matlab, use "v(m)".%% Example: let’s assume we have transform T as following:% X = T * x;% and suppose we have the same transform, but permuted, so that% xp = x(Ri);% Xp = Tp * xp;% X = Xp(Ro);% is same X as before. Now to convert T to Tp, use% T = Tp(Ro,inv_map(Ri));% and to convert Tp to T, use% Tp = T(inv_map(Ro),Ri);

[h,w] = size(m);im = zeros(h,w);len = length(m);for i=1:len

im(m(i)) = i;end

if max(im==0)==1 % Is any of the elements in im zero?% yes, some element is zeroerror(’Given map vector is not a valid mapping’);

end

number theoretic transform -based block motion estimation - oulu

Documents