hierarchical structures and extended motion information ... · hierarchical structures and extended...

Linkoping Studies in Science and TechnologyDissertation No. 683

Hierarchical Structures and ExtendedMotion Information for Video Coding

Astrid Lundmark

Department of Electrical EngineeringLinkoping University, SE-581 83 Linkoping, Sweden

Linkoping, March 2001

ISBN 91-7219-997-0ISSN 0345-7524

Printed in Sweden by UniTryck, Linkoping 2001

To My Mother,

Britta Lundmark,

and My Boys,

Jorgen, Erik and Johan Skold

Abstract

The thesis describes usage and generation of motion information inthe field of video coding. Hierarchical structures for computation ofthis motion information are suggested. These hierarchical structurescan be used also for other applications.

One part of the motion information consists of vectors indicatingthe displacement between successive frames in an image sequence.These vectors are frequently used in existing video coding schemes.Algorithms for calcualting the vectors are described and analyzed,and it is shown how the computational complexity can be loweredfor some of the algorithms.

The other part of the motion information is local certainty, whichgives an indication about the probability that the vector correspondsto the true displacement. The certainty part of the motion estimationhas not often been utilized in the field of video coding. In the thesis,different certainty measures are evaluated. Two applications of thecertainty measures are examined: pre-filtering for noise reduction,and reduction of the required number of bits to represent the videosequence in a scalable wavelet coding scheme.

Preface

This thesis is based on research work carried out at the Image Cod-ing Group, Dept. of Electrical Engineering, Linkoping Universityduring January 1990 – February 2001. From the starting point, whichwas motion estimation with certainty measures, interest developedtowards hierarchical structures and their interrelationship with hi-erarchical algorithms. The application has been visual communica-tion, therefore the properties of the human visual system have beenconsidered along the way.

Material from the following publications has been in-cluded in the thesis:

Chapter 2: Hierarchical Image Structures� Astrid Lundmark “2D-Frequency Sensitivity Adapted Subband

Decomposition”, Picture Coding Symposium, March 1993.

� Astrid Lundmark, Niclas Wadstromer and Haibo Li “Recur-sive Subdivisions of the Plane Yielding Nearly Hexagonal Re-gions”, RadioVetenskap och Kommunikation, June 1999.

� Astrid Lundmark, Niclas Wadstromer and Haibo Li “Hierar-

chical Subsampling Giving Fractal Regions”, IEEE Transactionson Image Processing, January 2001.

Chapter 3: Motion Estimation� Astrid Lundmark “Non-Overlapping Search Pattern for Log-

arithmic Search Motion Estimation”, In the review process ofInternat. Symposium on Video Processing and Multimedia Commu-nications, June 2001.

Chapter 4: Certainty Measures� Astrid Lundmark and Torbjorn Kronander “Using Certainty

of Motion Estimation in Image Coding”, Picture Coding Sympo-sium, August 1991.

� Astrid Lundmark, Haibo Li and Robert Forchheimer “A Wave-let Image Coding Scheme using Frequency Domain BackwardMotion Estimation and Certainty of Motion Vectors”, Scandi-navian Conference on Image Analysis, June 1999.

Chapter 5: Scalable Video Coding� Astrid Lundmark, Haibo Li and Robert Forchheimer “Motion

Vector Certainty Reduces Bit Rate in Backward Motion Esti-mation Video Coding”, SPIE Visual Communications and ImageProcessing 2000, June 2000.

The following publications contain material that is re-lated to but not included in the thesis:

� Haibo Li, Astrid Lundmark and Robert Forchheimer “ImageSequence Coding at Very Low Bitrates: A Review”, IEEE Trans-actions on Image Processing”, September 1994.

� Haibo Li, Astrid Lundmark and Robert Forchheimer “Low-Entropy Motion Estimation For Very Low Bitrate Video” Inter-nat. Workshop on Coding Techniques for Very Low Bit-Rate Video,November 1995.

� Haibo Li, Lena Klasen, Jorgen Ahlberg, Jacob Strom, AstridLundmark, Franck Davoine and Robert Forchheimer “Under-standing of Human Images”, Swedish Society for Automated Im-age Analysis Symposium on Image Precessing, March 1997.

� Haibo Li, Gregory Godart, Astrid Lundmark and Robert Forch-heimer, “Segmentation for Transmission”, ACTS ConcertationMeeting Workshop, June 1997.

� Haibo Li, Gang Wang and Astrid Lundmark, “Transmissionand Display of Real-Time IP Video”, RadioVetenskap och Kom-munikation, Proceedings, June 1999.

� Haibo Li, Astrid Lundmark and Robert Forchheimer, “VideoBased Emotional State Estimation”, Internat. Workshop on Syn-thetic-Natural Hybrid Coding and 3D-Imaging, Septembeber 1999.

Acknowledgements

First I would like to thank my supervisor, Robert Forchheimer, forgiving me the opportunity to do research in image coding, and for allour interesting discussions. Robert has the gift of being able to un-derstand also things that are not yet well formulated, which makesidea exchange a true pleasure! I would also like to thank Haibo Li,who has helped and encouraged me enormously and who has comeup with many of the ideas that are described in this work.

There are many more people that deserves a warm thank you,like Ingemar Ingemarsson, who was initially my supervisor, TorbjornKronander, who introduced me to the subject of motion estimation,and Ragnar Nohre for sharing his knowledge and insights. Manythanks also to all my friends and colleagues at the Image CodingGroup and other divisions of the Department of Electrical Engineer-ing. An extra thanks to Niclas Wadstromer for his help with proofreading.

Immediately before starting PhD studies at the Image CodingGroup, I had the privilege of staying at the image processing lab-oratory headed by Olaf Kubler at ETH, Zurich. I am grateful to ev-erybody at the lab for making this a both educating and pleasantexperience.

I would also like to thank the group involved in the “HD-DIVINE”

project, especially Olle Franceschi, Pia Marklund, and Goran Mag-nusson, for giving me the opportunity to participate in a “real-life”video coding project.

Last, but not least, I thank my husband Jorgen and my sons Erikand Johan for their love and support.

Linkoping, March 2001

Contents

1 Introduction 11.1 Disposition of the Thesis . . . . . . . . . . . . . . . . . 1

1.1.1 Contributions . . . . . . . . . . . . . . . . . . . 21.1.2 The Interconnectedness of All Things . . . . . 3

1.2 Image Sequence Coding . . . . . . . . . . . . . . . . . 31.2.1 Image Sensing . . . . . . . . . . . . . . . . . . . 31.2.2 Pre-filtering . . . . . . . . . . . . . . . . . . . . 41.2.3 Coding . . . . . . . . . . . . . . . . . . . . . . . 41.2.4 Decoding, Post-processing and Display . . . . 11

1.3 Properties of The Human Visual System . . . . . . . . 121.3.1 Physical Mechanisms of Vision . . . . . . . . . 121.3.2 Stationary Properties . . . . . . . . . . . . . . . 131.3.3 Dynamic Properties . . . . . . . . . . . . . . . 161.3.4 Hierarchical Properties . . . . . . . . . . . . . . 20

2 Hierarchical Image Structures 212.1 An HVS-Based Hierarchical Representation . . . . . . 23

2.1.1 Motivations . . . . . . . . . . . . . . . . . . . . 232.1.2 Implementation of the HVS-based Hierarchi-

cal Representation . . . . . . . . . . . . . . . . 25

i

ii

2.2 Hierarchical Subsampling Giving Fractal Regions . . 272.2.1 Motivations . . . . . . . . . . . . . . . . . . . . 292.2.2 Subdivisions on a Cartesian Grid . . . . . . . . 312.2.3 Subdivisions on a Hexagonal Grid . . . . . . . 372.2.4 Spatially Varying Resolution . . . . . . . . . . 43

2.3 Conclusions on Hierarchical Image Structures . . . . 45

3 Motion Estimation 473.1 Objectives and Notation . . . . . . . . . . . . . . . . . 483.2 Required Accuracy of Motion Vectors . . . . . . . . . 503.3 Algorithms Based on Spatial Matching . . . . . . . . . 51

3.3.1 Full Search . . . . . . . . . . . . . . . . . . . . 543.3.2 Hierarchical Search . . . . . . . . . . . . . . . 553.3.3 Reduced Search Strategies . . . . . . . . . . . 603.3.4 Comparison of search strategy efficiencies . . 67

3.4 Image Displacement Subband Evaluation Matching . 673.5 Frequency Domain Matching Algorithms . . . . . . . 693.6 Optical Flow Algorithms . . . . . . . . . . . . . . . . . 71

3.6.1 Direct Methods . . . . . . . . . . . . . . . . . . 723.6.2 Iterative Methods . . . . . . . . . . . . . . . . . 75

3.7 Bayesian Techniques . . . . . . . . . . . . . . . . . . . 753.8 Analysis of the Algorithms . . . . . . . . . . . . . . . 75

3.8.1 Introduction . . . . . . . . . . . . . . . . . . . . 753.8.2 Problematic scenes . . . . . . . . . . . . . . . . 763.8.3 Noise effects . . . . . . . . . . . . . . . . . . . . 783.8.4 Two dimensions . . . . . . . . . . . . . . . . . 813.8.5 Maximum Likelihood versus Maximum A Pos-

teriori Estimation . . . . . . . . . . . . . . . . . 823.8.6 Choice of Match Criterion . . . . . . . . . . . . 823.8.7 Adaptation of Matching Criterion to Noise Dis-

tribution . . . . . . . . . . . . . . . . . . . . . . 833.8.8 Matching and Algorithms based on the Opti-

cal Flow Equation — A Short Summary . . . . 85

4 Certainty Measures 894.1 Certainty Measures for Block Matching . . . . . . . . 894.2 Certainty Measures for Matching in the Frequency Do-

main . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.3 Certainty Measures for Motion Estimation based on

Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4 Evaluations of Certainty Measures . . . . . . . . . . . 99

Contents iii

4.5 Applications of Certainty Measures . . . . . . . . . . 1174.5.1 Motion Information in Prefiltering . . . . . . . 1174.5.2 Motion Information in Coding/Decoding . . 1254.5.3 Motion Information in Postfiltering . . . . . . 129

5 Scalable Video Coding 1315.1 A Wavelet Coder using FDBME . . . . . . . . . . . . . 134

5.1.1 Background . . . . . . . . . . . . . . . . . . . . 1345.1.2 The Frequency Domain Backward Motion Es-

timation (FDBME) scheme . . . . . . . . . . . . 1365.1.3 Motion Vector Certainty (MVC) . . . . . . . . . 1385.1.4 Implementation . . . . . . . . . . . . . . . . . . 1395.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . 1405.1.6 System Design . . . . . . . . . . . . . . . . . . 1455.1.7 Summary of the FDBME Scheme . . . . . . . . 146

5.2 Time Scalability and B-BME . . . . . . . . . . . . . . . 148

6 Conclusions 151

A Discrete pdf’s from Continuous Ones 153

Bibliography 168

1

Introduction

1.1 Disposition of the Thesis

To familiarize the reader with the topics of video coding and theproperties of the human visual system, sections 1.2 and 1.3 will pro-vide overviews of the fields. The presentations are focused towardsthe effects of motion, since much of the thesis work has been devotedto motion estimation and the errors that occur in motion vector esti-mates.

Hierarchical image representations will be the topic of chapter 2.Representation of video material is the key to efficient transmission,storage, retrieval, and processing of the visual information. Hier-archical structures have advantages in all of these fields. An ex-ample of a structure for building an image pyramid where the hu-man visual system’s sensitivity to orientation of lines and edges isrespected is given in section 2.1. A scheme for two-dimensional sub-sampling which yields fractal regions is introduced and analysed insection 2.2. The scheme is interesting because it provides more de-grees of freedom compared to the decomposition that is done inde-pendently in the horizontal and vertical directions. It is also wellsuited for hierarchical representations on the hexagonal grid, wherethe hexagonal shape of the picture elements in the original represen-tation can be well approximated also at higher levels of representa-tion.

Motion estimation algorithms will be described and analyzed in

1

2 Chapter 1 Introduction

Chapter 3. We will here see the strong interconnectedness betweenimage representation and algorithm performance.

Certainty measures for motion estimation and their usage will bethe topic of Chapter 4.

Scalable video coding is treated in chapter 5, where an imagecoding system using wavelet coding, frequency backward motionestimation, and certainty of motion vector estimates is described.

1.1.1 Contributions

The contributions of this work are:

• Derivation of the required maximum length of motion vectorson page 19.

• A hierarchical representation adapted to the human visual sys-tem in Section 2.1.

• Hierarchical subsampling giving fractal regions in Section 2.2.

• Non-overlapping search patterns for logarithmic motion esti-mation in Section 3.3.3.

• Analysis of the computational complexity of different strate-gies for motion estimation by spatial matching on page 67.

• Image Displacement Subband Evaluation (IDSE) matching inSection 3.4.

• Adaptation of matching criterion to noise distribution, withanalysis of the distance between the motion compensated pre-diction errors’ empirical distribution and a generalized Gaus-sian, results given in Table 3.3 and Fig. 3.18.

• Certainty measures presented in Chapter 4.

• Results on motion certainty controlled prefiltering presentedin Section 4.5.1.

• A wavelet coder using frequency domain backward motion es-timation, Section 5.1.

• The B-BME scheme suggested in Section 5.2.

1.2 Image Sequence Coding 3

1.1.2 The Interconnectedness of All Things

What is the relation between subsampling structures in image pyra-mids, motion estimation algorithms, certainty measures, and scal-able video coding? Chapters 2 to 5, which deal with these issues,are in fact strongly connected. Subsampling structures constitutethe base for hierarchical motion estimation, hierarchical certaintymeasure generation, and scalable coding. Motion estimation algo-rithms use hierarchical approaches to reduce computational com-plexity and avoid false matches while keeping the search field large.An improved subsampling structure for motion estimation, whichalso lends itself to hierarchical certainty measure generation, is sug-gested in Chapter 2. The subsampling structures described in Chap-ter 2 could also be used for scalable coding. Motion estimation andcertainty measures are utilized in the scalable coding system givenin Chapter 5.

1.2 Image Sequence Coding

Image sequences contain huge amounts of raw data, but these arehighly correlated, both spatially and temporally temporally (redun-dancy). Together with the fact that not all the data in the raw materialis relevant for the viewer (irrelevancy) this allows substantial reduc-tions in the bit rate needed to represent the image sequence.

What is irrelevant is determined by the properties of the HumanVisual System, see section 1.3. For low bitrate coding, also perceiv-able distortion has to be accepted, see [69, 76].

In what follows in this section we will take a closer look at eachpart of the image communication chain.

1.2.1 Image Sensing

The moving imagery to be communicated/stored is produced byphotons being reflected by the objects of the scene into the camera.The traditional camera integrates photons over a period shorter thanor equal to the inter-frame interval, a voltage corresponding to theamount of photons that has hit a certain part (pixel) of the imageregistration device is then read out, gamma transformed accordingto equation 1.3 and digitized. The pixel rows of an image may besampled at the same time instant or be sequentially sampled [130].This makes a difference when determining velocities, but is of less


A/D-encoder- -

amplitude-continuous

datadigitaldata

Figure 1.1 A/D-encoding

importance in image coding where motion is generally handled asdisplacement.

1.2.2 Pre-filtering

After recording, the image sequence is pre-filtered to reduce noisethat would otherwise make the performance of the coding-decodingprocess deteriorate. It could be argued that this noise-removal couldbe seen as a part of the coding process; however the view of havinga separate pre-filtering process improves modularity.

1.2.3 Coding

The coding process can be seen as entering amplitude continuousdata into an A/D-encoder that outputs digital data, see Fig. 1.1.

Distortion is inevitably introduced here since quantization has tobe performed. Rate-Distortion Theory [11] is the topic of comput-ing, for each bit rate, the lowest amount of distortion that can beachieved. For comprehensive information about quantization andentropy coding, see [41]. The simplest way of doing A/D-encodingis to quantize each sample individually, i.e. Scalar Quantization. Thiscauses a distortion which is higher than DS(R), the limit given byRate-Distortion Theory. DS(R) depends on the probability densityfunction (pdf) of the incoming signal and the rate of the digital out-put.

A better approach is Vector Quantization which acts on a blockof samples simultaneously. Theoretically, it would be possible todo vector quantization over very large blocks, and thereby achieveoptimal Rate-Distortion performance. Due to practical problems thisis not possible. The performance gain of vector quantization overscalar quantization is due to three factors [82]:

• The dependence between samples (memory).


Pre-processing

Scalarquantization

Entropycoding

- - - -

Figure 1.2 A conventionally designed A/D-coder

• The fact that n-dimensional space can be divided into regionsmore and more resembling spheres as n increases.

• The shape of the probability density function.

The first factor can be exploited prior to quantization using pre-diction or transform techniques as an invertible pre-processing toscalar quantization. It could also be exploited using an entropycoder which can take advantage of memory, but this is a less ef-fective approach because the quantization noise decreases the cor-relation between the samples. The second factor is not possible toachieve without using vector quantization. The third factor can beexploited by the entropy coder.

The types of entropy coding that is most often used either mapsa block of source symbols to a variable length codeword or maps avariable number of input symbols to a fixed length codeword. Theintroduction of entropy coding therefore makes it necessary to havea buffer which is filled by the data produced by the entropy coderat a variable rate and emptied by the channel interface at a fixed (orvariable) rate. A control system makes sure that the buffer is neveroverflowed or totally emptied.

Prediction means that values already sent are used to predict thesubsequent value. Only the prediction error has to be quantized.Since the energy of the prediction error is substantially lower thanthe energy of the signal itself, the distortion resulting from quanti-zation will be smaller when prediction is used.

Transform coding and the related subband coding means mak-ing a linear transform from input vectors to output vectors, therebyconcentrating the energy to some of the output components, whichmakes the quantization more effective.

After this general introduction we will continue with some prac-tical methods used for image sequence coding.


Transform ΣIi

−

+ Q

Prediction

Figure 1.3 Hybrid coding using transform domain prediction.

Hybrid Coding

In Hybrid Coding the previous decoded frame is used to obtain aprediction of the frame to be encoded. The prediction error is there-after coded using some spatial compression technique such as Dis-crete Cosine Transform (DCT) or Wavelet Transform (WT). The combi-nation of temporal prediction and spatial transform gives the tech-nique the name Hybrid Coding. In the simplest case the previousframe is used as a prediction. The next step upwards in complexityis to check whether the image is similar or not and for each regionsend an indicator whether to use the previous frame as predictionor not to use any prediction at all. This is called change detection ormotion detection, since changes mostly occur from motion. How-ever the efficiency of the coder can be increased if motion compen-sation is performed. Motion compensation means creating a pre-dicted frame from the previous frame using motion vectors to moveareas in the previous frame to correspond to their positions in the ac-tual frame. This scheme, which is used nowadays in practical imagecodecs such as MPEG-2 [60] and H.263 [61], was suggested in [34]and [63]. An earlier hybrid coding scheme [53] used prediction inthe transform domain, as shown in Fig. 1.3. The modern hybridcoding scheme, which uses prediction in the spatial domain, thusfacilitating motion compensation, is shown in Fig. 1.4. A compar-ison and analysis of different hybrid coding systems can be foundin [30].

Forward Prediction At the coder side the motion vectors can beestimated from the previously decoded frame (or even the previousoriginal frame) and the frame to be coded and transmitted over thechannel. This technique is used by today’s image coding standards,which all use one motion vector per block, where a block consists


Σ Transform Q

Transform−1

Σ

Prediction

Ii

Ii

−

+

+

+

Figure 1.4 Hybrid coding using spatial domain prediction. Thisis the coding scheme used by today’s standardized codecs. Ii is theinput frame to encode, and Ii is the corresponding reconstructedframe, which will also appear in the decoder. The transform thatis used in the standardized video codecs is DCT, but e.g. wavelettransform could be used instead. Motion compensation is normallyused in the prediction step, and the motion vectors are then sent asside information.

of e.g. 16 × 16 pixels. The algorithm for finding the motion field isnot specified in the standards, but since the objective is to reduce theprediction error, full search block matching with MSE has been themethod seen as the ideal, and everything else regarded more or lessas compromises to reduce computational complexity.

Backward Prediction The motion vectors could alternatively beestimated from the previously decoded images and the part of theimage that has already been sent over the channel. This could bedone in an identical manner at both the coding and decoding side,which makes transmission of the motion vectors unnecessary. Anassumption however has to be made that the motion field is pre-dictable either in time, space, or spatial frequencies.

The assumption of the motion field changing slowly over timeleads to using motion fields estimated from earlier decoded frames.However, this assumption is not very good, which can be realized


by considering e.g. how the motion field varies in the simple case ofan object performing a translational motion over a stationary back-ground.

The spatial consistency assumption leads to algorithms that arepel- or block-recursive, where block-recursive algorithms give thepossibility to use DCT or vector quantization. Pel-recursive tech-niques require scalar quantization of the prediction error. Thesemethods are described in [126], where it is also concluded that de-tection of motion discontinuities has to be performed at the codingside and transmitted over the channel for the algorithms to performwell.

Algorithms which use the assumption that motion is the sameover all spatial frequencies at a spatial position have been proposedindependently by several authors [5, 75, 99]. One of the algorithmswas investigated in [19]. The assumption of consistent motion overall spatial frequencies seems physically sound. There are specialcases, e.g. transparency, where the assumption is not met, but thesecases exhibit a principal problem: that there exists more than onetrue motion vector in that area.

Algorithms based on the assumption of motion consistency overspatial frequencies work in a hierarchical system where low frequen-cies are transmitted before high frequencies. Motion fields are ex-tracted from low frequency bands and used to do the prediction inthe next higher frequency band.

Forward versus Backward Prediction The reasons why backwardmotion estimation has not been used so far in image codecs arethreefold:

• In a broadcasting environment it is desirable to keep the de-coders as simple as possible.

• If decoders are going to perform motion estimation algorithmsthe motion estimation in the coder and in the decoder haveto give identical results. This requires either standardization,which could prohibit gradual improvements, or downloadabil-ity of motion estimation algorithms.

• Neither the assumption of the motion vectors being consistentover time or over space has been very successful.

These have been valid reasons, but in the future we will probablysee more of backward prediction schemes based on the assumption


of consistency over spatial frequencies.Further extensions are possible, such as computing a predicted

motion vector field using backward prediction and use forward pre-diction to obtain a more exact motion field. Only the difference be-tween the predicted and the more exact motion estimate would haveto be sent over the channel.

3D-Transforms

Viewing the image sequence as a volume and making transforms ofsubregions of this volume is perhaps the most natural approach toextend 2D transform techniques to image sequence coding. The dif-ficulty lies in incorporating motion compensation into the 3D trans-form coding scheme, which is necessary in order to make it more ef-ficient than 2D motion compensated hybrid coding. For the imagesto be aligned before coding and properly de-aligned after decoding,the motion fields have to be invertible, see [73], Paper 8. Invertibil-ity is however not a property of physical motion fields. The conti-nuity properties of the motion field introduced by the invertibilitydemand can lead to very visible distortion around edges of movingobjects. The reason is that when the spatial resolution of the motionvector field is low, around the edges of moving objects the alignedimage content is created using a linear combination of the object’smotion and the background’s motion. In the case of uncovered back-ground, texture from the object is imported. This smearing of objecttexture to the background is sometimes referred to as bleeding.

A way of avoiding the demand of invertibility is to introducereference frames that are coded as still images (intra- or I frames inhybrid coding terminology). To these I frames a block of frames canbe motion compensated. The motion compensated prediction errorimages are thereafter 3D-transformed, see Fig. 1.5. This type of 3D-coding was examined in [70]. For fractal 3D coding there exists ascheme that doesn’t require invertible motion fields [74].

Motion Compensated Alignment (MCA) is an invertible motionfield technique that was introduced in [73], Paper 8. It has advan-tages over the scheme used in [70] that could lead to better cod-ing efficiency. The first advantage is that no costly intra frames areneeded. The second advantage is due to the fact that predictionerrors that have to be encoded after motion compensation will oc-cur around moving object boundaries. These errors will have thesame spatial location over the whole block of frames for the MCA-


. . .

I frame Block of frames︸︷︷︸

-

time

Motion compensated prediction

Figure 1.5 A scheme for 3D-transform coding that does not requireinvertible motion fields.

technique, which will enable high performance of the 3D-transform.In the scheme used in [70] where motion compensated differenceimages are 3D-transformed, the prediction errors will have differentspatial locations in the different frames.

In real-time communication systems the transform blocks haveto have a very short extent in the temporal domain, otherwise toomuch delay is introduced. 3D-transforms are therefore less suitablefor real-time communication.

Scalability (see Chapter 5) is possible in both of the schemes men-tioned above. For the MCA-technique, an interesting possibility isto use 3D-SPIHT coding [103], which results in SNR scalability. For3D-coding of motion compensated prediction errors, a natural timescalability comes from letting the intra coded frames constitute thebase layer.

Model Based Coding

Model based image coding was presented in 1983 [35] and furtherdescribed in [33]; it uses 3D-modeling of the scene and transmitsvariations of the model to the receiver which builds an image usingcomputer graphics. This kind of image coding needs more than the2D-motion estimation discussed in this work. The 2D-motion vectorfields can be used as a first step in obtaining the 3D motion param-eters, see [78]. A certainty measure as described in Chapter 4 could


make the 3D-motion parameter estimation more stable by decreas-ing the influence of uncertain 2D motion vectors. However, this hasnot been evaluated in the current work.

1.2.4 Decoding, Post-processing and Display

Decoding is the inverse operation of the encoding, but because ofthe quantization it is not possible to reconstruct the original imagesequence exactly. If the quantization is not too coarse, no differencewill be seen, which is due to the properties of the human visual sys-tem (treated in section 1.3). The term irrelevance is used do denotethat changes can be made to the image without being perceptible tothe viewer.

Post-filtering has been tried for alleviating the visual impressionsof the coding artefacts, but not with convincing results. Motion com-pensation within the post-filtering is a big improvement since thecoding artefacts are less visible if they follow the image content thanif they are stationary [68]. If the motion vectors are estimated onthe encoder side, there is no problem with coding artefacts influenc-ing the motion vectors. When motion vectors are estimated at thedecoder side coding artefacts can lead to erroneous motion vectorfields. However, since low frequency image content is often sentusing fine quantization, there is a possibility that also backward mo-tion estimation might work for post-filtering. Motion compensationalso allows for lower temporal cut-off frequencies since moving ob-jects are saved from blurring.

The most important post-processing step is frame interpolationto achieve a suitable display frame rate. At the encoding side it isusual, due to bit rate restrictions, to send fewer images per secondthan is needed to make a flicker-free, non-jerky display. The sim-plest frame interpolation technique is frame repetition, i.e. zerothorder interpolation, which makes the reconstructed image sequencejerky if the time interval between each original frame is too long. Inmotion film frame repetition is used, but the time interval betweentwo original frames is as short as 1

24≈ 0.042 seconds. The motion

artifacts at that frame rate are visible to a trained eye, but are notvery annoying.

A better technique is to use motion vectors for creating interpo-lated images. However problems arise in the areas of uncoveredbackground. In e.g. the MPEG-2 standard [60] this has been solvedin a systematic manner by the so-called B-frames, which stands for


fovearetina lens

optical axisVisual angle a

Figure 1.6 Schematic picture of the human eye.

bilinearly interpolated frames. For these frames very little imageinformation is coded, just motion vectors that can point either back-wards or forwards in time, making it possible to reconstruct bothcovered and uncovered background. Image content can also be com-puted as a mean reconstructed from both previous and next encodedframe using one vector pointing back in time and one pointing for-ward.

1.3 Properties of The Human Visual System

This chapter gives a brief introduction to the HVS and some detailsrelevant for the rest of the text. For more information, excellent start-ing points are [9, 22, 59, 129].

1.3.1 Physical Mechanisms of Vision

Our eyes consist of an optic system which focuses the incoming lighton the retina. The retina contains light–sensitive receptors of fourdifferent kinds, namely rods, which play their greatest role duringlow luminance conditions, and three types of cones, each with theirmaximal sensitivity centered around different light wavelengths, toenable colour vision. The cones need higher luminance levels thanthe rods to operate.

The central part of the retina is called the fovea and has a highdensity of cones, which are arranged in a hexagonal pattern. Theretina takes input from the central visual area, extending approxi-mately 2 degrees of visual angle around the optical axis, which isthe axis reaching from the center of the fovea through the center ofthe pupil, see Fig. 1.6. Each cone integrates incoming light over asmall spatial area, but over longer time than the rods do. Thereforethe visual acuity in the fovea is high and color vision is good, buttemporal bandwidth is lower than in the rest of the retina.

In the central part of the fovea the receptor density is so highthat it matches the eye’s optical resolution. In the outer part of the

1.3 Properties of The Human Visual System 13

fovea, the optical resolution is still high, but the cones are furtherapart. From an engineering point of view, one might expect that thiswould lead to aliasing. It was however shown in [141] that becausethe receptors are not regularly spaced, aliasing is reduced. The pricepaid for the aliasing reduction is higher spatial frequency ambigu-ity, since a single-frequency input pattern gives rise to a response ina frequency range around the input frequency. For extrafoveal vision,i.e. the part of the visual field that is not handled by the fovea, thesituation is even worse. The cones are much further apart, and al-though they integrate light over a larger area, this is not enough toexplain why aliasing is not a big problem outside the center of thevisual field. It is assumed that in the extrafoveal area, the modelof the visual system dividing the input into frequency channels (seeSection1.3.2) is not as relevant as for foveal vision; instead, the pur-pose of peripheral vision might be just to extract moving objects inorder to attract attention.

The neural signals from the photoreceptors are processed in theretina by other cell types, and also passed directly over the opticnerve, together with the outputs from the other retinal cell types, tothe brain.

1.3.2 Stationary Properties

Stationary properties of the HVS are the ones that are measured withfixated eye and non–moving stimulus. The stimulus is however al-lowed to be switched on–off.

Luminance Sensitivity

By Just Noticeable Difference (JND) is meant the magnitude of thesmallest stimulus that is perceivable. For luminance L (measuredin e.g. Watts/m2), the just noticeable difference 4L when makingthe experiment of having a uniform background with intensity Land a spot with intensity L +4L can be written

4L = KL (1.1)

where K is a constant. Equation 1.1 is known as Weber’s law. Itholds over a large range of intensity magnitudes.

As can be seen, the visual system is not linear but logarithmic inamplitude sensitivity. This is rather natural when considering howintensity variations are caused, and knowing that the same physical


scene should give approximately the same visual impression inde-pendently of the incoming light intensity (illuminance). The amountof light coming from a surface (luminance) is the product of the il-luminance and the surface reflectance. The scene information liesin the surface reflectances, the incoming light energy being ratheruninteresting. It is therefore not surprising that the ratio betweenluminance values at adjacent points is used for determining if thereis some kind of event in the scene.

For super-threshold1 vision, Stevens’s law

I = aLn (1.2)

models how different types of stimuli are perceived by humans. Lis the physical stimulus and I the perceived sensation. For a pointwith intensity L against a black background the value of n is 0.5.

Stevens’s law is taken into consideration when digitizing imagesamples by introducing a non-linearity prior to quantization, e.g..

I = cL1γ (1.3)

where γ is typically slightly over 2. Equation 1.3 describes howluminance, L, is transformed into intensity, I , which will be avail-able for image processing. The process of using Equation 1.3 at theinput side of the image processing system and the inverse at the out-put side is known as gamma correction.

Returning to threshold investigations we note that the constantK in Weber’s law, equation 1.1, is constant only when the spectrum,retinal position, size, and duration of the stimulus are unchanged.

The energy of the spot is important for detection. This is statedby Bloch’s law (equation 1.5) and Ricco’s law (equation 1.6). If thearea of the spot is A and the duration time of the stimulus is T , wehave that the energy deviation caused by the spot is 4E

4E = A× T ×4L (1.4)

.1By super-threshold is meant that the levels of the stimuli are above the just no-

ticeable difference (JND). For determination of super-threshold sensation proper-ties, individuals are asked to assign a number proportional to the sensation (directscaling) or to sort the stimuli according to the sensation (indirect scaling). Anotherexample of an entity which is measured by direct or indirect scaling is subjectiveimage quality.


Bloch’s law:

T ×4L = C, (1.5)

where C is a constant, holds for duration times shorter than about0.1 seconds.

Ricco’s law:

A×4L = C ′, (1.6)

where C ′ is a constant, holds for stimulus sizes with a diameter ofless than about 10 minutes of arc in visual angle.

For larger spots, Equation 1.4 does not hold any longer.Piper’s law:

√A×4L = C ′′, (1.7)

C ′′ being constant, is in better accordance with experiments for stim-ulus sizes between 10 minutes of arc and 24 degrees of arc.

Further increasing the spot size does not influence the JND.Wavelength content of the stimulus, placement on the retina and

light adaptation interact with each other and affect the threshold.As we shall see next, the spatial frequency of the stimulus also

has a high influence on the detectability.

Spatial Frequency Sensitivity

Even though the human visual system is highly non-linear and lacksthe property of shift invariance, it can still be of interest to define aspatial modulation transfer function, which gives the relative sen-sitivity for different spatial frequencies. Because of the lack of shiftinvariance, there will be a continuum of spatial modulation transferfunctions for different angles from the optical axis. A spatial trans-fer modulation function can be obtained either by contrast matching(asking observers to adjust the contrast of stimuli having sinusoidalintensity with different frequencies so that they appear to have thesame contrast) or by measuring the threshold contrast for differentfrequencies. Doing this for the fovea, we find that the highest sen-sitivity can be found for a frequency of 6 cycles/degree of visualangle, see[22], page 112.

It has also been found that patterns that are oriented horizontallyor vertically are more visible than oblique patterns [12, 22, 29]. This


effect is more pronounced for high spatial frequencies than for lowfrequencies.

The HVS can be thought of as having parallel channels whereeach channel responds to a portion of the spatial frequencies. Thesechannels have a bandwidth of approximately one octave.

1.3.3 Dynamic Properties

We will now allow the stimulus to be time–varying, and later on alsoallow the eyes to move.

Spatio-Temporal Frequency Sensitivity and Motion Perception

Consider a sinusoidal stimulus with temporal frequency f , spatialfrequency ν, and contrast m, that is

I = I0(1 + m sin 2πνx sin 2πft). (1.8)

It has been shown [107] that, for a fixed contrast m, if one investi-gates which combinations of spatio-temporal frequencies (ν, f) arevisible, the pairs which form the detection threshold approximatelyobey

νf = constant. (1.9)

We will now look at the case where

I = I0(1 + m sin 2πν(x + vt)) (1.10)

Such moving sine gratings were used as stimuli in [14], where it wasconcluded that motion has the effect of shifting the spatial modula-tion transfer function so that maximum sensitivity occurs at lowerfrequencies than the 6 cycles/degree mentioned earlier. Motion canbe perceived at very high speeds, such as 800 degrees/second, butthe spatial frequency then has to be low. This can be explained bythe temporal band-limitation of the HVS, since a moving sine grat-ing gives rise to an intensity at each retinal position that varies sinu-soidally in time.

For small motion however, it has been shown [137] that the vi-sual resolution is not decreased, but equals the resolution when nomotion is present. With small motions is here meant up to 2.5 de-grees/second for horizontal and vertical motions, and up to 1 de-gree/second for oblique motion directions. The results were ob-tained for foveal vision.


For comparison, one can note that the smallest motion that canbe detected is about 0.2 degrees/second for a luminous point mov-ing in the dark, and about 0.03 degrees/second if there is a visible,stationary background.

An important phenomenon is the so-called apparent motion. It isintroduced by first showing a stimulus at position A, removing thestimulus and then letting it reappear at spatial position B. If the timebetween disappearance and reappearance, the interstimulus inter-val, is appropriate with respect to the distance between A and B, theobserver will perceive the stimulus moving from A to B. If the inter-stimulus interval is too short, the observer will instead see two stim-uli appearing simultaneously, and if the interstimulus interval is toolong, the observer will see the “true” scenario, that is the stimulusfirst at position A, then disappearing and reappearing at position B.

The reason why apparent motion is so important is that all dis-play technologies rely on this phenomenon for giving the impres-sion of smooth motion in the scenes, while what is really being pre-sented is a series of still images.

Two different processes are used to model the phenomenon ofapparent motion, the short-range process and the long-range pro-cess2.

The short-range system handles small displacements, 15–20 min-utes of visual angle or less. It responds to short interstimulus inter-vals and probably only to simple shifts of a contour or pattern.

The long-range process is capable of matching over distances ofseveral degrees, and to match objects differing in color, brightness,shape, size and orientation. The objects are then perceived to moveand transform simultaneously. However, an interstimulus intervalof at least 0.1 seconds is needed, which is longer than the time be-tween two images in all modern display systems.

In [58] it is demonstrated that the long–range process seems torequire attention (processing becomes serial), while the short–rangeprocess does not (processing is parallel over the image).

Motion perception is induced both by illumination changes inthe retina, by the image–retina system, and from interpreting theposition and orientation changes of the eyes, by the eye–head sys-tem.

2The division of motion perception into two separate processes is questioned;an alternative view is that motion perception is one mechanism, the “short-range” and “long-range” processes would then correspond to different areas inthe spatio-temporal domain where the sensitivity is high.


Watson and Ahumada [132] have suggested a model for how theimage–retina system might work. The model is based on spatio–temporal filtering. Burr et al. [13] report findings of spatio–temporalfiltering based detectors of image motion through psychophysicalexperiments. The detectors are sensitive to limited spatial frequen-cies, certain speeds and have a preferred direction of movement.These findings can explain many of the phenomenons describedabove, such as small motion giving no smear, and the short-rangeprocess of apparent motion.

The eye–head system will be discussed in the next section.

Eye Movements

When we move the fixation point of our eyes, the light–pattern onour retinas change, but we still perceive the world as being station-ary. On the other hand, when we follow a moving object, so thatthe image of the object is stationary on our retinas, we still perceivethe object as moving and the background as stationary. There is thusno one–to–one relation between velocity on the retina and perceivedmotion. Instead, the head–eye system compensates for the motionof the eyes, possibly by the parts of the brain controlling the motionof the head and eyes sending copies of their motoric signals to thevisual cortex, which makes the actual compensation.

There are different types of eye motion, namely:

• Saccades, which move the fixation point of the eyes to a newangle. Since there is no feedback during the saccade, it is notguaranteed to move the eyes exactly to the desired position.Several saccades might be necessary to move the eye to thepoint of interest. During the saccades, the visual system is in-ert.

• Microsaccades, small vibrations that prevent the image from be-ing stationary on the retina.

• Drift occurs under fixation, slow change around the fixationpoint.

• Reflex pursuit movements holds the fixation point stationary al-though the head is moving.

• Voluntary pursuit movements tries to get the image of the mov-ing object of interest at the central part of the fovea.


• Nystagmus, a repetitive pattern of eye drift in one direction fol-lowed by a quick saccade back to the starting point. Can beinduced by a repetitive moving pattern being shown in frontof the observer.

• Convergence and divergence changes the angle between left andright eye in order to adapt to viewing distance.

Smooth-pursuit movement is a term used for reflex and voluntary pur-suit movements. It is not possible to perform smooth-pursuit move-ments without an object to fixate the eyes on.

When working with image coding it is important to know thelimits of the voluntary pursuit movements, since successful pursuitallows full spatial resolution and thus only small spatial domainartefacts can be tolerated. Maximum velocity of voluntary pursuitmovement determines how large the search area for motion vectorsneeds to be.

According to [140], the most common fixation time when look-ing at stationary images is 0.20 – 0.25 seconds. It is therefore believedthat this time is necessary for satisfactory perception. For continu-ous, horizontal motion it is possible to track an object during thisamount of time if the speed does not exceed 200 degrees/second.For moving objects that appear suddenly and unexpectedly, the max-imum velocity is reduced to about 150 degrees/second. Cautionshould, for two reasons, be taken before stating that it suffices todescribe moving objects at speeds less than 200 degrees/second.One reason is that even if smooth pursuit is impossible, saccadesenable examination also of objects moving very fast, up to 500 de-grees/second. The other reason is that head movements can alsohelp when tracking objects.

The above analysis indicates that very fast moving objects needto be represented with full quality. However, the fact that the screenarea usually covers only a limited area of the visual field diminishesthe maximal velocity where full resolution is needed to

max velocity =screensize

0.2 seconds(1.11)

where screensize can be expressed as visual angle (in degrees) orin terms of pixels. The above formula (1.11) holds for screen sizeswhich cover less than 40 degrees of visual angle in the horizontaldirection.


As an example we can take an HDTV screen having 2000 pix-els horizontally. If the screen is watched from a distance of 4 timesscreen height and has 16:9 aspect ratio the screen covers 25 degreesof visual angle horizontally, so Equation 1.11 limits the velocity forwhich full spatial resolution is needed. The maximal velocity pos-sible to track is then 10000 pixels/second, and assuming that theframe rate of the incoming sequence is 25 frames/second this meansthat the search area for motion vectors ought to be +/- 400 pixelshorizontally.

In [56] it was verified that velocities up to 16 degrees/second donot diminish visual acuity at all, but higher velocities were not tried.

Acceleration of the moving object is not well handled by thetracking system [28], so that for an acceleration of a degrees/s2 weget a retinal velocity of vr = a

8degrees/s (for predictable motion).

Since a retinal velocity of 2.5 degrees/second in horizontal or ver-tical direction does not lower spatial resolution this means that anacceleration of less than 20 degrees/s2 does not lower the resolu-tion; corresponding figures for oblique directions are 1degrees/s and8 deg/s2.

1.3.4 Hierarchical Properties

Hierarchical structures can be visualized as pyramids, where infor-mation is assimilated in the bottom layer in its simplest, most de-tailed and least processed form. In the visual system the incominginformation enters when photons activate receptors in the retina.There is strong anatomical evidence of the visual system being orga-nized in a hierarchical series [59]. Also psychophysical experimentssupport that hierarchical structures are being used in the visual sys-tem. As an example, in random-dot kinematograms3, the upperlimit on the spatial displacement at which motion is perceived, Dmax,is proportional to the dot size for dot sizes larger than 15 minutesof visual angle. In Chapter 3 we will describe hierarchical motionestimation which has the same property. Also for reasons of compu-tational efficiency it is reasonable to assume that the human visualsystem is hierarchically organized.

3Random-dot kinematograms are used in motion-perception experiments. Apattern is generated by randomly assigning each dot a value. This pattern is thenused as stimulus in an apparent motion setting, i.e. first shown at spatial positionA, then removed and after a time period called the interstimulus interval shownin spatial position B. Positions A and B are separated by the displacement D.

2

Hierarchical Image Structures

Hierarchical image representations were introduced around 1970 [65,102, 120], and received much attention in the early 1980’s [17, 49,109]. They provide an alternative to pixel-oriented representationsthat is able to handle image structures at different scales. The ap-proach has shown to be viable, and is today extensively used in im-age coding [111, 124] as well as in computer vision and image pro-cessing [48].

In image coding, the layers normally contain image content ata specific frequency band. The use of this type of representationfor the purpose of image compression was introduced in [17]. Thenumber of samples at each level decreases with a constant subsam-pling factor, SSF, between each layer representing lower and lowerfrequency, i.e. the relative bandwidth is constant. This structure iscalled a pyramid, even if it does not look like the pyramids of Egypt,since the slope is not constant, see Fig. 2.1. The pyramid structure isstrongly connected with the tree structure shown in Fig. 2.2.

The motivations for dividing the image into frequency bands in-clude:

• energy compaction; concentrating the energy to a few com-ponents (in this case the low frequencies) gives better Rate-Distortion performance [11].

• adaptation of coding accuracy to satisfy the requirements ofthe human visual system

21

22 Chapter 2 Hierarchical Image Structures

• scalability and progressive transmission; decomposition intofrequency bands naturally leads to frequency scalability, andis also a base for progressive transmission [111].

• masking; when energy content is high in a frequency band,that frequency, and also neighbouring frequencies will havehigher thresholds for detection of quantization noise (this prop-erty is seldom used in image coding, but is taken advantage ofin audio coding).

In the pyramid shown in Fig. 2.1, where the bottom layer is iden-tical to the original resolution image, it is obvious that the total num-ber of samples , NP , is larger than the number of samples in the orig-inal image, NI . The number of pixels at each level in the pyramidform a geometrical series whose sum is

NP ≤ NI

1

1− 1SSF

(2.1)

with equality if and only if the top level contains only one pixel.Also some types of band-pass pyramids, e.g. the Laplacian pyra-

mid [17], exhibit the same type of data expansion.In image processing, the use of some extra computer memory is

affordable, but for image coding, data expansion should be avoided.Without sacrificing the possibility of perfect reconstruction of theimage, it is possible, by choosing appropriate filter banks for gen-eration of the pyramid, to store only (1 − 1

SSF) 1

SSF l coefficients ateach level l of the pyramid, thus making the number of samplesin the pyramid equal to the number of samples in the original im-age [25, 90, 91].

In the field of computer vision there exists another definition ofhierarchy. The hierarchical layers can have different semantic mean-ings, e.g. the lowest layer can be the image, as it is, the next layer(computed from the first layer) can typically be the gradient, and thefollowing layer can be the curvature [50].

Studies of the Human Visual System [24, 95] indicate that pyra-mid structures are used also in biological vision, as mentioned inSection 1.3.4.

The rest of this chapter will be devoted to two special types ofpyramids. In Section 2.1, pyramid generation where the spatial fre-quency content of the subbands is made to comply with the sensi-tivity of the human visual system is described. This work was alsopresented in [83]. Section 2.2 describes 2-dimensional non-separable

2.1 An HVS-Based Hierarchical Representation 23

Figure 2.1 A lowpass pyramid with a 2-dimensional subsamplingfactor (SSF) of 4, i.e. the number of samples at each level is a fourthof the number of samples in the layer below.

subsampling and analyzes the fractal properties of the transforma-tions. The work has been presented in [86, 87].

2.1 An HVS-Based Hierarchical Representa-tion

A subsampling scheme suitable for subband coding of still and mov-ing images is presented. The spatial frequency sensitivity of the hu-man visual system (HVS) is dependent on the orientation of the spa-tial frequency. The shape of the 2D-frequency sensitivity functionleads to the proposed partitioning of subbands.

2.1.1 Motivations

The purpose of subband decomposition in image coding is twofold;the first aim is energy compaction, the second is to make the divi-sions into subbands in a way such that all data represented in thesame subband is visually equally important. The latter allows for aweighting of the quantization errors according to the sensitivity ofthe HVS, and supports progressive transmission. Some subband de-composition schemes previously used for image coding are shownin Fig. 2.3 a) to c). All the schemes in Fig. 2.3, except Fig. 2.3 a), arehierarchical. The spatial arrangements of the samples that are usu-ally applied with the schemes in Fig. 2.3 b) and Fig. 2.3 c) are shownin Fig. 2.4 and Fig. 2.5, respectively.


Figure 2.2 Tree structure corresponding to a lowpass pyramid witha 2-dimensional subsampling factor (SSF) of 4. This subsamplingstructure is sometimes referred to as quadtree. It occurs frequentlyin image coding and image processing because it is common to usefilters which divides the frequency content in two halves (dyadicfilters). The dyadic filters are applied both horizontally and verti-cally. Keeping the frequencies which are low both horizontally andvertically gives SSF = 4.

It is well known that humans perceive horizontal and verticalfrequencies better than oblique ones. This is probably due to thedominance of horizontal and vertical structures in the environment,see [22], pp 486 – 487.

The partitioning into approximately octave bandwidths of thechannels of the HVS motivates a hierarchical, dyadic, subsamplingscheme.

Based on the two observations about the HVS stated above, wepropose the subband partitioning shown in Fig. 2.3 d). Good energycompaction properties come from the 2D-frequency content of typi-cal image material. The spatial arrangement of the samples is shownin Fig. 2.6.

2.1 An HVS-Based Hierarchical Representation 25

6

-

wy

wx

a)

wy

wx

6

-

b)

wy

wx

6

-��

��

��

��

��

��@

@@

@@

@

@@

@@

@@�

��

��

�@@

@

@@

@��

��@@

@@

c)

wy

wx

6

-��

��

��

��

��

��@

@@

@@

@

@@

@@

@@�

��

��

�@@

@

@@

@��

��@@

@@

d)

Figure 2.3 Different subband decompositions in 2D-frequencydomain: a) Uniform b) Hierarchical, traditional implementationc) Quincunx pyramid d) HVS-based.

2.1.2 Implementation of the HVS-based HierarchicalRepresentation

If we approximate the iso-sensitivity curves in the 2D-frequency sen-sitivity function of the HVS with diamond shapes, it becomes natu-ral to do a diamond-shaped band splitting. Quincunx sampling inthe spatial domain is compatible with diamond-shaped frequencyareas. Quincunx subsampling is treated in [1].

This first partitioning requires a two-dimensional filtering, whichis the computationally most costly part of the scheme. The first fil-tering cuts off the corners of the 2-D frequency domain, as shown


ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

Figure 2.4 The effect of traditional subsampling, which subsam-ples the plane a factor of 4 in each iteration, is seen when goingfrom left to middle and from middle to right.

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

pppppppp

Figure 2.5 The effect of quincunx subsampling, which subsamplesthe plane a factor of 2 in each iteration, is seen when going from leftto middle and from middle to right.

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

ppppppppppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

ppppppp

pppppppp

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

p

Figure 2.6 Sampling patterns resulting from HVS-based hierar-chical representation. From the original sampling structure shownto the left, the image is converted to a quincunx pattern, shown inthe middle. All subsequent downsampling steps follow the patternshown by going from middle to right, i.e. downsampling a factor of4, keeping the quincunx arrangement.

2.2 Hierarchical Subsampling Giving Fractal Regions 27

wy

wx

6

-��

��

��

��

��

��@

@@

@@

@

@@

@@

@@

Figure 2.7 The first step of the suggested HVS-absed representa-tion requires a two-dimensional filtering.

in Fig. 2.7. All subsequent steps can be performed using 1-D filter-ing, as shown in Fig. 2.8. This decomposition preserves the diamondshape of the lowest subband, and can be recursively performed untilthe desired number of subband decompositions has been achieved.

The HVS-adapted subbands are then quantized, each band witha quantization step adapted to the spatial modulation function ofthe human visual system. The output of the resulting HVS-adaptedwavelet coder is shown in Fig. 2.9, with comparison made to theoriginal, an image coded with a wavelet coder using the traditionalhorizontal-vertical decomposition, and an image coded using theJPEG DCT coder.

2.2 Hierarchical Subsampling Giving FractalRegions

A structure for image subsampling by means of base tilings is in-troduced in this section. When repeatedly applying the subsam-pling scheme, the resulting support areas approach a fractal, whichis described and analyzed using iterated function systems. The sub-sampling scheme can be easily implemented and is suitable in, e.g.,hierarchical image processing and image coding schemes such aswavelet coding. For hexagonally sampled images a hierarchical sub-


* * * * ** * * * *

* * * * ** * * * *

* * * * ** * * * *

* * * * ** * * * *

* * * * ** * * * *

@@

@@@@

@@

@@

* ** * *

* * ** *

* ** * *

* * ** *

* ** * *

��

��

��

��

Figure 2.8 Spatial support areas for first and second 1D-filter.

Figure 2.9 Original (top left), HVS-based wavelet coder (topright), horizontal-vertical wavelet coder (bottom left), DCT-coder(JPEG) (bottom right). The coded images use the same bit rate.


sampling structure is given which yields hexagon-like regions withfractal borders.

2.2.1 Motivations

The most often utilized way of subsampling an image is to indepen-dently subsample one dimension at the time, which yields a down-sampling factor that is the product of two integers. However thereexist more degrees of freedom that can be exploited when subsam-pling the plane. One example is alternating quincunx subsampling,see Fig. 2.5, which has a subsampling factor of 2. We will here intro-duce new ways of subsampling the plane.

One motivation for seeking new subsampling structures comesfrom hierarchical image processing. Hierarchical algorithms are veryefficient compared to non-hierarchical, as can bee seen in Table 3.2on page 68, but the efficiency of the hierarchical algorithm is depen-dent on the subsampling factor used. Because high subsamplingfactors means that most of the work has to be carried out at thelowest level of the image pyramid, loosely speaking, a high sub-sampling factor makes the algorithm less hierarchical than a lowsubsampling factor. Intuitively this can also be seen as low sub-sampling factors giving high and steep pyramids while high sub-sampling factors give low and less steep pyramids. We will take theexample of hierarchical estimation of dense motion fields as an ex-ample. The algorithm is described in Section 3.3. Let us denote the2-dimensional subsampling factor SSF . To be able to find all mo-tion vectors, SSF positions have to be searched at each level in thepyramid. A search over a larger area makes some motion vectorspossible to find through several different combinations of decisions,see the one-dimensional example in Fig. 2.10. It can be argued thatthis increases error robustness, but since many vectors, e.g. the im-portant zero vector, can only be achieved through a unique combina-tion of decisions, error recovery is better handled by other methods.We are now going to analyse how the computational complexity de-pends on the subsampling factor. For each of the N(1+ 1

SSF+ 1

SSF 2 +...) ≤ N

1− 1SSF

samples in a low-pass pyramid, a search is made. N de-notes the number of samples in the original image. For each of thesamples we are going to search SSF positions. The computationalcomplexity is proportional to the total number of searches, which is≤ N SSF

1− 1SSF

. We see that the complexity is minimized by SSF = 2, andthat it rises monotonically with SSF for SSF > 2. In motion esti-


-3 -2 -1 0 1 2 3

Figure 2.10 Decision tree for a motion vector estimated by hier-archical block matching in one dimension using a search area of 3and an SSF of 2. The scheme is redundant since some motion vec-tors are possible to obtain using different vector combinations. Inthe example above the vectors 1 and -1 can be obtained by differentchoices.

mation, it is desired to search the zero position, and displacementsto the right, left, up, and down. This is achieved by the subsamplingpattern described in Section 2.2.2 with SSF = 5, while if separablesubsampling is used, SSF will have to be raised to 9, which wouldyield an increase in computational complexity by 62%.

Another reason for investigating new subsampling structures isthe possibility this raises for hierarchical processing on hexagonalgrids. Studies of hierarchical decompositions of hexagonally sam-pled images are biologically motivated since the retina uses a hexag-onal sampling pattern and the human visual system utilizes pyra-midal representations, see Section 1.3. Hexagons are also ideal fortiling the plane so that the intra-tile correlation is maximized.

Fractal tilings of the plane have earlier been treated mathemati-cally in [6, 7, 26, 27, 37–39, 42–44, 51, 92, 115], from a computer graph-ics perspective in[117], and from an image representation / humanvision approach in [1, 15, 55, 88, 114, 131, 133, 134]. We will here makea clear connection to image subsampling, an important concept be-ing toggling subsampling, which will be described in Section 2.2.2.Toggling subsampling results in a coordinate system that remainsaligned with the coordinate system of the original image. As analy-sis tools we will use concepts from the theory of fractals.

In the following section we will start by showing in detail howthe subsampling is performed for a subsampling factor of 5, intro-ducing spiraling and toggling subsampling. Thereafter we go to-wards discussing fractal subsampling on a Cartesian grid in general.


Figure 2.11 A base tiling for subsampling the plane a factor of five.The center point of each region has been indicated.

The next section treats subsampling on a hexagonal grid. It is thenshown how these subsampling structures can be used to create spa-tially varying resolution.

2.2.2 Subdivisions on a Cartesian Grid

Koch Regions

the tiling of the plane shown in Fig. 2.11. It is clear that keeping thecentral point of each tile gives a subsampling of the plane with afactor of 5. The new sample points form an orthogonal grid, with itsaxes tilted an angle of ± arctan 1

2relative to the original axes. Since

the new sample points lie on an orthogonal grid it is possible toagain do the same subsampling operation. Choosing again the sameorientation of the shift gives an orthogonal coordinate system tilted±2 arctan 1

2, which we will in the sequel call spiraling subsampling.

Choosing the alternate orientation shift tilts the coordinate systemback to the same orientation as the original, and this will be calledtoggling subsampling, see Fig. 2.12.

Fractal Properties After a number of iterations the regions belong-ing to each sample will obtain fractal qualities, see Fig. 2.13, whichshows the regions for spiraling and toggling subsampling.

Theory found in [8] can explain the fractal qualities of the down-sampling system. Note that the same image is created as supportregions in the spiraling downsampling system as the attractor of the


Figure 2.12 The regions generated by the second iteration spiral-ing (left) and toggling (right) subsampling.

Figure 2.13 After a few iterations the subsampling regions obtaina fractal appearance, converging towards an IFS. The images showthe support regions of the spiraling (to the left) and the toggling (tothe right) subsampling schemes when subsampling a factor of fivewith the base tiling shown in Fig. 2.11.

Iterated Functions System (IFS):

w1

(x1

x2

)=

(0.4 0.2−0.2 0.4

) (x1

x2

)+

(00

)

w2

(x1

x2

)=

(0.4 0.2−0.2 0.4

)(x1

x2

)+

(0.40.2

)


w3

(x1

x2

)=

(0.4 0.2−0.2 0.4

)(x1

x2

)+

(−0.20.4

)

w4

(x1

x2

)=

(0.4 0.2−0.2 0.4

)(x1

x2

)+

(−0.4−0.2

)

w5

(x1

x2

)=

(0.4 0.2−0.2 0.4

)(x1

x2

)+

(0.2−0.4

), (2.2)

where the matrix(

0.4 0.2−0.2 0.4

)

is composed of a scaling with 1√5

and a rotation with arctan 12. The

IFS is visualized in Fig. 2.14.Also the support regions generated by the toggling subsampling

system can be generated by an IFS, but since the functions alter be-tween even and odd iterations, we have to use an IFS containing 25functions, each iteration of the IFS corresponding to two subsam-plings by a factor of 5. The IFS of the toggling subsampling schemeis shown in Fig. 2.15.

The fractal dimension of the subsampling areas is 2 for both thespiraling and the toggling subsampling scheme, since they tile theplane. The borderline between the subsampling regions1 have afractal dimension which can be investigated by generating an IFSthat generates a quarter of the four-fold symmetric boundary.

For the spiraling scheme an IFS that accomplishes this is shownin Fig. 2.16.

Since the IFS is non-overlapping [8], the fractal dimension of itsattractor can be calculated using the formula

N∑

n=1

|sn|D = 1 (2.3)

where D denotes fractal dimension, and sn contractivity factor, inthis case sn = 1√

5, n = 1, . . . , 5, yielding D = log 3

log√

5≈ 1.37.

1The boundary of the spiraling scheme is a type of Koch curve [127, 128].


w1

w2

w3

w4

w5

Figure 2.14 The IFS whose attractor is the same as the spiralingsubsampling Koch regions is composed of 5 contractive functionsw1, ..., w5. Each function wi is composed of a scaling, a rotation,and a translation.

For the toggling subsampling scheme we can build an IFS thatgenerates a fourth of the border using 9 functions, each with a con-tractivity factor of 1

5, yielding the fractal dimension D = log 3

log√

5, the

same as for the spiraling scheme.There are of course many more permutations of left-right choices

than the extreme cases spiraling and toggling subsampling. We be-lieve that all of them have equal qualities, since the fractal dimensionof the edges of spiraling and toggling subsampling are the same.

The “Lenna” test-image subsampled a factor of 625 by togglingsubsampling by a factor of five in each iteration is shown in Fig. 2.17and the same test image subsampled a factor of 625 by independentsubsampling, giving square blocks, is also shown in Fig. 2.17. Inboth images the mean of each block is used as reconstruction value.

Smoother reconstruction is obtained by bi-linear reconstruction,which is also seen in Fig 2.17.

The type of distortion is quite different in the two cases, with theseparable subsampling giving raise to horizontal and vertical struc-tures which are clearly visible to the eye.


w11 w12

w13

w14

w15

w21 w22

w23

w24

w25

w31 w32

w33

w34

w35

w41 w42

w43

w44

w45 w51 w52

w53

w54

w55

Figure 2.15 The IFS whose attractor is the same as the tog-gling subsampling regions is composed of 25 contractive functionswij, i, j = 1, ..., 5. Each function wij is composed of a scaling and atranslation.

wb1wb2

wb3

Figure 2.16 An IFS whose attractor is a fourth of the borderlineof the spiraling subsampling Koch curve. The horizontal line ismapped onto the three shorter lines.

Rhomb-like Fractal Regions

With Rhomb-like regions we mean areas generated by base tiles ac-cording to Fig. 2.18. These tiles fill the plane, forming a sponge-weave pattern. Each tile is made up of four pyramids with base k(each pyramid shaded in Fig. 2.18) and a central point, thus contain-


Figure 2.17 Top row: The “Lenna” test image in resolution 500pixels x 500 pixels downsampled a factor of 625 using togglingsubsampling (left) and independent vertical and horizontal sub-sampling (right). Lower row: Same test image, here downsampleda factor of 25 (52) and bi-linearly upsampled using toggling sub-sampling (left) and independent horizontal - vertical subsampling(right). A cut-out of the shoulder contour is shown.

ing SSFrk points where

SSFrk = 2(k2 + k) + 1. (2.4)

If we also generate an IFS that describes part of the boundary,Fig. 2.16 is an example where k = 1, the fractal dimension of theboundary, Drk, can be derived using Equation 2.3 which yields avalid calculation of the fractal dimension

Drk = 2log(2k + 1)

log(2(k2 + k) + 1)(2.5)

since the IFS is non-overlapping.The Koch region resulting from hierarchical subsampling with a

factor 5 is the first in the series of rhomb-like fractal regions. Withincreasing k, the fractal dimension diminishes so that the regionsbecome closer to an ordinary, non-fractal rhomb.


Figure 2.18 Base tiling pattern for rhomb-like fractal regions, inthis case k = 2. Shading has been done to illustrate Equation 2.4.

A General Condition for the Existence of Hierarchical Subsam-pling on a Cartesian Grid

There exist more subsampling schemes than those that have beendescribed in sections 2.2.2 and 2.2.2. In [131] an SSF of 10 was usedin an image codec modeling the human visual system. The conditionfor existence of a hierarchical subsampling from a Cartesian grid toa Cartesian grid in the plane is that the subsampling factor, SSF , isthe square of the distance, c, between the coordinate points at leveli, expressed in the coordinates of level i − 1, see Fig. 2.19. This canbe generalized to any dimension, d, giving the formula

SSF = cd. (2.6)

In the two-dimensional case,

SSF = c2 = a2 + b2 a, b ∈ N . (2.7)

The possible subsampling factors are tabulated in Table 2.1 fora and b up to 5. It is easy to show that an infinite number of basetiles yielding SSF as subsampling factor exist, if the base tiles areallowed to have parts spread out far from their origin.

2.2.3 Subdivisions on a Hexagonal Grid

The hexagonal grid, shown in Fig. 2.20, has optimal performancefor dividing the plane into regions that tile the plane and have thehighest intra-region correlation.


a

bc

Figure 2.19 Subsampling from a Cartesian grid to a Cartesian gridrequires preserving straight angles and equal distance to all fourneighboring pixels. The solution is shown above, where a, b ∈ N .

a 1 2 3 4 5b1 2 5 10 17 262 8 13 20 293 18 25 344 32 395 50

Table 2.1 Possible subsampling factors for hierarchical subsam-pling giving fractal regions on a Cartesian grid. The boldface sub-sampling factors can be obtained by Rhomb-like fractal subsam-pling. It is also possible to perform separable subsampling, whichyields SSF = N 2, N ∈ N .


Figure 2.20 Hexagonal subdivision of the plane.

Hexagonal sampling also has biological implications since it isthe sampling scheme used in the retina, especially in its central part.The visual information from the central part of the retina is in the hu-man visual system divided into different spatial frequency channels.For reasons of efficiency, we believe that the low frequency chan-nels are created from medium frequency channel data rather thanfrom data coming directly from the receptors, which indicates thathierarchical processing is taking place in the human visual system.A model for the formation of the four types of neurons observedin the static image pathway, namely circularly symmetric, simple,complex, and hypercomplex cells, has been given by Crettez and Si-mon [24]. Their model uses a Hadamard transform on a hexagonalgrid, and a subsampling factor of 7, corresponding to the GosperRegions that will be introduced in the next section.

Gosper Regions

The “flowsnake”, a modification of the Koch snowflake which makesit tile the plane was given by William Gosper and presented by


Figure 2.21 A base tiling for subsampling a factor of seven, givenan image of hexagonal representation.

Gardner in [37]. It has also been described by Mandelbrot [92] andStevens [117].

Gosper regions are generated by hierarchical spiraling subsam-pling using the base tiling shown in Fig. 2.21 which subsamples theplane by a factor of seven. The IFS for the spiraling subsamplingscheme is shown if Fig.2.22. The support regions of the togglingsubsampling scheme is shown in Fig. 2.23.

Subsampling of this type was discussed in [1]. In [133] a filterbank for image decomposition using a lowpass filter, 3 highpass fil-ters having odd symmetry, and 3 highpass filters having even sym-metry was derived. In [114] a QMF strategy for filter design wasadopted yielding circular symmetry also for the highpass filters.

Hexagon-like Fractal Regions

Merging hexagons so that they form a larger hexagon is not possible;it is however possible to approach the hexagonal shape by using a


w1

w2

w3

w4

w5w6

w7

Figure 2.22 The IFS whose attractor is the same as the spiralingsubsampling Gosper regions is composed of 7 contractive functionsw1, ..., w7. Each function wi is composed of a scaling, a rotation,and a translation.

Figure 2.23 Support region for toggling Gosper subsampling.

larger subsampling factor. We will here show how to construct thesehexagon-like fractal regions.

Consider a base tile made up of 6 pyramids with base k, and acentral point, see Fig. 2.24. It is easily verified that these base tilesfill the plane.

The number of pixels in the basic tile area, and thus the subsam-pling factor SSFhk is

SSFhk = 3(k2 + k) + 1 (2.8)


Figure 2.24 Base tiling pattern for hexagon-like fractal regions, inthis case k=2. The shading has been done to illustrate Equation 2.8.

From generating an IFS function that describes a piece of theboundary, it can be seen that the fractal dimension Dhk can be writ-ten

Dhk = 2log(2k + 1)

log(3(k2 + k) + 1). (2.9)

We notice that the dimension of the edge goes towards 1 with in-creasing k.

The Gosper regions is a special case of hexagon-like fractal tilings,generated by k=1 and having SSFh1 = 7 and Dh1 = 1.12915; the nexthexagon-like fractal tiling, generated by k = 2, has SSFh2 = 19 andDh2 = 1.09321.

Fudgeflakes

For hierarchical image processing it is advantageous to have a lowsubsampling factor in order to keep the computational load down.An SSF of three is obtained using the base tiling shown in Fig. 2.25,which generates a fudgeflake when iterated [92]. Its low subsam-pling factor and modest fractal dimension makes it an interestingsubsampling pattern.

A General Condition for the Existence of Hierarchical Subsam-pling on a Hexagonal Grid

The condition for existence of a hierarchical subsampling from ahexagonal grid to a hexagonal grid in the plane is that the subsam-


Figure 2.25 A base tiling for subsampling a factor of three, givenan image of hexagonal representation.

pling factor, SSF , is the square of the distance, c, between the coor-dinate points at level i, expressed in the coordinates of level i− 1,

SSF = c2 = a2 + b2 + ab a, b ∈ N , (2.10)

see also Fig. 2.26. The possible subsampling factors are tabulated inTable 2.2 for a and b up to 5. It can be shown that an infinite numberof base tiles yielding SSF as subsampling factor exist.

2.2.4 Spatially Varying Resolution

In situations where the viewer’s eye movements can be tracked, e.g.in a tele-operator situation, large savings of data rate can be madeby adapting the resolution of the transmitted image to the resolutionof the corresponding retinal area, thus allowing both a large view-ing angle and a high resolution in the central part of the visual field.The retina has a central disc-shaped part with high and constant res-olution, while the density of image receptors and subsequent cells


a

b

c

2π3

Figure 2.26 Distance, c, between sample points can be calculatedfrom a and b, see also Equation 2.10.

a 1 2 3 4 5b1 3 7 13 21 312 12 19 28 393 27 37 494 48 615 75

Table 2.2 Possible subsampling factors for hierarchical subsam-pling which give fractal regions on a hexagonal grid. The boldfacesubsampling factors can be obtained by hexagonal-like fractal sub-sampling.

in the signal processing pathway diminishes with distance from theretina’s central point.

Hexagon-like fractal subsampling was used to create an imagewith resolution resemblant of this kind of resolution, which is shownin Fig. 2.27 a). Here three levels of resolutions were used, full reso-lution, medium resolution and low resolution. The reconstructionof the low and medium resolution areas was made using constantbandwidth within each area. Since the hexagon-like regions areclose to circles, the areas of different resolutions are well matchedto the resolution of the retina. Fig. 2.27 c) shows the result of usingmedium resolution in all areas. To compute the number of samplesin Fig. 2.27 a) and 2.27 c), using the notations of 2.27 d), and ignor-ing edge effects coming from the fact that the image is not congruent

2.3 Conclusions on Hierarchical Image Structures 45

with the resolution regions, we note that the relations between thesampling densities are di+1 = di

SSF, and the relations between the ar-

eas are Ai+1 = (SSF − 1)∑i

j=0 Aj . For the varying resolution casewe get the total number of samples DV = d0A0 + d1A1 + d2A2 =d1A1((1 + 1

SSF−1) + (1) + (1)), while for the mono-resolution case

DM = d1A0 + d1A1 + d1A2 = d1A1((1

SSF−1) + (1) + (SSF )). For an

SSF higher than 2, the varying density scheme contains less sam-ples than the monoresolution one. Even though the varying reso-lution scheme yields less samples, its reconstructed image is betteradapted to the eye’s resolution, which gives better subjective quality.

2.3 Conclusions on Hierarchical Image Struc-tures

The HVS-adapted scheme presented in Section 2.1 is well suited forimage coding. It resembles the traditional dyadic, separable scheme,but because of the initial 2D-filtering and quincunx subsampling,the subbands conform better to the orientational sensitivity of thehuman visual system.

Other ways to subsample the image plane than the well-knownseparable and quincunx structures have been shown in Section 2.2.

The structures can be used for hierarchical image processing, inwhich case the base tilings giving low subsampling factors are themost interesting for keeping computational load down.

For wavelet coding many of the structures described are inter-esting, but substantial work will have to be invested in filter designand evaluation.

The hexagon-like structure is also interesting since there has beena lack of ways to subsample the hexagonal pattern so that the hexag-onal structure is preserved. It is well-suited in schemes where spa-tial resolution varies with position, e.g., as a function of eccentricityfrom the optical axis.


a) b)

A0

d0

A1

d1

A2

d2

c) d)

Figure 2.27 Spatially varying resolution using hexagon-like frac-tal regions with SSF = 19. Around focus of attention, there is anarea having full resolution. Around this area there is a medium res-olution area with 19 times fewer samples/area unit. The rest of thearea has 192 lower resolution than the central area. b) Original im-age with an overlay showing the spatial locations of the resolutionregions. c) Medium resolution used in all regions. d) Schematic fig-ure showing resolution regions having area Ai and sample densitydi.

3

Motion Estimation

Motion estimation is an essential and resource-demanding task inimage coding, image processing and computer vision [118].

What is really estimated when finding the same point on an ob-ject in two consecutive frames is displacement rather than motion.The established term for finding these correspondencies is however“motion estimation” rather than “displacement estimation”. Sincethere is a strong coupling between motion and the displacements inthe image sequence, we will use the term “motion” as a synonymfor “displacement”.

There are three desirable features of the motion vectors belong-ing to a frame of the sequence, true displacements, low prediction error,and low entropy. Which one(s) of these that are important dependson the application. For model-based coding or motion compensatedframe interpolation it is important to have true displacements. Hy-brid coding, as in e.g. MPEG-2 or H.263, does not rely on physicallytrue displacements, but instead of low prediction error and low en-tropy of the motion vector field. How the accuracy of motion vectorsinfluences prediction quality will be the topic of Section 3.2. How tofind motion vectors having low entropy has been studied in [77].In the rest of the chapter we will mainly be concerned with findingphysically true motion vectors.

Algorithms for computing motion vectors is the topic of this chap-ter.

The motion vectors that are the output from these computations

47

48 Chapter 3 Motion Estimation

are however only estimates, and the quality of the estimation varieswith spatial position and time. How to compute local certainty mea-sures for motion vectors is described in Chapter 4.

3.1 Objectives and Notation

The three-dimensional scene with its luminance variations, move-ments together with camera movements and zoom create variationsin the intensity in the image plane. If we could directly project thethree-dimensional movements onto the image plane we would getthe 2D-Motion Field. In [57] an example is given where the sceneconsists of a smooth sphere rotating under constant illumination.No motion detection algorithm, and no human being can detect themotion that is taking place in this scene; it is in this extreme exam-ple neither possible nor necessary for image coding purposes to geta measure of true motion.

From an image coding point of view, the desired output from amotion estimation algorithm is the two-dimensional Apparent Mo-tion Field, in the sequel called AMF (x), where x denotes spatial co-ordinates. The AMF can be caused by motion of objects in the scene,camera motion, e.g. panning, and also by camera zooming, whichis not physical motion but causes a similar effect. In most cases themotion field and the apparent motion field coincide since we arevery skilled in making correct interpretations of the light reflectedinto our eyes.

If the image content is locally 1-dimensional, it is impossible todetermine from a small piece of the image where it belongs in an-other image of the same scene (e.g. the previously captured image),see Fig. 3.1. This is called the aperture problem. Motion estimationalgorithms generally run into this problem, while it does not botherhumans, since we use higher order constraints to understand thescene.

The aperture problem does not prevent the component of themotion vector which is perpendicular to the image content from be-ing determined, which in the case of 1-dimensional image contentshould be sufficient for prediction purposes.

Humans can detect more than one motion taking place in thesame spatial location, as long as the spatial frequency content is dif-ferent for the different motions [132]. This situation can occur if wehave transparent objects, and when reflection occurs, e.g. a person

3.1 Objectives and Notation 49

Equallygood

guesses

Figure 3.1 The aperture problem. When the image content is 1-dimensional it is not possible to determine the correspondence be-tween image pairs based on local information. It is however possibleto determine the motion vector component which is perpendicularto the image content. In the example above, the local image con-tent is vertical, therefore it is possible to determine the horizontalcomponent of the motion vector.

is sitting behind a polished wooden desk. Another illustrative ex-ample is when a light-source reflects in a persons face, the high-lightdoes not follow the movements of the face.

In an image coding application it is essential that the estimatedapparent motion field is not smeared around borders of moving ob-jects, since this leads to very visible distortion [69].

It is further important that the algorithms give accurate estimatesalso for large motions. How large displacements must be detecteddepends on what type of scene is being coded; for HDTV the mag-nitude of 100 pixels displacement between frames is desirable, forvideo telephony, which has less movement and lower resolution, themagnitude of 10 pixels displacements may be sufficient, see page 19.

For many applications a Dense Motion Field, i.e. a Motion Fieldhaving one motion vector per pixel, is needed; In other applicationssuch as the standardized hybrid-DCT coders, e.g. H.263 and MPEG-2, only one motion vector per block is needed.

In the next subsections, it will be described how apparent motioncan be estimated. Only image Intensity, I(x), will be used, althoughit could be a good idea to use also the colour information, e.g. repre-sented as Hue, H(x), and Saturation, S(x), since colour informationis more insensitive to varying light conditions in different parts ofthe scene. How varying light conditions influence the motion esti-mation is analysed in [71], and using the values of Red, R(x), Green,


G(x), and Blue, B(x), instead of only I(x) to obtain more reliableestimates is investigated in [72]. The HVS has been believed not touse color for motion perception, perhaps because it is desirable thatthe same way of performing motion analysis should work in bothbright and dim light conditions, while color can not be detected indim light. More recent research has shown that motion can be per-ceived also for moving colour patterns, the perception is howeverweaker than for luminance patterns.

In the following, displacements will be denoted 4x =(

u v)T ,

a displacement field is defined as a vector field of displacements

over the entire image 4x(x) =

(u(x)v(x)

).

Logarithm with respect to base two will be denoted lb and loga-rithm with respect to base three will be denoted lt.

3.2 Required Accuracy of Motion Vectors

How accurate do the motion vectors have to be for the coding towork well? In the case of Hybrid Coding the prediction gain is afunction of motion accuracy and image frequency content. Supposethat pure translation has taken place, so that the signal can be per-fectly predicted by accurate motion compensation.

An error in the displacement gives different impact dependingon the frequency content of the image, see Fig. 3.2. The low fre-quency signal and the high frequency signal both suffer from thesame error resulting from erroneous motion estimation. For the lowfrequency signal, the prediction is however almost in phase withthe signal, so the prediction error will get relatively low amplitude.For the high frequency component, however, the result of the mo-tion vector error is that the prediction and the signal have oppositephase, so that the prediction error will have twice the amplitude ofthe signal itself.

The frequency components of the signal can be represented by2-dimensional vectors, see Fig. 3.3. Perfect motion compensationgives, after subtraction of the predicted signal, zero resulting error,while an error in the motion vector causing a phase shift of 60 de-grees gives a resulting error as big as the original signal. More gen-erally, denoting the energy of the prediction error B2, the energy ofthe signal vector A2 and the phase error Θ, we have

B2 = 2A2(1− cos Θ). (3.1)

3.3 Algorithms Based on Spatial Matching 51

Since a certain motion vector error deteriorates the prediction ofhigh frequencies more than low frequencies, a filter has sometimesbeen used to damp the high frequencies in the predicted image (aso called loop-filter) when motion vectors have been known to beinaccurate. The ideal filter is a Wiener filter, see [45], but this requiresknowledge about the image content’s spectral density function, theimage noise’s spectral density function and the probability densityfunction of the displacement error.

In [126] numerical examples can be found regarding how variousmotion vector errors decrease the performance of different trans-forms.

Ericsson [30] made simulations on video sequences comparingthe performance of integer-pixel accuracy and 1

8-pixel accuracy mo-

tion compensation of 16x16 pixel blocks. The simulations showedthat it was worth spending the extra bits on the increased accuracyof the motion vectors. The total performance, in a Rate-Distorsionmeaning, of the coder increased with fractional-pel accuracy. It wasnot, however, clarified how much of the performance gain was dueto the loop filtering introduced by the interpolation and how muchwas due to the improved precision of the motion vectors.

A further complication to motion compensation is that the inputimages are aliased by the imaging process due to non-ideal lowpassfiltering. This implies that even with perfectly accurate motion vec-tors, the prediction error will not be zero, except for integer-pixeldisplacements. Remedies to this problem have been suggested [10,135].

As a concluding remark on required accuracy we would againlike to stress that in the coding process it is not required that the mo-tion vectors are physically true, only that they give a good predic-tion. Accuracy is therefore in this context related to small deviationsfrom a vector giving the best prediction. The small deviations canbe caused both by estimation errors and quantization effects whenthe motion vectors are coded and transmitted over the channel.

3.3 Algorithms Based on Spatial Matching

Suppose that we are going to find a motion vector having its baseat a certain position in image one and pointing at a certain positionin image two in such a way that the displacement vector connectsthe same points in the underlying 3D-scene. In the block matching


Figure 3.2 The error in the displacement vector has relatively lowimpact on the prediction of a low frequency signal (top), but highimpact on the prediction of a high frequency signal (bottom). Solidline represents the signal to encode, and dotted line the motion com-pensated prediction. If the motion compensation was perfect, thetwo would coincide, and subtracting the motion compensated pre-diction would yield zero error.

Figure 3.3 Result of motion compensation. Left: Perfect motioncompensation gives zero resulting error; Right: Motion vector errorcorresponding to 60 deg phase shift gives a resulting error with thesame energy as the original signal.

technique we do this by comparing image content in an area in imageone around the base of the vector with the image content aroundpossible displacement positions in image two.

The amount of similarity is evaluated by the Match Criterion (MC),we choose the displacement that gives the best value of the MatchCriterion.

The set of displacements that are tested constitutes the SearchWindow, an example of a Search Window is a 5 × 5 square pixel area,allowing displacements of up to two pixels in each direction to be


detected.The area from which image content for the comparison is se-

lected is a rectangle called the Match Block, it can i.e. be of size 7× 7pixels, and is centered around the base of the motion vector to befound.

The Match Block constitutes the support of the Match Window.A reason for selecting a Match Window that is non-uniform is thatpoints lying closer to the base position of the vector are more likelyto have the same motion as this point and should therefore be givenhigher weights. An example of a Match Window is a truncated Gaus-sian window.

We will now continue by giving examples of some Match Cri-terions that have been suggested and used. A new Match Crite-rion is suggested in Section 3.8.7. Then follow descriptions of dif-ferent search strategies leading to different behaviour of the blockmatching algorithm, and also to different computational complexi-ties which are summarized in Table 3.2.

• Sum of Absolute Differences, SAD The weighted absolutevalue of the difference in image intensity is summed over thematch window. See also Mean Absolute Error, MAE.

• Mean Absolute Error, MAEThe weighted absolute value of the difference in image inten-sity is averaged over the match window. See also Sum of Ab-solute Differences, SAD.

• Mean Squared Error, MSEThe same as above, but with the differences squared.

• Uniform Error CriterionCounts the number of appearances of an absolute error greaterthan a threshold over the match block.

• CorrelationHere the weighted pixelwise product is computed. Maximumcorrelation gives the displacement estimate.

The correlation function can be written:

c(x, a) =∑

x

w(x)I1(x)w(x− a)I2(x− a). (3.2)


Normalization with respect to local energy content in the twoimages can be applied.

• RateComputable in a hybrid-DCT coding situation (theoreticallyalso in other coders). Given a certain acceptable distortion,what will the number of bits produced by the coder be usingthis displacement? The minimum is sought.

Block-matching has been extensively used in hybrid-DCT coderswith one displacement estimate for each block. There is howevernothing that prevents the generation of dense displacement fieldsusing block matching. A major disadvantage of block matching isthat it does not give sub-pixel accuracy. This disadvantage can beovercome e.g. by generating oversampled images or by a final inter-polation step, see [108].

We will summarize below some search strategies, of which one,full search, is guaranteed to find the extremal value of the matchcriterion, while the reduced search strategies are only guaranteed tofind local extremal values.

3.3.1 Full Search

An exhaustive search in the image plane is also called full search, orflat search, in contrast to hierarchical search strategies. If displace-ments up to N −1 pixels are to be detected, (2N −1)2 positions mustbe searched. The number of searches can be reduced by a constantfactor of e.g. 8 or 16 with very little loss in performance [143] usingpixel subsampling (equivalent to having a sparse match window)and block subsampling (estimating the motion field for e.g. half or afourth of the blocks and interpolating them for the others). Anotherspeed-up can be achieved by initially computing the match criterioncorresponding to zero displacement, then for the other positions firstcompute partial sums, which are used to exclude many of the vec-tors in the search window from computation of the match criterion.Lin and Tai [80] report that 86% of the vectors can be excluded fromMatch Criterion calculations. Because the number of positions inthe search window rises quadratically with the length of the vectorsto be found, the number of Match Criterions is still high for largedisplacements, see Table 3.2, where the figures for the unmodifiedalgorithm are given.


��PPPPP

N2I0

I1

I2

�� PPPN2

4

�� PPN2

16

Figure 3.4 Lowpass image pyramid. The number of pixels at eachlevel is given, assuming a subsampling factor of 2 in both horizontaland vertical direction.

An interesting variant of a full search algorithm is used by Gen-nery [40] who instead of choosing the position where the MC is max-imal (assuming here that high MC indicates good match) computesthe mathematical expectation value over the search window of thex position. This of course requires that the MC is proportional to theprobability of correct match. The motivation for the approach is lessnoise sensitivity, an interesting by-product is sub-pixel resolution1.

3.3.2 Hierarchical Search

Hierarchical matching was first described in [46], some extensionscan be found in [3].

Description

In Hierarchical Search, a low-pass or band-pass pyramid representa-tion of the image is first created. In Fig. 3.4 a lowpass image pyramidI i(x), i = 0, . . . , n is shown. The original image is I0(x). Assumingthat I0(x) has N 2 pixels and that the subsampling factor is 2 bothhorizontally and vertically, then the number of pixels at each leveln will be N2

4n as shown in Fig. 3.4. The total number of pixels in theimage pyramid is then upper bounded by the sum N 2 + N2

4+ . . .

which amounts to 43N2.

The generation of the pyramid is done one layer at the time bylowpass filtering and subsampling of the previous layer. The first

1The author however reports that for sharp correlation peaks, the estimatestend to be biased towards the nearest integer-pixel.


step is the generation of I1(x) from I0(x), which requires cfN2

4op-

erations where cf is a constant depending on the filter being used.Next step is generating I2(x) from I1(x) which requires cf

N2

16opera-

tions. The total number of operations for generation of the lowpassimage pyramid is upper bounded by cf

3N2.

Block matching starts at a high level in the pyramid giving acoarse displacement field. This displacement estimate is propagatedto the layer below in the pyramid, and used as starting point for thesearch at that level. The search area at each level can be very small.

Subsampling structures and computational complexity

Often, the subsampling factor in the resolution pyramid is 2 in bothhorizontal and vertical directions. Assuming a 3 × 3 pixels searcharea at each level, displacements up to N−1 pixels (N being a powerof 2) can be detected with 12(1 − 4−lb(N)) matching operations, seeTable 3.2, page 68. It is interesting to note that the number of matchoperations per motion vector in the resolution of the original im-age never exceeds 12, independent of the value of the maximumdetectable motion vector.

A further step to improve speed of computation would be to setthe subsampling factor when generating a lowpass pyramid to 3;all motion vectors in the search area would still be possible to findusing the algorithm with a search area of 3x3 pixels, see Fig. 3.5. Thenumber of matching operations needed would then drop to 81

8(1 −

9−lb(N)).Does there exist any possibility of further reducing the compu-

tational complexity? We note that if the width of the search win-dow (assumed to be square), equals the subsampling factor in onedimension, hereafter denoted SF , the motion vector search will becomplete but non–overlapping. The number of search operationsper motion vector in I0(x) will then become (SF )2

1− 1(SF )2

, SF = 2, 3, . . . ,

which is minimized when SF = 2.Doing the search one dimension at the time with one-dimensional

subsampling between the searches would further reduce the num-ber of searches per motion vector to 2

1− 12

= 4, SF = 2; this minimizesthe number of search operations per motion vector for hierarchicalsubsampling.

To be able to find motion vectors in all directions using only twomatches per level, one of the original images would have to be re-


Figure 3.5 Hierarchical matching using a subsampling factor of 3.All possible outcomes of the algorithm are indicated to show that itis possible to find all motion vectors.

sampled so that the image content is shifted 0.5 pixels horizontallyand vertically. The same is true also for higher layers, but here theshift can be put into the filter coefficients so that the complexity isnot increased. The output AMF will have vectors where the com-ponents take values from the set [. . . ,−1.5,−0.5, 0.5, 1.5, . . . ], whichis a disadvantage compared to the traditional block matching meth-ods where the motion vector components take integer values. Thedrawback is due to the fact that (0, 0)T is a common motion vector.If the AMF estimation process has a final interpolation step yield-ing continuous representation of the vectors, the drawback can bedisregarded.

An example of hierarchical search one dimension at the time isshown in Fig. 3.6. The number of search operations per motionvector using this scheme but excluding the final step is less than2∑∞

i=012n = 4. The final step makes the resolution of the motion vec-

tor field equal in both the horizontal and vertical direction, but adds50% to the complexity, so a tradeoff that could be made is not to im-plement the final step if the resulting motion vector field is goingto be fine-tuned by another algorithm. How the motion vector fieldconverges with iterations is shown in Fig. 3.7. It can be seen that theinitial steps here do not constitute valid approximations for motion


of the lower frequency bands, since the quantization of the motionat each level is so coarse.

Another approach that was introduced in [87] is to use subsam-pling regions which are not square. This can reduce the computa-tional complexity while keeping the (0, 0)T vector in the search set.This approach was described in Section 2.2.2.

Properties of the resulting motion vector field

Hierarchical matching also generates the apparent motion vectorfields of lower resolution, AMF i(x), i = 0, . . . , but since the finerresolution AMF s depend on the coarser ones, an implicit assump-tion has been made that all frequency components share the samemotion.

It is a disadvantage if the motion estimation algorithm smoothesthe motion vector field over borders between differently moving ob-jects. Hierarchical matching generates smoother vector fields thanfull search, which is due to two reasons. The first reason is that agroup of pixels (the size of which depending on the subsamplingfactor) share the same starting point for their search, so if the searcharea at each level is small they cannot differ very much. The secondreason is that in the process of subsampling the filtering that is re-quired is often done with filter kernels having a larger support thanthe subsampling factor. This means that a pixel in a layer above thelowest in the pyramid will be generated from areas in the image thatmay have different motion. This is a reason why the same decompo-sition that is used when subband/wavelet–coding the image contentmay not be suitable for the motion field acquisition. Desirable prop-erties of filters to be used when generating an apparent motion field(to be used for e.g. pre-filtering) with hierarchical motion estimationare spatial compactness and symmetry. Filters to be used for codingare often designed to give good compaction in the frequency do-main, which make them large in the spatial domain. The symmetryproperty is incompatible with orthogonal wavelet filters [89]. Themotion field that minimizes prediction energy, on the other hand, isprobably best generated using the same filters for subband decom-position and motion estimation.


Candidate vectors:

Candidate vectors:

Candidate vectors:

Candidate vectors:

Candidate vectors:

Final step:

Candidate vectors:

Figure 3.6 Sampling patterns and candidate vectors for hierarchi-cal search one dimension at the time.


Figure 3.7 Vectors that are output from hierarchical estimation onedimension at the time with a down-sampling factor of 2. Left: Try-ing to estimate zero motion. Right: Estimating the largest motionvector.

3.3.3 Reduced Search Strategies

The reduced search methods, Three–Step Search, 2D-LogarithmicSearch, and Conjugate Direction Search, are alternatives to exhaus-tive search; the purpose of the algorithms is to reduce computa-tional load while achieving almost the same results as full search.To do this they rely on the MC being essentially unimodal. Allsearches are being made using the original image resolution (withpossible exception of final steps for subpixel accuracy). Three-StepSearch, 2D-Logarithmic Search and Conjugate Direction Search willbe described here, although more methods have been proposed, seee.g. [20] and [144].

Three-Step Search was published by Koga et al. [67]. The searchorder of Three–Step Search can be seen in Fig. 3.8 and is the sameas the search order for hierarchical search. If the match window ismade larger for the first steps of the algorithm, logarithmic searchis similar to hierarchical search. But since large match windowsgive longer computation time, constant size match windows are nor-mally used, making the method resemble flat search.

Generalizations of Three–Step Search to give larger search win-dows can be made. If the step sizes are chosen2 as N

2, N

4, . . . , 2, 1,

the number of match operations to detect displacements up to N −1pixels adds up to 1 + 8 lb(N), see Table 3.2.

2D-Logarithmic Search, published by Jain and Jain [63], resem-bles the generalized Three–Step Search described above. The differ-

2Actually, in the original reference, the step sizes were 3, 2, 1.


ence is that the corner points are not examined, except in the last stepwhere all nine positions are evaluated, see Fig. 3.9. The advantage ofthis is that horizontal and vertical motion can be assumed to be morecommon than oblique motion and also easier for the human eye totrack, thus it is more important to find large horizontal and verticalmotion. The number of match operations amounts to 5 + 4 lb(N) ifa horizontal or vertical displacement of N − 1 pixels should be de-tected, which is compared with other methods in Table 3.2. It shouldhowever be noted that the search area for 2D-Logarithmic Searchis smaller than for Full Search, Hierarchical Search and Three–StepSearch, although all methods detect the same horizontal and verticaldisplacements. While the search areas for the previously describedmethods are quadratic and oriented so that the largest motion vec-tors that can be found are oblique ones3, the search area for 2D-Logarithmic Search becomes approximately diamond-shaped withthe largest motion vectors possible to find being horizontal and ver-tical, see Fig. 3.10. the comparison in Table 3.2 is nevertheless madefor horizontal and vertical displacements of N − 1.

Srinivasan and Rao [116] presented Conjugate Direction Searchtogether with the simplified technique One at a Time Search . Thelatter has got its name because it searches the horizontal and ver-tical directions one at a time. If searching the horizontal directionfirst, the best match is found among the three motion vectors zero,one pixel left and one pixel right. If one of the edges is found op-timal the search continues in that direction until a local optimumis found. From this optimum, the vertical direction is explored inthe same way. When the best match is found an iteration of the al-gorithm is completed. Srinivasan and Rao found one iteration tobe satisfactory, but this is of course a system design parameter thatcould change with increasing computational power, and the amountof motion present in the sequence.

The performance of one iteration can be somewhat improved byadding a search in the direction given by the vector from the Oneat the Time search. E.g. if the vector found was (3,3), the vectors(2,2) and (4,4) are also examined and the search may continue in thediagonal direction. This is the Conjugate Direction Search.

The number of matches required for these two methods dependson the image material, small motion vectors are found with less ef-fort than big ones.

3The shape of the search area could easily be changed for full search.


2 2 2

2 2 2

2 2 2

1 1 1

1 1 1

1 10 0 0

0 0 000 1

2

Figure 3.8 Three–Step Search. Numbering of the search steps hasbeen made in analogy with hierarchical search pyramid levels. Thesearch thus ends at level 0. To find one motion vector the abovematches are tried. Boldface indicates the best match found at eachlevel.

2

2 2 2

2

1

1

1 10 0 0

0 0 00 01

2

Figure 3.9 2D-Logarithmic Search. To find one motion vector theabove match locations are tried. Boldface indicates the best matchfound at each level.

Non-Overlapping Log-Search

Logarithmic search and the closely related 3-step search, which willboth hereafter be referred to as logarithmic search, are similar to hi-erarchical search in the sense that first a coarse estimate is produced,which is then stepwise refined. Unlike hierarchical methods, loga-rithmic search methods operate in image resolution. The search vec-


++++++

++++++++++++++

++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++

++++++

Figure 3.10 Search area when using 2D-Logarithmic Search.Three levels were used in this example.

tors in logarithmic search correspond to the downsampling patternin hierarchical search. Like hierarchical search, logarithmic searchcan be made more efficient if the search areas are non-overlapping.Fig. 3.11 (a) shows how the search pattern looks if a 3 by 3 searchwindow is used in each step, and the size of the refinement vector isreduced by a factor of 2 between the iterations. From Fig. 3.11 (a) it isclear that many of the arrows meet, i. e. the search windows overlap.This can at first glance seem to be an advantage, since it allows someerror recovery. However, the overlapping search windows imply astrange a priori probability distribution assumption for the motionvectors leading to sub-optimal decisions. Overlapping search win-dows also make the computational load unnecessarily high. Errorrecovery is handled well by using multiple search paths [21, 81].

Multiple search paths [21, 81] means keeping not only the bestcandidate motion vector at each iteration, but a pre-defined num-ber of motion vectors. This improves performance at the expense ofincreased computational complexity.

We propose to use non-overlapping search fields for logarithmicmotion estimation. This can be achieved by using the search pat-tern shown in Fig. 3.11 (b). The non-overlapping search fields havethe advantages of being able to find longer motion vectors in feweriterations, see Tables 3.1 and 3.2, and to allow easy and effective han-dling of multiple search paths.

Experimental Results We used the complex-motion scene “flow-ergarden” to compare the performance of motion-compensated pre-diction using full search, traditional logarithmic search, and the pro-


(a) (b)Figure 3.11 (a): Two iterations of logarithmic matching with halfas large refinement vectors in each step and a search window of 3 by3. In the first iteration (top image), 9 positions are searched. Aroundthe best match(es) found, a refined search is made (bottom image).All possible outcomes of the algorithm are shown. (b:) The proposedlogarithmic matching using refinement vectors with lengths 1

3of

the sizes in the previous step, and a search window of 3 by 3. Allpossible outcomes of the algorithm are indicated to show that it ispossible to find all motion vectors.

posed logarithmic search. Four iterations were made in both thelogarithmic estimators. For the traditional search scheme, this givesa natural limitation in the length of motion vectors to 15. For theproposed scheme, we limited the search area to the same size, in or-der to be able to make fair comparisons. For the logarithmic searchmethods, the number of match value calculations was 36s, where sstands for number of search paths. This number could be loweredby using the fact that some of the matches have already been cal-culated. That procedure, however, requires some overhead, and itis doubtful if it is meaningful when the match criterion is relativelysimple. For full search, the search area was also 31 by 31 pixels,


1 2 3 4 5 6 7 8 9 s

151617181920212223242526dB

◦?◦? ◦? ◦? ◦

? ◦?◦?◦?◦?

Figure 3.12 Results of motion compensated prediction using nomotion estimation (dotted line), full search motion estimation (solidline), logarithmic estimation with traditional search pattern (◦),and logarithmic motion estimation with the proposed search pat-tern (?). Test sequence was “flower garden” in 352 by 240 pixelsresolution. The size of the blocks used for motion estimation andcompensation was 8 by 8 pixels, and integer-pixel resolution of themotion vectors was used.

which gives 961 match value calculations. The outcome in terms ofPSNR for the motion compensated prediction is shown in Fig. 3.12

We see that motion compensation is really paying off, since usingthe previous picture without motion compensation gives a PSNR ofonly 15.7 dB, while using full search motion estimation gives a PSNRof 25.5 dB. The basic versions of logarithmic search using only onesearch path give PSNR around 20 dB (20.0 for the traditional searchpattern and 19.9 for the proposed). Using more search paths im-proves PSNR substantially for the logarithmic motion estimation al-gorithms. The computational complexity increases linearly with thenumber of search paths. For 9 search paths, the computational com-plexity for logarithmic search is a third of the computational com-plexity for full search. The non-overlapping scheme has capacityfor finding motion vectors up to a length of 40 with this number ofiterations. If this capacity would be used, full search would needtwenty times more match value calculations than logarithmic searchwith 9 search paths. For larger motion vectors, the relative differ-ence between the required computational complexity for full searchand logarithmic search increases, since the number of match value


Largest vector Non-overlapping log-searchN−1

21 + 8 lt(N)

1 94 1713 2540 33121 41

Table 3.1 Number of match operations per estimated motion vectorusing integer pixel resolution for non-overlapping log-search. Bylt is meant triunary logarithm, i.e. logarithm with respect to thebase 3. The figures in this table should be compared to the column“Generalized 3-step search” in Table 3.2.

calculations rise with the square of the motion vector length, whilethe number of matches for logarithmic motion estimation rises withthe logarithm of the motion vector length.

Remarks on Non-Overlapping Log-Search The proposed searchscheme for logarithmic motion estimation shows improvements incomputational complexity and proposes a more simple and straight-forward alternative to the partly-overlapping schemes that have pre-viously been presented. The number of match operations as a func-tion of largest vector possible to find is given in Table 3.1. The valuesshould be compared with the values for “Generalized 3-step search”in Table 3.2.

Non-overlapping log-search is ideally suited for use with multi-ple search paths. The non-overlapping property also facilitates thecomputation of certainty measures for the motion vectors.

To lower the complexity further, the number of positions evalu-ated in each iteration could be lowered to 5, by using the Koch-tilingdescribed in Section 2.2.2. Three iterations of the search is shownin Fig. 3.13. This would result in search areas of the type shown inFig. 2.13. The Koch search pattern could be an interesting alternativewhen matching is done over a very large area.

For hierarchical matching on a hexagonal sampling grid, it is pos-sible to use the tilings shown in Section 2.2.3.

3.4 Image Displacement Subband Evaluation Matching 67

Figure 3.13 Five positions, one of which representing zero motion,are searched in each iteration.

3.3.4 Comparison of search strategy efficiencies

The complexities of the algorithms are (in descending order):Full Search O(N 2);Conjugate Direction Search O(N 1);Generalized 3-Step Search O(log(N));2D-Logarithmic Search O(log(N));Hierarchical Search O(N 0);

see also Table 3.2.

3.4 Image Displacement Subband EvaluationMatching

This scheme has been developed to fit into the Frequency DomainBackward Motion Estimation (FDBME) scheme that will be treatedin Section 5.1. Backward motion estimation means that motion vec-


Number of match operationsLargest Full Hierarch. Generalized 2D-logarithmicvector search search 3-step search searchN − 1 (2N − 1)2 12(1− 4−lb(N)) 1 + 8 lb(N) 5 + 4 lb(N)

1 9 9 9 93 49 11.25 17 137 225 11.81 25 1715 961 11.95 33 2131 3969 11.99 41 2563 16129 12.00 49 29127 65025 12.00 57 31

Table 3.2 Number of match operations per estimated motion vectorusing integer pixel resolution. “Conventional” subsampling andsearch patterns were used for the hierarchical search and the gener-alized 3-step search. By lb is meant binary logarithm, i.e. logarithmwith respect to the base two. For Conjugate Direction Search thenumber of matches is dependent on image statistics and desired es-timation accuracy, but is reported by Srinivasan and Rao [116] tobe smaller than for 2D-Logarithmic Search, using the test sequence“Waterskiers”.

tors are estimated from already transmitted image data, and thus donot have to be sent over the channel. Frequency backward motionestimation indicates that the motion vectors are estimated from al-ready transmitted frequencies in a subband/wavelet coding schemewhere the frames are transmitted one subband at the time startingwith the lowpass image, which is transmitted without motion infor-mation. The data stream continues by sending the second lowestfrequency information and proceeds up to full image resolution.

The data available for the motion estimation process is the recon-structed previous frame and a reconstructed lowpass version of theimage to be coded. The desired output is a motion vector field thatcan motion compensate the previous reconstructed image so that agood prediction can be made of the frequency subband to be trans-mitted.

A motion estimation algorithm to be used in this scheme is theimage displacement subband evaluation (IDSE) scheme which willnow be presented. It reverses the order of the loops in full searchmatching so that a certain displacement is evaluated for all spatialpositions before proceeding with trying the next motion vector. This

3.5 Frequency Domain Matching Algorithms 69

is done because of the projection to basis vectors that needs to beperformed before comparison. The image data is shifted in finitesteps, with the possibility of subpixel accuracy. The shifted data isprojected to the basis vector set used by the wavelet coder by doingthe lowpass filtering steps until the same resolution is reached as thelowpass reconstruction. Here the lowpass data is compared doinga MC evaluation, treating the lowpass versions as images. For eachposition in the lowpass images the vector and the value of its MC isstored. A new shift of the image is now applied and for each positionin the lowpass images a new MC is calculated and compared to theprevious. The best match vector and MC is stored for future compar-isons. When all displacement vectors have been tested, each pointin the lowpass resolution has got its motion vector estimate corre-sponding to the best match of the displaced previous reconstructedframe to the available low resolution reconstructed subband.

The low resolution motion vector field is then interpolated to im-age resolution. In this process normalized convolution [136] can beutilized with certainty estimates given from the motion estimationprocess.

The IDSE motion estimation procedure corresponds to full searchin the low resolution image, and therefore has high computationalcomplexity. The higher the resolution of the subband, the higher thecomplexity, since the number of MC evaluations is proportional tothe number of pixels in the lowpass images. For lower frequencybands, more filtering steps have to be done, but since the number ofpixels in each lowpass version decreases with the two-dimensionalsubsampling factor, thus forming a geometrical series, it can be seenthat the main part of the computations done in the filtering step isperformed in the first downsampling step.

3.5 Frequency Domain Matching Algorithms

The techniques studied in Section 3.3 essentially try to find correctmotion vectors by evaluating local correlation through spatial con-volution, thus obtaining a correlation surface or something more orless equivalent like a mean squared error surface. Another way ofobtaining correlation is through multiplication in the Fourier do-main.

Displacements can be analysed by performing a two-dimensionalFourier transform of the two images (or blocks from the two im-


ages), multiplying the Fourier transform coefficients, and perform-ing an inverse Fourier transform to get back to the spatial domain.The result is called a Correlation Surface and will have peaks at thecoordinates corresponding to translations between the two images.

The information content of images is often assumed to lie more inthe phase than in the magnitude. If, before making the inverse trans-form, the magnitude is set to one everywhere, the only informationthat has been used is the phase in the two images, motivating thename Phase correlation.

The blocks have to be at least twice as big (preferably more) thanthe largest detectable displacement, since the same part of the objecthas to be visible in both blocks. For block matching in the spatialdomain this is not the case since the block is moved around, here itis instead the search area that has to be twice as big as the largestvectors to be found.

Very good results (precision of displacement estimates 0.2 pix-els and better with up to 45 pixels displacements detectable) havebeen reported in [123] with what will in the following be called theBBC method. It uses phase correlation on 64 x 32 pixel blocks to ex-tract the positions of the three highest peaks of the correlation sur-face. For all pixels the motion vector estimate is made by choosingamong the displacements suggested by the phase correlation of theblock the pixel is belonging to and the adjacent blocks. The choice ismade through block matching in the image plane, where the blockin the old image has to be interpolated according to the non-integerdisplacement vectors.

A variant of the BBC method that uses a half-cosine in both hori-zontal and vertical directions to window the image data before per-forming the Fourier transform has been investigated in [142].

The complexity of the BBC method can be decomposed into threeterms, KDFT for DFT:s and IDFT:s, Ksearch for finding the best dis-placements, and Kmatch for trying the found displacements for eachpixel. The blocksize is assumed to be 2N ∗ 2N , which makes thelargest detectable motion vectors about N horizontally and verti-cally. For large blocks, the probability of having many different mo-tion vectors within the block increases. Assume therefore that thenumber of vectors for which matching is tried is made proportionalto the number of pixels in the block. A search is performed alsowhen selecting which of the selected motion vectors fitted best, the

3.6 Optical Flow Algorithms 71

cost of this search raises with N 2. We get the complexity per block

Kb ≈ N2 log(N) Kadd&mult︸︷︷︸KDFT

+ CN 4Kcomp.︸︷︷︸Ksearch

+ CN 4(Kinterp.&MC + Kcomp.)︸︷︷︸Kmatch

(3.3)

and the complexity per motion vector

Kv ≈ log(N) Kadd&mult︸︷︷︸KDFT

+ CN 2Kcomp.︸︷︷︸Ksearch

+ CN 2(Kinterp.&MC + Kcomp.)︸︷︷︸Kmatch

(3.4)

where C is a small constant, BBC uses 364∗32 ≈ 1.510−3. The size

of C depends on the assumed number of distinct motion vectors ineach block. For sequences where the motion is well described bytranslation, a low number of motion vectors per block is enough todescribe the motion well, but for scenes containing e.g. zoom or ro-tation, there exist a continuum of motion vectors, and the methodcan not be assumed to work well. Increasing C is not the solution tobetter performance, the appropriate solution would be to adapt themethod to find different types of motion. When C is small and sincecircuit technology rapidly developing, it will be possible to find mo-tion vectors in the region of interest, see the analysis in Section 1.3.3.

3.6 Optical Flow Algorithms

Assuming that the intensity is smooth, and also assuming that in-tensity of object points does not change with time, then for smalldeviations, u, v, 4t

I(x, y, t) = I(x + u, y + v, t +4t) (3.5)

which gives rise to the optical flow constraint equation

Ixu + Iyv + It4t = 0, (3.6)

where Ix denotes partial derivative of image intensity in the direc-tion of the x axis, and so on.

The partial derivatives(

Ix Iy

)T can be recognized as the im-age gradient. The gradient and ∂I

∂tare obtained by filtering the im-

ages. This information is however not enough to determine uniquelydetermine 4x, only its component in the image gradient direction,


see Fig. 3.14. More information than first-order derivatives in onesingle point is needed to obtain a unique solution. Two different ap-proaches have been made: to use higher order derivatives, and toapply spatial consistency constraints on the motion field. Anotherdecomposition of differential based methods is between direct, i.e.non–iterative algorithms, and iterative algorithms.

6

-

QQ

QQ

QQ

QQ

QQ

QQ

QQ

QQ

QQ

u

vConstraint line

given by theoptical flow

equationSome motion vectorsthat fulfill the optical

flow equation

Figure 3.14 Constraint line given by the optical flow equation(3.6).

As can be seen, the optical flow equation is not alone sufficient todetermine the motion.

3.6.1 Direct Methods

First we will present the approach based on second derivatives, thena method for block-based motion estimation.

Second Derivatives Method

A more thorough description of the Second Derivatives Method canbe found in [139]; here a short description will be given.

To get more equations we take the derivatives with respect to xand y of the optical flow equation. This yields:

Ixxu + Ixux + Iyxv + Iyvx + Itx4t = 0 (3.7)

and

Ixyu + Ixuy + Iyyv + Iyvy + Ity4t = 0, (3.8)

3.6 Optical Flow Algorithms 73

respectively.Assuming4 that 4x varies slowly with x, let ux = uy = vx = vy =

0, yielding:

Ixxu + Ixyv + Ixt4t = 0 (3.9)

and

Ixyu + Iyyv + Ity4t = 0. (3.10)

Normalizing so that 4t = 1 gives:

Ixx Iyx

Ixy Iyy

u

v

=

−Itx

−Ity

. (3.11)

If the above matrix is non-singular, u and v can be computed as

u

v

=

1

IxxIyy − (Ixy)2

IxyIyt − IyyIxt

IxyIxt − IxxIyt

(3.12)

because we have that Iyx = Ixy. Since taking second derivatives is anoise-sensitive operation initial smoothing has to be performed.

An investigation of the direct differential method in a hierarchi-cal framework is given in [112].

Generalized Differential Methods

When using the second derivatives constraint equations 3.9 and 3.10we get a unique solution since we have two parameters and twoequations. If we also use the original optical flow equation 3.6 we getan overdetermined system that can be given a Least Mean Square(LMS) solution. This was the approach of Haralick and Lee [54] andTretiak and Pastor [125]. Gokstorp [47] suggested that a set of featureimage sequences can be obtained from the original image sequenceusing different operators. These feature images can then be differ-entiated a number of times thus giving rise to a higher number ofequations that are used to obtain an LMS solution. In [47] a largeexperimental comparison is made to test the accuracy of differentvariants of the general scheme.

4We see that also in the second derivatives method the assumption of spatialconsistency is made.


Block Gradient Method

This approach was presented by Limb and Murphy [79], and furtherdeveloped by e.g. Ohta [100, 101] as a method to reduce computa-tional complexity compared to block matching.

Assuming uniform motion 4x =

(u v

)T

in a block S we

have

It(x, y)4t = −Ix(x, y)u− Iy(x, y)v + Er(x, y), (x, y) ∈ S (3.13)

where Er(x, y) are changes unexplained by the motion model. Nowlet us denote a function f(x, y), (x, y) ∈ S by f , which can be re-garded as the column vector created by taking the function valuesin all the columns of the block and writing them under each other.Equation 3.13 can then be rewritten as

4tIt = −uIx − vIy + Er (3.14)

The norm of Er is minimized when Er is orthogonal to both Ix andIy. Therefore the motion vector 4x = should be chosen such that

(Er, Ix) = 0 (3.15)

and

(Er, Iy) = 0 (3.16)

where the notation (f1, f2) stands for inner product

(f1, f2) = f1′f2 =

∑

S

f1(x, y)f2(x, y). (3.17)

This leads to

u

v

=

1

ab− h2

b −h

−h a

−f

−g

(3.18)

where

a = (Ix, Ix) b = (Iy, Iy) f = (It, Ix)4t

g = (It, Iy)4t h = (Ix, Iy).

3.7 Bayesian Techniques 75

3.6.2 Iterative Methods

Instead of crudely setting the derivatives of 4x to zero, like in thedirect approaches, a measure of the departure from smoothness canbe computed as:

es =

∫ ∫ux

2 + uy2 + vx

2 + vy2 dxdy. (3.19)

Simultaneous minimization of es and a constant λ times the error inthe optical flow equation, ec,

ec =

∫ ∫(Ixu + Iyv + It4t)2 dxdy, (3.20)

leads to the Horn & Schunck method, a detailed description of whichcan be found in [57].

3.7 Bayesian Techniques

A recent development in motion estimation is the Bayesian Approach,for tutorials see [106, 121]. To design a Bayesian estimation algo-rithm the following steps have to be performed:

• Definition of all motion fields.

• Assigning an a priori probability to each motion field.

• Definition of the best solution.

• Deciding on a search strategy.

Bayesian techniques offer the possibility of adaptively introducinglocal discontinuities of the motion field. However, the computa-tional complexity is generally quite high.

3.8 Analysis of the Algorithms

3.8.1 Introduction

The situation of estimating the displacement of an object that hastranslated is very similar to the communication engineering prob-lem of Pulse Position Modulation (PPM) decoding. PPM is exten-sively treated in [138].


In PPM decoding, the maximum likelihood estimator in the caseof additive Gaussian noise and no anomaly is a correlator, and themaximum likelihood position is the position for which the correla-tor’s output has its maximum, see Fig. 3.15.

The signal S(x, tO) shown in the top picture of Fig. 3.15 was win-dowed out from a picture line in the “Trevor White” sequence. Theperiodic structure is due to the stripe pattern of Trevor’s shirt. Inthe picture below, the same pattern has been shifted 5 pixels to theright. The correlation between the two signals is shown in the mid-dle picture. It has its maximum value for x = 5. It also has otherpeaks, due to the periodic structure of the input signal. When weaknoise, here denoted nw, disturbs the signal, the maximum value ofthe correlation still occurs near the true displacement. In the lastpicture, however, an anomaly has occurred. Strong noise, ns, hascaused the correlation to become higher for another peak than theone corresponding to the true displacement.

An anomaly is said to occur when the correlation function is mul-timodal and the noise is strong enough to make the estimated dis-placement lie closer to another peak of the correlation function thanthe one coming from the true displacement. An anomaly in the PPMdecoding case can be said to be equivalent to a “false match” in themotion estimation case.

When searching the displacement between two images, the maindifferences from the communication engineering case are:

• The waveforms of the pulses cannot be freely chosen, but aregiven by the scene. The situation can be complicated by thematch window containing more than one object, by uncover-ing of background, by rotation, geometric distortions, defor-mations and illumination changes.

• The waveforms are not exactly known since there is noise inboth of the two images that are involved in the motion estima-tion.

• In motion estimation we search a two-dimensional displace-ment instead of a one-dimensional position.

3.8.2 Problematic scenes

The scene content might be such that the error probability is large.One example of this is the aperture problem, which was described in

3.8 Analysis of the Algorithms 77

S(x, t0)

−20 −15 −10 −5 0 5 10 15 200

5

10

15

20

25

30

x

S(x, t1)

−20 −15 −10 −5 0 5 10 15 200

5

10

15

20

25

30

−20 −15 −10 −5 0 5 10 15 200

5

10

15

20

25

30

x

c(S(x, t0), S(x, t1))

−20 −15 −10 −5 0 5 10 15 200

500

1000

1500

2000

2500

3000

x

c(S(x, t0), S(x, t1) + nw(x, t1))

−20 −15 −10 −5 0 5 10 15 20−1000

−500

0

500

1000

1500

2000

2500

3000

3500

x

c(S(x, t0), S(x, t1) + ns(x, t1))

−20 −15 −10 −5 0 5 10 15 20−1000

−500

0

500

1000

1500

2000

2500

3000

3500

x

Figure 3.15 Displacement estimation by finding maximum of thecorrelation. In the bottom row is shown a case where anomaly hasoccurred due to strong noise, ws.


Fig. 3.1. This can be detected by a certainty measure, which will bedescribed in Chapter 4. Also when the motion model and the datado not fit, this will be captured by the certainty measure.

3.8.3 Noise effects

If noise corrupts both image signals by addition of a stationary, zero-mean, noise process N(x, t), so that the observed signal is J(x, t) =I(x, t) + N(x, t), then the total correlation can be written

c(x,u′, t) =∫ ∫w(x− x1)J(x1, t)J(x1 − u

′, t +4t) dx1 =∫ ∫

w(x− x1)I(x1, t)I(x1 − u′, t +4t) dx1

︸︷︷︸c1(x)

+

∫ ∫w(x− x1)I(x1, t)N(x1 − u

′, t +4t) dx1

︸︷︷︸c2(x)

+

∫ ∫w(x− x1)I(x1 − u

′, t +4t)N(x1, t) dx1

︸︷︷︸c3(x)

+

∫ ∫w(x− x1)N(x1, t)N(x1 − u

′, t +4t)) dx1

︸︷︷︸c4(x)

, (3.21)

where w(x) is a window function. The window function is neces-sary in all practical systems, and also desirable in the motion esti-mation case since different parts of the image can move in differentdirections.

The first term in 3.21, c1, is the desired correlation. The expec-tation values of each of the noise-dependent terms c2, c3 and c4 isobviously 0. It can be noticed that the variances of c2 and c3 are bothsignal- and noise-dependent while the variance of c4 depends onlyon the noise.


To simplify the analysis, we consider the one-dimensional caseand assume that pure translation has taken place, that is

S(x, t +4t) = S(x− u, t) (3.22)

Suppressing the signal’s dependence on t, the correlation cannow be written

c(u′) = ∫w(x− x1)S(x1)S(x1 − u + u′)dx1

︸︷︷︸c1

+

∫w(x− x1)S(x1)N(x1 + u′, t +4t)dx1

︸︷︷︸c2

+

∫w(x− x1)S(x1 − u + u′)N(x1, t)dx1

︸︷︷︸c3

+

∫w(x− x1)N(x1, t)N(x1 + u, t +4t)dx1

︸︷︷︸c4

(3.23)

For matched filters, SNRmatch is defined as the square of the re-sponse at the accurate position caused by the signal divided by themean square error caused by the noise, see [31].

In this case we get

SNRmatch =(c1(u))2

E(c2(x) + c3(x) + c4(x))2(3.24)

A natural question is: how high does SNRmatch have to be togive reliable displacement estimates? This depends on how peakedc1(u

′) is. Uniform areas have a broad maximum for c1(u′), areas with

periodic pattern have several maxima for c1(u′), giving big problems

even for high SNRmatch. These problems can be avoided by lookingat the image at a coarser scale.

The easiest case to cope with is when the image content consistsof spatially white noise. Let us examine the case where we choosebetween two different displacements, u (the correct one) and xe (an-other predetermined displacement, which one is not important sincethe image only has one correlation peak and the rest of the positions


are equivalent). Then the probability Pc of correctly choosing u be-comes

Pc = P (c(u) > c(xe))

= P (c1(u) + c2(u) + c3(u) + c4(u) > c1(xe) + c2(xe) + c3(xe) + c4(xe))

= P (c1(u)− c1(xe) > c2(xe)− c2(u) + c3(xe)− c3(u) + c4(xe)− c4(u)).(3.25)

If we assume that the ci, i = 2, 3, 4 terms are Gaussian (reasonablesince they come from a summation over the match window), and ifwe also assume that c1(u) >> c1(xe) we can write

Pc ≈ P ((c1(u) > c2(xe)− c2(u) + c3(xe)− c3(u) + c4(xe)− c4(u)︸︷︷︸cN

).

(3.26)

Assuming that ci(u) and ci(xe) are independent, then cN has Gaus-sian distribution with zero mean and variance 2E(c2(x) + c3(x) +c4(x))2, which gives

Pc ≈ 1−Q

( √SNRmatch

2

)(3.27)

where

Q(x) =

∫ ∞

x

1√2e−

y2

2 dy. (3.28)

From Fig. 3.16, and bearing in mind that there are not only onebut many erroneous candidate vectors, and also bearing in mindthat the image content is not spatially white, we see that SNRmatch

has to lie in the order of at least 10 dB.What can then be done in the case of very noisy image sources?

The size of the match window can be increased, but if taken too farthis will lead to blurring of the image motion field. A step to alle-viate this problem is to use the median instead of the mean whencalculating the match criterion; then it would not affect the matchcriterion if a small part of the region around the point for which the


10−2

10−1

100

101

102

0.5

0.6

0.7

0.8

0.9

1

SNR_matchP

_c

Figure 3.16 Probability of choosing the right motion vector (out oftwo candidate vectors) as a function of SNRmatch when the imagecontent is spatially white.

motion vector was being estimated was moving in a divergent direc-tion. Another solution is to use more than two frames for displace-ment estimation. Three frames are used in [66] to obtain motionvectors from a synthetic image sequence with noise added to makethe SNRin between 0 dB and 10 dB. Another interesting approachwould be to use Kalman filtering along the motion trajectories.

3.8.4 Two dimensions

The extension of PPM-decoding to two dimensions is straightfor-ward in the case of pure translation. The only difference is that theone-dimensional search of PPM is replaced by a two-dimensional. Inpractice, this means a much higher computational complexity, sincethe dimensionality of the search space rises.

Rotations are also possible to find, even if this is not so often im-plemented in image coding applications since the time between twoframes is so short that normally only a small amount of motion hasoccurred and this can then locally be treated as translation. The di-mensionality of the search space would rise from two to three whengoing from translation to translation and rotation. The wrap-aroundeffect, however, makes it sufficient to explore rotations in the range(−π, π).


3.8.5 Maximum Likelihood versus Maximum A Pos-teriori Estimation

Maximum Likelihood (ML) estimation gives minimum error proba-bility only when the a priori probability distribution is uniform.

Using Maximum A Posteriori (MAP) estimation requires that theprobability distribution of the motion vectors is known.

In many practical motion estimation equipments, small motionvectors, and especially zero motion, is given preference over largerones. This is probably done both because of consideration to thea priori probability distributions of motion vectors, and to the factthat the distortion caused by erroneously assuming zero motion re-sembles that caused by using a CRT with slow phosphor. This is anartefact that viewers are used to and do not react as much against asthey would for an error of the other type (motion where there shouldnot be any motion). For very high quality moving images, however,using display devices with long time constants is not acceptable.

It is hard to determine the a priori probabilities of the motionvectors. Distributions centered around the vectors given by the as-sumption of either zero motion, constant motion, or constant accel-eration seem reasonable. In this work, however, only ML estimationis studied.

3.8.6 Choice of Match Criterion

A menu of match criterions to choose from was given in Section 3.3.As has been mentioned earlier, the MSE match criterion minimizesthe prediction error energy, but when a faithful apparent motionfield is the goal the situation changes. In [93] Maragos reports sig-nificantly better performance for morphological correlation, than forlinear correlation.

Morphological correlation, defined as

Cm(x, a) =∑

w(x) min(I1(x), I2(x− a)) (3.29)

is related to the MAE error criterion.Since

|A−B| = A + B − 2 min(A,B) (3.30)


minimizing the MAE and maximizing the morphological corre-lation is equivalent.

In the noise–free case, the decrease in morphological correlationis steeper than for linear correlation when there is a slight deviationfrom the correct displacement value.

Significant improvements are also reported for both MAE andMSE if the mean value is subtracted from each block and the block’senergy is normalized. This is due to variations in angular reflectanceand illumination.

3.8.7 Adaptation of Matching Criterion to Noise Dis-tribution

Morphological correlation, or MAE, can also be shown to be optimalin the ML sense if the sum of two noise terms is Laplacian.

Assume for simplicity, but without loss of generality, that no mo-tion has taken place between the images I1 and I2. We thus have

I2(x) = I1(x) + L(x) (3.31)

where the noise term L(x) is now seen as a Laplacian distributedmemoryless noise term with probability distribution function,

fL(l) =1

2ae−

|l|a (3.32)

Given the pixel values in a block in one of the images, say I1;what is the probability that the N pixels in a block in I2 will havejust those values that are actually there?

For zero displacement we get:

pLaplacian(0) =∏

block

p(L(x)) =1

(2a)Ne−

1a

∑block |L(x)| (3.33)

We see from equation 3.33 that the probability for a correct matchis a monotonically decreasing function of the MAE.

The above analysis performed for Gaussian memoryless noiseG(x), with distribution

fG(g) =1√2πσ

e−g2

2σ2 (3.34)


yields

pGauss(0) =∏

block

p(G(x)) =1

(√

2πσ)Ne−

12σ2

∑block G2(x) (3.35)

which leads to MSE as a maximum likelihood matching criterion.In image coding practice it is often assumed that the difference

signal that remains to be coded after prediction is approximatelyLaplacian.

There are thus reasons to prefer MAE to MSE, both theoretical(sharpness of peak and best noise suppression) and practical (easeof computation).

To test if there is a potential for further improvement of the matchresults, we assume that the difference signal has a distribution thatcan be written

fD(d) = ce−| d

c1|α

. (3.36)

This distribution is known under the two names Generalized Gaus-sian distribution and Generalized Laplacian distribution, indicatingthat Gaussian and Laplacian distributions are special cases of Equa-tion 3.36. It will here be called Generalized Gaussian distribution,since Generalized Laplacian Distribution has also been used for adifferent generalization of the Laplace distribution [94].

The distribution has been used to model DCT (see Section 1.2.3)components [64, 96] and wavelet transform (see Section 1.2.3) com-ponents [91].

An experimental series was made to measure which α that bestmatches the distribution of prediction errors after motion compen-sation. Hierarchical block matching with a rectangular match win-dow of 5 × 5 pixels was used, and motion estimation with integerpixel accuracy was made between first and second frame of the testsequences (“Salesman” and “Claire”).

The matching criterion was

∑

block

differenceα. (3.37)

The Kullback Leibler distance [23] between the model distribu-tion given in equation 3.36 and the actual distribution of the pre-diction errors was calculated. Quantization effects have to be takeninto account when calculating the probability density functions of


the prediction errors, at least when the test images are quantized to8 bits. Details about these computations are given in Appendix A.

The result is presented in Table 3.3 and Fig. 3.18. As can beseen, the lowest obtainable value of the Kullback Leibler distancewas 0.017 for “Salesman” and 0.013 for “Claire”. For comparison itcan be mentioned that for the “Claire” sequence the entropy [23]of the prediction error was 1.4 bits/pixel, almost independent ofwhat α that was used in the motion estimation match criterion. TheKullback Leibler distance measures how far the model distributionis from the actual distribution counted in how many bits extra itcosts to use the model distribution instead of the actual distribu-tion when source coding with a codebook ideally suited for the re-spective source. That the Kullback Leibler distance can get as lowas 1% of the entropy, which is the lowest obtainable cost for cod-ing the source, indicates that equation 3.36 is a good model for theerror residing after motion compensation for the two tested imagesequences. The experiment also shows that individual α values arenecessary for good modeling of different image sequences.

A comparison of the sharpness of the matching results using dif-ferent α-values can be seen in Fig. 3.17.

3.8.8 Matching and Algorithms based on the OpticalFlow Equation — A Short Summary

We have shown that, from a theoretically point of view, matchingtechniques are optimal in the case of pure translation. They can alsobe extended to handle rotation and other geometric distortions, buttheir computational complexity will then increase.

Algorithms based on the optical flow equation assume and pro-duce smooth motion fields. This might sound like a nice property,but is not necessarily a desired feature. At borders between movingobjects the AMF is discontinuous, and smoothing it gives ”bleed-ing” over object boundaries. On the other hand there are applica-tions which require a smooth, invertible motion field, e.g. motioncompensated 3–D coding. Another advantage of differentials basedmethods is the sub–pixel accuracy.

Which motion compensation technique to be used in practicalimage coding equipment depends on the application, video codingis such a wide field that a single method does not rule out the others.


Salesman

α KLD RMS MAE

0.8 0.0216 3.60 2.18

1.0 0.0173 3.60 2.18

1.1 0.0183 3.60 2.18

1.2 0.0202 3.60 2.19

1.4 0.0255 3.59 2.19

1.8 0.0389 3.57 2.21

1.9 0.0423 3.58 2.22

2.0 0.0454 3.58 2.22

Claire

α KLD RMS MAE

0.2 0.0456 3.34 1.22

0.25 0.0225 3.30 1.21

0.3 0.0137 3.27 1.21

0.35 0.0126 3.21 1.20

0.4 0.0163 3.18 1.20

0.6 0.0511 3.13 1.19

0.8 0.0917 3.13 1.20

1.0 0.1288 3.09 1.20

1.1 0.1478 3.09 1.20

1.35 0.1884 3.09 1.21

1.4 0.1950 3.09 1.21

1.6 0.2225 3.09 1.22

1.7 0.2335 3.08 1.22

1.8 0.2440 3.09 1.22

1.9 0.2552 3.09 1.23

2.0 0.2660 3.09 1.23

Table 3.3 Kullback Leibler Distance D(modeled errordistribution||actual error distribution), Root Mean Square errorand Mean Absolute Error between first frame motion compensatedand second frame of test sequence “Salesman” (left) and “Claire”(right), where the motion estimation was made according to MatchCriterion

∑ |x1 − x2|α.


x

x

x

x

Signal

Match results using α = 2 (MSE)

Match results using α = 1 (MAE)

Match results using α = 12

Figure 3.17 Results of matching using different values of α toachieve an ML estimator. Zero displacement is assumed. The noiselevel is zero in this example. For binary signals the matching resultsare independent of α.


0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.11

0.12

0.13

α

KLD

Figure 3.18 Kullback Leibler Distance. “Salesman” sequence rep-resented by filled dots, “Claire” sequence represented by unfilleddots. Data comes from Table 3.3.

4

Certainty Measures

The chapter starts with descriptions of methods for computing re-liability indicators of the motion vectors. Some of the methods arerelated to matching in the image plane, [2, 3, 40, 52, 69, 84], some tomatching in the frequency domain [104], and some to differentialsbased methods [16, 97, 100, 101].

Certainty measures have mainly been used for improving themotion vector fields. In [69] and [84] however, the certainty mea-sures were used for controlling prefiltering.

4.1 Certainty Measures for Block Matching

We will start by describing two simple, intuitive certainty measures,namely Best Match Value and Distance to Second Best. Thereafterwe introduce the appealing but computationally awkward Proba-bility of Correct Match. Then five scalar certainty measures earlierdescribed in literature, computationally and in performance lyingbetween Probability of Correct Match and the simple performancemeasures, will be described. Last a method to estimate the covari-ance of the estimated vector will be described. In the following textit will be assumed that the Match Criterion that is used is of the typethat low MC-value correspond to a good match, and therefore themotion vector giving the lowest MC-value will be chosen.

89

90 Chapter 4 Certainty Measures

Best Match Value

The simplest certainty measure, Best Match Value, consists of thematch value given by the estimated motion vector. Since low matchvalue indicates good match we have to apply a decreasing functionto the match value instead to get a certainty measure that is high forgood matches.

This certainty measure gets a low value if noise is high and if theimage changes in a way that is not consistent with the motion model.There are theoretically sound reasons for using Best Match Value asa certainty measure if the Match Criterion is properly adapted to thenoise statistics of the image sequence. Then the MC-value reflectsthe probability of getting the actual differences between the two im-ages in the match block. The problem with using Best Match Valueas a certainty estimate is that it can get a high value even if there isnot enough spatial information in the image to uniquely determinethe motion vector.

Distance to Second Best

The difference between the best and the second best match value canbe used as a certainty measure and will in the following be calledDistance to Second Best. This certainty measure gets a high value ifthere is structure in the image, which is a desired feature of the cer-tainty measure. A less desired feature of Distance to Second Best isthat when the variance of the noise increases, the certainty measurealso increases.

Probability of Correct Match

When doing matching with a discrete number of pre-specified can-didate positions, one of them being the physically correct one, wecan talk about the probability of a correct assignment.

Assuming that position i has been chosen, the Probability of cor-rect match is

Pc = P (i was correct). (4.1)

4.1 Certainty Measures for Block Matching 91

When the match value MCi corresponds to the chosen vector andwe have two positions i and j to choose between, not taking a prioriprobabilities of the motion vectors into account, the expression forPc can be written

Pc = 1− P (eMCi− eMCj

> MCj −MCi). (4.2)

In the above equation MCi and MCj denote registered matchvalues and eMCi

, eMCjdenote the noise terms in the respective match

values.Assuming that the errors eMCi

have Gaussian distribution1 withmean value µe and standard deviation σe arising from spatially andtemporally uncorrelated noise in the image sequence, the probabil-ity of correct match for a search window of two positions becomes:

Pc = 1−Q(MCj −MCi

(1− ρij)σe

) (4.3)

where

Q(x) =

∫ ∞

x

1√2π

e−y2

2 dy (4.4)

and ρij is the correlation between the noise terms at positions i andj which arises when the match window is larger than the distancebetween i and j. The noise variance σe has to be estimated, see theanalysis in Section 3.8.3. We see that the Probability of Correct Matchin the case of a search window of two positions becomes a functionof the Difference to Second Best.

For larger search windows the probability of correct match is noteasy to compute. It requires integrating a probability density func-tion over an n-dimensional volume, n being the number of searchpositions, which is delimited by n − 1 hyperplanes not aligned tothe symmetry axes of the density function.

Making the somewhat dubious assumption that the eMCivalues

are uncorrelated and keeping the assumption that they are Gaussian,and using the Union Bound [138] we can achieve an estimate

Pc ≈ 1−∑

j 6=i

Q(MCj −MCi

σe

) (4.5)

1Actually, the eMCiare not Gaussian. If the noise in the images is Gaussian and

MSE is used, the eMCihave Chi-square distribution. For large match windows a

Gaussian distribution is a reasonable approximation independent of the underly-ing noise distribution and which Match Criterion is used since every term in thesummation is independent.


In the case of hierarchical matching in the image plane the searcharea and subsampling factor can be chosen so that each motion vec-tor is obtainable by a unique sequence of displacements when doinga coarse-to-fine estimate. The probability of correct motion vectorwill in this case be the product of the probability for correct matchat each level,

Pc =∏

l

Pcl. (4.6)

Often the choice of search area and subsampling factor makessome motion vectors reachable by several paths, see Fig. 4.1.

��

��

��

��

��

��

��

AA

A

AA

A

AA

A

AA

A

AA

A

AA

A

AA

A�

��

��

�

��

�

@@

@

@@

@

@@

@��

HHHHHH

Figure 4.1 Two-dimensional hierarchical matching with searcharea 3 and subsampling factor 2. Some motion vectors are reach-able by several paths, making it possible, in some cases, to recoverfrom a match error.

From the example given in Fig. 4.1 it can be envisioned that theprobability of reaching the correct motion vector is dependent onthe value of the motion vector. The formula for Pc is therefore alsodependent on the value of the motion vector. To avoid this depen-dency, and also for reasons of computational efficiency, search areasand subsampling structures should be chosen so that there is oneunique path which leads to each motion vector. See also Section 2.2.

The probability of correct match has as far as we know neverbefore been used as a certainty measure, be it of lack of analysis orintractability of implementation. Instead some other more or lessad hoc certainty measures have been used, we shall discuss some ofthem below.


Anandan 1

Anandan presents a certainty measure for use in hierarchical motionestimation [2]. He computes the normalized curvature in four direc-tions around the best match (here indexed (0, 0) for convenience).

C0 =MC(0,−1)− 2MC(0, 0) + MC(0, 1)

MC(0,−1) + 2MC(0, 0) + MC(0, 1)(4.7)

C45 =(MC(1,−1)− 2MC(0, 0) + MC(−1, 1)

(MC(1,−1) + 2MC(0, 0) + MC(−1, 1)(4.8)

C90 =MC(−1, 0)− 2MC(0, 0) + MC(1, 0)

MC(−1, 0) + 2MC(0, 0) + MC(1, 0)(4.9)

C135 =MC(−1,−1)− 2MC(0, 0) + MC(1, 1)

MC(−1,−1)− 2MC(0, 0) + MC(1, 1)(4.10)

The lowest of these,

C = min(C0, C45, C90, C135) (4.11)

is then used as a certainty measure.

Anandan 2

In [3] Anandan presents a vector certainty measure. Let Cmax andCmin be the maximum and minimum curvature of the match valuesurface around best match. The unit vectors ~e1 and ~e2 indicate thedirections where the extreme values are found. The certainty valuesused are

c1 =Cmax

k1 + k2MCmin + k3Cmax

(4.12)

and

c2 =Cmin

k1 + k2MCmin + k3Cmin

. (4.13)


1D-Curvature over Match Value

Kronander, in [69], uses a model where the match values are sup-posed to rise quadratically around best match

MC(x) = kxx2 + lx (4.14)

and similarly in the y direction. The certainty measures are definedas

Cx =kx

lx, (4.15)

Cy =ky

ly, (4.16)

and

C =√

Cx2 + Cy

2 (4.17)

Kronander was the first to use a certainty measure in an image cod-ing application, namely pre-filtering, previous certainty measureshad been used for refinement of the motion vector field.

2D-Curvature over Match Value

A similar certainty measure was introduced in [84]:

C =k

l(4.18)

where k and l were estimated directly in two dimensions

MC(x, y) = k(x2 + y2) + l. (4.19)

The certainty measure was used to improve pre-filtering, see Sec-tion 4.5.3.


Accumulated Differences

Guse [52] uses a quality measure (assuming lowest MC indicatesbest match)

QM =1

N(2552 − σ2)

N∑

j=1

(MCj − AbsMin) (4.20)

where AbsMin is the lowest match value or, if the lowest matchvalue is lower than the estimated noise variance σ2, AbsMin = σ2.The MCj values are sorted so that MC1 is the second best matchvalue and the others follow in ascending order. We note that we canobtain Distance to Second Best as a special case by setting N = 1.

Covariance of Motion Vector

Gennery [40] introduced the covariance matrix of the disparity es-timates for the purpose of stereo matching. The covariance matrixcould also be useful in e.g. model based image coding.

Assume that from each computed MC(u, v) in the search win-dow we can extract a value ρ(u, v) that is proportional to the prob-ability of (u, v) being the correct match. For a generalized Gaussianprobability distribution function we get

ρ(u, v) = e−MC(u,v) (4.21)

under the assumption of stationary white noise.The variances and covariance of the displacement estimates are

then computed as

σ2u =

∑(u,v)∈SW ρ(u, v)u2

∑(u,v)∈SW ρ(u, v)

− u2 (4.22)

σ2v =

∑(u,v)∈SW ρ(u, v)v2


− v2 (4.23)

σuv =

∑(u,v)∈SW ρ(u, v)uv


− uv. (4.24)


where SW denotes Search Window.The covariance matrix is then

σ2u σuv

σuv σ2v

(4.25)

4.2 Certainty Measures for Matching in theFrequency Domain

Probability of Correct Match

When matching in the frequency domain we are in the fortunateposition that the probability of correct match can be estimated sincethe values in all positions are considered independent. The UnionBound

Pc ≤ 1−∑

j

P (j was better than i) (4.26)

can be utilized to estimate Pc in the same way as when computingprobability of correct match for matching in the spatial domain, buthere the assumptions (Gaussian, independent errors in the correla-tion values) are much more reasonable.

Normalized Peak Height

Pearson [104] models the phase angle as either correct or uniformlydistributed over the range (−π, π) radians. The total power in thePhase Correlation Surface (which has been normalized to 1) is di-vided between a coherent peak of power A2 (A is the peak height)and noise peaks with energy summing to 1 − A2. It is concludedthat the peak height, A, provides a sensitive measure of the degreeof commonality between two images and consequently of the corre-lation performance to be expected.

The assumption of one peak for correct match and the rest of theenergy coming from noise can of course be questioned. Assume that

4.3 Certainty Measures for Motion Estimation based on Gradients 97

two motions are present in a block, yielding two peaks A1 and A2

of the same height. If we assume that there is no noise then A1 =A2 = 1√

2. The certainty measure A = 1√

2is the same as for the case

with only one peak of height 1√2

and N other points having peakswith Gaussian distribution (0, 1√

2N). The error probability in the first

case is 12, in the second case it is lower since the error probability

NQ(1− 1√2N

) is strictly decreasing for positive values of N .

4.3 Certainty Measures for Motion Estima-tion based on Gradients

Burt

In [16] the motion vectors at a certain resolution level l is obtainedby four spatio-temporal filters in the horizontal, vertical, and twodiagonal directions, corresponding to HVS motion channels2. Sincecorrelation is used, a high value indicates good match. Two motionvector pairs are estimated, one from the horizontal - vertical filterpair and one from the pair of diagonal filters.

For each direction a certainty measure is computed. In the hor-izontal direction at a certain resolution level l the certainty is givenas

Crtl(i, j) =2MCzel

(i, j)−MCrtl(i, j)−MCltl(i, j)

MCzel(i, j) + MCltl(i, j) + MCrtl(i, j)

(4.27)

where MCrtl stands for Motion Channel right displacement at levell, but could also be read as Match Criterion right displacement atlevel l. The motion vectors can thereafter be classified into five classesdepending on the four certainty measures and the consistency of themotion vectors obtained by the four filters.

Case 1: All four certainty estimates are low.Either the velocity is too great to be handled at this scale or there

is not enough image detail to resolve the motion3.Case 2: All certainty measures are high and the velocity estimates

from the two filter pairs are approximately the same.2This method lies on the borderline between block matching and gradient

methods. Since it does not choose the best match but computes motion vectorsas ratios it has been placed under gradient based methods

3This is also a problem of scale


There is enough texture to resolve the motion.Case 3: All certainty measures are high, but the estimated motion

vectors differ.This is interpreted as a point near an occluding edge.Case 4: Three certainty measures are high and one is low.The point lies in an area where the structure of the image is one-

dimensional. The direction of the certainty measure with low cer-tainty is the same as the direction of the image structure.

Case 5: Two certainty measures are high and two are low.The image content is locally one-dimensional. This case is essen-

tially the same as Case 4, but the direction of the image structure liesbetween two filter directions. The component of the motion vectorperpendicular to the image structure is reliable, the one parallel tothe structure is not.

A criticism of the certainty measure above is that it computes thecurvature around zero displacement while the motion vector maydiffer from zero.

Ohta

For the Block Gradient Method described in section 3.6.1 Ohta in [100,101] derived reliability indices.

The first reliability index is the square root of the energy of theerror remaining after the motion has been applied:

re = min|Er|. (4.28)

But also when there is a good fit, the motion vectors can be un-reliable. The motion vectors were obtained through a matrix inver-sion, and thus the eigenvalues of the matrix being inverted are im-portant for the reliability of the motion vector components. Twofurther reliability indices are therefore introduced:

~r1 =√

λ1 ~e1 (4.29)

where λ1 is the highest eigenvalue and ~e1 the corresponding eigen-vector, and

~r2 =√

λ2 ~e2 (4.30)

where λ2 is the lowest eigenvalue and ~e2 the corresponding eigen-vector.

The most reliable direction is thus ~e1 and the least reliable ~e2.

4.4 Evaluations of Certainty Measures 99

Hessian

In second-derivatives based methods a matrix inversion is also per-formed. The Hessian

H = IxxIyy − (Ixy)2 (4.31)

has been used as a certainty estimate e.g. by Mulroy in [97]. Hehowever concludes that the Hessian can be high also in areas whichare almost locally one-dimensional.

4.4 Evaluations of Certainty Measures

In this section we will experimentally evaluate different certaintymeasures for matching in the spatial domain. The property that willbe evaluated is the ability to indicate true motion in the presence ofnoise. Three different test images will be used, these are:

• a circularly symmetric test image containing different spatialfrequencies,

• an image which, apart from the circularly symmetric pattern,also has areas consisting of white Gaussian noise, making itpossible to compare the effects of a locally one-dimensionalpattern to th effects of a pattern containing two-dimensionalinformation.

• a natural test image.

Circular Test Image

The following certainty measures were evaluated:

• Best Match ValueMSE was used as MC to produce a certainty image, which wasinverted because high MSE represents bad match. The com-putational cost of obtaining the certainty measure is zero ifblock matching is used to obtain the motion vectors, since themotion estimation algorithm requires finding the best matchvalue. If the motion vectors are obtained in a way other thanblock matching, the computational cost is still low.


• Distance to Second BestThe difference between best and second best MSE value is cal-culated. The neighbourhood in which the second best MSEvalue was searched was 5∗5 pixels. A search in the neighbour-hood is done to find the second best value, but computationalcomplexity is still not very high.

• Probability of Correct Match, Union Bound ApproximationThe approximation formula 4.5 was used to compute the prob-ability of correct match over a 5 ∗ 5 neighbourhood. If the Q-function (integral of the Gaussian pdf) is tabulated, the com-putational complexity is not very high.

• 2D-Curvature over Match ValueAlso here a 5 ∗ 5 neighbourhood was used for estimating thecertainty measure. Even if the curvature and match values areobtained as a result of a least squares approximation, the com-putations are not cumbersome since the certainty is obtainedas a ratio of two linear combinations of the MC values in theneighbourhood.

• Accumulated DifferencesIn the first simulation the differences to the 6 best MC val-ues were summed. The computational complexity is a littlebit higher for this method since it requires finding the lowest6 out of 24 values. The computational complexity would be-come lower if all values in the neighbourhood were used. Thiswas tried in the second simulation.

The simulations were done on a 512 ∗ 512 test image containinga circular pattern of sinusoidal gratings with frequencies decreasingas the distance from the center increases, see Fig. 4.2. The middleand edge areas are uniformly gray, making correct local velocity es-timates impossible to obtain in these areas. The motion vector waszero in the simulations. Gaussian, spatially and temporally whitenoise was added to both “previous” and “actual” frame, see Fig. 4.3.The sine-wave signal has an amplitude of 128 and the noise a stan-dard deviation of 10 making the SNRin 27 dB in the active area ofthe picture.

Since the test image is locally almost one-dimensional the motionvectors can not be correctly estimated. It is obvious from Fig. 4.4that the component vertical to the image structure gets correctly ap-proximated while the component parallel to the image structure is


erroneous. How this is captured by the certainty measures is seen inFig.4.5. The motion vectors were obtained with a hierarchical blockmatching algorithm using 4 layers (allowing displacements up to±15 to be detected), MSE as Match Criterion, and a Gaussian MatchWindow.

Figure 4.2 Circular test image used for certainty estimates simu-lations.

Figure 4.3 Circular test image with added noise.


Figure 4.4 Motion vectors obtained using a hierarchical blockmatching algorithm. In the two top pictures gray indicates correctmatch (zero displacement). Top left: horizontal estimated displace-ment; Top right: Vertical estimated displacement; Bottom: Lengthof error vector (zero represented by black).

Evaluation using Circular Test Image

Looking at the estimated motion vectors in Fig. 4.4, we see that forhorizontal displacement the estimate is wrong when the pattern isalso horizontal, which was expected. What was not expected is thatthe problems are worse in some of the directions, namely the n ∗90 deg, n ∗ 90 deg +27 deg, n ∗ 90 deg +45 deg, and n ∗ 90 deg +63 deg.

Comparing the results of the motion estimation algorithm, whichare shown in Fig. 4.4, and the outputs of the certainty measures,


which are shown in Fig. 4.8, we note the following:

• Best Match ValueThis certainty measure gave a totally unwanted result in thiscase (pure translation). Where there was no image structurethe motion vectors adapted to the noise, which made the resid-ual error become lowest in areas without image texture, whichis also where the errors are largest.

• Distance to Second BestConsidering the simplicity of this measure, it performs rela-tively well. Where there are low spatial frequencies, the algo-rithm tends to produce too low certainty estimates.

• Probability of Correct Match, Union Bound Approx.This certainty measure is qualitatively right, but rather noisy.The behaviour is somewhat similar to the behaviour of the Dis-tance to Second Best, which indicates that often the largestterm in the Union Bound summation constitutes the largestpart of the sum.

• 2D-Curvature over Match ValueThis certainty measure is better than the previous ones at de-tecting that the estimates are unsecure for high frequencies.

• Accumulated DifferencesThis certainty measure detects well where there exists imageinformation. For N = 6 the angular variations are not wellcaptured, n ∗ 90 deg has higher certainty, while the results ofthe estimation algorithm were worse here. The certainty esti-mates are too low for low frequencies. The accumulated differ-ences over the 24 spatially closest pixels show almost no vari-ation over spatial frequencies and phase, which is not surpris-ing since the area the summation is done over now containsmore than a small part of a period.

For all the certainty measures it can be said that they tend to beover-optimistic about the certainty for the highest frequencies. Thismay be explained by the fact that the certainty measures use the res-olution of the original image, while the motion estimation algorithmstarts on a coarser scale where only low frequencies exist due to im-age filtering and downsampling. The high frequencies are not pre-served at the scale where the algorithm starts. The noise, however,


has energy in all frequency components. This situation is not com-mon in natural images since, as a coarse approximation, images canbe said to be formed by edges between different areas. An edge con-tains all frequencies; due to defocusing the higher frequencies canbe eliminated, but missing low frequencies should be a rare excep-tion. Nevertheless, this is an unwanted behaviour of the certaintymeasures. The certainty should instead be estimated on every levelwhen doing motion estimation and combined, roughly by multipli-cation as discussed in Section 4.1.

1D-Pattern versus 2D-Pattern

In the previous experiment the whole test image was nearly locallyone-dimensional, it would be interesting to compare the result toan image containing true 2D-information. Since complex informa-tion can be emulated by a noise pattern, a new test image was con-structed using half of the image from the earlier experiment and theother two quarters consisting of white Gaussian noise having dif-ferent standard deviation, see Fig. 4.6. The experimental conditionswere otherwise identical to the previous experiment (spatiotempo-rally white, Gaussian noise with standard deviation 10 added toboth images, displacement 0). The upper left quarter of the imageconsists of white Gaussian noise with standard deviation of 20, re-sulting in an SNRin of 3 dB while the lower left quarter of the imageis composed of white Gaussian noise having a standard deviation of40, giving an SNRin value of 9 dB.

Also in this experiment we started by estimating the motion vec-tors using the hierarchical motion estimation algorithm described inSection 4.4. The resulting motion vectors are shown in Fig. 4.7.

The local certainty images given by the different certainty mea-sures are shown in Fig. 4.8. To make comparisons between cer-tainty measures easier, the certainties from Fig. 4.8 were histogramequalized. This is a sound operation, since any monotonically in-creasing function of a certainty measure can also be considered tobe a certainty measure. The results are shown in Fig. 4.9. As canbe seen, some of the certainty measures are more noisy after his-togram equalization than before. This is due to the fact that manysamples shared exactly the same value (e.g. maximum or minimumcertainty) and randomness was introduced to distribute the samplesevenly over the histogram.


Evaluation of 1D-Pattern versus 2D-pattern

From Fig. 4.7 we see that no errors at all were made in estimating themotion vectors where the image content was white Gaussian noisewith standard deviation of 40. When the standard deviation in theinput images was only 20, however, some errors occurred. The per-formance for the circular pattern was identical with the previous ex-periment.

• Best Match ValueAgain, this gave a totally unwanted result. The two inputquadrants having white noise as image signal were assignedapproximately the same certainty (the one with less image en-ergy even got a little higher certainty, since the motion estima-tion could adapt more to the noise). The highest certainty lev-els were assigned to the regions having a uniform gray-valueas input.

• Distance to Second BestThis certainty measure performs qualitatively right, but is a bitnoisy since only two match values are used.

• Probability of Correct Match, Union Bound Approx.The result of this certainty measure is also qualitatively rightbut noisy.

• 2D-Curvature over Match ValueThis certainty measure does not work at all when the input im-ages consist of spatially white noise. This is because the corre-lation function of the image content does not follow a parabola.

• Accumulated DifferencesThe certainty measure is qualitatively right both for N = 6 andwhen the differences are accumulated over the whole neigh-bourhood.

As a complement to the visual comparisons between certaintyimages and motion vector error images, curves were plotted show-ing how many of the motion vectors that were correct for each cer-tainty level. For a perfect certainty measure, such a curve shouldform a step, making it easy to separate between correct and incor-rect motion vectors. The actual behaviour of the certainty measuresis shown in Fig. 4.10. The number of vectors at each certainty level


was 1024. Since the search window in the simulations consists of 961pixels, the chance of hitting the right motion vector by coincidenceis negligible.

All the curves in Fig. 4.10 integrate to the same value, determinedby the percentage of the vectors that were correct over the whole im-age. As an ideal certainty measure we can imagine an indicator thattells us whether the motion vector was correct or not. The certaintymeasure would then only have to be bi-level. In a diagram of thetype in Fig. 4.10 the curve would rise from 0% to 100% in one step.This is too much to demand of a real certainty measure, but a rea-sonable demand is that the fraction of the motion vectors that arecorrect is an increasing function of their certainty value. That this isnot always the case is evident from Fig. 4.10. Desirable properties isthat the curve originates at lowest certainty and ends at maximumcertainty and that the transition is rather step-like.

That Best Match Value gives an almost inverted result from whatis desired is evident already from visual inspection of the certaintyimages (Fig. 4.9). Distance to Second Best on the other hand behavesrather well, whereas accumulating differences over more matchesgives very high certainties to many erroneous motion vectors. Prob-ability of Correct Match computed using Union Bound gives max-imum certainty to all the vectors having certainty above the level100, however the probability of a correct motion vector is not 100%as desired, but rather around 90%. The certainty measure Curvatureover Value does not model the image content as being white, and istherefore not well suited for this particular test image.

Natural Test Image

The natural image baboon, see Fig. 4.11, was used as test image.Strong Gaussian noise (standard deviation of 20 corresponding toan SNR of 16.7 dB) was added to both frames.

The resulting absolute values of the error vectors can be seenin Fig. 4.12, the histogram equalized certainty measure images inFig. 4.13, and the corresponding statistical evaluation is shown inFig. 4.14.

Evaluation using Natural Test Image

• Best Match ValueAlso for a natural test image it is true that the better the match,


the more unreliable the motion vector. In fact, when noise isthe source of errors in the motion estimation process, the bad-ness of the match would be a better certainty measure.

• Distance to Second BestAgain, this certainty measure gives good results.

• Probability of Correct Match, Union Bound Approx.The certainty measure performs rather well, but unfortunatelythe probability of correctness for the motion vectors given max-imum certainty does not reach 100%, only about 83%.

• 2D-Curvature over Match ValueThis certainty measure performs better for natural images thanfor test images consisting of spatially white noise. However,the 2D-Curvature over Match Value is not very selective.

• Accumulated DifferencesThe number of matches over which differences are accumu-lated affects the performance differently at the low and at thehigh end of the certainty scale. Near zero certainty, accumula-tion over many matches gives more reliable results. For highcertainty values, on the other hand, accumulating over onlya few matches gives better performance. Adaptivity could beintroduced to handle this tradeoff.

Conclusions based on the Simulation Results

A conclusion from the simulations is that the fit of the Match Crite-rion for the chosen motion vector (Best Match Value) is a bad indica-tor of the certainty of the motion vector estimate. This is true at leastas long as the motion model is valid, which it is in these simulationswhere the only motion is pure translation. However, over large areasof the image, the motion model is supposed to be valid, otherwiseit should be replaced, and in these large areas the certainty measureought to be trustable. The Best Match Value is implicitly involvedin all the other evaluated certainty measures. Improvements couldperhaps be done if the influence of the value of the Match Criterionat the chosen motion vector was diminished, e. g. by replacing thecertainty measure Curvature over Value with only Curvature.

Another approach that could be fruitful is to test accumulateddifferences over more than one but fewer than 6 second best matches.

Best match value Distance to second best

Probability of correct match 2D-curvature over match value

Acc. diff’s, 6 best matches Acc. diff’s, 24 positions

Figure 4.5 Local certainty of the motion vectors. All certainty im-ages were normalized so that white corresponds to maximum cer-tainty.

Figure 4.6 Test image used for certainty estimates simulations.Top left square contains white Gaussian noise with a standard devi-ation of 20; bottom left square contains white Gaussian noise witha standard deviation of 40.

Figure 4.7 Top left: horizontal estimated displacement; Top right:Vertical estimated displacement. Gray indicates no error.


Probability of Correct Match 2D-Curvature over Match Value


Figure 4.8 Local certainty of the motion vectors. All certainty im-ages were normalized so that white corresponds to maximum cer-tainty.




Figure 4.9 Local certainty of the motion vectors. All certainty im-ages were histogram equalized.

50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f vec

tors

that

wer

e co

rrec

t

50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f vec

tors

that

wer

e co

rrec

t


0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t

50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f vec

tors

that

wer

e co

rrec

t


50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f vec

tors

that

wer

e co

rrec

t

50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f vec

tors

that

wer

e co

rrec

t


Figure 4.10 Percentage of correct vectors as a function of certainty.Input image was the image from Fig. 4.6, a collage of the circulartest image and noise with different standard deviations.

Figure 4.11 The “baboon” test image without (left) and with(right) added noise.

Figure 4.12 Motion vectors obtained using a hierarchical blockmatching algorithm. In the two top pictures gray indicates correctmatch (zero displacement). Top left: horizontal estimated displace-ment; Top right: Vertical estimated displacement; Bottom: Lengthof error vector (zero represented by black).




Figure 4.13 Local certainty of the motion vectors. All certaintyimages were histogram equalized.

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t


0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t


0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

certainty level

perc

enta

ge o

f the

vec

tors

that

wer

e co

rrec

t


Figure 4.14 Percentage of correct vectors as a function of certainty.The input image was the natural image shown in Fig. 4.11.

4.5 Applications of Certainty Measures 117

4.5 Applications of Certainty Measures

Local certainty measures of motion vectors have been used for:

• Computation of the motion vector field in a hierarchical esti-mator [3], and in the Highest Confidence First algorithm [121].See also the scheme suggested in Fig. 4.15 on p. 119.

• Image processing using the motion vector field.

As an example, the confidence of stereo matching (which isstrongly related to motion estimation) is used when building a3D-model of the scene [40].

• Image coding: Prefiltering.

Using the certainty to be able to perform more efficient pre-filtering was first suggested in [68], and presented in [84]. Itsignificantly increases the efficiency of the pre-filtering, sinceheavy temporal filtering can be applied in the areas where thecertainty is high, while more careful filtering is applied in ar-eas where the certainty is low. Certainty of motion vectors inprefiltering is described in more detail in Section 4.5.1.

• Image coding: Coding/Decoding.

More about certainty measures in image coding/decoding ingeneral can be found in Section 4.5.2. In Section 5.1, the schemedescribed in [85] is detailed out.

• Image coding: Postfiltering.

In [122], motion vector certainty is used to segment the imagesequence into moving objects, static background, covered anduncovered background. This information is then used to per-form frame interpolation yielding improved motion rendition.In [68] certainty aided motion compensated postfiltering wassuggested to remove artefacts introduced by the image coder.Further details are given in Section 4.5.3.

4.5.1 Motion Information in Prefiltering

Pre-filtering is beneficial for reduction of camera noise. Such noisecan substantially lower the performance of the image coder withoutbeing very visible in the input sequence.


Pre-filtering without motion compensation can not be made withlow temporal cut-off frequencies without introducing blurring. Mo-tion detection has been used in practical systems to allow effectivefiltering of stationary areas while blurring of moving areas is pre-vented.

Motion-compensated pre-filtering was introduced in [68], and al-lows for much lower cut-off frequencies than straight (non-motioncompensated) pre-filtering without introducing blurring.

For straight temporal filtering, motion detection has been uti-lized to locally reduce the amount of filtering. In the same way,a certainty measure can be used when doing motion-compensatedprefiltering, to indicate that the motion vector is not reliable at thisspatial position, and therefore the bandwidth of the temporal filterhas to be increased.

Motion Estimation and Prefiltering In order to do motion com-pensated pre-filtering a motion vector field is obviously needed. Itwould however be beneficial to have the noise-reduced sequenceavailable for estimating the motion vectors, since noise degrades theperformance of the motion estimation algorithm. A system that al-leviates the noise problem by using frames from the pre-filtered se-quence as one of the two input frames to the motion estimation isintroduced in Fig. 4.15.

Simulations on Motion Certainty Controlled Prefiltering

Prefiltering followed by Entropy Estimation We have performeda set of simulations which are described below. This work was alsopresented in [84]. The simulations compare using certainty con-trolled motion compensated temporal prefiltering to using conven-tional, non-certainty controlled prefiltering. As test sequence, weused a 15-Hz version of the CCITT test image sequence “Salesman”(see Fig. 4.16). FIR-filters with equal tap weights were used as tem-poral filters. For the certainty controlled filter the length of the tem-poral filter was adjusted so that the certainty of the motion compen-sation between the pixel in the actual frame and the last frame in thefilter was always above a certain threshold. One could expect thatthis would lead to different areas of the picture being delayed bydifferent amounts of time, but such an effect could not be noticed.In these simulations we did not use non-causal filters since we con-sidered the use in a communication system as essential.


CCF

D

MC

MEmvcmvD

CCF

D

MC

MEmvcmv

Figure 4.15 Top: Motion information is usually extracted from theoriginal sequence and thereafter used in the subsequent processing,in this case certainty controlled filtering (CCF). Bottom: The pro-posed scheme uses the filtered image sequence as one of the inputsin the motion estimation process. This improves the signal- to noiseratio in the motion estimation step.

The results of the simulations are exemplified in Fig. 4.17, wheresmearing appears in the left image due to erroneous motion esti-mates. When certainty is applied the smearing is substantially alle-viated as can be seen in the right part of Fig. 4.17.

Prefiltering followed by Coding To test the efficiency of certaintycontrolled motion compensated prefiltering in a real-world situa-tion we coded some standard test sequences using the H.263 codingscheme [61], both with and without prefiltering. The H.263 codingscheme is a mature standard, and it has proven difficult to obtainfurther significant coding efficiency improvements. As we will seefrom the experiments, by performing certainty controlled motioncompensated prefiltering, we can obtain clear improvements. Thesame type of improvement should be obtained also for other cod-ing standards, such as MPEG-2 and MPEG-4, and also other codingschemes such as wavelet video coding.


Figure 4.16 Cut-out from the original frame from the test imagesequence “Salesman”.

Some questions that arise are:

• Is the certainty control really necessary for pre-filtering, or is itenough with just motion compensation?

• Maybe the motion compensation is not necessary?

– Using zero motion, the motion vector certainty can be eas-ily calculated and used to control the filtering.

– What is the difference in performance compared to usingneither motion compensation nor motion vector certaintyin the pre-filtering?

• Is it better to let the coder itself act as a prefilter.

We will start by addressing the last question. A DCT coder can per-form spatial filtering. The DCT coder can in fact be viewed as a sub-band coder having bands of equal absolute bandwidth. The filtersare however not well adapted to the HVS, since the block structuremakes the filter kernels have very abrupt borders. Moreover, theprefiltering that is studied here operates along the temporal axis,and can therefore not be replaced by spatial prefiltering in the coder.We will present some experimental results in the paragraph start-ing on page 122, where using no explicit pre-filtering (i.e. letting the


Figure 4.17 Cut-out of two images that have been temporally low-pass filtered. Their motion compensated prediction error have thesame entropy. Left: conventional motion compensation and fixedtemporal filter; Right: The bandwidth of the filter is controlled bythe certainty of the motion estimation.

coder itself act as pre-filter) is compared against using certainty con-trolled motion compensated prefiltering. We will also in Fig. 4.23 onpage 127 compare the visual results of the four different types of pre-filtering that are given by using or not using motion compensationin combination with using or not using certainty control.

Coder and Test Data The H.263 reference implementation inpredictive mode and without frame skipping was used. The quanti-zation step height, q, was fixated, and the resulting bitrate was mea-sured.

Input data consisted of standard test sequences in CIF resolution.These contain natural camera noise, but at a level so low that it can-not be perceived. In one of the experiments we also added whiteGaussian noise with standard deviation of 5 (giving a PSNR of 34dB), which gives clearly visible noise.

Filtering The scheme described in Fig. 4.15 was utilized. Thefilter, explained by Fig. 4.18, consists of a temporal motion compen-


Ii(x) 1−µ Σ

µ

Ii(x)

Ii−1(x +4x)

Ci(x)

Figure 4.18 The temporal filter used for pre-filtering.

sated first order auto-regressive filter where the µ parameter is var-ied depending on the value of the certainty measure of the motionvector. A similar filter was used in [110], but as input to the filter pa-rameter control the difference in pixel values was used, instead of asin our scheme a certainty measure calculated over a local neighbour-hood. The certainty measure used in the experiments was 1

MatchV alue,

and the Match Value was motion compensated absolute differencesummed over a Gaussian window centered around the pixel. The µvalue was varied with certainty according to Fig. 4.19. There is onefree parameter, T , in the prefiltering scheme. To avoid manual fine-tuning for finding the optimal T value, we decided to use the fol-lowing automatic but sub-optimal threshold setting algorithm: Foreach frame, a subset of certainty values evenly sampled from the im-age area was analysed, and the T value was set as a function of these.We used the same function to set T for all the sequences, which givesrather conservative filtering.

Experiments and Results The amount of filtering was set sothat no artefacts were introduced in this step. Where only naturalcamera noise is present, the filtered and the original sequences areindistinguishable. For the sequence with added noise, the filteredsequence looks less noisy, but has no smearing or other artefacts.

The improvement when coding the original test sequence “hall”was substantial. The bitrate decreased from 9191 kbit/s to 4339kbit/s (i.e. by 53%) for the first 50 intra-coded frames. The bitratewas lowered by 44% for the test sequence “mother and daughter” atthe finest quantization level. For larger quantization steps, the im-provement was not as high, see Fig. 4.20. One reason for the highergain of pre-filtering for fine quantization is that the camera noise


µmax

TT4

Ci(x)

Figure 4.19 A high level of certainty allows for a high value of theµ parameter.

causes many transform components to be separated from zero whenthe quantization step height is low, but for higher quantization stepheights, the camera noise does not introduce a significant amount oftransform components being separated from zero. Another reasonis that part of the bit stream consists of signaling information, whilethe rest represents actual transform component values. The part ofthe bit stream that represents transform values can be diminishedby lowering the noise level in the sequence, while the part of the bit-stream that represents signaling information is relatively unaffected.Since a large part of the bit stream relates to signaling informationfor a coarsely quantized sequence, a decrease in bits representingtransform coefficients gives less impact on the size of the total bit-stream than for a finely quantized sequence.

The standard test sequences that we used were recorded undersatisfactory lighting conditions, so the amount of camera noise iscontrolled. To test the performance of the proposed prefiltering un-der heavy noise, we added synthetic white Gaussian noise to the“hall” test sequence, and applied filtering using the same parametersetting that had been tried out for the scenes containing only cam-era noise. The Standard deviation of the noise was set to 5, whichgave clearly visible noise, both in the video sequence and in a frozenimage, see Fig. 4.21. There were no visible artefacts, and the im-provement in bitrate when adding certainty controlled pre-filteringis shown in Fig. 4.22.

To answer the questions regarding the necessity of motion com-


0 1 2 30

1

2

3

4

5

q

R (Mbit/s)

�

�

�

��

��

��

no prefiltering

certainty controlled motion compensated prefiltering

�

��

Figure 4.20 Bitrates resulting from H.263-coding the “mother anddaughter” test sequence with and without the proposed prefilteringscheme. The gain of using prefiltering is 44%, 27%, and 19% for q= 1, 2, and 3, respectively.

pensation and motion vector certainty in prefiltering, we used the“hall” sequence with four different prefiltering schemes and usedthe prefiltered sequences as input to the H.263 coder. The results arecompared visually, see Fig. 4.23.

Straightforward temporal filtering with a µ value that was ad-justed to yield approximately the same bitrate as the scheme usingmotion compensation and motion certainty gives heavy smearing of


Figure 4.21 A cut-out from frame 51 of the sequence “hall” withsynthetically added white Gaussian noise with a standard deviationof 5.

moving objects. Motion compensation without motion vector cer-tainty gives artefacts. The biggest problems arise in the vicinity ofborders between objects with different motion. Using no motioncompensation, but motion vector certainty gives very good resultsfor this test sequence, which has large non-moving areas. For thistype of sequences, filtering without motion compensation but withmotion vector certainty is the best alternative. For a sequence withcamera motion, adding motion compensation should improve theperformance of prefiltering.

4.5.2 Motion Information in Coding/Decoding

Motion information used for motion compensation has proved to bevery efficient in image coding, in the full bit-rate and quality spec-trum from videophone coding to HDTV coding. How the motioninformation is used practically was discussed in Section 1.2.3.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

no prefilteringcertainty controlled motioncompensated prefiltering

�

��

q

R (Mbit/s)�

�

�

�

�

�

�

��

� � � � � � �

��

��

��

��

��

��

Figure 4.22 Rate (R) as a function of quantization step height (q)for the “hall” sequence with artificially added noise, frames 1-50inter-coded, (P frames). The lower curve is obtained with certaintycontrolled motion compensated temporal prefiltering. The bitrate isdecreased by between 41% (for q = 1) and 16% (for q = 16).


Figure 4.23 A comparison of prefiltering schemes. The four imagesare cutouts from frame nr 37 of the “hall” sequence, all coded us-ing identical coder settings. Upper left: Straightforward temporalprefiltering (4341 kbit/s); Top right: Motion compensated temporalprefiltering (4333 kbit/s); Lower left: Prefiltering without motioncompensation but with certainty control (3571 kbit/s); Lower right:Motion compensated, certainty controlled prefiltering (4339 kbit/s)


We now want to go one step further and examine the usability ofthe certainty of the motion vectors.

Motion Certainty Aided Image Coding

When the motion certainty is low, less consideration should be takento old image information. One way of achieving this would be tolower the prediction coefficient in the prediction step of a hybridcoder. This was mentioned in Section 3.2 under the notion of loop-filtering. To let the motion certainty influence the entropy codingis another interesting possibility. When motion certainty is used tocontrol the prediction coefficient and/or entropy coding, the schemedescribed in Section 5.1 using backward prediction assuming con-stant motion over frequencies is very suitable since a dense mo-tion vector field with certainty estimates can be acquired at both thecoder and the decoder side without adding bits to the data stream.

In model based coding the local certainty of the 2D-motion vec-tor field can be useful in the process of acquiring robust 3D-motionestimates. This approach has been used in [119].

Another approach that can lead to substantial increase in codingefficiency is to let the motion certainty control the time necessary be-tween the updates of the image information. This can be comparedto a scheme which classifies the image into stationary backgroundand moving foreground, where the background is coded once andfor all and only the foreground is continuously updated. The mo-tion certainty controlled system would consider all areas that can besatisfactorily described by motion compensation as a sort of back-ground that is coded with image information and motion parame-ters once and for all (or rather until they disappear out of the pictureor are covered by new image information). This is a more complexway of doing image coding than the conventional hybrid scheme.Conventional hybrid schemes achieve almost the same effect by theuse of “not coded” blocks, i.e. blocks that are reconstructed withonly the motion compensated prediction and no update image in-formation.

Simulations on Motion Certainty Aided Image Coding/Decoding

Simulations using scalable video coding with frequency backwardmotion estimation and certainty measures are presented in Chap-ter 5. They show that adding certainty can reduce the bitrate sub-stantially with no difference in image quality. This is achieved by


letting the motion vector certainty act as background information tothe entropy coding/decoding.

4.5.3 Motion Information in Postfiltering

In [68], motion vector certainty for removal of coding noise was sug-gested by Kronander. If there are motion artefacts in the decodedsequence it is however very difficult to decrease them by filteringusing motion vectors derived from the decoded sequence itself. If amore accurate motion description is sent as side information, this isof course useful for post-processing.

Frame interpolation is another type of postfiltering that is gen-erally needed whenever the frame rate has to be converted. Twotypical situations are conversion between different television stan-dards, such as PAL and NTSC, and display of video coded at a lowframe rate. In both these situations, motion compensated frame in-terpolation produces much nicer results than the conventional frameskip/frame repetition scheme.

In [122], the frame rate of TV-sequences was enhanced by inser-tion of motion compensated interpolated frames. To avoid problemsaround the edges of moving objects, the frames were segmented intomoving objects, static background, covered, and uncovered back-ground, using “Best match value” as certainty measure. A signifi-cant improvement in visual quality was reported.

5

Scalable Video Coding

When sending a video stream over a “best-effort” network, such asthe Internet, there is no guarantee that all of the bitstream will reachthe receiver in time for decoding. It is further likely that different re-ceivers have different network bandwidth, display resolution, andcomputational power. Different attention levels of different usersmay also make it unnecessary to transmit high resolution images toall. It may e.g. be possible to detect that the user is not activelylooking at that part of the screen where the decoded sequence is dis-played, therefore a low resolution image is enough, and the band-width and computational power can be used for other purposes.

These are some reasons behind the development of scalable cod-ing, i.e. to split the bit stream into different layers having differ-ent relevance. In scalable coding there exists a base layer, withoutwhich decoding of the sequence becomes impossible. In additionto the base layer, receivers can receive and decode one or severalenhancement layers. The base layer can be given high priority in aquality of service (QoS) network, or protected against packet lossesthrough redundancy introduction, spreading the information overseveral packets. The enhancement layers should be given decreas-ing priority, since decoding of a specific enhancement layer typicallyrequires the decoding of all earlier transmitted layers.

There are different types of scalability: time scalability, SNR scal-ability, resolution scalability, computational scalability and objectscalability.

131

132 Chapter 5 Scalable Video Coding

Base layer:Enh. layer 1:Enh. layer 2:

Decoded:Decoded:Decoded:

-time time

-

Figure 5.1 An example of frame organization for time scalability.

Time scalability means that the temporal resolution varies withthe number of layers being received and decoded. It can be ob-tained in any hybrid coding system by introducing enhancementlayer(s) of frames that are not used for prediction of lower layers,see Fig. 5.1. An example of time scalability is hybrid coding with I,P, and B frames, such as in MPEG-2 [60] or H.263 [61]. The I framescan be decoded without any information from the P or B frames, de-coding the P frames requires decoding of the I frames, and decodingthe B frames requires decoding of both I and P frames.

In Section 5.2 we will suggest a scheme using time scalabilitytogether with backward motion estimation.

SNR scalability (quality scalability) means that each layer shouldminimize the distortion, measured at full image resolution. It cantherefore be seen as the answer to the question of coding for a fixedscreensize, but varying channel bandwidth. SNR scalability in DCTvideo coding is implemented in MPEG-2 [60] and H.263 [61].

An embedded bitstream is a scalable bitstream with fine layergranularity. A coding method produces an embedded bitstream ifencoding to B bits and then discarding the last b bits when decodingalways gives the same result as encoding to B − b bits and decod-ing the whole bitstream [98]. For still images, the bitstream can beembedded, like it is in zerotree wavelet coders [113]. In this case,the embeddedness comes without any performance penalty. An em-bedded video bit stream is produced by the 3D SPIHT coder [103].The concept of embeddedness is however difficult to combine withreal-time transmission, since many frames (for total embeddednessthe whole video sequence) have to be coded simultaneously. Whencoding video sequences using the hybrid scheme, embeddedness isharder to achieve, since the hybrid scheme is highly dependent onthe fact that the encoder and the decoder uses the same image asprediction. If the decoder receives a coded prediction error repre-sented by B bits, the codec could use the convention that the pre-

133

diction image to be used should have B bits of transmitted predic-tion error, or perhaps αB bits, 0 < α < 1, to enable a continuousincrease in decoded quality. However, this would mean that theencoder would have to use different predictions for every bitrate.Even if this problem could be solved in the encoder, embeddednessin the hybrid coding scheme would probably give lower coding per-formance, since the first part of the bit sequence would have to bebased on a coarse prediction. If we take a non-scalable scheme forcomparison, the whole part of the bit stream for the prediction errorcould be generated using all the bits of the prediction error in theprevious frame.

Resolution scalability (bandwidth scalability) means that eachlayer represents a frequency band. It solves the problem of differentusers having different screen resolution. The DCT hybrid codingstandards MPEG-2 [60] and H.263 [61] both support resolution scal-ability.

Wavelets have a strong connection with resolution scalability. Insection 5.1, a scalable wavelet video coding scheme will be presentedin detail.

Computational scalability (decoder complexity scalability) of-fers a solution to the problem of different terminals having differentdecoding capabilities. The available computational capacity mayvary over time due to e.g. other processes competing over com-putation resources, and the encoding has to be done so that the bitstream is decodable for every available computational power. Com-putational scalability often, but not necessarily, comes as a benefitof other scalability approaches. An example of computational scal-ability is object-oriented decoding, where the reconstruction of e.g.a talking face can be made by texture-mapping for normal recep-tion, by simple shading for lower available decoding power, and asa wire-frame as fall-back solution.

Object scalability means that objects can be assigned differentpriorities. This enables discarding of objects having low priority ifcongestion would occur in the network, and discarding in the de-coder for easier decoding (computational scalability). Object scala-bility is supported in some MPEG-4 tools [62].


5.1 A Wavelet Coder using FDBME

An image coding scheme using Frequency Domain Backward Mo-tion Estimation (FDBME) and Motion Vector Certainty is suggested.Backward motion estimation means derivation of motion vectorsfrom already transmitted image data at the transmitter and receiversides, in contrast to forward motion estimation, where motion vec-tors are calculated at the transmitter side and sent to the receiverover the channel. In this wavelet image coding scheme, a low fre-quency image is first transmitted over the channel without the use ofFDBME. Then the frequency bands are transmitted in successive or-der using motion data estimated from the already transmitted lowerfrequency bands. The coding of the prediction error is controlled bythe certainty of the motion vector estimate. Simulation results arepresented and the performance of the image codec is evaluated andcompared against the H.263 DCT coding scheme. Introducing scal-ability leads to a negligible increase in the total bit rate compared tothe non-scalable version. A low complexity scheme which does notuse motion compensation is suggested. It has lower performancethan the full scheme. However, the low complexity scheme can beuseful for terminals having limited computational power. A scalablesystem is suggested where the low complexity algorithm is used forthe lower layers, with its main application being portable terminals,and motion estimation/compensation is added to the scheme forhigher layers, which are used by high-performance terminals suchas stationary computers and TV/HDTV sets.

5.1.1 Background

Wavelet coding has been recognized as an efficient technique forcoding of still images; state of the art wavelet video coding for bothstill images and video is described in [124]. For low-delay video cod-ing, wavelet coding has up to now not shown the same advantagein rate-distortion sense, compared to the standardized algorithms(H.263 [61], MPEG [60]). These use blockwise coding of the resid-ual information after motion compensated prediction based on thediscrete cosine transform. This is in our opinion due to the difficul-ties associated with the representation of motion information in thewavelet case.

The true motion field varies from pixel to pixel (i.e. it is dense),and the two components take real values. The motion field is con-

5.1 A Wavelet Coder using FDBME 135

tinuous, except along the edges of objects.In the motion compensated DCT coders, the motion field is ap-

proximated with a constant value within a (usually 8× 8 pel) block,and is quantized (to e.g. half pel resolution). This approximation hasshown to yield rather efficient coding. Buschmann [18] has shownthat the performance of the motion compensation can be further im-proved if an affine or bilinear motion model is used.

For the wavelet case, there is no natural “block”, to attach motionvectors to. The most natural would be to estimate a dense motionvector field and then code the motion vector field allowing somedistortion. The bits representing the motion vector field would thenbe sent to the decoder.

Another approach is Backward Motion Estimation (BME), whichmeans estimation of the motion vector field at the sending and thereceiving side, using already transmitted image data. This is thestrategy that is adopted in the image codec described here.

One way to apply BME is to use the two last decoded framesIt−2, It−1 to estimate the motion field for the transmission of It. Thisis called Time Domain Backward Motion Estimation. Another ap-proach is Spatial Domain Backward Motion Estimation, where thealready transmitted spatial parts of It are used together with It−1,which is available in reconstructed form at all spatial locations, forestimation of the motion vectors used for coding of It.

Frequency Domain Backward Motion Estimation was suggestedby Li et al. [19, 75], Armitano, Florencio and Schafer [5, 32], and Nos-ratinia and Orchard [99]. The assumption of motion vectors beingthe same for the different spatial frequencies is in our opinion morerealistic than the assumptions that the motion is uniform in time orspace. If an object lacks low frequency structures, it is of course im-possible to estimate the motion from low frequency bands. This ishowever very uncommon for natural images, since their frequencycontent is roughly proportional to 1

f, where f denotes spatial fre-

quency.The rest of this chapter is organized as follows. Section 5.1.2

describes how to apply FDBME in an image coding scheme. Cer-tainty of motion vectors can be used in the motion estimation pro-cess, and in pre- and post-filtering. In Section 5.1.3, it is describedhow the certainty of the motion vector estimates can be utilized inthe coding/decoding process to improve the coding efficiency. Howthe wavelet image codec using frequency backward motion estima-tion and motion vector certainty is constructed is described in Sec-


tion 5.1.4. In Section 5.1.5, simulation results from the proposedscheme are presented and compared with a wavelet scheme usingintra-frame coding, a wavelet scheme using inter-frame coding, theproposed scheme but without using the motion vector certainty, andthe existing H.263 [61] standard using DCT coding in predictivemode. Finally, we give some conclusions and suggestions for fur-ther improvements.

5.1.2 The Frequency Domain Backward Motion Esti-mation (FDBME) scheme

In the proposed scheme the frames . . . , Ii−1, Ii, Ii+1, . . . are sent inconsecutive order and reconstructed on the decoder side yieldingthe reconstructed frames . . . , Ii−1, Ii, Ii+1, . . .

Each frame is subdivided according to Fig. 5.2 using the analy-sis highpass and lowpass filters from a wavelet filter bank. The toplevel lowpass image and the detail images are quantized, and on thereceiver side the image is reconstructed using the synthesis filtersof the filter bank. In Fig. 5.2 are shown two frames, Ii−1, which isthe previous reconstructed frame, and Ii, which is the frame to beencoded. The level of decomposition is denoted by a superscriptnumber, where the superscript 0 denotes the resolution of the orig-inal frames. A frame I l comprises the frame of the next lower reso-lution, I l+1, and the detail signal at level l, Dl. The detail signal Dl

comprises the signals DlHL, DlLH , and DlHH , which result from theanalysis highpass and lowpass filters in the horizontal and verticaldirections.

Transmission of a frame starts by sending the frame at the lowestresolution, I lmax

i . The next higher resolution, I lmax−1i , is reconstructed

using I lmax

i (already available at the decoder side) and Dlmax−1i , which

has to be coded and sent over the channel.In the process of encoding Dl

i, all information already available tothe decoder can be used for prediction of Dl

i. The information avail-able at the decoder side consists of all reconstructed frames withtime index less than i, i.e. Ik

j , k < lmax, j < i, and the reconstructedlower resolution frames of Ii, i.e. Ik

i , k > l, and the spatial parts of Dli

that is already transmitted.In the FDBME Scheme we use Ik

i−1, k < lmax and Iki , k > l for

extracting motion vectors in order to make the prediction of Dli ac-

curate.


I0i−1 I0

i

I1i−1 D0HL

i−1 I1i D0HL

i

D0LHi−1 D0HH

i−1 D0LHi D0HH

i

Synthesisfilters

Analysisfilters

I2i−1 D1HL

i−1 I2i D1HL

i

D1LHi−1 D1HH

i−1 D1LHi D1HH

i

Figure 5.2 Relationship between frames at different levels of de-composition (superscripted, level 0 denotes original frame resolu-tion), and detail images D. Subscripts denote time index. Super-scripts relate to frequency content.

Image Displacement Subband Evaluation

When coding the detail information at level l, motion compensationshould be performed in a higher resolution for best performance.In the proposed scheme, we perform all motion compensation inthe resolution of the original frames, but tradeoffs can be made todecrease the computational load. We compute a motion compen-sated prediction I0

i = f(I0i−1,V

0l+1

i ) where V0l+1

i denotes the mo-tion vector field. A motion vector comprised of a horizontal anda vertical displacement component is needed for every pixel in theoriginal image resolution. Available for the motion estimation pro-cess is the previous reconstructed frame in full resolution, I0

i−1, andthe reconstructed frame in lower resolution, I l+1

i . Since two valuesper pixel have to be estimated and since the full frame resolutionis higher than that of I l+1

i the motion estimation problem is under-


constrained. We have developed the image displacement subbandevaluation (IDSE) motion estimation approach for this application.

The principle of IDSE is to make displacements in the imagedomain and evaluate the effects of the displacements in the trans-form domain. Ideally, a search should be made among all physicallysound motion vector fields to find the most probable motion vectorfield, although the search has to be simplified to be computationallyfeasible.

Below will be described a scheme for obtaining a motion vec-tor field that, although it is not perfectly adapted to physical con-straints of the scene, provides a means of obtaining a motion vectorfield with good prediction properties. This method takes one mo-tion vector at the time from the search window; the image contentfrom frame I0

i−1 is displaced according to this motion vector, and themotion compensated frame is wavelet transformed and the result iscompared to I l+1

i . The comparison is done by evaluating a matchcriterion around every pixel in I l+1

i . The best value of the match cri-terion with its associated motion vector is stored for every pixel inI l+1i . In this way a motion vector field in the resolution of level l + 1

is created; to get the full resolution motion vector field needed formotion compensation, interpolation has to be used.

5.1.3 Motion Vector Certainty (MVC)

In forward motion estimation, where the motion vector is sent overthe channel, and the block pointed out by the motion vector in theprevious frame is used for prediction, it is not essential that the mo-tion vector corresponds to true motion. In backward motion estima-tion, the situation is different. The motion vector to be used is pre-dicted from motion vectors estimated from old data, and the predic-tion will in many cases fail if the motion is not physically true. Fromthe motion vector estimation process it is possible to obtain not onlya motion vector, but also a certainty measure for that vector. Thecertainty measure can be either a scalar value related to the prob-ability of the motion vector being correct, or the certainty measurecan indicate the direction of highest certainty and give a certaintyvalue for this direction and a certainty value for the perpendiculardirection. Due to the aperture problem, these two values can differsubstantially. It is also possible to let the motion estimation processoutput a set of vectors and values that relate to the certainty that thevectors are the true motion vectors. The perpendicular directions


case can be seen as a special case of this situation where the error ofthe motion vector is modeled by a 2-D Gaussian distribution.

The certainty of the motion estimation can be used in the imagecodec for pre- and post-filtering, and also in the coding/decodingprocess. There are at least two ways of using the certainty in thecoding/decoding process: in the prediction process and in the en-tropy coding stage.

5.1.4 Implementation

A wavelet video coder using frequency domain backward motionestimation as described in Section 5.1.2 was implemented.

The filter bank consisted of the well known FIR filters of lengths7 and 9 described in [4].

The simplified image displacement subband evaluation (IDSE)scheme was used. The resolution of the motion estimation was inte-ger pixel, in full frame resolution. Coding performance could prob-ably be improved by enhancing the resolution of the motion vectors,especially for the higher resolution subbands.

The wavelet transform coefficients were uniformly quantized us-ing a step size adapted to the frequency sensitivity of the human vi-sual system. “Dead-zone”, i.e. a bigger quantization interval aroundzero, was not used.

Information about the certainty of motion vectors was used whenperforming entropy coding of the quantized wavelet transform co-efficients. The certainty measure used was the value of the matchcriterion, in this case sum of absolute differences, where the sum-mation was done over a Gaussian window. Low value of the matchcriterion indicates high certainty. This match criterion also gives lowvalues in areas of low image contrast, but in this application thisis no disadvantage, since areas with low contrast tend to producewavelet coefficients with low values.

Arithmetic coding was used as entropy coding. The wavelet co-efficients were modeled as having generalized Gaussian distribu-tion [91]

h(u) = Ke−(|u|α

)β

. (5.1)

For each wavelet band, α and β were estimated using the estima-tion method suggested by Mallat [91]. These parameters were quan-tized and their values sent over the channel using two bits each. Themagnitude of the wavelet coefficient was modeled as being linearly


α(MC)

MC

Figure 5.3 Schematic figure of how the α value in the generalizedGaussian distribution was adjusted by the match criterion, underthe constraints that α should be above a lower threshold and belowan upper limitation value.

correlated to the match criterion, i.e. α was adjusted with a linear fac-tor of the match criterion, but restricted to lie between a maximumand a minimum value, see Fig. 5.3. In that way the probabilitiesof different magnitudes in the arithmetic codec could be adjustedto yield higher efficiency. A way of further improving the codingperformance would be to let also the shape parameter β in the gen-eralized Gaussian distribution vary with motion vector certainty.

In the first frame of the sequence, the entropy coding is done bya general source coder capable of exploiting correlations betweentransform components.

5.1.5 Results

The bit rates of the proposed algorithm are given in Fig. 5.4, curveD. In the same figure are plotted bit rates for the same algorithm, butwithout using certainty of motion vectors, curve C, bit rates for inter-frame coding without motion compensation, curve B, and bit ratesfor intra-frame coding with a general source coder, curve A. Theinput sequence consisted of the “Claire” test sequence with framerate 12 Hz. An example of the typical quality of an inter-coded frameusing the proposed algorithm is shown in Fig. 5.5.

The image quality and bit rate were low in these simulations,the purpose of this was to test the stability of the algorithm against


D

C

B

A

0 1 2 3 4 50

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

frame

bytes/frame

Figure 5.4 Simulation results showing the bit rate produced whilekeeping the quantization constant for: A: the wavelet coder operat-ing in intra-frame mode; B: the wavelet coder in inter-frame mode,without MC; C: the wavelet coder in FDBME mode, without MVC;D: the wavelet coder in FDBME mode, with MVC.

The input consisted of the 256 × 256, 12 Hz “Claire” test sequenceand the mean PSNR over frames 1-5 produced by the algorithmswas for A 34.4 dB, for B 32.6 dB, for C and D 32.9 dB.

quantization noise. The scheme was demonstrated to perform wellunder these conditions, but the performance is anticipated to be bet-ter in high quality/high rate application where quantization noisedoes not disturb the motion estimation process.

Comparison with H.263 In order to make a fair comparison withH.263, which does not use different quantization steps for differenttransform coefficients, the HVS-adapted frequency weighting wasabandoned also for the wavelet scheme. The test sequence used was“Miss America” in QCIF format (176∗144 luminance image and 88∗72 chrominance images). Every third frame from the test sequencewas sent to the coders. The coding mode was predictive, except forthe first frame. The H.263 coder did not use any of the H.263++


Figure 5.5 The reconstructed frame nr. 5 using Frequency DomainBackward Motion Estimation (FDBME). The PSNR value for thisframe is 32.3 dB. A frequency weighting adapted to the visual sys-tem has been made to enhance subjective quality, this lowers thePSNR value. The bit rate for inter-coded frames was approximately200 bytes/frame, which corresponds to 0.024 bits/pixel or 19 kbit/sat 12 Hz frame rate.

features (annexes). Both coders operated in fixed quantization, i.e.variable bitrate, mode; the resulting data rate is given in Fig. 5.6.

PSNR figures are slightly in favour of the H.263 scheme, whilesubjective image quality is slightly better for the wavelet scheme, itis e.g. difficult to see if eyes are open or closed in a DCT coded framewhile this is easy to determine from the same frame wavelet coded,see Fig. 5.7. Effective intra coding of the first frame has not been agoal in this work, therefore the cost of sending the first frame is fairlyhigh. The existing intra coding could be replaced by an efficientwavelet coding method such as e.g. the SPIHT coder [111]. Thenumber of bytes produced by the proposed coder is for most intra-coded frames rather close to the ones produced by the H.263 coder.There are however parameters to tune and refinements to be madein the suggested scheme.


H.263

Proposed

0 1 2 3 4 50

200

400

600

800

1000

1200

1400

1600

frame

bytes/frame

Figure 5.6 Comparison of bitrates produced by the proposedscheme and by the H.263 coding scheme

Introduction of Scalability Scalability, both in bit rate and com-putational load, can be gained in the frequency domain by not usingthe full resolution previous frame in the motion estimation process.A block diagram showing the coding and decoding of one step ofthe scalable algorithm is shown in Fig. 5.8. Scalability in the time do-main is not as attractive, since increasing the time distance betweencoded frames increases the lengths of the displacements, and thus alarger search area would have to be used which would increase thecomputational load. Our simulations show that the cost of addingspatial scalability is an increase in bit rate of about 2%, which can becompared to adding scalability to the H.263 scheme, which increasesits bit rate by approximately 50%.

Low Complexity Algorithm

The motion information used in this work consists of two parts,namely the motion vectors, and their certainty values. In Fig. 5.9 weshow how different use of the motion information results in four dif-ferent algorithms. The motion vectors are costly to compute, there-fore we propose a low complexity algorithm in which we assumethat the motion vector field consists of all zero vectors. Computingthe motion vector certainty in the case of zero motion is very fast,


Figure 5.7 The sixth frame in the sequence coded with the H.263coding scheme (top) and the proposed coding scheme (bottom).

the motion vector certainty measure consisting of the “Best matchvalue” is obtained by taking the absolute value of each componentof Dl+1

i , followed by low-pass filtering. This certainty measure is notgood at indicating whether the motion vector is the true one, but, aswe will see, it is useful for prediction of how large the error to en-code is. The prediction step will consist of pure intra-coding. Themotion vector certainty is used as background information in thecoding/decoding process in the same way as when using motioncompensation. Fig. 5.10 shows the bit rates produced by using thedifferent combinations of motion information from Fig. 5.9. Adding


I li−1

DlLHi DlHH

i

DlHLiI l+1

i

MotionInformationExtraction

MotionCompensation

MotionVectors

WaveletTransform

EntropyCoding

j

j

j

Σ

Σ

Σ

Motion VectorCertainties

Q

Q

Q

+

+

+

-

-

-

Figure 5.8 Schematic diagram of the scalable algorithm.

the certainty information yields a higher saving in bitrate for motioncompensated prediction (17.4 % in the experiment in Fig. 5.10) thanfor non-motion-compensated prediction (11.1% in the same experi-ment). We believe that this depends on the fact that the probabil-ity density function of the motion compensated prediction is morepeaked.

5.1.6 System Design

Especially for mobile terminals, there is a desire to keep the com-putational complexity at a minimum, because computation requirespower, which is a limited resource. This leads to the suggestionof motion estimation and other computationally demanding imageanalysis being moved from the terminals to e.g. the base stations,[36, 105]. The advantage is increased operating time for the mobileterminals, a disadvantage could be that more information has to betransmitted over the channel. The added amount of information (i.e.motion information), would however be transmitted from the basestations to the terminals, so the terminals would not have to spendmore energy on transmission. However, bandwidth is a limited re-source, but with the introduction of wide-band cellular networks,


Motion Vectors

MotionVector

Certainties

notused

used

notused used

Straightforwardtemporal

prediction

Motion compensatedtemporal

prediction

Straightforwardtemporal prediction

withcertainty aided coding

(low complexity algorithm)

Motion compensatedtemporal prediction

withcertainty aided coding(proposed algorithm)

Figure 5.9 Motion vector certainty adds a new dimension to theuse of motion information in video coding. The table shows the dif-ferent coding strategies that result from using or not using motionvectors and motion vector certainties in inter-frame coding.

such as UMTS, the tradeoff point may shift from preserving band-width to preserving power. Backward motion estimation schemesare easy to combine with motion estimation in the network, sincethey use only already transmitted information.

Another approach that would be suitable for scalable video cod-ing over a wide range of applications from mobile phones to HDTVis to use the low complexity scheme described in Section 5.1.5 for thelow end terminals. The full scheme with motion estimation and mo-tion compensation for the higher enhancement layers would onlybe used by high end terminals. These terminals would then have totolerate a certain decrease in coding performance that is not moti-vated by shortcomings in their own computational capacities. Thebenefit would be easy exchange of video between low performanceterminals and high performance terminals.

5.1.7 Summary of the FDBME Scheme

The frequency domain backward motion estimation scheme showsgood performance, especially the scalable variant is very competi-


1 2 3 4 5 6 7 8 9 10100

200

300

400

500

600

700

800

900

Figure 5.10 Data rates for non-motion compensated inter cod-ing (solid), non-motion compensated with certainty (dash-dotted),motion compensated (dashed), and motion compensated with cer-tainty (dotted). Test image was ”Claire”, 12Hz 256 × 256 pixels.Mean SNR value for the 10 frames was 33.8 dB for non-motion-compensated, and 33.5 dB for motion-compensated coding.

tive. The computational load is high, but the flow of computationsis straightforward and easy to parallelize or implement in hardware.Enhancing the scheme with motion vector certainty controlled arith-metic coding reduces the bit rate without adding any substantialcomplexity. Many improvements can be made for enhancing per-formance, e.g. improved motion vector fields, and improved statis-tical modeling of quantization values when information about mo-tion vector certainty is available.

A Backward Motion Estimation Scheme operating in time, spaceand frequency can be designed using all the information availableat the decoder, either by doing different motion estimations in thethree domains and combining them using e.g. Kalman filtering, orby doing a motion estimation taking input from all the available dataat the same time.

It is also possible to combine backward and forward motion es-


timation, so that the backward motion estimation serves as a pre-diction for the motion vectors and the forward motion estimationproduces a refinement of the motion vector field that is coded andsent over the channel.

5.2 Time Scalability and B-BME

In video coding schemes such as H.263 [61] and MPEG-2 [60], socalled bilinearly interpolated frames (B-frames) make use of motion-compensated prediction from both the previous frame and the fol-lowing frame. The B-frames can be sent with very little update of theprediction, since the prediction is usually good when the possibilityto look either backward or forward in time is given. In these sys-tems, information has to be transmitted about which motion vectorto use, and whether to use the previous frame, the following frame,or an average of these. Also, the frames have to be reordered be-fore transmission, so that the frames that the B-frames are predictedfrom are available to the coder and the decoder. This introducessome delay, typically 2 or 3 frames, which corresponds to inserting 2or 3 B-frames between each frame that is encoded using either intracoding (I frame) or predictive coding (P frame).

Looking back at Fig. 5.1 on page 132, we make the observationthat in a time scalable system, for all the enhancement layers it ispossible to use the underlying layers for prediction looking bothbackward and forward in time if we introduce a time delay. Whenthe frame rate is doubled for each temporal enhancement layer, thesize of the time delay is one interframe interval in the time resolutionof the enhancement layer to encode, see Fig. 5.11. These time delaysat different temporal resolution layers will add together, resulting ina geometric sum, which means that the total delay will not exceedone base layer interframe interval when the frame rate is doubled ineach interval.

Backward motion estimation is favourable to use in this type ofsystem, because no motion information has to be transmitted. Wewill, in analogy with the terminology B-frames, denote by B-BMEbilinear prediction from previous and subsequent frame using back-ward motion estimation. For the decision of how to combine the twopredictions in B-BME, a motion vector certainty measure will comein handy.

5.2 Time Scalability and B-BME 149

Temporal enhancement layer n

Temporal layer n− 1

time

Figure 5.11 In a time scalable system, prediction can be made fromframes in the same and lower layers that have already been trans-mitted. By delaying the transmission of the enhancement layer untilthe subsequent frame in lower temporal resolution has been trans-mitted, prediction can be made both forward and backward in time,which is a significant advantage when dealing with e.g. uncoveredbackground. All the frames represented by black can be used forcoding the frame represented by gray.

6

Conclusions

Hierarchical algorithms are needed in computer vision and imagecoding for reasons of computational efficiency, and are also able to“take in the whole picture” and make globally sound decisions. Hi-erarchical algorithms obviously need hierarchical image representa-tions. In Chapter 2, new two-dimensional subsampling structureswhich extend the traditional separable structure were presented.

Two-dimensional hexagonal subsampling shows better perfor-mance than separable subsampling, but there has been a lack of suit-able subsampling structures for hierarchical processing. A class ofhierarchical representations of hexagonally sampled data is furtherpresented in Chapter 2. Since the regions formed by subsamplinghave a shape that is close to the circular shape, these structures aresuitable for position-dependent resolution, which is made to matchthe resolution of the retina.

Another subsampling structure described in Chapter 2 takes ad-vantage of the fact that the human visual system is more sensitive tohorizontal and vertical lines than to lines with other orientations.

Motion estimation was the topic of Chapter 3, where it was shownthat the tiling/subsampling structures introduced in Chapter 2 canmake motion estimation more efficient.

Certainty measures that indicate if the estimated motion vectorsare reliable or not were examined in Chapter 4. The certainty mea-sures have several possible application areas, results showing theusability of certainty measures for pre-filtering of image sequences

151

152 Chapter 6 Conclusions

prior to coding were given in this chapter.The demand for scalable video coding is increasing with the in-

creased use of video transmission over heterogeneous networks. Ascalable video coding scheme using wavelets and Frequency Do-main Backward Motion Estimation was described in Chapter 5. Theperformance of the scheme was significantly improved by incorpo-rating certainty measures. A low complexity algorithm that onlyuses motion vector certainty, but no motion estimation or motioncompensation was also presented.

Finally, we presented the idea of using prediction backward andforward in time for backward motion estimation with certainty mea-sure. This would result in an equivalent of the B-frames that haveproven to be efficient in hybrid DCT coding, but without transmis-sion of any motion information.

A

Discrete pdf’s fromContinuous Ones

When dealing with stochastic variables that have been quantized,we need to go from their continuous probability density function(pdf), p(x), to their discrete pdf, p(i).

If the quantization is very fine, so that the continuous pdf can beapproximated by a constant value in every quantization interval,

p(a) = ki, di ≤ a ≤ di+1 (A.1)

then the discrete pdf can be approximated with this constant, that is

p(i) = kil(i) (A.2)

where l(i) denotes the length of the interval.Prediction errors, for example, are relatively small (zero being

the most common value when images quantized to 8 bits are used),so the continuous pdf associated with the non-quantized predictionerror is not well approximated by a constant value in each quantiza-tion interval. Therefore, we have to make a more precise derivation.

For the continuous case we have the model (assuming the motionis properly corrected for, and suppressing the dependence on x forreadability)

I(t) = I(t− 1) + E(t). (A.3)

When we measure the prediction error in the quantized image,letting A denote I(t), B denote I(t− 1) and E denote E(t), we get

DFD = A− B+E (A.4)

153

154 Appendix A Discrete pdf’s from Continuous Ones

where denotes quantization and DFD stands for Displaced FrameDifference.

Assuming uniform quantization with quantization step 1, the sit-uation is depicted in Fig. A.1

j−2 j−1 j j+1 j+2

p(a|B=b)

aj− 32

j− 12

j+ 12

j+ 32

j+ 52

B = b = j+bf

Figure A.1 Probability density function for A, given that B = b.Hatched area: p(DFD = 2).

It can be verified from Fig. A.1 that

p(DFD = i|B = j + bf ) =

∫ i+ 12

i− 12

pE(a− bf )da (A.5)

independent of j, since DFD is the difference between A and B.Since Bf , the fractional part of B, and E are independent, and Bf ∈R[−1

2, 1

2], we can write

p(DFD = i) =

∫ 12

− 12

∫ i+ 12

i− 12

pE(a− bf )da dbf (A.6)

and use this formula to numerically obtain the probabilities of thedifferent integer values of the DFD.

Bibliography

[1] Edward H. Adelson, Eero Simoncelli, and Rajesh Hingorani.Orthogonal pyramid transforms for image coding. In SPIE Vi-sual Communications and Image Processing, volume 845 of Pro-ceedings of the SPIE, pages 50–58, 1987.

[2] P. Anandan. Computing dense displacement fields with con-fidence measures in scenes containing occlusion. In IntelligentRobots and Computer Vision, pages 184–194. SPIE, 1984.

[3] P. Anandan. A computational framework and an algorithmfor the measurement of visual motion. International Journal ofComputer Vision, pages 283–310, January 1989.

[4] Marc Antonini, Michael Barlaud, Pierre Mathieu, and IngridDaubechies. Image coding using wavelet transform. IEEETransactions on Image Processing, 1(2):205–220, April 1992.

[5] Robert M. Armitano, Dinei A. F. Florencio, and Ronald W.Schafer. The motion transform: A new motion compensationtechnique. In IEEE International Conference on Acoustics, Speechand Signal Processing, May 1996.

[6] Christoph Bandt. Self–similar sets 5. Integer matrices and frac-tal tilings of Rn. Proceedings of the American Mathematical Soci-ety, 112(2):549–562, June 1991.

155

156

[7] Christoph Bandt and Gotz Gelbrich. Classification of self–affine lattice tilings. Journal of the London Mathematical Society,50(2):581–593, 1994.

[8] Michael Barnsley. Fractals Everywhere. Academic Press, 1988.

[9] Mark F. Bear, Barry W. Connors, and Michael A. Paradiso.Neuroscience: Exploring the Brain. Williams & Wilkins, 1996.

[10] Ulrich Benzler and Oliver Werner. Improving multiresolutionmotion compensating hybrid coding by drift reduction. In Pro-ceedings of the Picture Coding Symposium, March 1996.

[11] Toby Berger. Rate Distortion Theory. A Mathematical Basis forData Compression. Prentice-Hall, 1971.

[12] Duane O. Bowker and Marc B. Mandler. Apparent contrastof suprathreshold gratings varies with stimulus orientation.Perception & Psychophysics, 29(6):585–588, 1981.

[13] D. C. Burr, J. Ross, and M. C. Morrone. Seeing objects in mo-tion. Proceedings of the Royal Society of London B, 227:249–265,1986.

[14] David Burr and John Ross. Visual analysis during motion. InMichael A. Arbib and Allen R. Hanson, editors, Vision, Brain,and Cooperative Computation, chapter 5, pages 187–207. MITPress, 1987.

[15] Peter J. Burt. Tree and pyramid structures for coding hexag-onally sampled binary images. Computer Graphics and ImageProcessing, 14(3):271 – 280, November 1980.

[16] Peter J. Burt, Chinsung Yen, and Xinping Xu. Multi–resolutionflow–through motion analysis. In Proceedings of the IEEE Com-puter Vision and Pattern Recognition Conference, Washington DC.,June 1983.

[17] P.J. Burt and E.H. Adelson. The Laplacian pyramid as a com-pact image code. IEEE Transactions on Communications, COM-31(4):532–540, 1983.

[18] Ralf Buschmann. Analytische Bestimmung derSchatzfehlervarianz von Displacementschatzverfahren der Be-wegtbildcodierung. PhD thesis, Universitat Hannover, 1997.

157

[19] Diego Casanueva Escudero. Motion compensated multireso-lution coding of image sequences. Master’s thesis, LinkopingUniversity, May 1995.

[20] Yilin Chang, Feipeng Li, Binbin Li, Weixiong Cheng, and Guo-qiang Zhang. A new displacement estimation algorithm inimage sequence coding. In Communication Systems: TowardsGlobal Integration, pages 10.9.1–10.9.3, 1990.

[21] K. W. Cheng and S. C. Chan. Fast block matching algorithmsfor motion estimation. In IEEE International Conference onAcoustics, Speech and Signal Processing, pages 2311–2314, 1996.

[22] Stanley Coren and Lawrence M. Ward. Sensation & Perception.Harcourt Brace Jovanovich, Publishers, third edition, 1989.

[23] Thomas M. Cover and Joy A. Thomas. Elements of InformationTheory. Wiley Series in Telecommunications. Wiley, 1991.

[24] J. P. Crettez and J. C. Simon. A model for cell receptive fieldsin the visual striate cortex. Computer Graphics and Image Pro-cessing, 20(4):299–318, December 1982.

[25] Ingrid Daubechies. The wavelet transform, time-frequency lo-calization and signal analysis. IEEE Transactions on InformationTheory, 36(5):961–1005, September 1990.

[26] F. M. Dekking. Recurrent sets. Advances in Mathematics, 44:78–104, 1982.

[27] F. M. Dekking. Replicating superfigures and endomorphismsof free groups. Journal of Combinatorial Theory, series A, 32:315–320, 1982.

[28] Michael P. Eckert. The significance of eye movements andimage acceleration for coding television image sequences. InDigital Images and Human Vision, chapter 8, pages 89–98. MITPress, 1993.

[29] H. H. Emsley. Irregular astigmatism of the eye: Effect of cor-recting lenses. Transactions of the Optical Society, London, 27:28–42, 1925.

[30] S. Ericsson. Fixed and adaptive predictors for hybrid predic-tive/transform coding. IEEE Transactions on Communications,COM-33(12):1291–1302, 1985.

158

[31] Michael W. Farn and Joseph W. Goodman. Bounds on the per-formance of continuous and quantized phase–only matchedfilters. Journal of the Optical Society of America: A, 7(1):66–72,January 1990.

[32] Dinei A. F. Florencio, Robert M. Armitano, and Ronald W.Schafer. Motion transforms for video coding. In ProceedingsICIP, September 1996.

[33] R. Forchheimer and T. Kronander. Image coding — fromwaveforms to animation. IEEE Transactions on Acoustics, Speechand Signal Processing, 37(12), December 1989.

[34] Robert Forchheimer. Differential transform coding — a newhybrid coding scheme. In Proceedings of the Picture Coding Sym-posium, pages 15 – 16, June 1981.

[35] Robert Forchheimer and Olof Fahlander. Low bit-rate codingthrough animation. In Proceedings of the Picture Coding Sympo-sium, pages 113–114, May 1983.

[36] Robert Forchheimer and Haibo Li. Real-time mobile videocommunication with low power terminals. World IntellectualProperty Organization, International Application Publishedunder the Patent Cooperation Treaty (PCT), January 1999. In-ternational Publication Number: WO 99/02003.

[37] Martin Gardner. Mathematical games. Scientific American,pages 124–128, 133, December 1976.

[38] Gotz Gelbrich. Self-affine lattice reptiles with two pieces in rn.Mathematische Nachrichten, 178:129–134, 1996.

[39] Gotz Gelbrich and Katja Giesche. Fractal Escher salaman-ders and other animals. Mathematical Intelligencer, 20(2):31–35,1998.

[40] Donald B. Gennery. Modelling the Environment of an ExploringVehicle by means of Stereo Vision. PhD thesis, Computer ScienceDepartment, Stanford University, 1980.

[41] Allen Gersho and Robert M. Gray. Vector Quantization and Sig-nal Compression. Kluwer Academic Publishers, 1992.

159

[42] William J. Gilbert. Geometry of radix representations. In TheGeometric Vein: The Coxeter Festschrift, pages 129–139. Springerverlag, 1981.

[43] William J. Gilbert. Fractal geometry derived from complexbases. The Mathematical Intelligencer, pages 78–86, 1982.

[44] William J. Gilbert. The division algorithm in complex bases.Canadian Mathematical Bulletin, 39(1):47–54, 1996.

[45] B. Girod. Motion compensation: Visual aspects, accuracy, andfundamental limits. In M. Ibrahim Sezan and Reginald L. La-gendijk, editors, Motion Analysis and Image Sequence Processing,chapter 5, pages 125–152. Kluwer Academic Publishers, 1993.

[46] F. Glazer, G. Reynolds, and P. Anandan. Scene matching byhierarchical correlation. In Proceedings of the IEEE Computervision and pattern recognition conference, Washington DC., June1983.

[47] Mats Gokstorp. Depth Computation in Robot Vision. PhD thesis,Linkoping University, Sweden, 1995.

[48] Christine Graffigne, Fabrice Heitz, Patrick Perez, FrancoisePreteux, Marc Sigelle, and Josiane Zerubia. HierarchicalMarkov random field models applied to image analysis: Areview. In Proceedings of the SPIE, volume 2568, pages 2–17,1995.

[49] G. Granlund. Hierarchical image processing. In Proceedings ofthe SPIE, volume 397, pages 362–371, 1983.

[50] Gosta H. Granlund and Hans Knutsson. Signal Processing forComputer Vision. Kluwer Academic Publishers, 1995.

[51] K. Grochenig and W. R. Madych. Multiresolution analysis,Haar bases, and self-similar tilings of Rn. IEEE Transactions onInformation Theory, 38(2):556–568, March 1992.

[52] Wolfgang Guse, Michael Gilge, and Bernd Hurtgen. Effec-tive exploitation of background memory for coding of movingvideo using object mask generation. In Murat Kunt, editor,SPIE Visual Communications and Image Processing, pages 512–521. SPIE, 1990.

160

[53] Ali Habibi. Hybrid coding of pictorial data. IEEE Transactionson Communications, 22(5):614 – 624, May 1974.

[54] R. M. Haralick and J. S. Lee. The facet approach to opticflow. In L. S. Baumann, editor, Proceedings Image UnderstandingWorkshop, pages 84–93, 1983.

[55] N. Peri Hartman and Steven L. Tanimoto. A hexagonal pyra-mid data structure for image processing. IEEE Transactions onSystems, Man and Cybernetics, SMC-14(2):247–256, April 1984.

[56] Paul J. Hearty. Achieving and confirming optimum imagequality. In A. B. Watson, editor, Digital Images and Human Vi-sion, chapter 12, pages 149–162. MIT Press, 1993.

[57] Berthold Klaus Paul Horn. Robot Vision. MIT Press, McGraw–Hill Book Company, 1986.

[58] Todd Horowitz and Anne Triesman. Attention and apparentmotion. Spatial Vision, 8(2):193–219, 1994.

[59] David H. Hubel. Eye, Brain and Vision. Scientific AmericanLibrary, 1988.

[60] ITU-T. Information technology-generic coding of moving pic-tures and associated audio ; recommendation H.262; ISO/IEC13818-2; committee draft, March 1994.

[61] ITU-T. Draft text of Recommendation H.263 Version 2, 1998.

[62] ITU-T. Overview of the MPEG-4 standard, March 1999.

[63] J.R. Jain and A.K. Jain. Displacement measurement and itsapplication in interframe image coding. IEEE Transactions onCommunications, COM-29:1799–1808, December 1981.

[64] Rajan L. Joshi and Thomas R. Fischer. Comparison of general-ized Gaussian and Laplacian modeling in DCT image coding.IEEE Signal Processing Letters, 2(5):81–82, May 1995.

[65] M. D. Kelly. Edge detection in pictures by computer usingplanning. In Machine Intelligence, volume 6, pages 397–409,Edinburgh, 1971. Edinburgh University Press.

161

[66] Richard P. Kleihorst, Reginald L. Lagendijk, and Jan Biemond.Motion estimation from noisy image sequences. In Proceedingsof the Picture Coding Symposium, page 4.6, Lausanne, Switzer-land, March 1993.

[67] T. Koga, A. Iinuma, Y. Iijma, and T. Ishiguro. Motion-compensated interframe coding for video conferencing. InProceedings of the National Telecommunications Conference, pagesG5.3.1–G5.3.5, 1981.

[68] Torbjorn Kronander. Pre– and post– processing of images us-ing filters with motion compensated history. In IEEE Inter-national Conference on Acoustics, Speech and Signal Processing,pages 1104–1108, New York, April 1988.

[69] Torbjorn Kronander. Some Aspects of Perception Based ImageCoding. PhD thesis, Linkoping University, Sweden, 1989.

[70] Bi Lei. Motion compensated three dimensional transform cod-ing. Master’s thesis, Linkoping University, Sweden, May 1995.

[71] Haibo Li. Computation of optical flow considering changes inimage intensity. Technical report, Linkoping University, Swe-den, April 1991.

[72] Haibo Li. Optical flow from a color image sequence. Technicalreport, Linkoping University, Sweden, October 1991.

[73] Haibo Li. Low Bitrate Image Sequence Coding. PhD thesis,Linkoping University, 1993.

[74] Haibo Li. Three-dimensional fractal coding with motion com-pensation. Technical Report LiTH-ISY-R-1695, Linkoping Uni-versity, Sweden, October 1994.

[75] Haibo Li, Diego Casanueva Escudero, and Robert Forch-heimer. Motion compensated multiresolution transmission ofdigital video signals. In Hiroshi Harashima, editor, Proceed-ings of the 1995 International Workshop on Very Low Bit-rate Video,November 1995.

[76] Haibo Li, Astrid Lundmark, and Robert Forchheimer. Imagesequence coding at very low bitrates: A review. IEEE Transac-tions on Image Processing, 3(5):589–609, September 1994.

162

[77] Haibo Li, Astrid Lundmark, and Robert Forchheimer. Low-entropy motion estimation for very low bitrate video. In Hi-roshi Harashima, editor, Proceedings of the 1995 InternationalWorkshop on Very Low Bit-rate Video, Tokyo, Japan, nov 1995.

[78] Haibo Li, Pertti Roivainen, and Robert Forchheimer. 3-D motion estimation in model-based facial image coding.IEEE Transactions on Pattern Analysis and Machine Intelligence,15(7):545–555, July 1993.

[79] J. O. Limb and J. A. Murphy. Estimating the velocity of mov-ing images in television signals. Computer Graphics and ImageProcessing, pages 311–327, 1975.

[80] Yih-Chuan Lin and Shen-Chuan Tai. Fast full-search block-matching algorithm for motion-compensated video compres-sion. IEEE Transactions on Communications, 45(5):527–531, May1997.

[81] Anders Lindman. Implementering och utvardering avrorelseestimeringsalgoritmer for kodning av TV/HDTV. Mas-ter’s thesis, Linkoping University, Sweden, February 1992.

[82] Tom D. Lookabaugh and Robert M. Gray. High–resolutionquantization theory and the vector quantizer advantage. IEEETransactions on Information Theory, 35(5):1020–1033, September1989.

[83] Astrid Lundmark. 2d-frequency sensitivity adapted subbanddecomposition. In Picture Coding Symposium. Proceedings.,March 1993.

[84] Astrid Lundmark and Torbjorn Kronander. Using certainty ofmotion estimation in image coding. In Proceedings of the PictureCoding Symposium, 1991.

[85] Astrid Lundmark, Haibo Li, and Robert Forchheimer. Motionvector certainty reduces bit rate in backward motion estima-tion video coding. In SPIE Visual Communications and ImageProcessing, volume 4067, pages 95–104. The International Soci-ety for Optical Engineering, June 2000.

[86] Astrid Lundmark, Niclas Wadstromer, and Haibo Li. Recur-sive subdivisions of the plane yielding nearly hexagonal re-gions. In RadioVetenskap och Kommunikation, June 1999.

163

[87] Astrid Lundmark, Niclas Wadstromer, and Haibo Li. Hierar-chical subsampling giving fractal regions. IEEE Transactions onImage Processing, 10(1):167–173, January 2001.

[88] B. Mahesh and W. A. Pearlman. Image coding on a hexagonalpyramid with noise spectrum shaping. In Proceedings of theSPIE, volume 1199, pages 764 – 774, 1989.

[89] Stephane Mallat. A Wavelet Tour of Signal Processing. AcademicPress, 1998.

[90] Stephane G. Mallat. Multifrequency channel decompositionof images. IEEE Transactions on Acoustics, Speech and SignalProcessing, ASSP–37(12):2094–2110, December 1989.

[91] Stephane G. Mallat. A theory for multiresolution signal de-composition: The wavelet representation. IEEE Transactionson Pattern Analysis and Machine Intelligence, 11(7):674–693, July1989.

[92] Benoit B. Mandelbrot. The Fractal Geometry of Nature. W. H.Freeman and Company, 1982.

[93] Petros Maragos. Morphological correlation and mean abso-lute error criteria. In IEEE International Conference on Acoustics,Speech and Signal Processing, pages 1568–1571, Glasgow, Scot-land, May 1989.

[94] A. M. Mathai. Generalized Laplace distributions with appli-cations. Journal of Applied Statistical Science, 1(2):169–178, 1993.

[95] Peter Meer, Song-Nian Jiang, Ernest S. Baugher, and AzrielRosenfeld. Robustness of image pyramids under structuralperturbations. Computer Vision, Graphics and Image Processing,44:307–331, 1988.

[96] F. Muller. Distribution shape of two-dimensional DCT coeffi-cients of natural images. Electronics Letters, 29:1935–1936, Oc-tober 1993.

[97] J. P. Mulroy and M. W. Whybray. Improved motion vectorfields for object oriented coding using local confidence mea-sures. In Proceedings of the International Workshop on CodingTechniques for Very Low Bit-rate Video, 1994.

164

[98] David Nister. Embedded Image Coding. Licentiate of Engi-neering thesis, Chalmers University of Technology, November1998.

[99] Aria Nosratinia and Michael T. Orchard. Multi-resolutionbackward video coding. In Proceedings. International Conferenceon Image Processing, volume 2, pages 563–566, October 1995.

[100] Naoya Ohta. Movement detection with reliability indices. InProceedings of the IAPR Workshop on Machine Vision Applications,pages 177–180, 1990.

[101] Naoya Ohta. Image movement detection with reliability in-dices. IEICE Transactions, E. 74(10):3379–3388, October 1991.

[102] John R. Parks. A Multi-Level System of Analysis for Mixedfontand Hand-Blocked Printed Characters Recognition, chapter 15,pages 295–322. Academic Press, 1969.

[103] William A. Pearlman, Beong-Jo Kim, and Zixiang Xiong. Em-bedded video subband coding with 3D SPIHT. In Pankaj N.Topiwala, editor, Wavelet Image and Video Compression, chap-ter 24, pages 397–432. Kluwer Academic Publishers, 1998.

[104] J. J. Pearson, D. C. Hines, Jr., S. Golosman, and C. D. Kuglin.Video–rate image correlation processor. In Andrew G. Tescher,editor, SPIE Vol. 119 Application of Digital Image Processing (In-ternational Optical Computing Conference), pages 197–205, 1977.

[105] Wendi B. Rabiner and Anantha P. Chandrakasan. Network-driven motion estimation for wireless video terminals. IEEETransactions on Circuits and Systems for Video Technology,7(4):644–653, August 1997.

[106] Andre Redert, Emile Hendriks, and Jan Biemond. Correspon-dence estimation in image pairs. IEEE Signal Processing Maga-zine, pages 29–46, May 1999.

[107] J.G. Robson. Spatial and temporal contrast–sensitivity func-tions of the visual system. Journal of the Optical Society of Amer-ica, 56:1141–1142, 1966.

[108] Pertti Roivainen. Motion estimation in model based coding ofhuman faces. Licentiate Thesis, Dept. of Electrical Engineer-ing, SE-58183 Linkoping, Sweden, 1990.

165

[109] A. Rosenfeld, editor. Multiresolution Image Processing and Anal-ysis. Springer verlag, 1984.

[110] Shaker Sabri and Birenda Prasada. Video conferencing sys-tems. Proceedings of the IEEE, 73(4):671–688, April 1985.

[111] Amir Said and William A. Pearlman. A new fast and efficientimage codec based on set partitioning in hierarchical trees.IEEE Transactions on Circuits and Systems for Video Technology,6, June 1996.

[112] Makoto Sato and Hiroshi Sasaki. Hierarchical estimation ofdisplacement vectors in image sequences. Systems and Com-puters in Japan, 19(10):1–7, 1988. Translated from Denshi JohoGakkai Ronbunshi, Vol. 69-D, No. 5, May 1986, pp. 771–776.

[113] Jerome K. Shapiro. Embedded image coding using zerotreesof wavelet coefficients. IEEE Transactions on Signal Processing,41(12):3445 – 3462, December 1993.

[114] Eero P. Simoncelli and Edward H. Adelson. Non–separableextensions of quadrature mirror filters to multiple dimensions.Proceedings of the IEEE, 78(4):652–664, April 1990.

[115] H. J. Song and B. S. Kang. Fractiles derived from generalizeddigit systems. Singapore World Scientific, 4(4):495–500, 1996.

[116] R. Srinivasan and K. R. Rao. Predictive coding based on ef-ficient motion estimation. In P. Dewilde and C. A. May, ed-itors, Proceedings of the International Conference on Communica-tions, pages 521–526, 1984.

[117] Roger T. Stevens. Understanding Self-Similar Fractals. R&DTechnical Books, 1995.

[118] Christoph Stiller and Janusz Konrad. Estimating motion inimage sequences. IEEE Signal Processing Magazine, pages 70–91, July 1999.

[119] Jacob Strom, Tony Jebara, Sumit Basu, and Alexander Pent-land. Real time tracking and modeling of faces: An ekf-basedanalysis by synthesis approach. In Proceedings of the ModellingPeople Workshop at the International Conference of Computer Vi-sion, September 1999.

166

[120] S. Tanimoto and T. Pavlidis. A hierarchical data structure forpicture processing. Computer Vision, Graphics and Image Pro-cessing, 4:652–660, 1975.

[121] Murat A. Tekalp. Digital Video Processing. Signal ProcessingSeries. Prentice Hall, first edition, 1995.

[122] Robert Thoma and Matthias Bierling. Motion compensat-ing interpolation considering covered and uncovered back-ground. Signal Processing: Image Communication, 1989.

[123] D. A. Thomas. Television motion measurement for DATVand other applications. Technical Report 87–11, BBC Re-search Department, Engineering Division, Kingswood War-ren, Brighton Road, Tadworth, Surrey, KT20 6NP, GreatBritain, September 1987.

[124] Pankaj N. Topiwala. Wavelet Image and Video Compression.Kluwer, 1998.

[125] O. Tretiak and L. Pastor. Velocity estimation from image se-quences with second order differential operators. In Proceed-ings International Conference on Pattern Recognition, pages 16–19, 1984.

[126] G. Tziritas and C. Labit. Motion Analysis for Image SequenceCoding. Advances in Image Communication. Elsevier, 1994.

[127] Helge von Koch. Sur une courbe continue sans tangenteobtenue par une construction geometrique elementaire. InArkiv for Matematik, Astronomi och Fysik, volume 1, pages 681–702. Almqvist & Wiksell, 1904.

[128] Helge von Koch. Une methode geometrique elementaire pourl’etude de certaines questions de la theorie des courbes planes.Acta Mathematica, 30:145–174, 1906.

[129] Brian A. Wandell. Foundations of Vision. Sinauer Associates,1995.

[130] John Watkinson. The Art of Digital Video. Focal Press, secondedition, 1994.

[131] A.B. Watson. Efficiency of a model human image code. Journalof the Optical Society of America: A, 4(12):2401–2417, December1987.

167

[132] A.B. Watson and A.J. Ahumada. Model of human visual-motion sensing. Journal of the Optical Society of America: A,2(2):322–342, 1985.

[133] Anderew B. Watson and Albert J. Ahumada, Jr. A hexagonalorthogonal-oriented pyramid as a model of image representa-tion in visual cortex. IEEE Transactions on Biomedical Engineer-ing, 36(1):97–106, January 1989.

[134] Andrew B. Watson. Perceptual-components architecture fordigital video. Journal of the Optical Society of America: A,7(10):1943–1954, October 1990.

[135] Thomas Wedi. A time-recursive interpolation filter for motioncompensated prediction considering aliasing. In Proceedings.International Conference on Image Processing, April 1999.

[136] Carl-Johan Westelius. Focus of Attention and Gaze Control forRobot Vision. PhD thesis, Linkoping University, Sweden, 1995.

[137] Gerald Westheimer and Suzanne P. McKee. Visual acuity inthe presence of retinal–image motion. Journal of the Optical So-ciety of America, 65(7):847–850, July 1975.

[138] John M. Wozencraft and Irwin Mark Jacobs. Principles of Com-munication Engineering. John Wiley & Sons, Inc., 1965.

[139] Sanbao Xu. Motion and Optical Flow in Robot Vision. Liu-tek-lic-1994:28, Linkoping University, Sweden, May 1994.

[140] Alfred Lukyanovich Yarbus. Eye Movements and Vision.Plenum Press, 1967.

[141] J.I. Yellot. Spectral analysis of spatial sampling by photorecep-tors: Topological disorder prevents aliasing. Vision Research,22:1205–1210, 1982.

[142] Robert W. Young and Nick G. Kingsbury. Frequency-domainmotion estimation using a complex lapped transform. IEEETransactions on Image Processing, 2(1):2–17, January 1993.

[143] Andre Zaccarin and Bede Liu. Fast algorithms for motion es-timation. In IEEE International Conference on Acoustics, Speechand Signal Processing, volume III, pages 449–452, March 1992.

168

[144] Hui Zhang and Zhenya He. A new motion estimation algo-rithm. In Proceedings of the Pacific Rim conference on Communi-cations, Computers and Signal Processing, pages 619–622. IEEE,1991.

hierarchical structures and extended motion information ... · hierarchical structures and extended...

Documents