efficientvideocodingwith motion-compensatedorthogonal...

47
Efficient Video Coding with Motion-Compensated Orthogonal Transforms DU LIU Master’s Degree Project Stockholm, Sweden 2011 XR-EE-SIP 2011:011

Upload: others

Post on 11-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Efficient Video Coding withMotion-Compensated Orthogonal

Transforms

DU LIU

Master’s Degree ProjectStockholm, Sweden 2011

XR-EE-SIP 2011:011

Page 2: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Efficient Video Coding with Motion-CompensatedOrthogonal Transforms

Du Liu

July, 2011

Page 3: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Abstract

Well-known standard hybrid coding techniques utilize the concept of motion-compensated predictive coding in a closed-loop. The resulting coding de-pendencies are a major challenge for packet-based networks like the Internet.On the other hand, subband coding techniques avoid the dependencies ofpredictive coding and are able to generate video streams that better matchpacket-based networks. An interesting class for subband coding is the so-called motion-compensated orthogonal transform. It generates orthogonalsubband coefficients for arbitrary underlying motion fields. In this project, atheoretical lossless signal model based on Gaussian distribution is proposed.It is possible to obtain the optimal rate allocation from this model. Addition-ally, a rate-distortion efficient video coding scheme is developed that takesadvantage of motion-compensated orthogonal transforms. The scheme com-bines multiple types of motion-compensated orthogonal transforms, variableblock size, and half-pel accurate motion compensation. The experimentalresults show that this scheme outperforms individual motion-compensatedorthogonal transforms.

i

Page 4: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Acknowledgements

This thesis was carried out at Sound and Image Processing Lab, School ofElectrical Engineering, KTH.

I would like to express my appreciation to my supervisor Markus Flierlfor the opportunity of doing this thesis. I am grateful for his patience andvaluable suggestions and discussions.

Many thanks to Haopeng Li, Mingyue Li, and Zhanyu Ma, who helpedme a lot during my research.

I would also like to thank my parents and my friends Alicia Wang andPeng Wu for their constant support.

ii

Page 5: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Contents

Abstract i

Acknowledgements ii

1 Introduction 1

2 Background 32.1 Motion-Compensated Orthogonal Transforms . . . . . . . . . 3

2.1.1 Motion Compensation . . . . . . . . . . . . . . . . . . 32.1.2 The Orthogonal Transform . . . . . . . . . . . . . . . 5

2.2 Adaptive Spatial Wavelet Transforms . . . . . . . . . . . . . . 72.2.1 Type-1 Spatial Wavelet Transform . . . . . . . . . . . 72.2.2 Type-2 Spatial Wavelet Transform . . . . . . . . . . . 8

2.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 10

I Theoretical Model 12

3 Theoretical Signal Model 133.1 General Transform Model . . . . . . . . . . . . . . . . . . . . 133.2 Memoryless Gaussian Model . . . . . . . . . . . . . . . . . . . 16

4 Numerical Results 20

II Practical System 24

5 Efficient Video Coding Scheme 255.1 Construction of Various MCOTs . . . . . . . . . . . . . . . . 25

5.1.1 Multiple Types of MCOT . . . . . . . . . . . . . . . . 255.1.2 Multi-hypothesis MCOT . . . . . . . . . . . . . . . . . 26

5.2 Obtaining Motion Vectors . . . . . . . . . . . . . . . . . . . . 295.3 Variable Block Size . . . . . . . . . . . . . . . . . . . . . . . . 30

iii

Page 6: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CONTENTS iv

5.4 Mode Decision . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Experimental Results 33

7 Conclusions 36

Bibliography 37

Page 7: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

List of Figures

2.1 Blocked based motion-compensatioin with two matching blocksx1,i and x2,j and a motion vector mv. . . . . . . . . . . . . . 4

2.2 Half-pel accurate motion compensation for integer positionA. Position 1 to Position 8 are the possible half-pel positionsfor Position A. . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 The distribution of a 2-dimensional noised image for (a) Haarwavelet transform with a rotation of 45◦ and (b) MCOT withan optimal decorrelation angle α∗. . . . . . . . . . . . . . . . 6

2.4 Type-2 spatial wavelet transform of Lena with three decom-position levels. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Structure of a bitstream for one code-block. Sign: Signs ofthe coefficients. SP: Significant Propagation Pass. MR: Mag-nitude Refinement Pass. CP: Cleanup Pass. . . . . . . . . . . 11

3.1 Theoretical signal model. . . . . . . . . . . . . . . . . . . . . 133.2 The theoretical curve g(Rp) of the variance of the clean high

band over Rp. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 The theoretical curve of the total rate ht over Rp. . . . . . . . 153.4 Different g for different f . . . . . . . . . . . . . . . . . . . . . 183.5 Different hc for different f . . . . . . . . . . . . . . . . . . . . 19

4.1 The total rate ht over Rp with γ = 9 for different noise levels.g0 = σ2

v = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 The rate of the coefficients hc over Rp with γ = 9 for different

noise levels. g0 = σ2v = 1. . . . . . . . . . . . . . . . . . . . . . 22

5.1 Efficient video coding system. . . . . . . . . . . . . . . . . . . 265.2 Multi-hypothesis for bidirectional half-pel motion estimation. 275.3 An example of 6-hypothesis motion estimation. . . . . . . . . 285.4 Partitions of a macroblock of 16x16 for motion estimation. . . 305.5 Structure of the minimization of the cost function with the

three levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v

Page 8: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

LIST OF FIGURES vi

6.1 Luminance PSNR vs. rate for the QCIF sequence Foremanat 30fps with 64 frames and a GOP size of 8 frames. Thecompared transforms include the proposed MCOT, the bidi-rectional MCOT with variable block size (VBS) and half-pelmotion compensation (HP), the bidirectional MCOT withoutVBS or HP, the Haar wavelet transform without VBS or HP,and the intra coding. . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 Luminance PSNR vs. rate for the QCIF sequence Mother &Daughter at 30fps with 64 frames and a GOP size of 8 frames.The compared transforms include the proposed MCOT, thebidirectional MCOT with variable block size (VBS) and half-pel motion compensation (HP), the bidirectional MCOT with-out VBS or HP, the Haar wavelet transform without VBS orHP, and the intra coding. . . . . . . . . . . . . . . . . . . . . 35

Page 9: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 1

Introduction

Video communication has been broadly used in today’s communication andvisual services such as terrestrial broadcast, cable TV, satellite TV, real-time conversation, Internet video, and so on. For all these applications,video coding techniques play an important part in storage, transmission,and representation of video data. Since the storage space or the transmissionbandwidth is usually limited, most video coding schemes are lossy. Thereis obviously a trade-off between the video quality and the hardware andsoftware requirements. Thus for a video coding technique, it is expected tocode the video sequences efficiently such that the decoded video will providewith the highest possible quality for a given storage space or a given datarate.

The standard video compression techniques, such as H.261 [1], H.263[2], MPEG-1 [3], MPEG-4 Part2 [4], and more recently, H.264/AVC [5], uti-lize the concept of motion-compensated predictive coding. Predicted frames(known as P-frame) and bi-predicted frames (B-frame) are used to exploitthe temporal redundancy of the sequences with one key frame (I-frame) foreach group of pictures (GOP). Because predictive coding is developed in aclose-loop fashion, the coded videos heavily depend on the relationship ofthe successive pictures. These dependencies introduce the risk of error prop-agation to the subsequently decoded pictures, which might be suboptimalin packet loss channels [6]. On the other hand, the motion-compensatedorthogonal transform (MCOT) is a subband coding technique that operatesin an open-loop fashion. It does not depend on predictive coding and, there-fore, avoids the error propagation. Thus it is more suitable for packet basednetworks like the Internet. The motion-compensated orthogonal transformis a class of subband coding techniques. It generates orthogonal subbandcoefficients for arbitrary underlying motion fields.

The goal of this project is to develop a rate-distortion efficient video cod-ing scheme that takes advantage of motion-compensated orthogonal trans-forms. A theoretical transform coding model is proposed to analyze the

1

Page 10: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 1. INTRODUCTION 2

optimal rate allocation. The performance of the practical system will beevaluated by peak signal-to-noise ratio (PSNR).

The report is organized as follows: Chapter 2 introduces the backgroundof the motion compensated orthogonal transforms, the adaptive spatialwavelet transforms, the quantization, and the entropy coding. Chapter 3proposes a theoretical signal model for the transform coding. Numericalresults for the theoretical model are presented in Chapter 4. Chapter 5describes the implemented video coding system. Chapter 6 presents theexperimental results for the coding system.

Page 11: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 2

Background

2.1 Motion-Compensated Orthogonal TransformsThe class of MCOTs include the unidirectional motion-compensated orthog-onal transform [7], the bidirectional motion-compensated orthogonal trans-form [8], a half-pel motion accurate transform [9], and a multi-hypothesistransform [10]. In this thesis, various types of MCOTs are combined withvarious motion models to achieve an efficient adaption of the actual motionof the coded image sequence.

2.1.1 Motion Compensation

Motion compensation describes the similarity of consequent pictures. It isvery commonly used in today’s video coding techniques. Usually, a sequenceof successive frames are similar. Motion compensation is used to explorethe redundancy of this kind of information. Applying this algorithm to theInternet video services, one can save bits from several megabytes per secondto 10 kbps [11].

In block-based motion compensation each frame is divided into blocks,such as 8 × 8 pixels or 16 × 16 pixels in each block. A reference frame isdefined and the motion-compensation algorithm searches the most similarblock that best matches the current processing block. A motion vector isused to indicate the shift between the current block and the reference block.

Fig. 2.1 depicts the two matching blocks x1,i in x1 and x2,j in x2. x2,j isthe current processing block. x1 is the reference frame in which x2,j will findthe most matching block with a motion vector mv. The system searches themost matching block. The criteria is usually Sum of Squared Differences(SSD) or Sum of Absolute Differences (SAD).

The values of motion vector need not to be integer. It can be sub-samplessuch as half pixel or quarter pixel position.It will provide more accuratemotion compensation for the blocking matching scheme and therefore reducethe information in the residual signals.

3

Page 12: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 4

Figure 2.1: Blocked based motion-compensatioin with two matching blocksx1,i and x2,j and a motion vector mv.

Figure 2.2: Half-pel accurate motion compensation for integer position A.Position 1 to Position 8 are the possible half-pel positions for Position A.

Page 13: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 5

In this project, half-pel motion accuracy is considered. Fig. 2.2 depictsthe integer pixel positions A to I and half-pel positions 1 to 8. For eachinteger pixel position, e.g., A, we consider the around eight half-pel posi-tions. The interpolation for half-pel positions are given by the average ofthe neighbouring integer pixels as

p1 = 12(pA + pD), p5 = 1

2(pA + pH),

p2 = 14(pA + pB + pC + pD), p6 = 1

4(pA + pH + pF + pG),

p3 = 12(pA + pB), p7 = 1

2(pA + pF ),

p4 = 14(pA + pB + pI + pH), p8 = 1

4(pA + pF + pE + pD). (2.1)

2.1.2 The Orthogonal Transform

Since the MCOT is an orthogonal linear transform, the differential entropy ofthe source signal is preserved. We have h(x1,x2) = h(L,H) where h(x1,x2)is the joint entropy of the two input pictures x1 and x2 and h(L,H) thejoint entropy of the low band and the high band. Fig. 2.3 depicts the 2-dimensional signal with (a) Haar wavelet transform and (b) MCOT. α isthe rotation angle decided by the transform. The Haar wavelet transformalways rotate the signal by α = 45◦, which means it may be suboptimalif the source signal distribution has an angle unequal to 45◦. We have theHaar transform matrix

HHaar = 1√2

(1 1−1 1

). (2.2)

The MCOT, on the other hand, specifies the decorrelation angle α = α∗

by the constraint of energy concentration. It aims at rotating the signal tothe x1-axis, which means the MCOT is adaptive to the distribution of thesignal. The orthogonal matrix for MCOT is

HMCOT =(

cosα sinα− sinα cosα

). (2.3)

For uncorrelated Gaussian signals, the coefficients after the MCOT areindependent. The differential entropy of the source signal turns to h(L) +h(H). We can write the differential entropy as

h(x1,x2) = h(L,H) ≤ h(L) + h(H), ∀α (2.4)

andh(x1,x2) = h(L) + h(H), only for α = α∗. (2.5)

Page 14: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 6

Figure 2.3: The distribution of a 2-dimensional noised image for (a) Haarwavelet transform with a rotation of 45◦ and (b) MCOT with an optimaldecorrelation angle α∗.

Considering the two matching blocks in Fig. 2.1, the orthogonal trans-form turns to (

x′′1,ix′′2,j

)= H

(x′1,ix′2,j

). (2.6)

After the transform, x′1,i will be the low band block x′′1,i and x′2,j be the highband block x′′2,j . In the ideal case that x′1,i = x′2,j , the high band block x′′2,jwill result in zero.

Since it is an orthonormal transform, the Parseval’s theorem alwaysholds. The energy is preserved before and after the transform∥∥∥x′1,i∥∥∥2

2+∥∥∥x′2,j∥∥∥2

2=∥∥∥x′′1,i∥∥∥2

2+∥∥∥x′′2,j∥∥∥2

2. (2.7)

Thus it is possible to evaluate the distortion of the coefficients even beforethe inverse transform. The MCOT achieves high energy concentration, withup to 99% energy in the temporal low band for the QCIF sequence Foreman[12]. If the input images are identical, the whole energy will be compactedto the temporal low band and the temporal high band turns to zero.

Considering bidirectional MCOT, we can construct the orthogonal trans-form matrix as

H =H3H2H1

=

cosψ 0 sinψ0 1 0

− sinψ 0 cosψ

1 0 0

0 cos θ − sin θ0 sin θ cos θ

cosφ 0 sinφ

0 1 0− sinφ 0 cosφ

. (2.8)

Page 15: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 7

The Euler angles ψ, θ, and φ are calculated by energy concentration.The bidirectional MCOT can be expressed asx

′′1,ix′′2,jx′′3,k

=H3H2H1

x′1,ix′2,jx′3,k

. (2.9)

where x1,i, x2,j , and x3,k are three coefficients from three correspondingframes.

After the transform, the original coefficients x′1,i and x′3,k will turn to thelow band coefficients x′′1,i and x′′3,k respectively. x′2,j becomes the high bandcoefficient x′′2,j . In the ideal case that x′1,i = x′2,j = x′3,k, the high coefficientx′′2,j will be zero.

In the case of half-pel accuracy, the pixel based motion-compensatedorthogonal transform is given in [9]. We have 2-hypothesis MCOT for p1, p3,p5, and p7, and 4-hypothesis transform for p2, p4, p6, and p8, see also in Fig.2.2. The 2-hypothesis has a similar transform to the bidirectional MCOT,while the 4-hypothesis extends the orthogonal transform to an operation offive pixels at a time.

2.2 Adaptive Spatial Wavelet TransformsSpatial transforms can exploit the spatial redundancy between coefficientsin a picture and map the pixels into spatial low and high bands. Adap-tive spatial wavelet transform is designed to modify the spatial relationshipwithin each of the temporal bands produced by MCOT [13]. It preserves theorthogonality of the temporal decomposition. It also takes the scale factorsfrom the MCOT into consideration to achieve efficient energy compaction.The adaptive spatial wavelet transform consists of two types: type-1 spatialwavelet transform and type-2 spatial wavelet transform.

2.2.1 Type-1 Spatial Wavelet Transform

The type-1 is the adaptive Haar-like wavelet transform. It processes twopixels each time according to(

x′′1x′′2

)= H

(x′1x′2

), (2.10)

where x′1 and x′2 are two successive temporal low band samples and x′′1 andx′′2 the corresponding spatial low and high band coefficients, respectively.The orthogonal transform matrix H is

H = 1√1 + a2

(1 a−a 1

)(2.11)

Page 16: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 8

where a is the decorrelation factor determined by energy concentration con-straint. If a = 1, the spatial transform turns to standard Haar wavelettransform.

The output of the temporal low band after spatial decomposition willbe spatial low band and horizontal, vertical, and diagonal high bands. Theadaptive spatial wavelet transform achieves high energy compaction. It isshown in [12] that type-1 adaptive spatial transform achieves 98.3% en-ergy concentration from the temporal low band to the spatial low band forForeman and 98.77% for Mother&Daughter.

2.2.2 Type-2 Spatial Wavelet Transform

Instead of processing two pixels each time for the type-1 spatial transform,the type-2 spatial transform considers three pixels each time and processesalong a whole row or column within the each of the subbands. Let x′1, x′2, andx′3 be three samples proceesed by the type-2 transform at a time. After thetransform, x′2 turns to spatial high band pixel x′′2. Assuming x′1 = x′2 = x′3,the high band energy can be completely removed (x′′2 = 0).x′′1x′′2

x′′3

= S3S2S1

x′1x′2x′3

, (2.12)

where

S = S3S2S1

=

cosψ 0 sinψ0 1 0

− sinψ 0 cosψ

1 0 0

0 cos θ − sin θ0 sin θ cos θ

cosφ 0 sinφ

0 1 0− sinφ 0 cosφ

.(2.13)

The Euler angles φ, θ, and ψ are determined via energy concentration con-straint.

Fig. 2.4 depicts the three spatial decomposition levels of Lena (512×512).It has most of its energy in the upper left corner as the spatial low bandwhile the spatial high bands contain information of edges and curves as grayparts. Since the type-2 outperforms the type-1 in energy compaction [13],the type-2 adaptive spatial wavelet transform is used in this thesis.

2.3 QuantizationQuantization is a mapping from a large set of values to a representation ofsmall set of unit values. The mean squared error (MSE) of the quantizationcan be expressed as

D = E[(x− q(x))2

], (2.14)

Page 17: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 9

Figure 2.4: Type-2 spatial wavelet transform of Lena with three decompo-sition levels.

where x is a random vector. The literature studies the relationship of theMSE and the probability density function (pdf) of the source signals [14][15]. It is shown that for high rate and smooth pdf, the distortion can bewritten as

D = 112N2

(∫xf

13x (x) dx

)3, (2.15)

where N is the number of representative levels and fx(x) is the pdf of x. Inmost of the cases, the quantizer is uniform and scalar. Assume x is uniformlydistributed. The distortion is

D =N∑i=1

∫ xi+1

xi

(xi − x̂i)2 dx = ∆2

12 , (2.16)

where ∆ is the quantization step size and ∆ = 1/N .Quantization is always associated with the rate. This refers to the rate-

distortion theory. If the quantization step size grows larger, less bits areneeded for the entropy coding, and vise versa. In the case of high rate, theuniform quantizer is often optimal [16]. For uniform quantization, the ratecan be expressed in the form of the number of quantization levelsR = log2N .It is possible to control the total rate by setting different quantization stepsizes.

Page 18: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 10

In this thesis, the Embedded Block Coding with Optimized Truncation(EBCOT) is used as entropy coding as well as Rate-Distortion evaluation.Quantization effects on EBCOT based compression are studied in [17]. How-ever, as suggested in [18], a uniform scalar deadzone quantization is usedwith standard step size ∆ = 1 for simplicity and the rate is controlled bythe post-compression rate-distortion (PCRD).

2.4 Entropy CodingThe main purpose of entropy coding is to reduce the redundancy of thesource message and represent it in binary format.

Assume X to be a discrete random variable with probability p(xi) forits possible values xi. The Shannon entropy of X is defined as

H(X) = −n∑i=1

p(xi) log2(p(xi)). (2.17)

It can be used to measure the uncertainty of this random variable X. It alsoindicates the theoretical lower limit of data compression and the optimalcode length for one symbol is H(X).

The above definition of Shannon entropy is dependent on the discreteX. If we consider a continuous X, e.g., practical analog signals, the entropycan be extended to differential entropy to describe the continuous case. Letfx(x) be the pdf of X. The differential entropy is defined as

h(X) = −∫fx(x) log2 fx(x) dx. (2.18)

Note that the differential entropy can be negative. And it is not limited bythe Shannon entropy.

A number of entropy coding techniques have been designed, such asHuffman coding, arithmetic coding, and Lempel-Ziv-Welch (LZW) [19] forlossless coding. Most image coding techniques are lossy coding, such asEmbedded Zerotrees of Wavelet Transforms (EZW) [20], Set partitioning inhierarchical trees (SPIHT) [21], and Discrete Cosine Transform (DCT) [22].

EBCOT is another coding method used in image and video compression[23]. It serves as the entropy coding in JPEG2000. It utilizes the idea of bit-plane coding to encode bits from the most significant bit-plane to the leastsignificant bit-plane. Generally, the low-order bit-planes are more difficultto encode than the high-order bit-planes because they contain more detailsand randomness.

Consider an 8 bits image. It has a maximum value of 255 for a greyscale image. Each of the coefficients can be represented as x = an−12n−1 +an−22n−2 + · · · + a020 where x is the coefficients and n is the number ofthe binary bit-planes. Coefficients construct the bit-planes from the most

Page 19: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 2. BACKGROUND 11

significant bit an−1 to the least significant bit a0. The image is divided intosmall code-blocks, such as 16×16 or 32×32. The bit-plane encoder encodesthe bit-planes of each code-block independently by three steps:

• Significant Propagation Pass;

• Magnitude Refinement Pass;

• Cleanup Pass.

All the information obtained from the previous processes is coded byan adaptive arithmetic encoder. The bit-stream for one code-block is con-structed in an embedded way, shown in Fig. 2.5. The most importantinformation is put at the head and the least important information followsat the end. If the bit-stream is truncated, it always tries to preserve themost important information and throw away the bits at the back. In thisways, the EBCOT can perform rate control without different quantizationsteps. It can create embedded bitstreams from bit-planes and is adaptive toa given rate by truncating the bitstreams.

Figure 2.5: Structure of a bitstream for one code-block. Sign: Signs of thecoefficients. SP: Significant Propagation Pass. MR: Magnitude RefinementPass. CP: Cleanup Pass.

Because the code-blocks are coded independently, the bit-streams of dif-ferent blocks are also independent. If one bit-stream is lost, it does notaffect other bit-streams and the decoder is still possible to decode the pic-ture. The PCRD algorithm can be performed over each single code-blockand the complexity of the rate-control is reduced. Moreover, the distortionis additive when the truncated bit-streams are independent: D = ∑

iDi,where D is the distortion of the whole picture and Di is the distortion ofone code-block Bi. Detailed distortion estimation based on bit-planes canbe found in [24], but here we only look at the block-based distortion.

In this project, the JasPer software of JPEG-2000 codec (ISO/IEC 15444-1) is used for implementation [25].

Page 20: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Part I

Theoretical Model

12

Page 21: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 3

Theoretical Signal Model

3.1 General Transform ModelWithin each single MCOT, assume there are two input pictures x0 and x1.x0 and x1 can be viewed as a clean picture v plus independent additivewhite Gaussian noises n0 and n1 respectively, shown as Fig. 3.1 [26]. Thenoise n0 and n1 are statistically independent.

Figure 3.1: Theoretical signal model.

After the transform, the output signals have one temporal low band Land one energy removed temporal high band H. However, the energy of thenoises n0 and n1 cannot be shifted after the transform. They remain in theeach of their subbands. Thus the temporal subbands are composed by theclean subband signals plus the noises: L = Lclean+n0 and H = Hclean+n1.

Since we would like to describe the performance of transform, we usethe parameter rate Rp to determine how much energy is moved from thehigh band to the low band. The parameter rate includes the informationthat is related to the transform, such as the rate of the motion vectors andthe rate of the block sizes. If Rp = 0, no additional bits are spent on the

13

Page 22: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 3. THEORETICAL SIGNAL MODEL 14

parameter rate, the energy is not shifted. If Rp gets larger, more energywill be concentrated to the low band and less is left in the high band. IfRp → +∞, the clean high band energy can be completely removed andall the signal energy is in the low band, shown as Fig. 3.2. Let g(Rp) bethe transform function of Rp indicating the variance of the clean high bandsignal. g(Rp) is a decreasing function saying if Rp gets larger, more energywill be removed from the high band. However we should notice that it isnot possible to remove the noise from the high band even if Rp → +∞.

Figure 3.2: The theoretical curve g(Rp) of the variance of the clean highband over Rp.

Because the noise cannot be shifted around, use f to present the varianceof the noised high band

f = σ2H = σ2

n + g(Rp). (3.1)

Since f indicates the high band, f should be 0 ≤ f ≤ E2 , where E = 2σ2

n+2σ2v

is the total energy. From Fig. 3.2, we know both f and g are convex. Thevariance of the noised low band is

σ2L = σ2

n + 2σ2v − g(Rp) = 2σ2

n + 2σ2v − f. (3.2)

The total energy E is always conserved because of the orthogonal transform.Let hc be the differential entropy of the coefficients

hc = 12(h(L) + h(H)) ≥ 1

2h(x1,x2). (3.3)

The differential entropy of the total signal is

ht = Rp + hc

= Rp + 12(h(L) + h(H)) [bpp]. (3.4)

Page 23: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 3. THEORETICAL SIGNAL MODEL 15

Because hc is related to the subband coefficients, there is a trade-off be-tween Rp and hc. Either more bits are spent on Rp to improve the transformperform and less bits for hc, or less bits are spent on Rp and more bits arerequired for hc. For efficient transform coding, we expect the decreasingamount of hc is larger than the increasing amount of Rp, such that it ispossible to reduce the total rate. Theoretically, there exists an optimal R∗pthat can minimize the total rate. Fig. 3.3 depicts the theoretical curve ofht with optimal R∗p and min{ht} = h∗t . If Rp = 0, ht is 1

2(h(L) + h(H)). IfRp → +∞, hc approaches to the constant entropy 1

2h(x1,x2) of the sourcesignal.

Figure 3.3: The theoretical curve of the total rate ht over Rp.

To minimize the total rate ht, we take partial derivative of ht with respectto Rp

∂ht∂Rp

= 0. (3.5)

The result will give min{ht} = ht(R∗p) = h∗t .If we consider a cost function L with µ = 1

L = Rp + µhc, (3.6)

to minimize ht is to find

dL = dhc + dRp = 0, (3.7)

which turns to bedhcdRp

= −1. (3.8)

Page 24: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 3. THEORETICAL SIGNAL MODEL 16

To evaluate Eq. 3.8, we need to know the absolute value of hc such thatwe can find the optimal rate allocation when combined with Rp. Althoughhc can be obtained from the bit-streams after entropy coding, it is difficultto calculate the value of hc within the transform. Because we would liketo evaluate the transform before entropy coding, we therefore assume thesource signal is Gaussian distributed and calculate its differential entropy.

3.2 Memoryless Gaussian ModelAs is known that the Gaussian source is the most difficult source to encodeof all real-valued probability distributions with the same mean and variance[27], it requires the highest amount of bits. Here we are considering theworst case in modeling the signal.

We obtain the differential entropy of the temporal low band

h(L) = 12 log2(2πe) + 1

2 log2(σ2L) (3.9)

and the temporal high band

h(H) = 12 log2(2πe) + 1

2 log2(σ2H). (3.10)

The rate of the coefficients is

hc = 12(h(L) + h(H)) (3.11)

= 12 log2(2πe) + 1

4 log2

[(σ2n + 2σ2

v − g(Rp))(σ2n + g(Rp))

](3.12)

= 12 log2(2πe) + 1

4 log2 [(E − f)f ] . (3.13)

If we minimize hc for a given Rp0 , we have

min hc s.t. Rp = Rp0 . (3.14)⇒ min(E − f)f s.t. Rp = Rp0 (3.15)

Because 0 ≤ f ≤ E2 , Eq. 3.15 can be rewritten as

min f s.t. Rp = Rp0 (3.16)

Looking at Eq. 3.14, hc is based on the Gaussian distribution assumption.But from Eq. 3.16, we notice the function is based on the variance of thehigh band rather than Gaussian distribution. That means, if we considerEq. 3.16, there is no need to know the signal distribution in advance. Forthis, we can construct another cost function

J = f + λRp = σ2H + λRp, (3.17)

Page 25: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 3. THEORETICAL SIGNAL MODEL 17

where λ is the Lagrangian multiplier. To minimize this cost function

dJ

dRp= 0, (3.18)

we haveλ = − df

dRp(3.19)

This equation indicates the transform performance in two aspects. The firstis for the performance with the same Rp but different g(Rp)s. The secondis for the same g(Rp) but different Rps:

• Suppose there are two transform functions g1 and g2. Assume g1 issteeper than g2, which means at given f1, Rp11 < Rp21. So g1 is moreefficient in compacting energy as less bits are spent. Then at f1 wehave λ11 < λ21. At f2 we have an opposite case that λ21 > λ22, shownas Fig. 3.4.

• Consider f1 and f2 (f1 < f2) for a given g1. The rate at these twopoints are Rp11 > Rp12, which means a higher Rp will remove moreenergy from the high band.Then the two λs at these two points areλ11 < λ12. The same holds for g2 that λ21 < λ22, also shown in Fig.3.4.

Table 3.1 shows the parameters for each of the conditions in Figs. 3.4and 3.5.

f g1 g2 hc1 hc2f1 Rp11, λ11 Rp21, λ21 λh11 λh21f2 Rp12, λ12 Rp22, λ2 λh12 λh22

Table 3.1: Parameters shown in Figs. 3.4 and 3.5.

Eq. 3.8 can be rewritten as

dhcdRp

= dhcdf

df

dRp= −1. (3.20)

We obtaindhcdf

= 14 ln 2

E − 2f(E − f)f = 1

λ. (3.21)

The term dhcdf evaluates the slope of the differential entropy of the subbands

over f based on Gaussian distribution. This term also has meanings in twoaspects:

• Consider one f for two curves hc1 and hc2, saying hc1 is higher thanhc2. We have hc1(f) > hc2(f). The larger value means hc1 is not

Page 26: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 3. THEORETICAL SIGNAL MODEL 18

Figure 3.4: Different g for different f .

so efficient in coding the coefficients comparing to hc2. Then for f1,1

λh11>

1λh21

⇒ λh11 < λh21 and for f2,1

λh12<

1λh22

⇒ λh12 > λh22,shown as Fig. 3.5. For a relatively small f1, ghc1 should be lower thanghc1 such that λh11 < λh21, thus Rp,h11 < Rp,h21. So ghc1 is moreeffective than ghc2. When a more effective ghc1 is combined with a lessefficient hc1, it can still achieve an optimal performance.

• If we look at one curve hc1 and two points f1 and f2, we have 1λh11

>1

λh12⇒ λh11 < λh12. Besides, for one transform function g, we obtain

f1 < f2 ⇒ Rp1 > Rp2 ⇒ λh11 < λh12. The two results are consistent,which means the transform and the entropy coding can be analyzedtogether.

As is discussed above, Eq. 3.20 clearly demonstrates the relationshipbetween the transform and the entropy coder. In real cases, the term dhc

dfwill tell the performance of the entropy coder, e.g., the EBCOT. If entropycoder can encode the coefficients efficiently, it will give a smaller hc thanthat from an inefficient encoder with the same f . In the case of bit-streams,we can use Rc instead of hc to present the actual rate of the coefficients.

To choose one optimal λ, we can do the following steps:

• For a given transform g(Rp), calculate N possible λs λ1 . . . λN for theN different Rps.

• With each Rp,n (n ∈ [1 . . . N ]) we can obtain σ2H,n and Rc,n.

Page 27: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 3. THEORETICAL SIGNAL MODEL 19

Figure 3.5: Different hc for different f .

• Choose the minimum Rn = Rp,n + Rc,n among the N possible ratesand we can find the corresponding optimal λ.

The next chapter will give numerical results for for a given g(Rp). Anoptimal rate combination will be presented.

Page 28: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 4

Numerical Results

Because g(Rp) is unknown beforehand, it can have any form like exponentialfunctions or rational functions. For simplicity, here we assume the functiong(Rp) to be

g(Rp) = g02−γRp (4.1)

where g0 is the value for g at Rp = 0 and γ > 0 is a parameter indicatingthe shape of g.

To find one minimum point for ht, we have

∂ht∂Rp

= 0 (4.2)

∂2ht∂R2

p

> 0. (4.3)

The solutions are

R∗p = −1γ

log2

(4− γ +

√(4− γ)2 + 4(2− γ)C2(2− γ)

)(4.4)

whereC = 4σ2

vσ2n + 2σ4

n, (4.5)

h∗t = R∗p + 12 log2(2πe) + 1

4[(σ2n + g(R∗p))(2σ2

v + σ2n − g(R∗p))

], (4.6)

andλ∗ = − df

dR∗p= γg0 ln 2 · 2−γR∗

p = γ ln 2 · g(R∗p). (4.7)

This pair of R∗P and h∗t is then our optimal rate allocation.Figure 4.1 presents the total rate ht over Rp for noise levels of −10dB,

−30dB, and −50dB. The top curve of σ2n = −10dB shows small amount of

reduction in ht with about 0.1bits. At this level, the noise is large enough todestroy the original signal. The transform does not gain much in reducing

20

Page 29: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 4. NUMERICAL RESULTS 21

−2 0 2 4 6 80

1

2

3

4

5

6

7

8

Rp(bits)

h t(bits

)

σn2=−50dB

σn2=−30dB

σn2=−10dB

Figure 4.1: The total rate ht over Rp with γ = 9 for different noise levels.g0 = σ2

v = 1.

the total rate. The curve in the middle is with σ2n = −30dB. There is 1bit

at the minimum point. This level of noise is acceptable to the signal as it isnot too high nor too low. For the last one with noise of −50dB, the noiseis so small that the signal is almost clean. The transform decreases withabout 1.8bits in ht. It is possible to see the transform is efficient in reducingthe total rate for a low level of noise, but for a high level of noise there isno much gain.

As we are also interested in the relationship between hc and Rp, Fig.4.2 depicts hc vs. Rp. As is shown, all the three curves first decreaseand then approach to be constant. The decreasing part indicates that Rpis compensating hc. We can see negative values in hc because it is thedifferential entropy and differential entropy can be negative. The level ofthe noise affect the rate distribution. For the top curve with σ2

n = −10dB,there is only a short range of trade-off between hc and Rp. Because the noisedestroys the signal, we do not see much reduction in hc. The optimal ratesof R∗p, h∗c , and h∗t are presented in Table 4.1.

Page 30: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 4. NUMERICAL RESULTS 22

0 1 2 3 4 5 6

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rp(bits)

h c(bits

)

σn2=−50dB

σn2=−30dB

σn2=−10dB

Figure 4.2: The rate of the coefficients hc over Rp with γ = 9 for differentnoise levels. g0 = σ2

v = 1.

Noise h∗t R∗p h∗c σ2∗H λ∗ dhc/df

-50dB 0.24 1.88 -1.64 1.81e-5 4.99e-5 2.00e4-30dB 1.16 1.14 0.02 0.18e-2 0.50e-2 2.00e2-10dB 2.09 0.37 1.72 0.20 0.61 1.64

Table 4.1: Optimal rate combinations and corresponding σ2∗H and λ∗ for

different noise levels.

Page 31: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 4. NUMERICAL RESULTS 23

RemarksNotice that the function g(Rp) is assumed to be given when calculating Eqs.4.2 and 4.3. If we calculate these two functions without assuming any g(Rp),the solution is

g∗ = σ2v + σ2

n −√

(σ2v + σ2

n)2 − [2(σ2v + σ2

n)g0 − g20]2−4Rp . (4.8)

This special g∗ makes ∂ht(g∗)∂Rp

= 0, ∀Rp, which means the obtained ht is flatover Rp. However, a flat curve of ht means the transform is not efficientat all. And the original question is to find one minimum point for ht. Wedon’t need to make g available for all Rp. Therefore, g should be assumedin advance.

Page 32: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Part II

Practical System

24

Page 33: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 5

Efficient Video CodingScheme

The practical video coding system is depicted in Fig. 5.1. It utilizes varioustypes of motion-compensated orthogonal transforms. The input is a groupof n pictures (GOP = n). The MCOT is a combination of the unidirectionalMCOT, the bidirectional MCOT, a half-pel motion accurate transform, andvariable block sizes. The decision of which type to be used for the transformis decided by the Lagrangian cost function. After the MCOT, the temporalsubbands consist of one temporal low band and n− 1 temporal high bands.Then the adaptive spatial wavelet transform is applied to the temporal sub-bands. It is not efficient to apply the spatial transform to the temporalhigh band comparing to the transform of the temporal low band, as thereis no much spatial redundancy in the high bands. But we still use the spa-tial transform to all the subbands, because the EBCOT codec requires thesame spatial decomposition level for all the subbands. And in this work,the spatial decomposition level is set to three. After the transforms, we useEBCOT as entropy coding to encode the obtained coefficients. Accordingto [28], the uniform deadzone quantization with step size one is used andthe rate is controlled by the PRCD.

5.1 Construction of Various MCOTs

5.1.1 Multiple Types of MCOT

As introduced in Chapter 2, there are two types of MCOT: the unidirectionalMCOT and the bidirectional MCOT. Using these two types to construct oursystem, we are considering the following available transforms:

• Intra-frame coding

• Left unidirectional MCOT

25

Page 34: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 26

Figure 5.1: Efficient video coding system.

• Right unidirectional MCOT

• Bidirectional MCOT.

Intra-frame coding means there is no transform in the video sequence.The original pictures are kept as temporal subbands directly without per-forming any algorithm. It is highly inefficient in video compression. Thereason we keep this kind of coding scheme is because it can be applied to theworst case when the motion-compensation corrupts, such as a completely dif-ferent frame appearing in the video sequence and any motion-compensationwill bring in high distortion. In general case, the intra-frame coding is nottouched.

The left unidirectional MCOT and the right unidirectional MCOT aresimilar. The only difference is the left unidirectional MCOT considers theprevious frame as the reference frame while right unidirectional MCOT con-siders the subsequent frame. This strategy applies for the cases that theremight be a sudden break in the sequence and the following frames are com-pletely different. The system would choose to take the previous (left) frameas the reference to process the current frame and the subsequent (right) oneas reference for the following pictures.

The last one, the bidirectional MCOT takes both the previous frameand the subsequent frame into consideration. The system will comparethe performance of the four possibilities and choose the optimal one. Thedecision is made by Lagrangian cost function (see Section 5.4) [29] .

The purpose of engaging various types of MCOT is that the implementedsystem is expected to adapt to different video sequences with different con-tents and patterns and, thus, improve the overall performance.

5.1.2 Multi-hypothesis MCOT

A unidirectional half-pel MCOT has been introduced in [9]. To enrich vari-ous combinations of motion models for MCOT, it is necessary to consider an

Page 35: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 27

additional combination of the bidirectional MCOT and half-pel motion esti-mation. This combination requires an extension of multi-hypothesis motion-compensation besides the 1-hypothesis, 2-hypothesis, and 4-hypothesis.

The unidirectional MCOT can be considered as 1-hypothesis with integermotion estimation. Having half-pel motion estimation, we have 2-hypothesismotion estimation for positions p1, p3, p5, and p7 and 4-hypothesis for posi-tions p2, p4, p6, and p8, Fig. 2.2. In the case of bidirectional half-pel MCOT,suppose we have two reference frames A (the previous frame) and B (thesubsequent frame). The unidirectional half-pel MCOT has three hypothesistypes:

• 1-hypothesis for frame A or B (using 1A or 1B for short)

• 2-hypothesis for frame A or B (2A or 2B)

• 4-hypothesis for frame A or B (4A or 4B).

When this turns to bidirection, there will be nine combination possibil-ities in total, shown in Fig. 5.2.

Figure 5.2: Multi-hypothesis for bidirectional half-pel motion estimation.

As we can see, there are four kinds of new multi-hypothesis motionestimation: 3-hypothesis, 5-hypothesis, 6-hypothesis, 8-hypothesis. To con-struct the transform matrices for these new kinds of hypothesis, we need toconsider the idea of energy compaction and distribution.

From the bidirectional transform matrix shown in Eq. 2.8, we can ob-serve that the orthogonal matrix H1 performs energy concentration on thefirst and third pixels. Their energy is compacted to the third pixel. H2then concentrates the energy from the second pixel to the third pixel. Atthis point, the energy of all the three pixels is compacted in the third one.Finally H3 split the compacted energy back to the first and the third pixel.

Page 36: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 28

So the energy in the second pixel is shifted to the other two pixels and thesecond pixel becomes high band pixel. With the same idea, we can constructthe additional transform matrices.

Figure 5.3: An example of 6-hypothesis motion estimation.

An example of 6-hypothesis MCOT

Fig. 5.3 is an example of 6-hypothesis (4A+2B) motion estimation. Thereference frame A provides a 4-hypothesis motion estimation, which meansthe half-pel position p2 is the average of the four neighbouring integer pixels.The reference frame B provides a 2-hypothesis motion estimation that thehalf-pel position p5 is the average of the two neighbouring integer pixels. TheMCOT compacts the energy of the seven coefficients to one coefficient andthen distributes the whole energy back to the six low band pixels (pA ∼ pDin A and pA and pH in B) leaving an energy-removed high band pixel (greypA). Eq. 5.2 to 5.4 present the sub-transform matrices Ha to Hf thatconstruct H. Each of the sub-matrix deals with two pixels at a time. Theenergy is gradually compacted to one pixel by Euler angles φ1 to φ6. Thedistribution of the energy is determined by φ7 to φ11.

The transform matrix for this 6-hypothesis motion estimation is

H =Ha(φ11)Hb(φ10)Hc(φ9)Hd(φ8)He(φ7)Hf (φ6)·He(φ5)Hd(φ4)Hc(φ3)Hb(φ2)Ha(φ1) (5.1)

Page 37: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 29

where Ha to Hf are 7× 7 matrices with

Ha =

cosφ sinφ 0 0 0 0 0− sinφ cosφ 0 0 0 0 0

0 0 1 0 0 0 00 0 0 1 0 0 00 0 0 0 1 0 00 0 0 0 0 1 00 0 0 0 0 0 1

Hb =

1 0 0 0 0 0 00 1 0 0 0 0 00 0 cosφ sinφ 0 0 00 0 − sinφ cosφ 0 0 00 0 0 0 1 0 00 0 0 0 0 1 00 0 0 0 0 0 1

(5.2)

Hc =

1 0 0 0 0 0 00 1 0 0 0 0 00 0 1 0 0 0 00 0 0 1 0 0 00 0 0 0 cosφ sinφ 00 0 0 0 − sinφ cosφ 00 0 0 0 0 0 1

Hd =

1 0 0 0 0 0 00 cosφ 0 sinφ 0 0 00 0 1 0 0 0 00 − sinφ 0 cosφ 0 0 00 0 0 0 1 0 00 0 0 0 0 1 00 0 0 0 0 0 1

(5.3)

He =

1 0 0 0 0 0 00 1 0 0 0 0 00 0 1 0 0 0 00 0 0 cosφ 0 sinφ 00 0 0 0 1 0 00 0 0 − sinφ 0 cosφ 00 0 0 0 0 0 1

Hf =

1 0 0 0 0 0 00 1 0 0 0 0 00 0 1 0 0 0 00 0 0 1 0 0 00 0 0 0 1 0 00 0 0 0 0 cosφ sinφ0 0 0 0 0 − sinφ cosφ

(5.4)

and φ1 ∼ φ11 are determined by energy concentration constraints.

5.2 Obtaining Motion VectorsIn this coding system, we are using both integer motion estimation andhalf-pel motion estimation. To obtain motion vectors, we use JMV =min{∑ |xi−xj |+λmvRmv} where xi is the reference block and xj is the cur-rent block. ∑ |xi−xj | is the sum of the absolute difference of the coefficientsof xi and xj . λMV is the Lagrangian multiplier for the motion vectors andRMV is the rate of the motion vectors. Normally, a higher rate of motionvectors can provide a better match for xi and xj . If λmv is set to zero, weonly consider the similarity of the two blocks. Then JMV will give the mostsimilar block to the reference block with respect to xj .

Our motion compensation provides one integer motion vector (mx0 ,my0)and its corresponding eight half-pel positions around the integer position.However, due to implementation complexity and time consuming, only the

Page 38: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 30

integer (mx0 ,my0) and best two half-pel motion vectors (mx1 ,my1) and(mx2 ,my2) are considered for practical evaluation.

Because the motion vectors are crucial to the reconstruction of the videosequences, they require lossless coding. Huffman coding is used to codethem.

5.3 Variable Block SizeIn addition to the multiple types of MCOTs, various block sizes are engagedto provide more accurate block-based motion estimation. A macroblockwith block size of m × n is partitioned into smaller block sizes of m × n

2 ,m2 ×n, and

m2 ×

n2 . Fig. 5.4 depicts a macroblock of 16×16 is segmented into

subblock sizes of 16× 8, 8× 16, and 8× 8. The motion estimation providesone motion vector for each of the subblocks. A maximum of four motionvectors can be transmitted for a macroblock if the subblocks of size 8× 8 ischosen. In our case, there are 9 (= 1 + 2 + 2 + 4) motion vectors saved foreach macroblock before the MCOT. Our system is to evaluate all the foursubblock types to determine which kind of block size is optimal, see Section5.4.

Figure 5.4: Partitions of a macroblock of 16x16 for motion estimation.

Summarizing from the description above, there are three levels of com-binations inside our system:

• Motion compensation for each type of MCOT

• Different types of MCOT for each subblock

• Variable block sizes for each macro block.

Fig. 5.5 demonstrates the structure this three levels of combinations.The first and most detailed level evaluates the nine possible motion vectorsfor a particular subblock with a particular type of MCOT. The second levelevaluates the performances of different transform types for each single sub-block given different motion vectors. That means after the second level, wehave a number combinations of motion vectors and transform types for the

Page 39: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 31

subblocks. Finally, the last level finds out which kind of block segmenta-tions is best for a macroblock. In the end, the system gives the optimalcombination of motion vectors, transform types, and subblock type for eachmacroblock.

Figure 5.5: Structure of the minimization of the cost function with the threelevels.

5.4 Mode DecisionThe purpose of our Lagrangian cost function is to figure out the optimal com-bination of our various kinds of parameters and achieve an efficient trade-offbetween the rate of the parameters and the rate of the coefficients. Let Rpbe the parameter rate indicating the sum of the rate of the motion vectorsRp(mv), the rate of the types of MCOT Rp(t), and the rate of the subblocksizes Rp(s). They are obtained from the motion estimation, transform types,and various block sizes, respectively. Let σ2

H present the variance of the highband. The relationship between Rp and σ2

H has been studied in Chap. 3.Our Lagrangian cost function is

J = σ2H + λRp (5.5)

= σ2H + λ (Rp(mv) +Rp(t) +Rp(s)) . (5.6)

For practical implementation, the multiplier λ is set to 1.This cost function is based on a macro block. Because the various block

size is used to split a macro block into subblocks, the cost function can beexpanded to

J = σ2H + λ

(N∑i=1

(Rp(mv, i) +Rp(t, i)) +Rp(s)). (5.7)

Page 40: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 32

where N is the number of subblocks. Here N can be 1, 2, or 4 as is shownin Fig. 5.4. Note that even there are subblocks, we still sum the parametersup to make them equal to the level of a macro block:

Rp(mv) =∑i

Rp(mv, i), (5.8)

Rp(t) =∑i

Rp(t, i). (5.9)

In this system, different macro blocks contain different combinationsof the parameters. Different subblocks within one macro block can alsohave different transform types. Take a macro block with block partitionof size 16 × 8 for example. Assume the optimal transform type for thefirst Subblock0 is Type1 (left unidirectional MCOT). The transform typefor the second Subblock1 can be any type from Type0 to Type3. However,the constraint is the total cost for these two subblocks should be minimumcomparing to other types of transform.

Finally, we obtain the optimal subblocks, transform types, and motionvectors for each macroblock. After the MCOT, the subband coefficientsare processed by spatial transform, uniform deadzone quantization, andEBCOT.

Page 41: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 6

Experimental Results

For experiments, we use the test videos Foreman and Mother&Daughter.The motion compensation uses a macro block size of 16 × 16 and a searchrange of ±20. The dictionary for Huffman coding of motion vectors is es-tablished from five training videos Foreman,Carphone, Salesman, Claire,and Mother&Daughter, each with 288 frames. The performance is evalu-ated by PSNR

PSNR = 10 log10

(2552

MSE

). (6.1)

The JasPer software is used as entropy coding here, which is the codecspecified in the JPEG2000 part I standard (ISO/IEC 15444-1) written inC programming language. It has been verified that JasPer and JJ2000(JPEG2000 part I in Java) give almost the same coding performances.

Figs. 6.1 and 6.2 present the PSNR of the luminance signal over therate for Foreman and Mother&Daughter with different transform types.The first curve is the proposed transform, which is an efficient combina-tion of variable block size, different transform types, and half-pel motion-compensated accuracy. The second curve is the bidirectional MCOT withoutbeing combined with the left/right unidirectional MCOT. The third one isalso the bidirectional MCOT, but without variable block size or half-pel mo-tion compensation. The fourth curve is the Haar wavelet transform withoutvariable block size or half-pel motion compensation. And the last one isintra coding without temporal transform.

For Fig. 6.1, there is a large gap between intra coding and Haar wavelettransform. The bidirectional MCOT shows a 2 to 4 dB improvement com-pared to the Haar wavelet transform. If variable block size and half-pelmotion accuracy is engaged, we have an additional 1 dB improvement. Fi-nally, when the transform is constructed by unidirectional and bidirectionalMCOT, our proposed system gains another 0.5 dB comparing to a singlebidirectional MCOT. As is shown in the figures, the proposed MCOT out-performs the other compared transforms. We can observe the same result

33

Page 42: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 6. EXPERIMENTAL RESULTS 34

from Fig. 6.2 that the proposed system is the optimal one.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 124

26

28

30

32

34

36

38

40

42

Rate[bpp]

PS

NR

[dB

]

Proposed MCOTBi−MCOT,VBS,HPBi−MCOT,non−VBS,non−HPHaar,non−VBS,non−HPIntra

Figure 6.1: Luminance PSNR vs. rate for the QCIF sequence Foreman at30fps with 64 frames and a GOP size of 8 frames. The compared transformsinclude the proposed MCOT, the bidirectional MCOT with variable blocksize (VBS) and half-pel motion compensation (HP), the bidirectional MCOTwithout VBS or HP, the Haar wavelet transform without VBS or HP, andthe intra coding.

Page 43: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

CHAPTER 6. EXPERIMENTAL RESULTS 35

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.4526

28

30

32

34

36

38

40

42

44

46

Rate[bpp]

PS

NR

[dB

]

Proposed MCOTBi−MCOT,VBS,HPBi−MCOT,non−VBS,non−HPHaar,non−VBS,non−HPIntra

Figure 6.2: Luminance PSNR vs. rate for the QCIF sequence Mother& Daughter at 30fps with 64 frames and a GOP size of 8 frames. Thecompared transforms include the proposed MCOT, the bidirectional MCOTwith variable block size (VBS) and half-pel motion compensation (HP),the bidirectional MCOT without VBS or HP, the Haar wavelet transformwithout VBS or HP, and the intra coding.

Page 44: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Chapter 7

Conclusions

The goal for this project is to implement an efficient video coding schemethat combines various kinds of motion-compensated orthogonal transforms.

The first part of the report proposes a theoretical lossless signal modelfor the orthogonal transform. The signal model is based on the Gaussiandistribution assumption. From this model, we find an optimal rate combi-nation for the rate of the coefficients and the rate of the parameters. Therelationship between the transform and entropy coding is studied and a costfunction for the orthogonal transform is constructed. This cost function isalso used in practical implementation to make mode decisions. Numericalresults for the memoryless Gaussian model is presented to show the optimalrate allocation for a given transform function.

The second part of the report describes an efficient combination of theMCOTs. The combination includes multiple types of MCOTs, variableblock sizes, and half-pel motion estimation. The experimental results showthat using variable block sizes and half-pel motion estimation can improvethe PSNR perfomance significantly. And combined with multiple types ofMCOTs, the PSNR can be increased by another 0.3 to 0.6 dB. From theresult, we see that our proposed system outperforms the individual motion-compensated orthogonal transforms.

36

Page 45: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

Bibliography

[1] Video Codec for Audiovisual Services at p × 64 kbit/s. ITU-T Recom-mendation H.261, 1990.

[2] K. Rijkse. H.263: video coding for low-bit-rate communication. Com-munications Magazine, IEEE, 34(12):42–45, Dec. 1996.

[3] Coding of moving pictures and associated audio for digital storage mediaat up to about 1.5 Mbit/s - Part 2: Video. Int. Standards Org./Int.Electrotech. Comm. (ISO/IEC) JTC 1, 1993.

[4] Coding of audio-visual objects - Part 2: Visual. Int. Standards Org./Int.Electrotech. Comm. (ISO/IEC) JTC 1, 1999-2003.

[5] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview ofthe H.264/AVC video coding standard. IEEE Transactions on Circuitsand Systems for Video Technology, 13(7):560–576, July 2003.

[6] G. Sullivan and T. Wiegand. Video compression - from concepts to theH.264/avc standard. Proceedings of the IEEE, 93(1), 2005.

[7] M. Flierl and B. Girod. A motion-compensated orthogonal transformwith energy-concentration constraint. In Proc. of the IEEE Interna-tional Workshop on Multimedia Signal Processing, pages 391–394, Oct.2006.

[8] M. Flierl and B. Girod. A new bidirectionally motion-compensatedorthogonal transform for video coding. In Acoustics, Speech and SignalProcessing, 2007. ICASSP 2007. IEEE International Conference on,volume 1, pages I–665–I–668, Apr. 2007.

[9] M. Flierl and B. Girod. Half-pel accurate motion-compensated orthog-onal video transforms. In Data Compression Conference, 2007. DCC’07, pages 13–22, Mar. 2007.

[10] M. Flierl and B. Girod. A double motion-compensated orthogonaltransform with energy concentration constraint. In Proceedings of theSPIE Conference on Visual Communications and Image Processing,page 6508, 2007.

37

Page 46: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

BIBLIOGRAPHY 38

[11] B. Girod. Efficiency analysis of multihypothesis motion-compensatedprediction for video coding. Image Processing, IEEE Transactions on,9(2):173–183, Feb. 2000.

[12] O. Barry, Du Liu, S. Richter, and M. Flierl. Robust motion-compensated orthogonal video coding using EBCOT. In Image andVideo Technology (PSIVT), 2010 Fourth Pacific-Rim Symposium on,pages 264–269, Nov. 2010.

[13] M. Flierl. Adaptive spatial wavelets for motion-compensated orthogonalvideo transforms. In Proc. of the IEEE International Conference onImage Processing (ICIP), pages 1045–1048, Nov. 2009.

[14] P.F. Panter and W. Dite. Quantization distortion in pulse-count mod-ulation with nonuniform spacing of levels. Proceedings of the IRE,39(1):44–48, Jan. 1951.

[15] Sangsin Na and David L. Neuhoff. On the support of mse-optimal,fixed-rate, scalar quantizers. IEEE Trans. Inform. Theory, 47:2972–2982, 2001.

[16] H. Gish and J. Pierce. Asymptotically efficient quantizing. InformationTheory, IEEE Transactions on, 14(5):676–683, Sep. 1968.

[17] C. Gunter and A. Rothermel. Quantizer and entropy effects on ebcotbased compression. Consumer Electronics, IEEE Transactions on,53(2):661–666, May 2007.

[18] D. S. Taubman and M. W. Mercellin. JPEG2000 Image CompressionFundamentals, Standards and Practice. Kluwer Academic Publishers,Boston/Dordrecht/London, first edition, 2002.

[19] T.A. Welch. A technique for high-performance data compression. Com-puter, 17(6):8–19, June 1984.

[20] J.M. Shapiro. Embedded image coding using zerotrees of wavelet co-efficients. Signal Processing, IEEE Transactions on, 41(12):3445–3462,Dec. 1993.

[21] A. Said and W.A. Pearlman. A new, fast, and efficient image codecbased on set partitioning in hierarchical trees. Circuits and Systems forVideo Technology, IEEE Transactions on, 6(3):243–250, June 1996.

[22] N. Ahmed, T. Natarajan, and K.R. Rao. Discrete cosine transfom.Computers, IEEE Transactions on, C-23(1):90–93, Jan. 1974.

[23] D. Taubman. High performance scalable image compression withEBCOT. IEEE Transactions on Image Processing, 9(7):1158–1170,July 2000.

Page 47: EfficientVideoCodingwith Motion-CompensatedOrthogonal …kth.diva-portal.org/smash/get/diva2:511426/FULLTEXT01.pdfEfficientVideoCodingwith Motion-CompensatedOrthogonal Transforms DU

BIBLIOGRAPHY 39

[24] F. Auli-Llinas and M.W. Marcellin. Distortion estimators for bitplaneimage coding. Image Processing, IEEE Transactions on, 18(8):1772–1781, Aug. 2009.

[25] Michael D. Adams and Faouzi Kossentini. Jasper: A software-basedjpeg-2000 codec implementation, 2000.

[26] M. Flierl and B. Girod. Investigation of motion-compensated liftedwavelet transforms. In Proceedings of the Picture Coding Symposium,pages 59–62, 2003.

[27] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory(Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.

[28] M.W. Marcellin, M.J. Gormish, A. Bilgin, and M.P. Boliek. Anoverview of JPEG-2000. In Proc. of the IEEE Data Compression Con-ference, pages 523–541, Mar. 2000.

[29] H. Everett. Generalized Lagrange multiplier method for solving prob-lems of optimum allocation of resources. Oper. Res., 11:399–417, 1963.