[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

10
Spatially Scalable Video Coding Based On Hybrid Epitomic Resizing Qijun Wang, Ruimin Hu, Zhongyuan Wang National Engineering Research Center on Multimedia Software (NERCMS) Wuhan University, China {wangqijun308,hrm1964,wzy_hope}@163.com Abstract Scalable video coding (SVC) is considered as a potentially promising solution to enable the adaptability of video to heterogonous networks and various devices. In spatially scalable video encoder, how to resize the captured high-resolution video to get low-resolution video has great effect on the quality of experience (QoE) in the clients receiving low-resolution video. In this paper, we propose a new resizing algorithm called hybrid epitomic resizing (HER), which can make the resized image preserve the same ‘physical’ resolution with original image by the way of utilizing texture similarity inside image and highlight regions of interest while avoiding potential artifacts. For hybrid epitomic resizing, we also design two new inter-layer prediction methods to eliminate the redundancy between adjacent spatial layers instead of conventional inter-layer prediction. Experimental results show that HER can get resized images with perceptually much better quality and the performance of new inter-layer prediction are comparable to that of conventional inter-layer prediction in H.264 SVC. 1. Introduction Through the Internet, multimedia content has been widely used for sharing information among users. Their transparent access from almost everywhere at anytime through all kinds of devices is often desired and required. At the user’s end, hand-held devices including cellular phones, Smartphones, PDAs, and Pocket PCs are now in widespread use for their mobility and portability. Different devices are connected to the networks with different bandwidth, while different devices are also with different display resolutions and power capacity. To enable such universal multimedia access, it’s essential for video bitstream to adapt to these various circumstances. Most recently, scalable video coding [1] is considered as a potentially promising solution to enable universal access to video through heterogonous networks instead of computationally complex transcoding. The scalability is accomplished by providing multiple versions of a video stream so that the same contents of different qualities are obtainable in different clients. The scalability consists of temporal scalability, spatial scalability and quality scalability, and in these several scalable types spatial scalability is used to adapt to various display resolutions in clients. For spatially scalable video coding, in middle of video capturing and encoding, spatial resizing should be carried out to obtain video with different resolutions corresponding to different spatial layers, and the redundancy between adjacent spatial layers is removed through inter-layer prediction, which would match the resizing algorithm to maximize the coding efficiency. In current literatures, there are two main kinds of resizing algorithms: the scaling method and the cropping method [2]. The scaling method subsamples each frame in video to preserve intact video contexts, but once a visual content is scaled down more than its 2010 Data Compression Conference 1068-0314/10 $26.00 © 2010 IEEE DOI 10.1109/DCC.2010.20 139

Upload: zhongyuan

Post on 10-Mar-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

Spatially Scalable Video Coding Based On Hybrid Epitomic Resizing

Qijun Wang, Ruimin Hu, Zhongyuan Wang National Engineering Research Center on Multimedia Software (NERCMS)

Wuhan University, China {wangqijun308,hrm1964,wzy_hope}@163.com

Abstract Scalable video coding (SVC) is considered as a potentially promising solution to enable

the adaptability of video to heterogonous networks and various devices. In spatially scalable video encoder, how to resize the captured high-resolution video to get low-resolution video has great effect on the quality of experience (QoE) in the clients receiving low-resolution video. In this paper, we propose a new resizing algorithm called hybrid epitomic resizing (HER), which can make the resized image preserve the same ‘physical’ resolution with original image by the way of utilizing texture similarity inside image and highlight regions of interest while avoiding potential artifacts. For hybrid epitomic resizing, we also design two new inter-layer prediction methods to eliminate the redundancy between adjacent spatial layers instead of conventional inter-layer prediction. Experimental results show that HER can get resized images with perceptually much better quality and the performance of new inter-layer prediction are comparable to that of conventional inter-layer prediction in H.264 SVC.

1. Introduction

Through the Internet, multimedia content has been widely used for sharing information among users. Their transparent access from almost everywhere at anytime through all kinds of devices is often desired and required. At the user’s end, hand-held devices including cellular phones, Smartphones, PDAs, and Pocket PCs are now in widespread use for their mobility and portability. Different devices are connected to the networks with different bandwidth, while different devices are also with different display resolutions and power capacity. To enable such universal multimedia access, it’s essential for video bitstream to adapt to these various circumstances.

Most recently, scalable video coding [1] is considered as a potentially promising solution to enable universal access to video through heterogonous networks instead of computationally complex transcoding. The scalability is accomplished by providing multiple versions of a video stream so that the same contents of different qualities are obtainable in different clients. The scalability consists of temporal scalability, spatial scalability and quality scalability, and in these several scalable types spatial scalability is used to adapt to various display resolutions in clients. For spatially scalable video coding, in middle of video capturing and encoding, spatial resizing should be carried out to obtain video with different resolutions corresponding to different spatial layers, and the redundancy between adjacent spatial layers is removed through inter-layer prediction, which would match the resizing algorithm to maximize the coding efficiency. In current literatures, there are two main kinds of resizing algorithms: the scaling method and the cropping method [2]. The scaling method subsamples each frame in video to preserve intact video contexts, but once a visual content is scaled down more than its

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.20

139

Page 2: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

minimal perceptible size, the quality of service (QoS) or quality of experience (QoE) is usually far from acceptable. Besides that, when the size of object in video was not big enough, the scaling method would make the object too small to be of any value. According to the scaling method, the inter-layer prediction defined in the scalable extension of H.264 (H.264 SVC) [1] for spatial scalability is used to eliminate the inter-layer information redundancy. To maintain enough resolution for the image regions of interest (ROI) for users, another alternative of the cropping method is used to generate small video, and the cropping method discards partial surroundings to highlight specific user interests. On this condition, the method for decreasing the redundancy between adjacent layers is similar to that in coarse SNR scalability of H.264 SVC with the same image physical resolution in different layers. In real practices, due to the different bias of the two methods, an adaptation engine has to make visual tradeoffs between the subject readability and content completeness, but that is a difficult work requiring interactive actions from users. In most situations, sacrificing either aspect is usually intolerable because they are both important in our viewing experience, and the ideal solution is to combine the advantages of both methods.

Figure 1.The scalable video coding and its adaptability to various devices2

In the field of computer vision, epitomic analysis [3] of image or video has gained great attention. Through fusing image content with similar texture features, image epitome can capture the essential representation of original image, preserving the global texture and shape features with a size of only a fraction of the original one or even less. But image epitome loses the location information of pixels in original image, so it is not suitable for resizing task. In the following work, Denis Simokov [4] added a coherence term to the energy function in [3] to form the bidirectional energy function which describes the bidirectional similarity between the original image and the resized one (image epitome), and the original image can be resized through minimizing the bidirectional energy function. The bidirectional similarity ensures that each patch in the input image is included in the output in a satisfactory manner. But for this method, the resized images starts to show artifacts when the resizing ratio is decreasing too much. In this paper, we proposed a new resizing algorithm called hybrid epitomic resizing (HER). Comparing with the previous work, the novelty of ours lies in two aspects: First of all, different from Denis Simokov’s method (we call it epitomic resizing hereafter), we firstly take a separation of background with regions of interest, and the background image is fed into epitomic resizing process, and then at last foreground objects are re-integrated with the resized background to generate the ultimate output resized image; Secondly, template matching and motion estimation/motion compensation are designed for inter-layer intra prediction in spatial scalability to keep in line with hybrid epitomic resizing.

1 In real practice, high resolution video is usually captured for encoder and is subsampled to generate low resolution one 2. Drawn by Thomas Wiegand, available at http://ip.hhi.de/imagecom_G1/savce/index.htm

140

Page 3: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

This paper is organized as follows: hybrid epitomic resizing algorithm will be described in the second section in detail; in the third section we will focus on the new inter-layer prediction methods for hybrid epitomic resizing. The experimental results and analysis will be reported in the fourth section; at the last we will conclude in the fifth section.

2. Hybrid Epitomic Resizing The epitomic analysis of image is first reported by N. Jojic [3], and later is extended and

used for image resizing by Denis Simokov [4]. The method in [3,4] divides the original image into a set of image patches with different size, and the image patches can overlap with each other. Image epitome can be computed through Maximum Likelihood Estimation (MLE) using EM (Expectation Maximization) algorithm given a specific original image.

Figure 2. Image epitome (the right) and the mapping between

original image (the left) and image epitome

In the computation process, an energy function ),( zxφ is defined to describe the similarity between the original image and image epitome:

∑ ∑+ +∈ ∈

++ −+−=Xp Zq

qqpp zxZ

zxX

zx22

**||1),( αφ

In the above equation, x represents the original image, and z for image epitome. The energy function consists of two terms: the first term indicates the ‘completeness’ for image epitome describing original image, and to minimize it means to find the best match in image epitome for all the patches in original image, and in this term ÷X denotes the set of image patches, and the patch p is the element of ÷X , representing the coordinates set of a image patch sampled from the original image. ∗p is the corresponding patch in image epitome, so

px is a matrix composed of pixel values on the coordinates set of p , and ∗pz is the matrix

composed of pixel values in the coordinates set of ∗p while ∗pz is the best match in z for

px ; the second term in energy function indicates the ‘coherence’ for image epitome as an image, and to minimize it means to find the best match in original image for all the patches in image epitome. ÷Z denotes the set of image patches, and the patch q is the element of ÷Z , representing the coordinates set of a image patch sampled from image epitome. ∗q is the corresponding patch in original image, so qz is a matrix composed of pixel values on the coordinates set of q , and ∗qx is the matrix composed of pixel values in the coordinates set

of ∗q while ∗px is the best match in x for qz . Before EM algorithm, z should be

px

∗pzqz

*qx

*qx

completeness coherence

(1)

141

Page 4: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

initialized randomly. In each iteration of EM algorithm, the E-step and M-step should be done alternatively.

E-step: Searching the best matching block ∗pz for every px in ÷X , and then searching the best matching block ∗qx for every qz in ÷Z ;

M-step: Updating z for minimizing the energy function ),( zxφ according to the searched sets of ∗qx and px , the detailed updating way is shown in equation (2):

)),(),(1(

)),(),(1(),(

),(),(

1

),(),(

*

*

∑∑

∑∑

∈∈

+

∈∈

+

∈∈

∈∈

++

++

+

+∗

+

+=

qjiZq

q

pjiXp

p

qjiZq

q

pjiXp

p

jixZ

jixX

jiZ

jiX

jiz

α

δαδ

(2)

In the above equation (2), p with ∗p and q with ∗q are the pairs of corresponding coordinates set through E-step, and ),( jix p represents the pixel values of the corresponding pixel in px for ),( ji in ∗p , and ),( jix p represents the pixel values of the corresponding pixel in *qx for ),( ji in ∗q , . The function ),(* jipδ and ),( jiqδ is the delta function which is described by equation (3):

⎩⎨⎧

∉∈

=bjiifbjiif

jib ),(0),(1

),(δ (3)

After the updating process, current iteration is finished, and the E-step in a new iteration is started with the modified z .

Through EM algorithm, only locally best solutions can be achieved. The initialization at the start of EM algorithm has great effect on the ultimate results. If, on the other hand, the gap in size between original image x and image epitome z were only minor (in our experiments, xz *9.0= , where |·| denotes the size), then subtle scaling down of original image to the size of image epitome could serve as a good initialization (since all source patches are present with minor changes in appearance). In such an “ideal” case, iterative refinement using the update rule of equation (2) would converge to a good solution. Even though gradual resizing can provide a good approximation to the best result, artifacts may appear when the resizing ratio is high, like Figure 4(e) and 5(e). The artifacts have great negative effect on image quality, especially for regions of interest for users .

To avoid artifacts in regions of interest, we first extract regions of interest, which are not taken into account during epitomic resizing. Regions of interest can be determined by visual attention, and previous research showed that this physiological process could be modeled by a saliency-based attention model which can compute an attractive value for each pixel or image block. Detailed information about saliency-based attention model is out of the scope of this paper, and can be referred to [8].

The background part of original image is used to compute its epitome with the target size, and the removed image regions are filled with random values at the first. During the iterations, the removed part is inpainted successively at the same time of resizing, the detailed results are shown in Figure 3.

142

Page 5: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

(a) (b)

(c) (d)

(e) (f) (g)

Figure 3. Hybrid epitomic resizing: (a) is original image with CIF format. (b) and (c) are separated regions of interest and background respectively.(d) is the background after first gradual resizing iteration, its size is same to original image.(e) is the background after six gradual resizing iterations (f) is the ultimate result for the background resizing. (g) is generated by reintegration of (b) and (f)

3. Inter-layer Prediction for Hybrid Epitomic Resizing After hybrid epitomic resizing, the base layer image and the enhancement layer image have

the same ‘physical’ resolution, which is different from conventional H.264 SVC. So, the pixel values, motion information and residuals can not be directly scaled to generate the corresponding prediction for the enhancement layer. For hybrid epitomic resizing, motion

143

Page 6: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

information and residuals in adjacent layers have low correlation, we can only eliminate the correlation in pixel domain, and in another word, the method like inter-layer intra prediction can be applied for hybrid epitomic resizing. We are inspired by the new techniques such as intra displacement compensation (IDC) [5] and template matching prediction (TMP) [6] in intra coding, and the similar techniques can also be used to do inter-layer prediction. Instead of the reconstructed part of current frame, the base layer image is taken as the reference frame, and two additional modes named EPITOME_IDC and EPITOME_TMP respectively should also be covered by current H.264 syntax.

For EPITOME_IDC, the prediction is generated with the way similar to that of inter-coded block, and after motion estimation in base layer image, the most similar block is taken as the prediction of current block in enhancement layer, while the motion vector representing the mapping between enhancement layer and base layer images should also be coded. In H.264\AVC, the intra-coded block partitions include 4x4, 8x8 and 16x16. The smallest block with the size of 4x4 would get the most accurate prediction, but for the new intra mode, the side information would cost the most. The biggest block with the size of 16x16 would cost the least side information, but the prediction accuracy would be the worst. Considering the trade-off of the accuracy of prediction and the cost of side information, the block of 8x8 is chosen for the new intra mode.

For EPITOME_TMP, the similar way with conventional template matching prediction is used to generate the prediction for current block. The reconstructed L-block adjacent to current block is taken as template, and through template matching, the relative neighboring block of the searched L-block in base layer image is considered as the prediction. The template in our approach is adjusted according to the availability of adjacent blocks.

We extend the syntax element ‘intra_luma_pred_mode’ from 3-bit length to 4-bit length to support the new intra mode. The motion vector is also coded with fixed bits according to the size of base layer image. For example, in our experiments, the size of base layer image is QCIF, and then the bit length for the x element of motion vector is 8 and that of y element is also 8. In the process of computing the rate-distortion cost, the bits for motion vectors should be taken into account.

4. Experimental Results 4.1 Hybrid epitomic resizing

We have implemented our resizing algorithm based on ANN library [9], and the kd-tree is re-built in each iteration when the resized image changes. The original image for enhancement layer is with CIF size and the resized image for base layer is with QCIF size. For hybrid epitomic resizing, the size of image patch is set to 8x8. To get the image patch set, image patch sampling period in the original image is set 0.25, which means that the image patches are sampled at an interval of (2, 0) (0, 2) (2, 2) in the original image. The iteration number is specified to 10 times. The regions of interest are excluded in epitomic resizing process.

The test sequences ‘coastguard’ and ‘tempete’ are used to evaluate our hybrid epitomic resizing algorithm. To compare with our algorithm, the scaling, cropping and epitomic resizing algorithms are also implemented. From the experimental results shown in Figure.5 and Figure.6, we find that hybrid epitomic resizing algorithm not only can preserve the global information in the original image, and make a good trade-off between the completeness and coherence of resized image, but also can emphasize the image quality for regions of interest through the separation of regions of interest with background and re-integration of regions of interest and resized background.

144

Page 7: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

(a)

(b) (c)

(e) (f)

Figure 4. The results of different resizing methods on the first frame of coastguard: (a) scaling method (b) cropping method (e) Denis Simokov’s method (f) hybrid epitomic resizing

4.2 The performance of new inter-layer prediction

For the base layer images derived from epitomic resizing in spatially scalable video coding, we implement the inter-layer prediction method depicted in section 3 on the platform of JM12.4 [7], we call the new codec SSVC-HER in the following. To evaluate the coding efficiency of the new proposed inter-layer prediction, we take the JSVM9.14 [10] as anchor spatially scalable video codec. In the experiments, the scalable video bit-stream consists of two spatial layers: CIF for the enhancement layer and QCIF for the base layer. The input images for JSVM are a CIF sequence and the QCIF sequence is derived through scaling (low-pass filtering and downsampling).

145

Page 8: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

(a)

(b) (c)

(e) (f)

Figure 5. The results of different resizing methods on the first frame of tempete: (a) scaling method (b) cropping method (e) Denis Simokov’s method (f) hybrid epitomic resizing

The input for our SSVC-HER is the same CIF sequence and the QCIF is derived from hybrid epitomic resizing. To simplify the experiments, we only encode the first frame using intra-coded way. CABAC is used in all codecs. The PSNR and size for enhancement layer are shown in Table.1. The PSNR and bits for base layer are 35.893dB and 14464 for SSVC-HER, and those for JSVM are 34.018dB and 21704.

From the above experimental results, we find that when the QP difference between base layer and enhancement layer is small, the total coding efficiency of SSVC-HER is comparative to that of JSVM, but when the QP difference is big, JSVM outperform SSVC-HER due to the blurriness in hybrid epitomic resized base layer image.

146

Page 9: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

Table 1. The PSNR and size for enhancement layer in the experiments

QP JSVM SSVC-HER Single layer

34 PSNR (dB) 31.401 31.731 31.667 Size (bits) 33672 42128 45224

36 PSNR (dB) 30.225 30.609 30.444 Size (bits) 20472 31192 33592

38 PSNR (dB) 29.360 29.377 29.182 Size (bits) 12232 22080 23728

40 PSNR (dB) 28.701 28.418 28.156 Size (bits) 7080 16456 17536

4.3 Computational complexity

Our hybrid epitomic resizing program runs in several hours depending on the image size and content, which is comparable to the [1]. More than 95% of the time is spent on the nearest neighbor search. However, the computationally heavy nearest-neighbor search may be significantly sped up by using location from previous iteration to constrain the search.

The computational complexity in encoder side is also increased due to motion estimation. For the encoder, given the specific image epitome, the complexity is increased by about 50% through observation of encoding time. For the decoder, there is no obvious increase in computational complexity.

5. Conclusion In this paper, we proposed a new resizing algorithm HER, which can make the resized

image preserves the same ‘physical’ resolution with original image utilizing texture similarity in image and highlight regions of interest while avoiding artifacts. For hybrid epitomic resizing, we also design two new inter-layer prediction methods to decrease the redundancy between adjacent spatial layers instead of conventional inter-layer prediction. Experimental results show that HER can get resized images with perceptually much better quality, and the performance of new inter-layer prediction are comparable to that of conventional inter-layer prediction in H.264 SVC. In the future work, we will focus on the nearest neighboring search acceleration, and make good trade-off between the computing speed and resizing quality.

Acknowledgements The authors are grateful to Dr. Feng Wu and Dr. Xiaoyan Sun for their valuable

suggestions. This research is supported by the National Natural Science Foundation of China under Grant No.60772106 and No.60970160; the National Grand Fundamental Research 973 Program of China under Grant No.2009CB320906; the Ph.D Candidates Self-research (including 1+4) Program of Wuhan University in 2008

References [1] T.Wiegand, G. J. Sullivan, J. Reichel, H. Schwarz, and M.Wien, Joint Draft 11 of SVC

Amendment, Joint Video Team, Doc. JVT-X201, Jul.2007 [2] W.-H. Chen, C.-W. Wang, J.-L. Wu. “Video Adaptation for Small Display Based on

Content Recomposition,” IEEE Trans. Circuits and Systems for Video Technology, vol. 17, no. 1, pp: 43--58, 2007

[3] N. Jojic, B. J. Frey, and A. Kannan, “Epitomic analysis of appearance and shape,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV’03), pp. 34–41. 2003

147

Page 10: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Spatially Scalable Video Coding Based on Hybrid Epitomic Resizing

[4] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. “Summarizing visual data using bidirectional similarity”. In Proc. IEEE CVPR, pp1~8, 2008.

[5] S.-L. Yu and C. Chrysafis, “New intra prediction using intra macroblock motion compensation,” JVT-C151, 3rd meeting of Joint Video Team (JVT), May 2002.

[6] Thiow Keng Tan, Choong Seng Boon, and Yoshinori Suzuki, “Intra prediction by template matching,” in Proc. IEEE International Conference on Image Processing (ICIP’06), pp. 1693-1696. Oct. 2006

[7] Joint Model Reference Software JM12.4 [online] Available: http://iphome.hhi.de/suehring/tml/download/old_jm/jm12.4.zip

[8] L. Itti, C.Koch, and E. Niebur, “Amodel of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.

[9] ANN Library, http://www.cs.umd.edu/~mount/ANN/ . [10] Available through

http://ip.hhi.de/imagecom_G1/savce/downloads/SVC-Reference-Software.htm

148