tube-based video coding

SIGNAL PROCESSING:

IlUAiGE COMMUNICATION

ELSEVIER Signal Processing: Image Communication 9 (1997) 249-266

Tube-based video coding

Robert Hsu”, T. Sumitomo, H. Harashima Department of Electrical Engineering, The University of Tokyo, 7-3-l Hongo, Bunkyo-ku, Tokyo 113, Japan

Abstract

This paper presents a novel video compression strategy, based on structured video representation, to generate a content- addressable bit-stream supporting retrieval and composition. The structured video representation, extracted by the spatiotemporal segmentation algorithm, comprises a hierarchy of sequences, episodes, shots and motion events that constitute the building blocks of a scripted video stream. The main characteristics of this approach are that (i) it relies on the assumption that temporal redundancy can be efficiently exploited through temporal coherence on a tubewise basis, (ii) it controls the bit assignment according to motion complexity and temporal relations among video elements, (iii) it parameterizes the motion of the entire tube with an affine motion model, and (iv) it synthesizes images from representative texture patterns or anatomical models.

The coding-decoding system is based on the construction of a specific video transformations - a tube code -, which, when interpolated and composed according to the temporal relations, produces a sequence of images that approximate the original. Coding efficiency is enhanced because structured video representation allows optimal reduction of temporal redundancy. We show how to design such a system for the coding of CIF-format color digital video ‘Miss America’ (30frames/s) at rates of 10 and 37kb/s, using 3-D face wireframe model and MC-DCT to encode textural changes, respectively. The structured video representation, once marked semantically, can facilitate interactive and content-based operations on image sequences, such as editing, browsing and content-based access and filtering. 0 1997 Published by Elsevier Science B.V.

1. Introduction

With the convergence of common applications

of telecommunications, computer and entertainment industries, the resulting new expectations and requirements necessitate a visual coding scheme that pro-

vides interactivity and high compression. A few key functionalities which are not well-supported by exist-

ing coding schemes include very low bit-rate video

coding, content-based access, hybrid natural and synthetic data coding, and content-based manipulation and bit-stream editing. In the context of video com-

* Corresponding author.

pression, the coding scheme shall provide efficient

transmission of visual data on low-bandwidth chan- nels and efficient storage of visual data on limited-

capacity media. From the interactivity perspective, the coding scheme shall provide data access based on

the visual content by using various accessing tools such as indexing, hyperlinking, querying, browsing

and deleting. Furthermore, the scheme shall support

content-based manipulation and provide ability to combine synthetic scenes with natural scenes.

In this paper, we present a novel video compression strategy, based on structured video representation with video composition, to generate a content-addressable bit-stream providing high compression ratio while

0923-5965/97/%17.00 @ 1997 Published by Elsevier Science B.V. All rights reserved.

PIISO923-5965(96)00034-3

250 R. Hsu et al. ISignal Processing: Image Communication 9 (1997) 249-266

preserving high-resolution fidelity. The purpose of a tube-based coder is threefold: (1) associating compressed video streams with semantic indices to facilitate content-based access and editing; (2) increasing coding efficiency by grouping temporally regions of similar texture and motion and assigning a representative vector to each of the video elements; (3) improv- ing motion compensation used in current compression standards by predicting areas of non-translational motion, object occlusion, or new objects.

The mathematical foundation of the coding scheme is structured video representation, comprised of a hierarchy of sequences, episodes, shots and tubes that constitute the building blocks of a structured video stream. The representation and its corresponding tube codes are thoroughly presented in [7] and more suc- cinctly in Section 2. An encoder based on the structured video data, as described in Section 3, would only need to determine the warping applied to each object from frame to frame using the texture of the video elements and, for a given image sequence and an object, transmit the video transformations of the corresponding tube. Section 4 describes the procedure for the encoding of any digital video, given a specific set of video transformations. Section 5 addresses the synthesis-reconstruction of a video from a tube code,

shows how to compute bit-rates, and evaluates analysis-synthesis simulation results.

2. Structured video representation

Structured video representation is composed of different story units such as sequences, shots, episodes [4,17] and tubes (see Fig. 1). The most basic video element is a tube, which describes a single scene object experiencing a distinctive motion event. Physically, the tube consists of a spatiotemporal (ST) volume swept out over time by the contour of the dynamic object. The tube in a single frame is the smallest meaningful entity.

Once the type of a tube is determined, shots, episodes and sequences can be modeled. A shot, com- posing of one or more frames generated and recorded continuously, contains a collection of overlapping tubes and does not contain a scene change. One or several related shots are combined in an episode and a series of related episodes form a sequence. For example, a broadcast of an NBA game is a sequence that has four episodes, and each episode represents action from one quarter. Within the episodes are a collection of shots that describe the offensive and defensive series. Each

Episodes

Q Qlllll-Ql I,a

Shots

~. i.h h-‘,--r

Fig. 1. Structured video representation.

R. Hsu et al. ISignal Processing: Image Communication 9 (1997) 249-266 251

shot composes of several tubes, where each tube represents the play action of a player such as dribbling, a dunk or a no-look pass.

2.1. Temporal coherence

Let (voris, PoTis) be the original video which we want to encode, where pLotis is the initial frame, and let d be a given distortion measure. The tube-based video coding is the construction of a video transformation z for which the sequence of transformed ,uOig is a close approximation to the original V,ig:

v = Cr(~tis ), d(v, vorig) < s, (1)

where E is a small, positive real number. Provided that r has a lower complexity than the original image, r can be viewed as a code - lossy in general - for v,is.

From (1 ), we see that the code has a low complexity when the original video can be interpolated strictly from its initial frame. Namely, if the texture and motion are piecewise continuous, temporal redundancy can be efficiently exploited through temporal coherence on a tubewise basis. The requirements for the transformation r are formulated as follows:

v,fie = {fl,&. . .,FN},

K(t) N z(Z(t - 1)) N z-‘(qt + 1))

0 Fixed Bit Rate

(2)

where the original video v,es is partitioned into a hierarchy of tubes such that forward transformation z(Z(t - 1)) is similar to backward transformation s-‘(Lq(t + 1)).

We call a transformation r which satisfies (1) and (2) a tube code for v,is. Note that the transformation z is a function of the motion complexity. In the case of simple motions such translation, scaling or rotation, r can be formulated using conventional techniques such as pure translation or affine motion model. In the case of complex motions such as gesture or facial expression, however, the transformation t must be designed to account for global rigid motion as well as local nonrigid motion. Since the analysis of local nonrigid motion is computationally expensive, we design a class of video transformations which combines global motion compensation and local texture updating (see Fig. 2).

2.2. Structure of’ tube codes

Our work focuses on a class of video transfotma- tions defined tubewise, of the general form

N-l N-l M,-1

r(v) = c Q4) = c c Qt)(eR,r,(t)), (3) i=o i=O t=O

Motion Complexity \ d .

v V A

Simple MC Tube based -PEG complex

Fig. 2. Motion-complexity-based bit allocation strategy.


where {!I$, 0 6 i < N} denotes a non-overlapping

partition of the image support into N regions, 9& denotes a tube defined over the region !Ri, Mi denotes

the number of frames in each tube, and zi(t) denotes

an elemental tube transformation form of a tube from time t to time t + 1. For clarity, ri is expressed as the

composition of three transformations Ai, fi and ri:

Zj = /ii 0 ri 0 Yi, (4)

where Ai, ri and ri are designed to refresh texture,

update texture and compensate motion, respectively. The construction of a tube code r for v,is will be

accomplished separately for each tube, and indepen-

dently from one another. Hence, the encoding of v,~s

amounts to finding, for every tube $, a transformation

Ti(t) from time t to t + 1 such that the composition of transformed tubes is a close approximation to the

original video v,is .

3. Design of a tube-based coding system

The main issues involved in the design and imple-

mentation of a tube-based coding system are: (i) the

image-dependent partition of an image sequence, (ii)

the choice of a strategy for assigning bits to the struc-

tured video representation, and (iii) the specification of a class of video transformations defined consistently

with the texture and motion of a tube, and of a scheme for the quantization of their parameters. Fig. 3 shows

the flowchart of a tube-based encoding-decoding

system.

3.1. Video segmentation

The cubic volumetric ST data, formed by stacking up the original image sequence, will be partitioned

into overlapping tubes of different sizes, thus forming

the structured video representation. The larger tubes, depicting the static or slowly moving background, are referred to as background tubes, and the smaller tubes, depicting static or moving scene objects, are referred to as object tubes. A background tube, in the event of a scene change, can be either split into smaller

background tubes or replace entirely by a new background tube. An object tube, in the event of an activity change, can be replaced by a new object tube that describes the same scene object. Activity change is

defined as an event when the scene object experiences

different types of motion (e.g. rotation-translation, acceleration). Decisions about the splitting or parti-

tioning of the tubes are made during the encoding

process to ensure optimal spatial and temporal coherence within the tubes. Thus, this video segmentation

strategy is content dependent; it allows the coder (i)

to use content-based tubes to exploit spatiotemporal redundancy of smoothly varying motion and texture,

and (ii) to use scene changes, motion changes and occlusion to assign semantic labeling.

3.2. Bit assignment strategy

Let Y? denote the shot header, which represents the

initial frame of a shot and its temporal relations with respect to other shots in the sequence. Let 8, Y denote

its episode and shot indices, respectively. Let 9 be an image, and 9 be the layered representation of _Y

ordered in depth in terms of opacity, shape and color:

N-l

YiRi E Y, 9 = C CL&~,, (5) i=O

where {!I$, 0 < i < N} denotes a non-overlapping

partition of the image support into N regions - usually masks of arbitrary shape covering the scene ob-

jects or background; _55&, denotes the restriction of an image 9 to the region ‘X

The encoder controls the bit assignment accord-

ing to the temporal structure (sequences, episodes, shots, tubes) of the structured video streams. If a

scene change occurs, the encoder transmits a new shot

header which includes an episode index to jth episode, a shot index to kth shot, and N layers of compressed image texture A( 9i).

&?g,y = {&‘j’,Yck),_4(9~) E Y:O<i <N}. (6)

We define the shot encoding strategy as a multiplexor which switches video transformations according to three possible scenarios. If a new tube emerges in the shot, the image texture of its initial frame is compressed using n transformation and inserted into the shot header. If an existing tube experiences partial textural changes or is occluded by another tube, its image texture is updated by Y transformation. Finally, if the tube has smooth motion, & transformation is applied to 9’i to predict regions in the succeeding frames. The


Image Analysis Encoder Mode Con!ml

I

temporal indices

+

Image Synthesis

\ /

semantic

Vi&o indices 4 Retrieval

temporal priority mode

decoding Texture Warping

(b)

Fig. 3. Tube-based encoding4ecoding system.

shot encoding strategy is summarized as follows:

Z(LZi) =

denote the activity change, partial textural changes and

occlusion which act upon Z’i::, respectively.

1

Ai +Z if new 2& due t0 YV or &%?ii,

X(%1 if FG$ or S%?i, (7)

ri(yi) otherwise,

where Y%? denotes a scene change, &%i, F% and cO%$

3.3. A class of video transformations

In this section we describe a class of video transformations defined tubewise to perform motion compensation and texture updating.


3.3.1. Motion compensation r We describe the motion compensating operator r,

which predicts layers in the shot header %’ from time to, Yi = mYi to time tn, aYi = Zi(tn). The operator r is based on the affine motion model, which is suitable for describing a wide range of image motions such as translation, rotation, scaling and shear [14]. Let (I& I$) denote the image velocity and (px, p,,) the image coordinates of a pixel. The afline motion transformation V : R2 -+ R2 is written as follows:

=(:; z:) (;)+(I$ (*I where al,..., a6 are real constants.

Since the video stream is partitioned in such a way that each tube has smooth motion, a set of motion trajectories is sufficient for representing the motion of the entire tube. In essence, each motion trajectory is a spline that approximates the temporal history of an affine motion parameter. Six cubic splines, for in- stance, may be used to parameterize the global affine motion of a moving region, two for translation and four for rotation and scaling. The motion compensation operator r is thus expressed as

-@‘)(Px, Py, t)

= r(al(t),a2(t), . . . , a6(t))-@(px, py, t - 1)

(9)

ai = Cli + C2it + C3it2 + Cait3, (10)

where cii, c2i, cji, ~4 are real constants. In the case where tubular motion is not affine, the motion- compensating operator A’ is best applied recursively in a hierarchical manner.

3.3.2. Texture updating T Texture updating are those transformations which

process the frame P’ of a tube in the event of an activity change or occlusion. The terminology updating is used in order to emphasize that r transformations alter only parts of .Z. We list in the following a list

of four transformations supported on the arbitrarily shaped region ‘%p of the initial frame _Y.

Accretion: 9I2 reappears from occlusion

r,(T) = (2) EB %P. (11)

Deletion: X2 is occluded by another tube

T@(Y) = (2) 0 KY?. (12)

Creation: %z is replaced by new texture after a partial textural change

r&q = (2) 0 KY. (13)

Destruction: %y disappears from _!Z after a partial textural change

z-,(Y) = (2) 0 9&J. (14)

In effect, Y transformations allow us to generate, from the initial frame 2 the tube, a whole family of texture-related transformed regions to be motion compensated during the encoding. In the case of accretion and deletion, A? itself is generally not affected; instead, the occlusion temporarily covers or uncovers parts of _fZ’. In the case of creation and destruction, however, P? is altered. We, therefore, encode the partial textural changes by applying MC-DCT [ 161 to Y and Z(Z).

4. Construction of tube code for digital video

This section addresses the implementation of a tube-based video coding system described in Sections 2 and 3. We present a way to partition raw digital video into hierarchical tubes and to solve the problem of constructing a tube code for the structured video representation.

An original digital video v,,is is given as input to the coder. Let (8, Y,F} be the video partition consisting of hierarchical tubes of possibly different spatial sizes and temporal lengths. Since the coding procedure is done separately for each shot, we will focus on the encoding of a shot made of N tubes

{%}OGi<N. Recall from (4) and (5) that a video transformation r has the form

N-l

Z= CTi, Zt = Ai 0 & 0 Ti. (15) i=O

R. Hsu et al. ISignal Processing: Image Communication 9 (1997) 249-266

Given a tube 2”i):, the construction of the tube code is broken into three steps corresponding to the transformations Ai, c and Ti, respectively. - Texture refresh: The construction of the A trans-

formation consists of compressing the initial frame

of a tube dp by any standard image coding schemes

such as DCT [16], wavelets [13] or fractals [ll]. However, it is important to note that the coding scheme should account for arbitrarily shaped re-

gions to maximize coding efficiency. - Motion compensation: The construction of the r

transformation amounts to computing a set of mo-

tion trajectories for representing the motion of the

entire tube. The computation involves the compu-

tation of affine motion on a two-frame basis and

the temporal integration of the affine motion parameters into global affine motion splines.

- Texture updating: The final part of the construc-

tion consists of detecting occlusion and partial textural changes in the tube. We select a pool of

textural changes, made up of all partial textural

changes which maximize the distortion between 2? and T(Y). This selection criteria is chosen in

order to keep the pool size small in order to obtain

lower bit-rates. In summary, the encoding of the tube Y consists in

finding a best triplet (Ai, &, X) such that the distortion

d(Ti, ri 0 ri(2j)) is minimum (16)

255

and the amount of information

Ai is minimum.

4.1. Video analysis

(17)

The implementation of a tube-based encoder requires accurate video analysis that consists of spatial segmentation to identify the objects in the scene, temporal segmentation to mark scene changes, occlusion and activity changes, warping estimation to interpo-

late warping prediction, and pattern matching to select representative texture patterns. Fig. 4 illustrates

the interaction among these video analysis tasks.

The spatial segmentation module partitions an image into coherently moving regions using simultaneously the smoothness and contrast of such visual cues

as color, motion and texture [7]. The core of the spatial segmentation module is a k-means clustering algo-

rithm, which by hypothesis testing assigns each point in the image to the integrated visual model (e.g. color,

motion) that best describes the local image data. The

contrast of visual cues (e.g. motion boundary, intensity change) are then used to modify region bound-

aries. Spatial segmentation is performed only after

scene changes or occlusions.

During warping estimation, an affine motion model is estimated for each coherent region that best describes its local motion data. The estimation of affine

image clusters

1

warpw (3 estimation

& activitizs 1 video elements

(+)--hexture patterns

. affine image

database

Fig. 4. Image sequence analysis.


motion can be implemented in two ways: (1) an optical flow algorithm followed by least-squares estimation

over each coherent region [ 181; and (2) a full search

algorithm with incremental translation, rotation and scaling. Warping estimation is performed on moving

regions that change significantly from the previous

frame.

The temporal segmentation module detects scene

changes between frames, occlusion among overlap-

ping regions, and motion events of each coherent region [8]. The scene changes and activity boundaries

are treated as a collection of motion discontinuites. Formulated as the sign of the Gaussian and mean cur-

vature of the spatiotemporal surfaces, the motion dis-

continuities are applied to segment video streams into

semantic video clips such as leaning backward, sit- ting, pipping, stretching for the test sequence ‘Flip-

ping Dog’ [9]. The detection of scene changes and

occlusions is performed once for the entire image sequence, while the detection of activity boundaries is

performed after the image segmentation. During pattern recognition, tubes are matched to store texture patterns to compute a set of representative texture

patterns.

4.1.1. Spatial segmentation This section addresses the implementation of the

spatial segmentation module for computing the ini-

tial support maps of the tubes. We present a method

to partition an image using motion, color and pixel

position, shown in Fig. 5.

If we restrict our consideration to stationary back-

ground, it has been shown that there exists a simple, subtraction-based motion detection [12]. Our method adopts one derivation of this type of motion detec-

tion and requires three frames to compute the motion mask. The images need not be consecutive frames in

the image sequence as long as the objects are entirely

contained in each frame. Let I’, Z2, Z3 be three color frames extracted at time tl, t2, t3 from the image sequence. Assuming that the object boundary is a ramp edge that moves successively from a position in I’ to another position in Z3. The only area where both frame differences d12 and d23 are meaningful is at the loca- tion of the ramp edge in the middle frame where both frame differences intersect [5]. Hence, the moving can be extracted by finding the intersection of the differ-

Initial Conditions

RCgiOXl Masks

Fig. 5. Spatial segmentation based on motion detection and color

segmentation.

ence between frames 1 and 2, d12, and the difference between frames 2 and 3, d23.

The method for computing the motion mask _&’ of

multiple moving objects is (1) smooth d with a low- pass filter, (2) thresholding d with value Zd to obtain a binary image, (3) eliminating small, spurious regions

using a 5 x 5 median filter, and (4) filling holes by

successively applying operations of dilation and clos- ing. Let K denote a 5 x 5 square structuring element. The computation of a motion mask is summarized as

J%” = (MF[TH(LF(d), thd)] @ K) . K, (18)

where MF is the median filter, TH the thresholding operation and LF the low-pass filter.

Color segmentation is the process that divides the

image into homogeneous regions using the color information at each pixel. It is possible to determine image segments of arbitrary shape by simultaneously combining clustering in color-spatial space with spa-

tial region growing. Izumi [lo] first pioneered this hybrid technique by suggesting that segmentation can be

done by initially partitioning an image into small regions of homogeneous color and pixel position. The additional spatial constraint rejects pixels that are too far away from the center of the small region. The hybrid algorithm is based on k-means clustering and generally converges rapidly, usually achieving it lim- iting accuracy within two or three iterations.

We now consider the integration of motion detection and color segmentation described in the previous


sections. Motion detection produces a mask JV which

generally contains small holes and is slightly different in size from the true size of the moving object. Color segmentation partitions the image into a large num-

ber of coherent regions, but the object itself is split into several regions. Thus, both the motion detection

and color segmentation are imperfect. The key obser-

vation for the integration strategy is that if one of the

regions belonging to the moving object is identified,

it is possible to merge that region with other partially masked regions using the criteria of uniform color and

masked area. In the cases we have tried, segmentation accuracy

of this process is good. With image sequences of non- occluding moving objects, the correct partitions are

recovered to within roughly 3% of the true size, re-

gardless of the number of objects or the complexity of background scene. When the process is applied to a region of occluded objects, it will pick up the correct

boundary which surrounds the objects, but will not be

able to partition them into separate regions. We have attempted to include motion in the clustering process

to eliminate this ambiguity.

4.1.2. Region tracking Once we have a set of support maps, we need to

track their image position over time to generate the

spatiotemporal structure of each tube. Our approach

to this problem is to find a motion model which simultaneously describes the image motion of a region between two successive frames and represent the

cylindrical structure of a tube. We use affine motion model for this purpose since it has been shown to

provide a good approximation of 3-D moving objects

[ 15,2]. Our implementation of affine motion estimation is similar to robust techniques presented by

[3, 181. Instead of a direct estimation of global affine motion, we first compute a map of global motion fields and then perform least-squares estimation of affine motion over each support map. This strategy minimizes the problems of multiple objects within

the analysis region.

4.1.3. Temporal segmentation After the regions are warped and tracked, we wish

to create a hierarchical structure of episodes, shots and tubes by merging temporally the regions tracked

from successive frames. To achieve this, we need to

find a set of breakpoints which enable such a temporal partition to the video streams. These breakpoints [ = { YV, 0%, J&Z} include scene changes between

frames, occlusion among overlapping tubes, and activity changes between tubes of identical scene ob-

jects, respectively. Our rule for merging the tracked

regions is

Label( Y(‘))

= (g(i), @k+i), 40) if [ = 9% > (19)

(@i) y(k) T[) 3 9, otherwise.

To extract these breakpoints, we could examine varia-

tions of the texture and motion between two successive frames. Although widely employed in the current liter-

ature, this approach is inherently noise-sensitive since

a point of large variation can be easily mislabeled as a discontinuity point! Instead, we adopt a spatiotem-

poral strategy where redundant information from ad-

ditional frames provides the necessary constraints on

the otherwise ill-posed analysis problem. We assume that the contours of moving objects form spatiotemporal surfaces in space-time, and treat the breakpoints

as a set of curvature discontinuities on the spatiotem-

poral surfaces [8,9]. This formulation minimizes the problems of labeling ambiguity, as well as providing

a method for simultaneously detecting the breakpoints due to texture variations and motion discontinuities.

4.2. Transformed tubes

Recall that the tubes can be classified based on their motion complexity. Using the tube classification method described in detail in [7], we create two classes of tubes 4 and % for background tubes and object tubes, respectively. We further divide 4 into two sub- classes of simple object tubes z#, and complex ob-

ject tubes ZO. Simple object tubes &, have smooth motion but no significant textural changes. The motion is assumed to be approximately affine, i.e. not to have any nonrigid motion. Complex object tubes z0 presents a strong change of texture, often localized in


time and space, as a result of local nonrigid motion, formations c’(q”(s)) is simply reduced to 3-D rigid motion or illumination changes. &‘( 3).

We now propose an encoding procedure for

the construction from an A4 frame tube of a tube code, according to the motion complexity of the

tube.

5. Compression by video synthesis

Y is a background tube 4. We simply encode 3, the initial frame of 4, using any of the con-

ventional still image coding schemes (e.g. DCT,

wavelets, fractals). The transformation Ti which

compresses g is simply zj = ni.

Y is a simple object tube 9&. We restrict our at- tention to r transformations which represent the motion of the entire tube with a single affine mo-

tion model. The transformation zi is a composition

of still image coding and affine motion compensation, of the form

Zi(4) = c 0 Ai (20)

= {/ti(~),&‘(~),~2(~),. . . ,J”-‘(z)}.

(21)

5 is a complex object tube 9&. We use transfor-

mations Zi which are a composition of still image coding affine motion compensation and texture up-

dating, of the form

Zi(4) = fi 0 x 0 Ai (22)

= {n,(~),~‘(_r’(~)),~2(~2(~)) )...)

p-‘(zy’(~))}. (23)

When an occlusion occurs, 3 is initially motion

compensated, followed by accretion and deletion

to synthesize the proper occluded regions. When a textural change occurs, textures are created or destructed on z, and motion compensation G is applied to q @ s or ^I;: 0 s. Let Titi, denote the tth frame of the tube. Among the pool of registered textures, the one which minimizes the distortion measure d(c’(q’(s)), To’,,,) is selected. If neither occlusion nor partial textural changes occur, the composition of trans-

5.1. Video analysis

We have tested the video segmentation algorithm

with several real images of complex natural scenes and multiple moving objects. In all cases, the analy-

sis region was taken to be the entire region, and the

images were 352 x 288 pixels in size. The threshold values for motion detection Zd and for integration z,-

were set to 7 and 1000, respectively. All computations were performed on a Sun SparcStation 5. The entire

process of spatial segmentation required roughly 100 s for each image.

During spatial segmentation, the test sequence

‘Miss America’ showing a head talking and moving simultaneously was used. A single frame from

this sequence is shown in Fig. 6(a). Three frames,

Itc-4, It, It+4, were selected from the sequence and used to detect the motion mask shown in Fig. 6(b). The middle frame It was then partitioned into a

large number of small regions shown in Fig. 6(c). In Fig. 6(d), the integration of motion detection and

color segmentation is illustrated, showing that partially masked regions were correctly merged to the

face and background. The threshold values for motion detection was var-

ied in order to partition the image into head, shoulder and background. In our experiments, we set the

threshold values at 4 and 7. At the threshold value of 4, the image was partitioned into 2 regions, corresponding to the head-shoulder and background, respectively. At the threshold value of 4, the image was also partitioned into 2 regions, corresponding to the

head and shoulder-background, respectively. By ap-

plying a simple logical operator, AND, to the two images, we obtained the head, shoulder and background.

Since only the head and shoulder move in the sequence, they are tracked in the subsequent frames. The temporal segmentation module treats the entire sequence as a shot and partitions it into 3 activities or motion events, corresponding to talking, talking- rotating left and talking-rotating right at the intervals of l-60, 61-87 and 88-150 frames, respectively.

Fig. 6. Motion detection based on image subtraction and color segmentation: (a) ‘Miss America’ image sequence; (b) motion

con lputed from the differences of three color images; (c) image segmentation using clustering of color and pixel position; (d) inte

of / color clustering and image subtraction.


masks

gration

Hence, the entire sequence consists of 4 tubes, 3 head,

1 shoulder and 1 background. Three partial textural changes were selected from the head tubes to reflect the variations of facial features such as eyes and

mouth. Fig. 7 shows the initial frame for each tube

and the textural changes.

.2. Video reconstruction from a tube code

When the tubes were warped to synthesize an image sequence, the problems of holes and overlaps

existed at the tube boundaries. Small holes were filled

by copying the image intensity from their neighboring pixels. Recurring large holes were merged temporally

to form an imaginary tube and encoded similarly to the real tubes. Overlaps were solved by comparing

the original video to the overlapping tubes; the tube whose overlapped areas were most similar to the original video were selected to represent the overlapped regions.

Information about the specific tube-based coding system for the encoding ‘Miss America’-design

R. Hsu et al. I Signul Processing: Image Communication 9 (1997) 249-266

Fig. 7. Registered texture images for representing ‘Miss America’

specification, encoding procedure, and system performance is given in Table 1. Since most of the bits were spent to encode the texture information, we tried to reduce the information using both affine MC-DCT and 3-D face wireframe model. Figs. 8 and 9 show the results of these two methods. Since the region of interest is mainly the head, we used a quality of 80 in DCT to encode the head tube and its textural changes. For the shoulder and back-

ground, we used a quality of 20. Table 2 summarizes the information in bits, excluding the information texture, for the structured video representation of ‘Miss America’.

5.3. Compurison to related work

Object-oriented analysis-synthesis coder [6] encodes arbitrarily shaped objects using three sets of


Fig. 7. Continued.

parameters: motion, shape and color. The image analysis partitions an image into planar patches, approximates their shape by polygon and spline representation, and estimates displacement vector fields or 3-D rigid motion. The object-oriented coder, however, encodes images on a one-frame basis and does not utilize fully the temporal redundancy. Further- more, the object-oriented coder is designed strictly for video compression, hence it is difficult to extend the approach to video composition.

In an effort to facilitate efficient coding and composition, more recent approaches [ 1,181 have adopted a layered representation, in which image sequences are decomposed into a set of layers ordered in depth with associated maps defining their motions, opac- ities and intensities. In effect, a digital cornpositing

system can synthesize from the layered representation a new image sequence using techniques of warping and cornpositing. However, such an object-oriented representation presents its own problems. Temporal relationships cannot be assigned to overlapping or nested video sequences as is accomplished in the stratification model. Thus, the layered representation cannot specify the elaborate logical structure of video data and do not address content-based access.

Traditional motion-compensated (MC) prediction [ 161 assumes the image motion comprises solely block-based translational motion. When this assumption fails, the coding efficiency deteriorates sharply as more bits are assigned to encode the DCT coefficients of prediction errors. A hybrid predictor, as shown in Fig. 10, can improve coding efficiency by combining


Fig. 8. Tube-based coding results (combined with MC-DCT)

R. Hsu et ul. ISignal Processing. Image Communication 9 (1997) 249-266 2 63

Fig. 9. Tube-based coding results (combined with 3-D wireframe model).

264 R. Hsu et al. ISignal Processing: image Communication 9 (1997) 249-266

Table 1

Information in bits for the representation of video structure

Tube type Parameters Information in bits

Background Layer index (150 frames)

Centroid (image position: X, y)

Simple object Layer index (150 frames)

Centroid (image position: x, y)

Affine motion parameters

Complex object Layer index (I 50 frames)

Centroids (image position: n, y)

Affine motion parameters

Texture change index

2x 150=300

10 x 2 = 20

lb = 320b 2x 150=300

10 x 2 = 20

8 x 24 = 824

Is0 = 1144b

2x 150=300

10x2x3=60

8 x 24 x 3 = 2412

2

Ice = 28346

Table 2

Tube-based coding system used for the encoding of ‘Miss America’

Original Name

Resolution

Color levels

Frames

‘Miss America’

352 x 288

256 (CIF)

150

Encoding specifications Video segmentation

Tubes

Classification

Textural changes

Tube transformations Texture refresh A

Affine motion compensation r

Texture updating r

Tube encoding Background tube &

Simple object tube &

Complex object tube Y&

System performance Frame rate

Bit-rate

Average SNR

General remarks

Background (&), simple object (X0 ), complex object (%,)

3 head (&), 1 shoulder (&), 1 background (yb)

3 head

Still image compression

Warping with {CI~,C~~,C~~,C~~}I Q~G~

Conventional moving image compression

n : (DCT, JPEG with quality=20)

n : (DCT, JPEG with quality=20)

r: {CI1,C2r,C3irC4r}l~r46

,4 : (DCT, JPEG with quality=20)

r : {Cli,CZI,C3,,C4i}l~lgh

T : MC-DCT or model-based coding

30 frames/s

38.9 kb/s (MC-DCT), IO kbjs (model-based coding) 31.50 dB (MC-DCT), 28.54 dB (model-based coding)

Excellent reproduction of texture. Very good fidelity of uniform areas in background, but

artifacts along the tube boundaries are visible. More visually pleasing than the results of

conventional very low bit-rate coding systems


Fig. 10. Hybrid motion compensation for MPEG applications

a conventional MC predictor (translation), with a tube-

based predictor (non-translational motion, object oc-

clusion, new objects).

6. Discussion and conclusion

We have described the implementation of digital video coding system based on tube-based representation and video transformations. The main characteristics of this approach are that (i) it relies

on the assumption that temporal redundancy can be efficiently exploited through temporal coherence on a

tubewise basis, (ii) it controls the bit assignment according to motion complexity and temporal relations

among video elements, (iii) it parameterizes the motion of the entire tube with an affine motion model,

and (iv) it synthesizes images from representative texture patterns or anatomical models.

Some problems related to our conception of tube- based video coding remain unsolved, which we list in the following:

- Video segmentation is not perfect, especially during the detection of occluded areas.

- During the synthesis, the motion at the tube boundaries should be synchronized in order to ensure a smooth transition between tubes.

- Textural changes currently requires many bits to en-

code. We are working on a fractal-based technique

to solve this problem.

The tube-based coding system shows great promise

for encoding digital video at very low bit-rates using the structured video representation and video synthe-

sis. The encoded tubes, once marked semantically, can facilitate interactive and content-based operations

on image sequences, such as editing, browsing and content-based access and filtering.

References

Ul

PI

[31

[41

[51

E.H. Adelson, Layered representation for image coding,

Technical Report No. 18 1, Vision and Modeling Group, The

MIT Media Lab, December 1991.

J.R. Bergen, P.J. Burt, R. Hingorani and S. Peleg, “A

three-frame algorithm for estimating two-component image

motion”, IEEE Trans. Pattern Anal. Machine Intell., Vol.

14, No. 9, 1992, pp. 886896.

M.J. Black and P. Anandan, “Robust dynamic motion

estimation over time”, Proc. IEEE Conj on Computer Vision and Pattern Recognition, 1991, pp. 296-302.

G. Davenport, T.A. Smith and N. Pincever, “Cinematic

primitives for multimedia”, IEEE Comput. Graphics Appl., July 1991.

M.P. Dubuisson and A.K. Jain, “Object contour extraction

using color and motion”, Proc. CVPR-93, 1993, pp. 471- 476.

266

t61

[71

PI

PI

[lOI

[Ill

WI

R Hsu et al. ISignal Processing: Image Communication 9 (1997) 249-266

H.G. Musmann, M. Hotter and J. Ostermann, “Object-

oriented analysis-synthesis coding of moving images”,

Signal Processing: Image Communication, Vol. 1, No. 2,

1989, pp. 117-138.

P.R. Hsu, Structured representation of moving images for

realizing interactive video environment, Ph.D. Thesis, The

Univ. of Tokyo, 1995.

P.R. Hsu and H. Harashima, “Spatiotemporal representation

of dynamic objects”, Proc. IEEE Conj: on Computer Vision and Pattern Recognition, 1993, pp. 14-19.

P.R. Hsu and H. Harashima, “Detecting scene changes and

activities in video databases”, Proc. IEEE Internat. Conf Acoust. Speech Signal Processing, 1994. N. Izumi, H. Morikawa and H. Harashima, “Dynamic scene

analysis using mosaic-based segmentation”, PCSJ’91, 1991,

pp. 145-148.

A. Jacquin, “Image coding based on a fractal theory of

iterated contractive image transformations”, IEEE Trans. Image Process., Vol. 1, No. 1, 1992, pp. 18-30.

R. Jain and H.H. Nagel, “On the analysis of accumulative

difference pictures from image sequence of real world

scenes”, IEEE Trans. Pattern Anal. Machine Intell., Vol.

1, No. 2, April 1979, pp. 206-214.

[I31

Cl41

u51

[I61

P71

[181

S.G. Mallat, “A theory for multiresolution signal

decomposition: the wavelet representation”, IEEE Trans. Pattern Anal. Machine Intell., Vol. 11, 1989, pp. 674-693.

Y. Nakaya and H. Harashima, “Motion compensation based

on spatial transformations”, IEEE Trans. Circuits Systems Video Technol., Vol. 4, No. 3, 1994, pp. 339-356.

S. Negahdaripou and S. Lee, “Motion recovery from

image sequences using only first-order optical flow infor-

mation”, Internat. J. Comput. Vision, Vol. 9, No. 3, 1992,

pp. 163-184.

A.N. Netravali and B.G. Haskell, Digital Pictures - Representation and Compression, Plenum Press, New York,

1988.

D. Swanberg, C. Shu and R. Jain, ‘Knowledge guided

parsing in video databases”, Proc. SPIE, San Jose, January

1993.

J.Y.A. Wang and E.H. Adelson, “Layered representation for

motion analysis”, Proc. IEEE Conf on Computer Vision and Pattern Recognition, 1993, pp. 361-366.

tube-based video coding

Documents