three-dimensional video compression using subband/wavelet

762 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO. 6, JUNE 1999

Three-Dimensional Video CompressionUsing Subband/Wavelet Transform

with Lower Buffering RequirementsHosam Khalil,Student Member, IEEE,Amir F. Atiya, Senior Member, IEEE,and Samir Shaheen,Member, IEEE

Abstract—Three-dimensional (3-D) video compression usingwavelets decomposition along the temporal axis dictates that anumber of video frames must be buffered to allow for the tem-poral decomposition. Buffering of frames allows for the temporalcorrelation to be made use of, and the larger the buffer themore effective the decomposition. One problem inherent in sucha set up in interactive applications such as video conferencing,is that buffering translates into a corresponding time delay. Inthis paper, we show that 3-D coding of such image sequences canbe achieved in the true sense of temporal direction decompositionbut with much less buffering requirements. For a practical coder,this can be achieved by introducing an approximation to theway the transform coefficients are encoded. Applying waveletdecomposition using some types of filters may introduce edgeerrors, which become more prominent in short signal segments.We also present a solution to this problem for the Daubechiesfamily of filters.

Index Terms— Image sequence compression, 3-D wavelets,video conferencing.

I. INTRODUCTION

I MAGE sequence coding has been an active research areafor a long time. Video conferencing image sequence coding

is becoming even more important now with the widespreaduse of networks and the affordability of video capturingequipments. In addition to spatial redundancies in the case ofstill images, video images possess temporal redundancies thathave to be exploited in any efficient codec. Most video com-pression techniques utilize standard two-dimensional (2-D)compression methods, but with an added prediction step inthe temporal dimension. The most widely used 2-D methodutilizes the DCT, but also subband and wavelet transformshave provided very efficient techniques.

Still image compression using wavelets is accomplished byapplying the wavelet transform to decorrelate the image data,quantizing the resulting transform coefficients, and encodingthe quantized values. This is usually carried out in a row-by-

Manuscript received August 11, 1996; revised July 23, 1998. The associateeditor coordinating the review of this manuscript and approving it forpublication was Dr. Thrasyvoulos N. Pappas.

H. Khalil is with the University of California, Santa Barbara, CA 93106USA (e-mail: [email protected]).

A. Atiya is with the Department of Electrical Engineering, Cal-ifornia Institute of Technology, Pasadena, CA 91125 USA (e-mail:[email protected]).

S. Shaheen is with the Department of Computer Engineering, CairoUniversity, Giza, Egypt.

Publisher Item Identifier S 1057-7149(99)04064-6.

row then a column-by-column basis when using a separableone-dimensional (1-D) wavelet transform. A natural extensionis to apply wavelet compression to a “3-D” image, with thetemporal axis being the third dimension. Thus, in 3-D waveletcoding schemes, the frequency decomposition is carried outin addition along the temporal axis of a video signal. Slowmovement in the video content will produce mostly lowfrequency components and the high frequency componentswill be almost negligible. This is the energy compactingproperty of wavelet decomposition, that would allow us toignore some frequency components and still have a goodrepresentation of the signal.

Recently, several 3-D wavelet/subband coding schemeshave appeared. Wavelet/subband coding lends itself to scalabletechniques and one such technique was presented in [1].The multirate scheme uses global camera-pan compensationand supports multiple-resolution decoder displays. In [2],an adaptive scheme in which the decomposition along thetemporal direction utilizes more or less frames depending onvideo activity. Typically, less frames would be used whenactivity is high (correlation is low). In [3], a scheme ispresented that uses wavelet packets. Yet another schemeutilizing 3-D subband coding is described in [4] demonstrating3-D zero-tree coding. Other schemes such as [5]–[9] arebasically hybrid-like schemes with correlation only betweentwo adjacent frames utilized.

The obvious advantage of utilizing a temporal decompo-sition is that representation of temporal changes can be veryefficient in the transform domain when movement is slow. Theforward transform is implemented using analysis filters, and,generally, better analysis is achieved when filters are longerso that correlation among more frames can be discovered. Oneproblem inherent in such a scheme is that if it is required toanalyze video over long sequences, one must first buffer alarge number of frames. Such a requirement is not a problemwhen considering general video such as in [1]–[4]. But ininteractive applications such as video phone, buffering offrames results in a corresponding time delay. If there is amaximum allowable delay in such applications, that wouldput an upper limit on the number of frames that can beused for our decomposition resulting in lower compressionefficiency. Previously, this problem was solved by using onlyHaar wavelets (a filter with a length of two taps) in thetemporal direction. One even simpler solution was to treatonly a maximum of two consecutive frames (i.e., a maximum

1057–7149/99$10.00 1999 IEEE

KHALIL et al.: 3-D VIDEO COMPRESSION 763

of one decomposition level in the temporal direction), and thusis very similar in nature to normal hybrid coding techniques.Examples of such codecs can be found in [5]–[9]. In this paper,we propose a solution to this problem by allowing correlationover a larger number of frames to be utilized while operatingbelow the maximum allowable delay. Such a scheme is madepossible by applying the temporal decomposition to a windowof frames consisting of some old frames and a number ofnew frames as dictated by the maximum allowable delay.Selective transmission of coefficients concerning only the newframes is then performed. The next section explains the useof wavelets and Sections III and IV present the details ofthe video coding algorithm. Results and conclusions will bepresented in Sections V and VI, respectively.

II. THE SUBBAND/WAVELET TRANSFORM

Still images are considered as 2-D signals. Applying thesubband/wavelet transform to such signals is most commonlydone by using the 1-D transform version and applying it tothe still image in both row-order and column-order. This isbecause implementation of the single dimension transformis more efficient than an equivalent 2-D transform, and wasshown [10] to be an effective solution. Video images can beconsidered as a 3-D signal, the three dimensions being thehorizontal, the vertical, and the temporal dimension. In thissection, we will summarize the implementation of the 1-Dwavelet transform which is also representative of the subbandtransform in this context.

The wavelet transform, as a data decorrelating tool, has wonacceptance because of its multiresolution analysis capabilitiesin which the signal being transformed is analyzed at manydifferent scales to give a transformation whose coefficients canefficiently describe fine details as well as global details in asystematic way. This, in addition to the locality of the waveletbasis functions as opposed to the Fourier transform, forexample, which uses infinite width basis functions. Waveletsalso unify the many other techniques that are of local type,such as the Gabor transform and the short time Fouriertransform.

The wavelet/subband transform is implemented using a pairof filters: a highpass filter and a lowpass filter , which splita signal’s bandwidth in two halves. The frequency responsesof and are mirror images. To reconstruct the originalsignal an inverse transform is implemented, using the inversetransform filters , which are also mirror images.

The 1-D forward wavelet transform of a discrete-time signal, is performed by convoluting that signal with

both , and and downsampling by two. Specifically, theanalysis cycle is

(1)

(2)

where represent the lowpass downsampled result,thehighpass downsampled result, andand , respectively, are

the coefficients of the discrete-time filtersand with filtertaps. The synthesis cycle (backward transform) is performedas follows:

(3)

where the ’s represent the synthesized values after an anal-ysis/synthesis cycle and whenever

is odd. Also, and , respectively, are the coefficientsof the discrete-time filters , and . Perfect reconstructionrequires that . By downsampling, asimplied by (1) and (2), we are in effect producing a criticallysubsampled set. This is possible because the analysis filterseach result in a half-bandwidth output.

A common problem arises with respect to edges, and iswell known as the “edge-problem” in wavelet/subband coding.Basically, the problem does not arise in infinite-length signals,but practically, we only transform a limited-length signal suchas a row of pixel values in an image. When applying (1) and(2), we immediately see that we would need to extend thesignal beyond the edges, to provide all required coefficientsin those equations. This problem occurs again in (3). Wecan assume some suitable values for the extension, but ifexact reconstruction will be required, then the values of thereferenced elements that are outside our initial set that areused in the analysis cycle need to be retained so that theycan be used again in the synthesis cycle. The disadvantageof this is that the number of elements (samples) that haveto be retained will be larger than the original set and thusif recursive application of the transform will be applied, thedata set size will continue to rise and thus not constitute acritically sampled data set.

On the other hand, the requirement that the number ofelements necessary for representation to remain constant isstated as follows: The highpass and lowpass filtering of a1-D signal of length should result in outputs andwith a total number of samples that is also. Some methods[11] have been used to solve this problem, and generally allof them extend the 1-D signal to counter the effect of finitesignal length. One extension method that involves wrappingthe image around the edges to form a periodical signal providesperfect reconstruction but when subjected to high quantizationstill produces degradation at the edges. Another method thatmakes a mirror extension at the edges may produce lessdegradation but may not give perfect reconstruction for somefilters.

In this work, we use two types of wavelets. The symmetric-tap filters of [12] for the spatial directions, and the 6-

tap Daubechies filter [13], [14] for the temporal direction.The symmetric extension at the edges are suitable for theformer filter, but inappropriate for the latter. Because we arecompelled to use the combination of these filters, as will beexplained in Section III, we use a new extension technique thatprovides perfect reconstruction for Daubechies filters (that weuse for the temporal dimension) even at the edges. This newtechnique is explained in the appendix so as not to distractfrom the main topic of this work.


III. T HREE-DIMENSIONAL WAVELET CODING

In this section, we propose a coding system utilizing a 3-Dwavelet transform that will be able to handle interactiveapplications. We use separable wavelets for our decomposition,whereby 1-D wavelets are used in each of the three dimen-sions. We essentially perform the three steps of horizontaldecomposition, then vertical decomposition, then temporaldecomposition. When we apply the wavelet transform, wemust therefore first define the signal segment of lengththat needs to be transformed. When we are presented with astill image, we already have all pixels available to us, and thuswe are free to immediately perform wavelet decomposition tothe image. But when considering video sequences, to be ableto perform temporal decomposition, we must consider a signalsegment that contains all temporal versions of the same spatialpixel location. There will therefore be a delay, since we mustwait for frames, for example, to accumulate the requiredframes.

When we specifically consider the special case of videoconferencing applications, we note that such an applicationis typically an interactive communication utility. In such acase, the encoding time at the transmitter plus the transmissiontime plus the decoding time at the receiver must all sum upto an acceptable delay period. A widely acceptable delay isabout 1/4 s, then at a frame rate of 30 frames/s, this translatesinto a maximum delay of seven frames. In other words, if wewill perform 3-D coding, we can only buffer a maximum ofseven frames to be able to apply wavelet decomposition (notincluding computational delays). Let us give a conservativeestimate that we can only buffer a maximum of four frames

. Of course, such a low figure will rule out waveletfilters that usually span six or eight elements at the lowestdecomposition level, and multiples more at higher levels. Thus,we may be lead to the fact that 3-D coding is only suitable fornoninteractive general video in which delay is unimportant.But we will demonstrate that by imposing a certain setup, wemay be able to use 3-D coding in interactive applications andstill make the most out of buffering a large number of framesthat enhances wavelet performance.

The basic idea of this 3-D coding system is as follows.Assume wavelets can only be applied when we have buffereda large number of frames, say , and we assume we canonly buffer frames (say ), because of the delayconsideration, as previously explained. Then, if it is possibleto put together past frames together with newframes, we would have the frames that we need. Now ifwe can extract only the wavelet coefficients associated withthe last frames, then we would have taken the advantageof applying the 3-D wavelet transform on the entireframes,as is preferable, while not spending extra bit rate to codethe other coefficients concerning the other frames.This setup was found to be necessary because an effective 3-D wavelet transform may not be able to operate on onlyframes, since they are too few, unless we use the Haar waveletthat, as will be shown in the simulation section, gives inferiorresults.

Fig. 1. Two-dimensional example of selectively transmitting coefficientsbelonging to region of interest (the shaded regions approximate the importantcoefficients).

Considering decomposition along the temporal direction, theapplication of the wavelet transform to a signal segment resultsin two lower resolution component segments (the lowpass andthe highpass). Now, it is in our interest that the pixels of the

frames that we wish to code, when wavelet transformed,end up in a confined region and not have their effect dispersedamong too many coefficients in the transform domain. Thereason for this is that we are going to extract the waveletcoefficients concerning the last frames, and hence we wouldnot want these to be too many, otherwise we would have lowcoding efficiency. Not all wavelets have such a characteristicas will be explained later in this section.

Fig. 1 shows a schematic diagram for an example of whatis meant by having a signal effect confined when wavelettransformed. It is easier to comprehend this 2-D version,and the 3-D version follows easily. In the figure, we assumethat we have a 2-D signal with new columns (asopposed to frames). Now we wish to transform the 2-D signalconcerning the new columns but by applying a transform to


the whole signal to make use of any correlation. After thesecond iteration, the coefficients concerning the new columnswill be dispersed as shown in the shaded regions in the figure.Now we want to transmit only the coefficients in the shadedparts. At the other end, the receiver has in its buffer onlythe old columns. Now before we can inverse transform thecoefficients concerning the new columns that the receiver hasjust received, we would like to insert them in their correctrespective places. Before we can do that, the receiver mustwavelet transform the old columns and since the dimensionsof the 2-D image must be equal to that at the transmitter,we will extend the 2-D old columns by assumedcolumns(chosen, for example, as a constant extrapolation of the lastframe). Now we transform this extended signal, and reinsertthe coefficients that the transmitter sent in their correspondingplaces. The process of inserting the coefficients into theirrespective placesreplaces the coefficients that result fromthe assumedcolumns. In reality, the transformation causeswider dispersion than shown in the shaded regions. We onlytransmit the more important coefficients, since it will notbe cost-effective to transmit the remaining coefficients fromthe compression point-of-view. Due to neglecting coefficients,this approximation becomes more and more accurate whenthe new frames are more correlated with the old frames.Practical experiments (Section V) show that it is an acceptableapproximation in general, especially if movement in the videois slow, as is the case in typical video-conferencing-typesequences.

After such a procedure, we will now end up with a 2-Dsignal that resembles the original 2-D signal that we arecompressing. In other words, even though we coded a fewcolumns, we were able to take into account the correlationover a wider range of columns while not sacrificing extra bitrate. A 3-D interpretation follows easily, and Fig. 2 offers anexample concerning four new frames that are to be coded. Twodecomposition iterations are performed.

It remains to be seen if there exist filters that when appliedto a signal segment will not disperse the effect of a localizedregion. If, for example, we consider the 9-tap QMF filter, asingle pixel in a signal segment will affect a total of

coefficient. If we want perfect reconstruction, then all 17coefficients concerning this one pixel must be transmitted tothe receiver. This figure, in fact, is lower than 17 because, firstwe are not treating one pixel but four adjacent pixels and thenthere would be overlap in the affected coefficients, and secondwe are treating pixels at the edge of a signal segment, and thatwe ignore affected regions that are outside our scope. But then,the figure is still too high, and it would be advantageous to finda filter that has approximately four coefficients affected by fourpixels. Such a setup can be achieved if we use the Daubechies’filters [13] because it has a more compact representation.When we specifically use the 6-tap filter, which was builtto allow perfect reconstruction, we find that the influence ofa group of pixels in a signal segment is still a bit too wide(dispersion still occurs). Therefore, we make an approximationby neglecting the least affected of these dispersed coefficients.Retaining all affected coefficients is the only way to ensureperfect reconstruction, but this is inefficient. A cost-effective

Fig. 2. Cubic portion of frames undergoing 3-D wavelet transformation.Notice the position of the coefficients that depend on the last four frames.

solution, that we found to give acceptable distortion, is toretain as many coefficients as there were pixels and neglectthe rest. A heuristic choice of retained coefficients is shownin Figs. 1 and 2, and these coefficients are the most dependenton the pixels that we want to compress. It is important tonote that in general, filter-length does not directly affect thetime-locality. We just stress that in the special circumstancesof this algorithm, where coefficients are neglected in thisspecial way, filters may in fact have different characteristics,especially when handling signals at regions near the edges.We have performed experiments and the results agree with thisinterpretation. In the spatial domain, we use the symmetricQMF filters because they have superior performance, and thecircumstances allow us to use them.

A. Detailed Steps of the Algorithm

Step 1: When the transmitter and receiver start to makecontact, the transmitter sends 12 frames using anykind of coding. We suggest a sort of intraframecoding for simplicity. The lowpass coarsest sub-band is coded using adaptive Huffman coding withpast frame prediction. These 12 frames will be thebasis of our prediction in the 3-D transform.


Step 2: The transmitter buffers new frames.Step 3: The transmitter now having old

frames, and new frames, i.e., a total of 16frames, can perform a 3-D decomposition of thesignal. To make practical implementation possible,we divide the frames into spatial blocks of size64 64, such that the 3-D signal that we willhandle at any one time is of dimensions 6464

16.Step 4: The encoder at the transmitter, must now locate the

coefficients in the transform domain that depend(or depend most) on the frames. Theirpositions are usually known beforehand and arefixed for any set of filters (see Fig. 2). Thesecoefficients are then coded using any efficientcoding method. We describe one such method laterin this section.

Step 5: Now the receiver has already 12 old frames, andneeds to reproduce the new frames. Butthese 12 old frames are not in the transform do-main, while the data received from the transmitteris. We thenextendthe 12 frames to 16 by repeatingthe 12th frame four times. Then we 3-D transformsuch a 64 64 16 signal. We then receive thecoded coefficients corresponding to the four newframes, and distribute them among their correctlocations. When we distribute these coefficientsamong their locations, they will approximatelyreplace all the coefficients concerning the extendedfour frames.

Step 6: Now that the receiver has a 64 64 16block that resembles the original one except forquantization errors, we can perform an inverse 3-Dwavelet transform to the block and end up with our12 old frames plus the four new frames.

Step 7: This process is first repeated for all spatial 6464blocks. When we have completed the whole framesize, the receiver and transmitter can now shiftback in time the frames by four temporal locationsto get ready for another four new frames. We thenreturn to step 2.

As we see in step 5 above, the receiver extends the numberof frames to 16 by repetition before applying the temporaldecomposition. As mentioned earlier, the coefficients repre-senting these frames will be a little dispersed beyond theconfinement region. We are only able to replace some ofthem when we distribute the coefficients that are receivedfrom the transmitter. Because it was found to be cost-effectiveonly to send the more important coefficients, there will besome mismatch between the transmitter and the receiver. Tosolve this potential problem, the transmitter repeats all stepsundertaken by the receiver. In other words, the transmitter,after it has selected the coefficients to be sent (and quantizedthem), performs a four-frame extension by repetition, thenforward transforms the 3-D block, reinserts these coefficientsand then performs a backward transform. Now, both receiverand transmitter are ensured to be in full synchronism.

Fig. 3. Parent-child relationship for a 3-D version of zero-trees for atwo-level decomposition of 3-D signal.

IV. THREE–DIMENSIONAL ZERO-TREE CODING

The performance of any codec not only depends on thechoice of wavelet filter or the number of decompositions,but also on the way the coefficients are quantized and en-coded. Many successful 2-D wavelet-based image compressionschemes have been reported in the literature, ranging fromsimple entropy coding to more complex techniques such asvector quantization [12], tree encodings [15], [16], and edge-based coding [17]. Shapiro’s zero-tree [15] encoding algorithmdefines a data structure called azero-tree. An extension to thisalgorithm to suit 3-D coding was derived independently in [25]prior to the appearance of [4]. This section briefly outlines themajor concepts of a 3-D zero-tree coding algorithm, and howit is used to fulfill our objective.

The image sequence is first filtered along thedimension, resulting in a lowpass image and a

highpass image . Since the bandwidth of andalong the dimension is now half that of , we can

safely downsample each of the filtered image sequences inthe dimension by two without loss of information. Thedownsampling is accomplished by dropping every other fil-tered value. Both and are then filtered along thedimension, resulting in four subimage sequences:

, and . Once again, we can downsample the subimagesby two, this time along the dimension. Then each ofthese four subimage sequences are then filtered along thedimension, resulting in eight subimages:

, and (see Fig. 3). The 3-D filtering decomposes an image sequence into the slowlymoving average signal and sevendetail signalswhichare directionally sensitive: for example, emphasizesthe quickly changing average signal, slowly changinghorizontal image features, the slowly changing verticalfeatures, and the quickly changing diagonal features.


The directional sensitivity of the detail signals is caused bythe frequency ranges they contain.

It is common in wavelet compression to recursively trans-form the average signal . The number of transforma-tions performed depends on several factors, including theamount of compression desired, the size of the original image,and the length of the filters. In general, the higher the desiredcompression ratio, the more times the transform is performed.

The encoder/decoder pair, orcodec, has the task of com-pressing and decompressing the sparse matrix of quantizedcoefficients. A wavelet coefficient is said to be insignificantwith respect to a given threshold if . The zero-treeis based on the assumption that when a wavelet coefficient ata coarse scale is insignificant with respect to some threshold

, then all wavelet coefficients having the same orientation(as defined by the type of filters applied to any one subband)at the same spatial and temporal location at all finer scales(lower decomposition levels) are likely to be also insignificantwith respect to .

Consider a hierarchical recursive wavelet system. We defineparent and child relationships to relate coefficients of onegiven scale to the next. The first modification to make is thezero-tree shape. In 2-D, a zero-tree starts as a single coefficientthen it has child coefficients, then each of thoseanother four, and so on. In three dimensions, a zero-tree startsas a single coefficient. Then, for the next level, it will have

child coefficients, each of which has another eight,and so on. If we are assuming a two-level decomposition thenthe parent/child relationships are as follows (see Fig. 3): Acoefficient in LLH2 is the parent of eight coefficients in LLH1(where one and two are the first and second decompositionlevels, respectively). A coefficient in LLH1 has no children.A coefficient in HLL2 is the parent of eight coefficients inHLL1 and so on.

When encoding the zero-tree information, a scanning ofthe coefficients is performed in such a way that no childnode is scanned before its parent. For an-scale transform,the scan is done separately for each of the slowly, and thequickly moving subbands. The slowly moving subband scan,for example, begins at the lowest frequency subband, denotedby , and scans subbands , and ,at which point it moves on to scale , etc. Note thateach coefficient within a given subband is scanned before anycoefficient in the next subband.

A coefficient is said to be an element of azero-treefora threshold if itself and all of its descendants (childrenand children of children and so on) are insignificant withrespect to . A symbol is used to indicate a zero-tree rootand the insignificance of the coefficients at all finer scales.This information can be efficiently represented by a stringof symbols from a three-symbol alphabet (similar to [15])which can be entropy-coded. These three symbols are 1) zero-tree root (ZT), 2) isolated zero (IZ), which means that thecoefficient is insignificant but has some significant descendent,and 3) significant (SIG).

It has also been found more efficient to encode runs ofzero-tree root symbol (ZT) using variable codeword-lengthrun-length encoding. A state diagram is shown in Fig. 4. Thus

Fig. 4. State diagram showing possible output sequences.

we are always in one of two states: we are either countingruns of ZT, or we encode either IZ or SIG. We chose that wealternate between the two states so as to save on the signalingrequirement that would be needed to switch also between IZand SIG (more efficient to inhibit transition between IZ andSIG, and provide a run-length of zero of ZT in exchange). ZTrun-length values are encoded using fixed Huffman tables asin [18]. Formerly marked descendants of a previously foundzero-tree root are not counted in the run-length as they areknown at the decoder. To differentiate between IZ or SIG, weuse one bit. In case the coefficient is SIG, we quantize thecoefficient using a nonlinear optimal Lloyd quantizer and thenentropy-encode the quantized values.

As for the lowest subband (the one subjected only tolowpass filters), we code it using variable length differentialpulse code modulation (DPCM) with past frame prediction.Fixed Huffman tables are used to encode differential values asthe distribution of differential values isa priori known.

V. CODING RESULTS

In this section, we demonstrate the performance of theproposed system. We use the first 36 frames of the imagesequences Claire and salesman. We only use a 320256portion of the frame so as to simplify division into 64 64blocks. The frame rate is 30 frames/s, and we handle thegray-scale frames, which have a resolution of 256 levels.

As indicated before, we code the first 12 frames using2-D coding, then weswitch to 3-D coding mode. The PSNRof the different frames for the Claire and salesman sequencesare indicated in Figs. 5 and 6, respectively. Shown in thesefigures are comparisons of the ITU-T H.261 [19] standard andthe 3-D wavelet systems that use the Haar wavelet and thesix-tap Daubechies filter. Coding of the first 12 frames for the3-D methods requires an average bit rate of around 900 kb/s.The steady-state bit rate for all designs is fixed around300 kb/s. Results are demonstrated only for the320 256 portion of the sequences and the following peaksignal-to–noise ratio (PSNR’s) are calculated as averages offrame 13 to frame 36, because that is when all sequencesare running at the same bit rate. The average PSNR overthe test sequence Claire for H.261 is 40.58 dB, for Haaris 40.05 dB, and for 6-tap Daubechies is 40.63 dB. The


Fig. 5. PSNR comparison of Claire sequence. Two-dimensional coding isused till frame 12, then 3-D afterwards.

Fig. 6. PSNR comparison of salesman sequence. Two-dimensional codingis used till frame 12, then 3-D afterwards.

average PSNR over the test sequence salesman for H.261 is35.58 dB, for Haar is 35.48 dB, and for 6-tap Daubechiesis 35.97 dB. This shows that incorporating as many framesas possible in the decomposition (6-tap Daubechies versusHaar) produces better results, and the performance of thisnew system is comparable to H.261, if not better. But thisdesign does not require the time and computation consumingmotion compensation (MC) step and thus can be viewed as abetter solution in situations where such resources are scarce.However, standards like H.261 and H.263 [20] have quitesophisticated MC procedures that are the main source of theirefficiency. H.263 outperforms H.261 by about 2 dB for thesesequences, and the main source of this additional improvementis mainly attributed to further improvements in MC. H.263is a highly optimized method and is therefore very hard tobeat. In our proposed wavelet approach, we have not used anyMC, so in that sense our results are favorable. Recent work in[21]–[23] combined MC with 3-D subband coding and verypromising results were demonstrated. We believe that if ourproposed system outperforms H.261 in its present state, thenadding H.261’s MC will most probably improve performance,and such complicated MC as used in H.263 even more. Thisreduced-buffering 3-D codec can approximate a real 3-D codeconly under conditions that neglected coefficients are in factnegligible. Under conditions of high-motion, we expect these

Fig. 7. PSNR comparison of Claire sequence at 64 kb/s.

coefficients to be more significant and the codec performanceis bound to deteriorate. It becomes necessary to have motion-compensation to counteract any motion that occurs. Anydiscrete cosine transform based (DCT-based) codec will suffersimilar fate when no motion-compensation is used. Many ofthe referenced papers on 3-D coders show compression resultsthat are favorable when motion is low, then significant drops inPSNR are exhibited when motion increases. Research is underway in combining the MC step of [21] to our proposed method,but this will be studied in a future work. Also, though zero-tree encoding is a very effective coding technique, it can easilybe exchanged with any other more effective coding techniquethat is discovered in the future. The actual encoding of thewavelet coefficients and the actual way that the system worksare two separate issues.

We also conducted a test at the lower rate of 64 kb/s. Fig. 7shows the results for the Claire sequence. The average PSNRfor the 3-D codec is 36.82 dB and that of the DCT basedsystem is 36.78 dB. To achieve the shown results, we codeCIF frames at only 10 frames/s.

Subjectively, the 3-D wavelet codec produced reconstructedframes that did not contain blocking degradations that arealways produced in H.261. The blurring that occurred inhigh movement areas were not as annoying as these blockingeffects. See Fig. 8 for a comparison of the H.261 output withthat of our 3-D coder for the Claire sequence for the higherbit rate (frame 33), and Fig. 9 for the lower bit rate (frame90). The thirty-third frame, for the lower rate, was chosen onpurpose to show that even when the PSNR of the 3-D coderis lower than H.261 for that particular frame, the subjectiveresults are favorable. The performance of our coder is morepronounced when movement is low since we do not use MCas previously explained. This can be noticed in the secondpart of the salesman sequence where movement is low. Fig. 10shows a comparison of the thirty-sixth frame of salesman. Onefinal comment about the results is that the quantitative periodicchange in PSNR that is seen in Figs. 5–7, is a problem thatis characteristic of most 3-D subband/wavelet coders reportedin the literature. It is mainly due to the fact that the jointencoding of a group of frames makes it difficult to know whichcoefficients affect exactly which frames, since it is considerednow a 3-D signal, i.e., a single unit.


(a)

(b)

(c)

Fig. 8. Comparison of thirty-third frame of Claire sequence. (a) Original.(b) H.261. (c) Three-dimensional coder (6-tap Daubechies).

It is interesting to study how the codec would perform athigher bit rates. As mentioned before, the 3-D decompositioncauses dispersion of coefficients that concern the frames thatwe wish to encode. Because we neglect some coefficients

(a)

(b)

(c)

Fig. 9. Comparison of ninetieth frame of Claire sequence at 64 kb/s. (a)Original. (b) H.261. (c) Three-dimensional coder (6-tap Daubechies).

in the way we code them, that places an upper limit onpossible reconstruction performance. We have conducted testson both the Claire sequence and the salesman sequence under


(a)

(b)

(c)

Fig. 10. Comparison of thirty-sixth frame of salesman sequence. (a) Origi-nal. (b) H.261. (c) Three-dimensional coder (6-tap Daubechies).

conditions of no quantization. Fig. 11 shows the output results.It can be seen that the neglected coefficientsdo have an

Fig. 11. Performance of the 3-D codec under conditions of “No quantiza-tion” of the 3-D coefficients. Frames 1–12 have exact reconstruction as theyare unaffected by neglected coefficients.

impact on the highest achievable PSNR. So even if bit rateis available, performance is bounded by the curves shown inthe graph. To overcome this limitation, we must use a differentcoefficient encoding algorithm (different from the used zero-tree) because the zero-tree codes all coefficients in the samemanner irrespective of the subband. In a modified algorithm,we would need to distribute the available bit budget in a moreintelligent way.

Enhancements to the performance of the coding algorithmare also possible. As indicated in the procedure, it is seen that a3-D forward and backward wavelet transform is performed forall frames, while we are indeed only concerned about

. The following three steps can considerably decreasethe required computations:

1) The first step addresses the issue of the forward trans-form at the transmitter. Since the transmitter will onlysend selected coefficients (those concerned only with thenew frames), then we can perform the forward wavelettransform only for the needed coefficients and saveconsiderable time.

2) The second step concerns the forward transform at thereceiver. If we consider the decomposition along thetemporal axis, it should be noted that most of the co-efficients calculated using convolution with the analysisfilters correspond exactly with a translated counterpartfrom the transformed 3-D block that we handled whenwe were considering the previous four frames. Thus forthose coefficients, we need only perform a translation ofthe coefficients from the previous forward transformedblock. The quantity of such translation depends onthe specific coefficient. If the coefficient is in the firstdecomposition level, then we translate it bypositions back in the temporal direction. If it is thesecond decomposition level, then we shift it bypositions, and so on.

3) Since at the receiver, we are only concerned withcalculating the pixels inside the fournew frames, wethus only inverse transform such pixels.

The overall effect of these optimizations is a decrease ofthe required computations, at the cost of a slightly more


sophisticated implementation, but the purpose is to show thattreating frames as part of a larger frames does notnecessarily constitute a proportional increase in computations.

VI. CONCLUSION

A 3-D coding system for video conferencing applicationshas been introduced. 3-D coding for such applications wasusually avoided because of the delay that such a systemwould introduce. A method whereby advantages of 3-D codingcan be effectively made use of was demonstrated, and asolution to the delay problem was highlighted. Althoughthe proposed algorithm is comparable to the H.261 standardand worse that the H.263 standard, it has the advantage oflower computational complexity. In addition, by adding thesophisticated MC of H.261 or H.263, performance of thedeveloped algorithm can be further enhanced. We believe thereis room for improvement in this direction, but this will bestudied in a future work.

APPENDIX

Because the forward and backward transformations areaccomplished by a convolution operation, we run into theproblem of edge-effects. For the Daubechies family of filters(which are nonsymmetric filters), there exists only one simpleedge extension solution that involves cyclically repeatingthe signal to form an infinite-like signal. When quantizingcoefficients resulting from transforming such a signal, highreconstruction errors may appear, especially if the signal hashighly differing amplitudes at the opposing edges. Here, wepresent a new edge-extension that has better performanceunder general circumstances.

Before we describe the new edge-extension, we will explainwhy exactly the finite-length input signal poses a problem.Application of the forward and backward transforms can beimplemented in several different ways. We present here twoschemes:

Scheme A

This is the same as described in Section II; it is repeatedhere for convenience.

The forward transform can be performed as follows:

(4)

(5)

where represents the lowpass downsampled result,is thehighpass downsampled result, andand respectively are thecoefficients of the discrete-time forward filters and . Thebackward transform is performed as follows:

(6)

Scheme B

Another possible implementation for the forward and back-ward transform is as follows:

(7)

(8)

(9)

Both Scheme A and Scheme B suffer from edge-problems.Scheme A gives perfect reconstruction at the right edge,but distorts the left edge. As can be seen from (4) and(5), the analysis accesses elements(beyond the right edge) and the synthesis accesses elements

(beyond the left edge). In the analysiscycle, we can make any arbitrary extension and would stillobtain perfect reconstruction at the right edge. In the synthesiscycle, however, not just any extension at the left edge will leadto perfect reconstruction. Scheme B has the same problemsexcept that the edges are switched, and now only the left edgegets perfect reconstruction.

Since Scheme A and Scheme B are very similar, onlyScheme A will be discussed in detail. Now, the problemcan be solved if we can extend the left edges ofandso that the analysis cycle can be performed. We can forceperfect reconstruction at the left edge by making some suitableextension to the signal , and beyond their left edge suchthat over an analysis and then synthesis cycle at that edge,we will get exact reconstruction for all edge pixels. This,however, will not work using (4), (5), and (6) as they are.To see this, note that (4) and (5) do not even access elementsbelow . This means extending the signalat the left edgeseems worthless. Even if we use equations of Scheme B, theexact same situation occurs. We propose, therefore, to solvethis other problem by implementing the analysis/synthesis asthe following scheme.

Scheme C

(10)

(11)

(12)

Here, it must be noted that the forward transform, denotedby (10) and (11), now requires the knowledge of both the sig-nal elements and(i.e., requires extension at both edges). The backward trans-


form now needs to extend to the left and to the right.This now is very appropriate because now we can achieveperfect reconstruction at both edges by choosing those edge-extensions wisely.

We will only demonstrate the solution for the left edge andthe analysis will carry over to the case of the right edge. Asmentioned before, we need to extend the signalat the leftedge (and right edge), and the signalto the left. Denotethose extended elements of the signal: ,by , respectively. Also denote the extendedelements of the signal: ( is alwaysan even number), by , respectively. Fromthe analysis and synthesis equations (10)–(12), we can obtainlinear equations that give the reconstruction values at theedge elements of the signal. In other words, we can write

as a linear combination of terms of, and as a linearcombination of terms of , which will now include termsdepending on . Then we write the reconstruction linearequations that give in terms of both and , but also

. Since now is in terms of both and (and ), then itwould also be in terms of (and ). Now if we impose toexactly match , we get a linear equation with only andunknown. Now if we get a set of such equations for different, we will get a set of linear simultaneous equations that can

be solved for the unknowns and . Because we only haveequations that actually reference and , and the fact

that we have a total of unknowns, we have thefreedom to assume the values of a total of variablesand still be able to find perfect solutions to the equations.One such set of assumed values may be to mirror-reflect,such that . Thismakes up for unknowns and was found to performwell under quantization. Thus, now we only have to solve forthe unknowns . With some simple manipulation of theseequations, we can write

(13)

where

(14)

and

(15)

where is a constant matrix of dimensions .The reason why we need elements of follows fromcareful examination of (10)–(12).

After we have found such a matrix, it can be fixed for anyDaubechies filter. Thus, solving the simultaneous equationsis all done off-line to calculate the matrix. During onlineoperation of the transformation step, all we do is just mirror-reflection for the extension of and calculate (13) and use

as the extension for. Thus, this edge extension schemeis not complex to implement and involves a simple matrixmultiplication at the edge of the signal segment. We repeatthis similarly for the right edge.

We wish to note another interesting solution for the edge-effect problem that has recently been developed [24]. The

solution was based on the idea that since the Daubechies-tap filter has vanishing moments, it can representa polynomial of degree very efficiently. Then ifit is assumed that the extension of the finite length signalsegment is the coefficients of a polynomial of such a degree,a polynomial expansion of the scaling function around theregion will produce a number of linear equations that can besolved to deduce these coefficients.

REFERENCES

[1] D. Taubman and A. Zakhor, “Multirate 3-D subband coding of video,”IEEE Trans. Image Processing, vol. 3, pp. 572–588, Sept. 1994.

[2] Y. K. Kim, R. C. Kim, and S. U. Lee, “On the adaptive 3D subbandvideo coding,” inProc. SPIE, vol. 2727, pp. 1302–1312, Mar. 1993.

[3] A. Mainguy and L. Wang, “Performance analysis of 3-D subband codingfor low bit rate video,” inProc. SPIE, vol. 2952, pp. 372–379, 1996.

[4] Y. Chen and W. Pearlman, “Three-dimensional subband coding of videousing the zero-tree method,” inProc. SPIE, vol. 2727, pp. 1302–1312,Mar. 1997.

[5] E. Chang and A. Zakhor, “Scalable video coding using 3-D subbandvelocity coding and multirate quantization,” inProc. Int. Conf. Acous-tics, Speech, and Signal Processing, Minneapolis, MN, Apr. 1993, vol.5, pp. 574–577.

[6] G. Karlsson and M. Vetterli, “Three dimensional subband coding ofvideo,” in Proc. ICASSP, 1988, pp. 1100–1103.

[7] C. I. Podilchuk, N. S. Jayant, and N. Farvardin, “Three-dimensionalsubband coding of video,”IEEE Trans. Image Processing, vol. 4, pp.125–139, Feb. 1995.

[8] M. Domanski and R. Swierczynski, “Efficient 3-D subband coding ofcolor video,” in Proc. IWISP’96, Manchester, U.K., pp. 277–280.

[9] T. Alonso M. Luo, H. Shahri, and D. Youtkus, “3D subband codingof video conference sequences,” inProc. Int. Conf. Signal Process-ing Applications and Technology, Dallas, TX, Oct. 1994, vol. 2, pp.933–938.

[10] S. Mallat, “A theory for multiresolution signal decomposition: Thewavelet representation,”IEEE Trans. Pattern Anal. Machine Intell., vol.11, pp. 674–693, July 1989.

[11] G. Karlsson and M. Vetterli, “Extension of finite length signals forsubband coding,”Signal Process., vol. 17, pp. 161–166, June 1989.

[12] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image codingusing wavelet transform,”IEEE Trans. Image Processing, vol. 1, pp.205–220, Apr. 1992.

[13] Y. Chan,Wavelets Basics. Boston, MA: Kluwer, 1995.[14] I. Daubechies, “Orthonormal bases of compactly supported wavelets,”

Commun. Pure Appl. Math., vol. 41, pp. 909–996, Nov. 1988.[15] J. Shapiro, “Embedded image coding using zerotrees of wavelet coef-

ficients,” IEEE Trans. Signal Processing, vol. 41, pp. 3445–3462, Dec.1993.

[16] A. Lewis and G. Knowles, “Image compression using the 2-d wavelettransform,”IEEE Trans. Image Processing, vol. 1, pp. 244–250, 1992.

[17] J. Froment and S. Mallat, “Second generation compact image codingwith wavelets,” Wavelets: A Tutorial and Applications. New York:Academic, San Diego, CA, vol. 2, pp. 655–678, 1992.

[18] G. Einarsson, “An improved implementation of predictive coding com-pression,”IEEE Trans. Commun., vol. 39, pp. 169–171, Feb. 1991.

[19] ITU-T Recommendation H.261, “Video codec for audiovisual servicesat p � 64 kbits/s,” Dec. 1990, Mar. 1993.

[20] ITU-T Recommendation H.263, “Video coding for low bit rate commu-nication,” Dec. 1995.

[21] J.-R. Ohm, “Three-dimensional subband coding with motion compen-sation,” IEEE Trans. Image Processing, vol. 3, pp. 559–571, Sept.1994.

[22] P. Waldemar, M. Rauth, and T. A. Ramstad, “Video compressionby three-dimensional motion-compensated subband coding,” inProc.Norwegian Signal Processing Symp., pp. 227–232, Sept. 1995.

[23] K. Ngan and W. Chooi, “Very low bit rate video coding using 3Dsubband approach,”IEEE Trans. Circuits Syst. Video Technol., vol. 4,pp. 309–316, June 1994.

[24] J. Williams and K. Amaratunga, “A discrete wavelet transform withoutedge effects using wavelet extrapolation,” IESL MIT Tech. Rep., 95-02,Jan. 1995.

[25] H. Khalil, “New coding techniques for video images in conferencingapplications,” M.Sc. thesis, Cairo Univ., Cairo, Egypt, 1996.


Hosam Khalil (S’98) was born in Washington, DC,in 1972. He received the B.S. and M.S. degreesin electrical engineering with honors from CairoUniversity, Cairo, Egypt, in 1993 and 1996, respec-tively. He is currently pursuing the Ph.D. degree atthe University of California, Santa Barbara.

From October 1993 to July 1996, he was anAssistant Teacher at Cairo University. He was alsoemployed part-time during that period with IBM,Egypt. In the summer of 1998, he was an intern inthe Multimedia Communications Research Labora-

tory, Bell Laboratories, Murray Hill, NJ. His research interests are in imageand speech compression and recognition.

Mr. Khalil was a recipient of the UC Regents Fellowship.

Amir F. Atiya (S’86–M’90–SM’97) was born inCairo, Egypt, in 1960. He received the B.S. degreein 1982 from Cairo University, and the M.S. andPh.D. degrees in 1986 and 1991 from the CaliforniaInstitute of Technology (Caltech), Pasadena, CA, allin electrical engineering.

From 1985 to 1990, he was a Teaching and Re-search Assistant at Caltech. From September 1990to July 1991, he held a Research Associate po-sition at Texas A&M University, College Station.From July 1991 to February 1993, he was a Senior

Research Scientist at QANTXX, Houston, TX, a financial modeling firm.In March 1993, he joined the Computer Engineering Department, CairoUniversity as an Assistant Professor. He spent the summer of 1993 withTradelink Inc., Chicago, IL, and the summers of 1994, 1995, and 1996 withCaltech. In March 1997, he joined Caltech as a Visiting Associate in theDepartment of Electrical Engineering. In January 1998, he joined SimplexRisk Management Inc., Hong Kong, as a Research Scientist, and he is currentlyholding this position jointly with his position at Caltech. His research interestsare in the areas of neural networks, learning theory, pattern recognition, datacompression, and optimization theory. He has published about 60 papers inthese fields. His most recent interests are the application of learning theoryand computational methods to finance.

Dr. Atiya received the Egyptian State Prize for Best Research in Science andEngineering in 1994. He also received the Young Investigator Award from theInternational Neural Network Society in 1996. Currently, he is an AssociateEditor for the IEEE TRANSACTIONS ON NEURAL NETWORKS. He served on theorganizing and program committees of several conferences, most recentlyComputational Finance, New York, 1999, and IDEAL’98, Hong Kong.

Samir I. Shaheen(S’74–M’81) received the B.Sc.and M.Sc. degrees from the Department of Elec-tronic and Communications Engineering at CairoUniversity, Cairo, Egypt, in 1971 and 1974, respec-tively, and the Ph.D. degree from McGill University,Montreal, P.Q., Canada, in 1979.

From 1979 to 1981, he worked in ComputerVision Laboratory, McGill University. In 1981, hejoined Cairo University as an Assistant Professor.He joined the Computer Science Department, KingAbdulaziz University, Saudi Arabia, as a Visiting

Professor. He is currently a Professor in the Computer Engineering De-partment, Cairo University, and Director of the Computer Laboratories.He is the Manager of Cairo University Computer Network and ComputerResearch Center. His current research interests include neural networks,medical imaging, document analysis, and multimedia in high speed digitalnetworks.

three-dimensional video compression using subband/wavelet

Documents