packet loss resilient mpeg-4 compliant video coding for the internet

22
Signal Processing: Image Communication 15 (1999) 35}56 Packet loss resilient MPEG-4 compliant video coding for the Internet F. Le Le H annec*, F. Toutain, C. Guillemot INRIA/IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France Abstract Targeting multimedia communications over the Internet, this paper describes a set of complementary techniques in the direction of both improved packet loss resiliency of video-compressed streams and e$cient usage of available network resources. Aiming "rst at a best trade-o! between compression e$ciency and packet loss resiliency, a procedure for adapting the video coding modes to varying network characteristics is introduced. The coding mode selection is based on a rate-distortion procedure with global distortion metrics incorporating channel characteristics under the form of a two-state Markov model. This procedure has been incorporated in an MPEG-4 video encoder. It has been observed that, in error-free environments, the channel adaptive mode selection technique does not bring any penalty in terms of compression, with respect to the initial MPEG-4 encoder, while allowing a signi"cant gain with respect to simple conditional replenishment. On the other hand, under the same loss conditions, it is shown that this procedure signi"cantly improves the encoder's performance with respect to the original MPEG-4 encoder, to approach the robustness of conditional replenishment mechanisms. This intrinsic robusti"cation of the encoder allows to minimize the e!ects of packet losses on the visual quality of the received video; however, it does not avoid losses. A rate-based #ow control mechanism is then developed and introduced into the encoder, in order to match the bandwidth requirements of the source to the bandwidth available over the path of the connection, for both &social' and &individual' bene"ts. The control mechanism developed combines an RTT-based control loop allowing early reaction to congestion and a TCP- friendly rate prediction model getting into play under lossy conditions. This hybrid control mechanism allows full rate control (even in loss-free conditions) and smooth rate variations together with high responsiveness. The introduction of the rate control in the MPEG-4 compliant encoder allows to maintain a stable PSNR and visual quality while decreasing signi"cantly the source throughput, hence reducing congestion and loss provoked by the same video source at a constant bit-rate. ( 1999 Elsevier Science B.V. All rights reserved. 1. Introduction Multimedia communication within the current best-e!ort Internet faces well-known challenges * Corresponding author. Tel.: #33-299-842-543; fax: #33- 299-847-171. E-mail address: fabrice.le } leannec@irisa.fr (F. Le Le H annec) with respect to quality of service, congestion man- agement, and network friendliness. Due to the real-time nature of envisioned data streams, multi- media delivery usually makes use of the so-called unresponsive transport protocols, i.e. the User Datagram Protocol (UDP) and/or Real-time Transport Protocol (RTP). Both UDP and RTP o!er no quality of service control mechanisms and can therefore not guarantee any level of QoS, 0923-5965/99/$ - see front matter ( 1999 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 2 3 - 5

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Packet loss resilient MPEG-4 compliant video coding for the Internet

Signal Processing: Image Communication 15 (1999) 35}56

Packet loss resilient MPEG-4 compliant video codingfor the Internet

F. Le LeH annec*, F. Toutain, C. Guillemot

INRIA/IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France

Abstract

Targeting multimedia communications over the Internet, this paper describes a set of complementary techniques in thedirection of both improved packet loss resiliency of video-compressed streams and e$cient usage of available networkresources. Aiming "rst at a best trade-o! between compression e$ciency and packet loss resiliency, a procedure foradapting the video coding modes to varying network characteristics is introduced. The coding mode selection is based ona rate-distortion procedure with global distortion metrics incorporating channel characteristics under the form ofa two-state Markov model. This procedure has been incorporated in an MPEG-4 video encoder. It has been observedthat, in error-free environments, the channel adaptive mode selection technique does not bring any penalty in terms ofcompression, with respect to the initial MPEG-4 encoder, while allowing a signi"cant gain with respect to simpleconditional replenishment. On the other hand, under the same loss conditions, it is shown that this proceduresigni"cantly improves the encoder's performance with respect to the original MPEG-4 encoder, to approach therobustness of conditional replenishment mechanisms. This intrinsic robusti"cation of the encoder allows to minimize thee!ects of packet losses on the visual quality of the received video; however, it does not avoid losses. A rate-based #owcontrol mechanism is then developed and introduced into the encoder, in order to match the bandwidth requirements ofthe source to the bandwidth available over the path of the connection, for both &social' and &individual' bene"ts. Thecontrol mechanism developed combines an RTT-based control loop allowing early reaction to congestion and a TCP-friendly rate prediction model getting into play under lossy conditions. This hybrid control mechanism allows full ratecontrol (even in loss-free conditions) and smooth rate variations together with high responsiveness. The introduction ofthe rate control in the MPEG-4 compliant encoder allows to maintain a stable PSNR and visual quality while decreasingsigni"cantly the source throughput, hence reducing congestion and loss provoked by the same video source at a constantbit-rate. ( 1999 Elsevier Science B.V. All rights reserved.

1. Introduction

Multimedia communication within the currentbest-e!ort Internet faces well-known challenges

*Corresponding author. Tel.: #33-299-842-543; fax: #33-299-847-171.

E-mail address: fabrice.le}[email protected] (F. Le LeH annec)

with respect to quality of service, congestion man-agement, and network friendliness. Due to thereal-time nature of envisioned data streams, multi-media delivery usually makes use of the so-calledunresponsive transport protocols, i.e. the UserDatagram Protocol (UDP) and/or Real-timeTransport Protocol (RTP). Both UDP and RTPo!er no quality of service control mechanisms andcan therefore not guarantee any level of QoS,

0923-5965/99/$ - see front matter ( 1999 Elsevier Science B.V. All rights reserved.PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 2 3 - 5

Page 2: Packet loss resilient MPEG-4 compliant video coding for the Internet

despite the companion protocol Real-time Trans-port Control Protocol (RTCP). RTP is indeedsomehow an empty shell for multimedia data bits,with respect to traditional transport features, e.g.#ow control or reliability. Hence, multimedia com-munication relying on a best-e!ort network service,as provided by today's Internet, has to face varyingnetwork QoS characteristics, in terms of delay andpacket losses.

Traditional video compression algorithms, re-lying widely on di!erential, run-length and variablelength coding, are very sensitive to packet losses.Losses can spread within a single picture up toa given resynchronization point, or even acrossseveral pictures when using temporal prediction,hence have di!erent impacts on the quality of ser-vice, from decoder no-start, to a whole range ofquality impairments. Various approaches aimingat improved resiliency against packet losses ofvideo streams have emerged recently. Targetingintrinsically error resilient streams, robust variablelength codes such as reversible VLC [33], ormechanisms for limiting loss propagations areintroduced. Error propagation in the decodedstream is limited by incorporating in the streamsyntactic descriptors like synchronization or datapartitioning markers. Restrictions in terms of pre-diction window size are introduced, in order tocon"ne all spatially predictively encoded informa-tion within a single video packet, delimited byresynchronization markers [10]. Similarly, in orderto avoid temporal loss propagation, videocon-ferencing tools like nv and vic, currently used onthe MBone, do not support temporal predictionbut rely only on conditional replenishment [18].Experiments reported here, and based on a sim-pli"ed MPEG-4 encoder where temporal predic-tion is replaced by conditional replenishment, showthe compression loss traded for the increasedpacket loss robustness. These experiments alsoshow that error resilient mechanisms such as thosesupported by the MPEG-4 video veri"cationmodel may not be su$cient under high loss condi-tion. This paper proposes several solutions whileremaining fully compliant to the MPEG-4 videosyntax.

The "rst issue addressed here is therefore a bettertrade-o! between error robustness and compres-

sion e$ciency, while limiting both temporal andspatial propagation. First attempts for maintainingtemporal prediction in an error-prone environmentare considered in [9]. The channel is modelled bya Bernoulli process and the intra/inter codingmodes selection relies on a Viterbi algorithm. How-ever, best-e!ort Internet is better modelled by "nitestate erasure channels exempli"ed by the Elliott}Gilbert channel [1]. A new coding mode selectionstrategy aiming at tailoring intra/inter modesto channel characteristics is introduced. Theprocedure relies on global distortion metrics incor-porating an Elliott}Gilbert process for channelmodelling.

At the transport level, error control mechanisms,such as forward error correction (FEC), automaticrepeat request (ARQ), or hybrid ARQ/FEC repeatrequest (ARQ), can be also considered. Error con-trol mechanisms increase stream resiliency topacket loss, at the expense of increased bandwidth,but do not avoid packet loss. They are consideredhere for protecting high-priority information suchas, for example, visual sequence or video objectplanes headers transporting decoder con"gurationparameters.

Complementary approaches, such as congestionand rate control, aim at minimizing the amount ofpacket loss by matching the video bandwidth re-quirement to the available network capacity. Thebasic principle behind unicast (or point-to-point)congestion control is an adaptive process involv-ing a source and a receiver for controlling thesource's throughput. By monitoring the networkstate, the source}receiver pair can detect incipientcongestion and react by lowering the output rate.Conversely, an unloaded state triggers a rate in-crease so as to better use the available networkresources. Quantities that are usually monitoredinclude packet loss and round-trip time (RTT) de-lay. Schemes making use of additive increase/multi-plicative decrease rate control are commonplace inthe literature, the best known instance of this beingthe TCP protocol. The resulting aggregate behav-iour of such schemes is ideally one in which thenetwork utilization is kept high and the loss ratelow. In addition, a new session coming into playmay expect to get more or less fair a share of thenetwork bandwidth.

36 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 3: Packet loss resilient MPEG-4 compliant video coding for the Internet

However additive increase/multiplicative de-crease schemes typically give rise to so-called saw-tooth rate patterns. The QoS requirements ofa multimedia stream may be in sharp contrast, assmooth rate variations are often a prerequisite formaintaining acceptable video quality. Further-more, end-to-end delay variations resulting fromnetwork queues building up in time of congestionhave a greater impact on continuous data streamsthan they have on traditional computer commun-ications. It is therefore a core issue to adopt conges-tion control strategies dedicated to continuousstreams, i.e. schemes targeting functional goalswhich re#ect the QoS requirements of these com-munications.

A hybrid RTT/Bandwidth control algorithm,built upon the TCP-friendly approach, is intro-duced in the source encoder regulation procedure.Apart from the TCP-friendliness property, the de-sign goals of the hybrid control approach, includefull rate control (even when no packet loss occurs),smooth rate variations (no sawtooth pattern) to-gether with high responsiveness, as well as the abil-ity to make use of current RTP/RTCP features, andeasy extension to multicast scenarios. This ratecontrol mechanism allows to maintain a stableSNR and visual quality while signi"cantly decreas-ing the source throughput. Encoding at a bit rateadapted to the link, leading to few losses, is moree$cient than using a higher bit rate which yieldshigher loss rates. By adjusting the source through-put to the available bandwidth, this mechanism,reducing the channel congestion, also leads toa better share of the network resources.

The remainder of this article is organized asfollows. Section 2 reviews the issues associated withvideo communications in the Internet and themechanisms that have been proposed to deal withthem. Sections 3}5 tackle the error control issue,with Section 3 brie#y addressing error resiliencewithin the MPEG-4 framework, Section 4 inves-tigating conditional replenishment schemes, andSection 5 being devoted to the Coding Mode Selec-tion scheme that we have designed, and introducingsome experimental results. Next we turn to the rateand congestion control issue, and introduce in Sec-tion 6 a hybrid RTT- and TCP-friendly-based ratecontrol prediction scheme, followed by experi-

mentation results. Section 7 is devoted to the multi-cast scenario and reviews the various issues asso-ciated with video multicasting along with theirpossible answers within the framework of our pro-posals. Finally, concluding remarks and directionsfor future work are given in Section 8.

2. Video communication over the Internet

The success of the best-e!ort Internet is largelybound to widespread use of TCP, which maintainsgood network conditions through its congestioncontrol mechanism. However, the use of this proto-col is incompatible with strict delay requirements ofreal-time multimedia. Multimedia delivery relies onunresponsive protocols such as UDP or/and RTP.RTP o!ers no reliability mechanisms, has no no-tion of connection and is usually implemented aspart of the application. Unresponsive use of theseprotocols gives rise to severe threats for the mediaQoS, as well as for the network QoS, as more andmore multimedia streams are being deployed. Asthe number of unresponsible data #ows increases,congestion inside the network with its devastativee!ects on multimedia delivery and interaction(large packet losses and end-to-end delays) be-comes a major concern.

Therefore, it is of prime importance to design lossresilient coding strategies as well as rate controlstrategies dedicated to multimedia communicationinside best-e!ort networks, i.e. implement mecha-nisms that are able to cope with the QoS-orientedneeds of multimedia applications, yet at the sametime ensure proper congestion avoidance or recov-ery. Widely retained approaches go from simpli"edand robust temporal redundancy exploitation tech-niques, often under the form of conditional replen-ishment, to error control and congestion controlstrategies.

2.1. Conditional replenishment

The compression e$ciency of motion-compen-sated temporal prediction is often sacri"ced for thesimplicity, and error resilience of conditional re-plenishment [18]. Temporal redundancy is onlyexploited through a change or motion detection

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 37

Page 4: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 1. Conditional replenishment scheme.

between blocks of adjacent frames. Fig. 1 depictsa block diagram for the conditional replenishmentalgorithm. For each block in a new frame, a dis-tance between the reference block and the newblock is computed. If the distance is above a thre-shold, the block is encoded and transmitted.

2.2. Error control mechanisms

The objectives of error control are to provideloss recovery facilities. To recover from loss, twowell-known techniques exist, ARQ and FEC, underthe form of so-called redundant data or of paritycodes. ARQ consists in re-transmitting the originalpackets that are lost. Therefore, the sender needs toknow the sequence number of the lost packet. Thisinformation may, for instance, be provided by thereceiver by using the RTCP report packets. Theprinciple of redundant data consists in re-transmit-ting into packets, information bits that have al-ready been transmitted in the previous packet,under the same form or encoded at a lower bit rate.This mechanism is widely used for vital informationsuch as picture headers in [8], and also for audiodata [4]. FEC strategies for the Internet are oftenbased on parity codes and block codes, such as theReed}Solomon codes. One or more parity blocksover a group of k packets are generated by linearlyindependent combinations of data blocks, often bybit-wise exclusive-ORing of the k packets. The par-ticular combination is called a parity code. Afterthe parity operation, there is a total of n data plusparity blocks (i.e., n!k parity blocks). This mecha-nism can recover from k losses in a n packet mess-age. It increases the rate by a factor of k/n and addslatency. Reed}Solomon codes o!er better protec-tion than parity codes but at the expense of in-creased processing. A RS code takes a group ofk data blocks and generates n!k FEC blocks.Comparative to ARQ, using parity data, the sender

needs only to know the packet loss probability (ormaximum number of packets lost) but does notneed to know their sequence number. This presentssome advantages for evolutions towards multicast,the feedback being reduced from per-packet feed-back to per-group of packets feedback.

Besides the error control strategy to retain, thecontrol of the amount of redundant informationadded at the source is a major concern. It can bebased on feedback information about the loss pro-cess measured at the destination, i.e. using QoSreporting mechanisms of RTCP.

2.3. Rate and congestion control mechanisms

Unicast congestion control schemes are typicallydesigned as a feedback loop between a source anda receiver. In such a scheme the receiver is respon-sible for monitoring the network state, and sendingperiodic feedback information to the source. Thelatter uses this indication to compute a propersending rate or to trigger some parameter tuningactions within its data encoding software (e.g. com-pression of raw video frames). The network stateobservation is to be smoothed, using either a slid-ing window or exponentially weighted movingaveraging (EWMA), then coarsely quantized intotwo or three `network statesa, e.g. increase/de-crease, or unloaded/loaded/overloaded. Coarsequantization requires one or two thresholds to bede"ned. Three-state schemes actually implementa `deadzonea feature which allows a more staticsteady-state behaviour. The network state charac-terization is then sent to the source via the feedbackloop.

Some schemes have been proposed which requirethe receiver to monitor the round-trip time (RTT)on the path between sender and receiver (e.g. [25]).The goal of focusing on a delay quantity is totrigger early reaction from the source, in order to

38 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 5: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 2. MPEG-4 video packet structure.

avoid any packet loss. Since network queues build-ing up in time of rising congestion increase theRTT, such schemes are supposed to react su$-ciently early to withdraw the congestion before itactually induces packet losses. There are howeverthree drawbacks associated to RTT monitoring.First, asymmetry within the network may make theRTT measure somewhat ine$cient. Next, delaythresholds have to be chosen to coarsely quantizethe monitored RTT, and it is a di$cult task toaccurately select `mean delaya values, that wouldresult from standard network load. Finally, it hasbeen proven that RTT-based schemes cannot easilyinterfere with loss-based ones (e.g. TCP sessions) ona fair basis, because the former have a much moreconservative behaviour than the latter [1].

An alternative is to monitor packet losses, asTCP does. Several design choices may be made: thereceiver may be asked to acknowledge incomingpackets, either positively or negatively, or to com-pute a loss rate over some time interval. The lattermay be calculated at the packet level or at a higherlevel (e.g. frames for video streams). Using the RTPprotocol and its companion, RTCP makes it easyto feed the source with packet loss event reports.

A new trend has emerged, which emphasizes onthe `network citizena behaviour of the congestioncontrol scheme. A property named TCP-friendli-ness captures the characteristics of a `gooda ses-sion, with respect to TCP connections, that isa behaviour which allows conforming sessions tofairly share the network. However, the underlyingprinciple of additive increase/multiplicative de-crease typically gives rise to sawtooth rate patterns,which may be in contrast with QoS requirements ofa video stream. Hence, broadly speaking, rate con-trol schemes should avoid sawtooth rate patternsand rather aim at feeding the source with smoothrate indications, yet at the same time be almost asreactive as a typical TCP implementation is, in

order to maintain some fairness between tradi-tional data exchanges and multimedia sessions.

3. MPEG-4 error resilient modes

The MPEG-4 video syntax provides support fora set of speci"c error resilient modes. When theerror resilience mode is `ona (`error-resilience-dis-ableda #ag set to `0a), then a resync-marker isinserted by the encoder before the "rst macro-blockafter the number of bits output since the last re-sync-marker "eld exceeds a predetermined value.The marker spacing value is dependent on theanticipated error conditions of the transmissionchannel and compressed data rate. The compresseddata included between two resync-markers is calleda video packet. In order to make each video packetindependently decodable, all predictively encodedinformation must be con"ned within a video packetso as to prevent the propagation of errors. How-ever, depending on the initial setting of the resync-markers and of the transmission channels vari-ations, the video packet may be larger than the sizeof one RTP packet, and may then be fragmented.This may adversely impact the loss resilience e$-ciency along with the overall network usage, sincelosing any fragment of a video packet renders thewhole packet useless, yet unlost fragments are car-ried and processed by the network. Depending onthe loss pattern, very high video packet loss ratesmay result from a moderate fragment loss rate.

As shown in Fig. 2, header information is alsoprovided at the start of a video packet. This headercontains information needed to restart the decod-ing process and includes the macroblock address(number) of the "rst macroblock contained in thispacket and the quantization parameter (quant-scale) necessary to decode that "rst macroblock.The macroblock number provides the necessary

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 39

Page 6: Packet loss resilient MPEG-4 compliant video coding for the Internet

spatial resynchronization while the quantizationparameter allows the di!erential decoding processto be resynchronized. Following the quant-scale isthe Header Extension Code (HEC). HEC is a singlebit used to indicate whether additional informationwill be available in this header. If the HEC bit is setthen the following additional information is avail-able in this packet header: modulo time base,VOP-time-increment, VOP-coding-type, intra-dc-vlc-thr, VOP-fcode-forward, VOP-fcode-back-ward. When the Header Extension Code is set to`1a, each video packet (VP) can be decoded inde-pendently. The information needed for decodingthe VP is then included in the header extensioncode "eld. If the VOP header information is cor-rupted by the transmission error, it can be correc-ted by the HEC information. However, the abovemechanisms turn out not to be su$cient under highloss conditions.

4. Conditional replenishment in an MPEG-4compliant encoder

As a "rst step, the temporal prediction modes ofan MPEG-4 compliant encoder are abandonedand replaced by a very simple conditional replen-ishment (CR) mechanism (Fig. 1). Note that thebitstream delivered is fully compliant with the VPstructure (Fig. 2) and the MPEG-4 video syntax.The decoder is therefore strictly the same.

4.1. Motion detection

The motion detection aims at selecting 16]16pixels macroblocks to be refreshed in intra-mode.Similarly to [18], macroblocks are divided into4]4 blocks. Let B

t~1"(r

1,2, r

n) be a reference

block of pixels in the reference frame bu!er, and¹ the motion detection threshold. The macroblockcontaining the block of pixels (x

1,2,x

n) in the

frame t will be refreshed in intra-mode if and only if

Kn+i/1

(ri!x

i)K'¹. (1)

In order to reduce blocking artifacts, replenishmentis also applied to the neighbouring macroblocks

adjacent to the selected block. The threshold ¹ canbe adjusted according to the motion in the scene. Inour experiments, a threshold ¹"100 turned out tobe well adapted to high motion sequences.

4.2. Results

The MPEG-4 Veri"cation Model (withoutB frames) and the CR-based MPEG-4 compliantencoders have been used for encoding the CIF`coastguarda sequence at a constant bit-rate of384 kbit/s. The frame rate of the source sequence is10 frames/s. The MPEG-4 encoder is used in therectangular mode, with INTRA refreshment peri-ods of respectively 15 and 30 frames. The errorresilient modes (described in Section 3) are enabled.To keep the bit-rate at a constant average value of384 kbit/s, the VM5.1 SRC (scalable rate control)rate control algorithm [11] is used.

4.2.1. Experiment on error-free channelsBoth encoders are "rst tested in an error-free

environment, in order to compare the respectivecompression e$ciency. Fig. 3 shows the PSNRratio of decoded sequences for both codecs, as afunction of the frame number. We observe thatthe PSNR curve of the CR system lies belowthe MPEG-4's one by an average of 2.65 dB, at thesame bit-rate. This emphasizes the poor compres-sion e$ciency of CR, in comparison with a tem-poral prediction algorithm.

Fig. 3. Performances (PSNR) on error-free channels.

40 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 7: Packet loss resilient MPEG-4 compliant video coding for the Internet

4.2.2. Experiments on xnite state erasure channelsThe current Internet is often modelled by a two-

state Markov process, characterized by a `re-ceiveda and a `losta state. The transition probabil-ities from `receiveda to the `lossa state and from`lossa to the `receiveda state are denoted respec-tively as p and q. Measures collected betweenINRIA in France and University College Londonin the UK, reported in [2], led to Elliott}Gilbertparameters p"0.08 and q"0.76. In our experi-ments, the p and q parameters have been set to 0.08and 0.60 in order to simulate a channel with highpacket loss rates. Fig. 4 depicts comparative PSNRresults of the Conditional Replenishment and theMPEG-4 encoders. An Intra refreshment period of15 frames is used in the MPEG-4 encoder. Themajor result of this experiment is that CR performsmuch better than di!erential coding in the presenceof packet losses. These results are also noticeableon the decoded images. In conclusion, despite itslow compression e$ciency, CR appears to be moreadapted to Internet video coding than a temporalpredictive scheme. In the next section, a new codingmethod is presented, which tries to gather theadvantages of both CR and MPEG-like codingsystems.

5. Coding mode selection

As shown above, CR-based encoders providea higher packet loss resilience, but at the expense ofpoor compression e$ciency. The purpose of thisparagraph is to "nd a coding strategy that wouldoptimize the trade-o! between error robustnessand compression e$ciency. A solution, proposed in[35], consists in creating a dependence graphbetween macroblocks of consecutive frames. Analgorithm provides an importance measure to eachmacroblock of each frame, and selects a set ofmacroblocks to be encoded in intra mode, accord-ing to a given bit budget. However, this method isnot adapted to real-time video coding, because itneeds multiple pass compression.

On the other hand, a coding mode selection isproposed in [34], for real-time video encoding onwireless channels. The method jointly optimizesmacroblock coding modes and associated para-

Fig. 4. Performances (PSNR) on "nite state erasure channels.(p"0.08, q"0.60).

meters (quantization parameters for instance), un-der a bit budget constraint. The purpose of thismethod is to exploit the numerous modes providedby the H.263 standard, in order to improve therate-distortion performance of the source codingprocess. However, no channel characteristic istaken into account.

This section describes a mode selection algo-rithm, based on a distortion measure which ex-ploits the knowledge of channel characteristics.In addition, the coding modes optimization ispreceded by a motion detection process (as usedin CR), in order to reduce the amount ofmacroblocks to be encoded, hence the encodercomplexity.

5.1. Principle

The strategy retained consists in combining Con-ditional Replenishment with an intra/inter-codingmode decision mechanism, as shown in Fig. 5. Thedecision process is based on a rate-constrainedoptimization procedure that takes into account thechannel characteristics.

The syntax of the MPEG-4 video encoderused supports, for the P-frames, the followingmodes:f INTRA: intra-coded,f INTER16: inter-coded with one motion vector

per macroblock,

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 41

Page 8: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 5. Coding mode selection mechanism.

f INTER4V: inter-coded with four motion vectorsper macroblock,

f U-mode: uncoded.The INTRA/INTER mode decision is based on acomparison between the variance of the luminanceof the original MB, and that of the prediction error.The coding mode in MIN¹ER16, IN¹ER4<Nyielding the smallest sum of absolute di!erence(SAD) between the original and the motion-com-pensated macroblocks is chosen.;-mode is chosenwhen the motion vector found and the associatedquantized prediction error are equal to zero.

Building upon the conditional replenishmentscheme, the macroblocks to be encoded are clus-tered into groups of N macroblocks (MB). For eachmacroblock X

ito be refreshed, the mode selection

algorithm chooses between: INTER mode, i.e.INTER16 or INTER4V, following the strategy of theMPEG-4 encoder as described above, and theIN¹RA mode. Let I"MIN¹RA, IN¹ERN bethe set of possible coding modes for each MB ofa group of MB X"MX

1,2,X

NN. A combination

of coding modes for the GOB s is an elementM"MM

1,2,M

NN3IN. The mode selection pro-

cess aims at providing a best combination of codingmodes, in the rate-distortion sense.

5.2. Distortion metrics

The distortion metric often used in mode selec-tion, as in [34], measures the distance between

a macroblock and its reconstructed version afterinverse quantization. This problem is re-formulatedhere by de"ning a global distortion measure, takingalso into account the channel distortion.

5.2.1. Channel modelsIn [9], the channel is modelled by a Bernoulli

process. Considering the loss or receiving states ofconsecutive packets as independent events, thepacket loss rate is expressed by an average prob-ability P

e. However, best-e!ort Internet is often

better modelled by "nite state erasure channels andespecially the Elliott}Gilbert channel [2]. TheElliott}Gilbert model is a two-state Markovprocess, as depicted in Fig. 6. The process is in stateR if packet n at step n has been received and instate ¸ otherwise. p and q are the transitionprobabilities between the two states. The averageloss probability is PM

e"p/(p#q). The state propa-

gation across packets is modelled by the transitionmatrix

P"A1!p p

q 1!qB.Note that the Elliott}Gilbert process is equivalentto a Bernoulli process if and only if p#q"1. Inaddition, it can be shown that

∀k*1, &pk, q

k3]0, 1[ /Pk"A

1!pk

pk

qk

1!qkB.

42 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 9: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 6. Elliott}Gilbert model (R: received, L: lost).

So, given a number k of successive transitions, thepacket loss process will converge towards aBernoulli process if p

k#q

k&1. Therefore, accord-

ing to the parameters k, p and q, the mode selectionalgorithm developed here will alternately adopta Bernoulli process or an Elliott}Gilbert processfor modelling the channel. Indeed, the valuedet(Pk)"(1!(p#q))k is compared to a thresholde'0. If (1!(p#q))k'e, then the packet lossmodel chosen is a Gilbert process. If(1!(p#q))k(e it is a Bernoulli process.

5.2.2. Channel as a Bernoulli processLet P

ebe the packet loss probability. Let Xt

ibe

the original macroblock at spatial location i andframe number t. We call XK t

ithe result of inverse

quantization of Xtiand XI t

ithe concealed version of

XK tiat the decoder side, when a packet loss occurs.

The concealment method considered consists inreplacing the current MB by the MB at same spa-tial location in the previous frame. By adoptinga Bernoulli process for modelling the channel, thedistortion for an intra-coded macroblock can beexpressed as

D(Xti, Intra)"(1!P

e)DXt

i!XK t

iD#P

eDXt

i!XK t~1

iD,

(2)

where D . D denotes the mean-square error of a mac-roblock. Let us now consider the inter-codingmode. The distortion metric introduced covers thegeneral case where a frame can be fragmentedacross several packets and reciprocally the casewhere several frames can be grouped into onepacket. This depends on both the source rate andthe maximum transfert unit (MTU) of the network.Let k

ibe the number of packets sent on the network

and containing at least one macroblock at spatiallocation i, since the last intra-macroblock (see

Fig. 7). Let /i(l) be the number of occurrences of a

MB at spatial location i in the packet l3[0, ki!1].

For example, in Fig. 7, we have ki"2, /

i(0)"1

and /i(1)"3. Let U

ibe the function de"ned by

∀l3[0,ki!1], U

i(l)"

l+j/0

/i( j).

The function Ui

captures the number of macro-blocks at the spatial location i contained in the lastl packets transmitted on the network, among thekipackets considered here. To simplify the expres-

sion of the distortion metric, we only consider pre-diction modes with motion vectors equal to zero.Let mt

t~1be the prediction error between the macro-

blocks XK t~1i

and Xti, at spatial location i in frames

of numbers t!1 and t.We suppose that the last intra-coded macro-

block at spatial location i, denoted MBI in Fig. 7,has been received. If we consider Fig. 7, the distor-tion metric for an inter-coded macroblock is givenby

D(Xti, Inter)

"(1!Pe)2DXt

i!XK t

iD#(1!P

e)P

eDXt

i!XK t~U

i(0)i

D

#PeCPe

DXti!XK t~U

i(1)i

D#(1!Pe)

KXti!XK t~U

i(1)i

!

t~1+

j/t~Ui(0)

mj`1j KD.

Implementing an accurate distortion measurewhich would take into account all the possible losscases, for each macroblock i in the image, wouldrequire to store all the macroblocks containing theprediction error signals mj`1

jtransmitted since last

intra-coded macroblock at the same spatial loca-tion i. In order to reduce the implementation com-plexity, an approximate distortion measure, basedon the assumption that the di!erence between twomacroblocks at the same spatial location i increaseswith their temporal distance. The distortionmeasure is indeed approximated by the upperbound DXt

i!XK tI

iD, where XK tI

iis the inverse quantized

version of the last intra-MB at location i. Thisassumption leads to

D(Xti, Inter))(1!P

e)kiDXt

i!XK t

iD

# (1!(1!Pe)ki) DXt

i!XK tI

iD. (3)

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 43

Page 10: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 7. Correspondence between video compressed stream and packet numbers.

The implementation of this distortion metric isquite simple: each time a MB is intra-coded, westore its value XK tI

iin a reference frame bu!er. The

value kiof relation (3) for each macroblock is up-

dated each time a video packet is formed.

5.2.3. Channel as an Elliott}Gilbert processIf we take into account the memory of the

Elliott}Gilbert model, the distortion expression isthe same as in Eq. (2) for the intracoding mode.Pe

is replaced by the average loss probability

Pe"p/(p#q). In inter-mode, the distortion

measure can be developed by adopting a similarapproach as for the Bernoulli model.

The di!erence is that kiis rede"ned as the total

number of packets sent on the network since lastintra-macroblock at spatial location i, and we usethe transition probabilities of the Elliott}Gilbertprocess. In the same way as with the Bernoullimodel and with the same approximation, we have

D(Xti, Inter))(1!p)ki DXt

i!XK t

iD

# (1!(1!p)ki) DXti!XK tI

iD. (4)

The value ki

for each macroblock is maintainedalmost in the same way as before (Section 5.2.2),except that it accounts for the total number ofpackets since the last intra-coded macroblock.

5.3. Mode selection

Given the de"nitions in Section 5.1, if the rateallocated to the current frame is R

&3!.%, then the

rate allocated to each GOB is proportional toR

&3!.%and to the length of the GOB:

R#"a]R

&3!.%]

NbGOB

Nb&3!.%

, (5)

where NbGOB

and Nb&3!.%

represent respectively thenumber of macroblocks in the considered GOBand the total number of macroblocks to refresh inthe current frame. The choice of parameter a isdiscussed in Section 5.4. The problem of "nding thebest coding modes combination for the subset ofmacroblocks X consists in "nding

minM D(X,M)

s.t. R(X,M))R#,

where D, R are the distortion and rate measuresand R

#the rate constraint for the GOB X. The

distortion measure is given respectively by expres-sions (3) or (4), according to the test made on thechannel model as described in Section 5.2.1. Theabove rate constrained optimization problem isrewritten as an unconstrained Lagrangian formula-tion. The distortion measure being additive, theLagrangian cost function becomes

J(X,M)"N+i/1

J(Xi,M)

"

N+i/1

D(Xi,M)#jR(X

i,M).

The algorithm then searches for the set M ofcoding modes minimizing the above objective cost

44 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 11: Packet loss resilient MPEG-4 compliant video coding for the Internet

function,

minM

(J(X,M)). (6)

The minimization of this functional can be car-ried out with the following steps [21,28]:

Initialization: Two values of j, jl and j6

arechosen such as jl)j

6. The minimization of

Jjlgives the rate Rl and distortion Dl parameters.

Similarly, the minimization of Jj6gives the para-

meters R6

and D6. R

6and Rl must verify

R6)R

#)Rl. (7)

This condition requires a careful choice of jl and j6.

In practice, we take jl"0 (minimizing the distor-tion without rate constraint) and j

6"R (minim-

izing the bit-rate). If the constraint (7) is met withequality for one of the two values, we have an exactsolution. Otherwise the algorithm proceeds withthe following steps:1. Minimize Jj where j"(Dl!D

6)/(R

6!Rl)#e,

e being a given real number. This provides a rateR and a distortion D.

2. If R"R6, the algorithm is over. If R'R

6, do

jl :"j and goto step (1). Otherwise do j6:"j

and goto step (1).

5.4. Rate control mechanisms

5.4.1. Rate control in the MPEG-4 verixcation modelThe rate control algorithm used in our video

encoder is the VM5.1 SRC presented in [10]. Thisrate control process uses an output bu!er and as-sumes that the encoder rate distortion function canbe modelled by

R"X1SQ~1#X

2SQ~2, (8)

where R is the encoding bit count and S the sum ofabsolute di!erences between original current mac-roblock and previous reconstructed macroblockfor a P-macroblock. The variable Q, X

1and

X2

represent respectively quantization and ratecontrol modelling parameters. The rate controlmechanism consists of four main steps:1. Initialization of parameters X

1and X

2.

2. Computation of the target bit-rate before encod-ing, based on the available and last encodedframe bits and on the bu!er status.

3. Computation of the quantization parameterQ before encoding, based on the rate distortionfunction (8) of the source encoder.

4. Updating of the rate distortion function given byEq. (8).

5.4.2. Choice of parameter aThe rate control presented in the previous sec-

tion attempts to achieve optimal quality for a giventarget bit-rate. Hence, it chooses quantization para-meters as "ne as possible, according to the avail-able bit-rate.

As a result, if a"1 in Eq. (5), then the amount ofintra-selected mode is very low because of the bit-rate cost of intra-coding when the quantizer para-meters are "ne.

A choice of a'1 in Eq. (5) allows to counterbal-ance the VM5.1 SRC rate control process. Theparameter a turns out to have a high in#uence onthe amount of intra-coded macroblocks. Hence, itallows to get a trade-o! between the number ofintra-MBs selected by the mode selection processand the "neness of the quantization parameters,adjusted by the VM5.1 SRC rate control. Thistrade-o! can be found by tuning a according to thechannel characteristics, for instance the averageloss probability. We then de"ne a as a function ofPe, and following the next relation:

a"1

1!Pe

. (9)

As a result, the whole rate control process ismade of a two separable steps that counterbalanceeach other.

Some work is driven in order to jointly optimizeboth quantization parameters and coding modes inthe rate-distortion sense as in [12,13,34].

5.5. Results

The MPEG-4 compliant encoder describedabove has been used for encoding `coastguardaand `newsa sequences, with the same coding con"g-uration as in Section 4.2. The mode selection algo-rithm uses the same Elliott}Gilbert transitionprobabilities as in Section 4.2, in presence of packetlosses, i.e. p"0.08 and q"0.60. Note that no I

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 45

Page 12: Packet loss resilient MPEG-4 compliant video coding for the Internet

frame is used in the mode selection encoder, exceptfor the "rst frame of the sequence. Only P frameswith intra-macroblocks are encoded.

5.5.1. Experiments on error-free channelsThe three encoders, namely MPEG-4 Veri"ca-

tion Model, the CR-based encoder, and theMPEG-4 compliant encoder incorporating theChannel Adaptive Mode Selection mechanism, are"rst tested in an Error-Free channel. From now on,the three encoders will be respectively referred asMPEG-4, CR and the CAMS encoders.

Fig. 8 depicts the PSNR ratio as a function of theframe number for each encoder. We observe that,on average, the CAMS encoder performs betterthan CR when no packet loss occurs. Its rate distor-

Fig. 8. Performances (PSNR) on error-free channels.

tion performance remains a bit lower than the ratedistortion performance of the MPEG-4 VM. This isdue to the use of the threshold ¹ of relation (1)which is at the moment not adapted to the channelcharacteristics. The lower PSNR values are due tomacroblocks that are not selected by the motiondetection process.

5.5.2. Experiments on xnite state erasure channelsConsidering the same channel characteristics

as in Section 4.2, Fig. 9 depicts the PSNR valuesobtained. When losses occur, the MPEG curvefalls below the two others for both sequences. TheCR encoder's curve is the most stable one in`Coastguarda sequence, in presence of packetlosses and high motion in the sequence ("rst partof `Coastguarda sequence). However, when the

Fig. 9. Performances (PSNR) on "nite state erasure channels.

46 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 13: Packet loss resilient MPEG-4 compliant video coding for the Internet

amount of motion is reasonable, the CAMS en-coder provides better PSNR values than CR whenno channel error occurs, and remains comparableto it in the presence of packet losses.

As a matter of fact, the mode selection schemeseems to provide a good trade-o! between com-pression e$ciency and packet loss resilience. There-fore, taking into account the channel statisticalmodel and the scene activity through the optimiza-tion process allows to improve the e$ciency ofvideo transmission across a "nite state erasurechannel.

Fig. 10 depicts the PSNR values obtained as afunction of the frame number, when consideringdistortion metrics based exclusively on the Be-rnoulli or the Elliott}Gilbert models, for both anError-Free and a Finite State Erasure channel. Theusage of the Elliott}Gilbert model leads to a higheramount of intra-coded macroblocks. This explainsthe higher stability in presence of packet losses, ofthe PSNR obtained with the Elliott}Gilbert model.The Gilbert model's curve recovers faster a packetloss than the other one. Hence, the use of theElliott}Gilbert in the distortion measure in thecoding mode selection process allows a higher resil-ience to packet losses.

6. Rate and congestion control

The adaptive mode selection mechanism de-scribed above increases the trade-o! between

Fig. 10. Performances (PSNR) obtained when using exclusivelythe Bernoulli or Elliott}Gilbert model.

packet loss robustness and compression e$ciencybut does not avoid losses. A rate control mecha-nism based on congestion control and rate-predic-tive models is developed here in order to minimizethe amount of losses by matching the video sourcebandwidth requirement to the available networkcapacity.

Given our loss resilient coding scheme, we nowturn to the congestion control issue, and proposea rate control scheme dedicated to video commun-ication, the behaviour of which is compatible withstandard (i.e. TCP) communications.

6.1. TCP throughput models

Early work in this area has shown that the sta-tionary throughput of a saturated TCP sender (i.e.one with an in"nite amount of bytes to send) is onthe inverse order of the square root of the loss rateobserved by the connection [16]. In the followingwe consider the so-called MF model, due to Mad-havi and Floyd [15]. Another model of interest isthe PFTK one, which was proposed by Padhyeet al. [19]. Mahdavi and Floyd propose in [15] theequation

BW"1.22]MTU

RTT]JLoss(10)

to compute the bandwidth that a TCP connectionreceives, provided it has MTU bytes per packet,and incurs a roundtrip time of RTT seconds and aloss rate of Loss. This equation comes from asteady-state analysis of the TCP congestion avoid-ance mechanism. Such a model is known to be validfor loss rates up to about 15%. In the following, weconsider using this model in a rate prediction pro-cess designed to feed the source's rate control mech-anism.

6.1.1. Parameter estimationImplementing a throughput model requires

knowledge of the connection's MTU, RTT and lossrate. The MTU can be either set to a "xed value, forinstance the standard minimum value of 576 bytesde"ned for TCP, or determined using an MTUdiscovery algorithm [32]. In our experiments, we

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 47

Page 14: Packet loss resilient MPEG-4 compliant video coding for the Internet

used "xed length packets which allow for a con-stant MTU, set to 576 bytes.

The RTT estimation is performed by keepinga recent average value. Following TCP's RTTsmoothing strategy, we implemented an exponen-tial "lter with a constant 0.9 smoothing factor [29].This "lter is fed with RTT measures conducted by avery simple and robust (stateless) request/reply pro-tocol, which takes place as a background process.This protocol sends RTT requests carrying thehost's local time periodically, and delivers a newRTT measure upon reception of the RTT replywhich carries the same, unchanged time value. Theserequests are sent once per RTT. The smoothedRTT estimation is hereafter denoted by S

RTT.

The loss rate estimation is performed by observ-ing packet losses, detected using sequence numbers,and again keeping a recent average value. Theaveraging operation is somewhat problematic inthis case, because we may think of network condi-tions where the loss rate reduces to zero at a giventime. Using an exponential "lter, whatever thesmoothing factor is set to, implies that losses aretaken into account for quite a time. This generatesa `residuala loss rate which may trigger extremelyhigh rate predictions from the throughput model asit tends to zero. Instead we use a time-based slidingwindow, the width of which is set to a 30RTT timeinterval, as determined through some simulationsin [31].

6.2. Architecture

Our purpose is to make use of the TCP-through-put model to regulate the output rate of our videocodec. The envisioned architecture is as follows. Inaddition to the forward data stream, a backwardcontrol stream is set up between the source and thereceiver. The control stream is made of periodicfeedback packets, which carry explicit rate informa-tion used by the source application to regulate itsdata output. In this architecture the throughputprediction model is located at the receiver side, andthe results of its computations are periodically sentback to the source. An alternate architecture couldbe designed, where the parameters required by thethroughput model would be carried inside feedbackpackets to the source, the latter implementing the

throughput model computations. An importantquestion relates to the feedback frequency, as toofrequent estimations may well overload the net-work with control packets, whereas too few feed-back indications may render the regulation schemeine$cient [22]. According to [15], a time intervalof at least one RTT is required between two con-secutive actions, so as to take into account theimpact of the former one. In order not to overloadthe network when the RTT is small, we choose inthe rest of this section to set the feedback period tothe maximum between one second and one RTT.

6.3. New rate prediction model

An issue when investigating an MF-based regu-lation is its so-called non-self-limiting behaviour.As opposed to TCP's congestion control mecha-nisms which obey an upper rate bound as a conse-quence of the maximum congestion window size,the MF model may potentially compute in"niterate predictions, typically when the loss rate is null.The non-self-limiting behaviour of most rate-basedcongestion control schemes is classically answeredby implementing timers, which expire in the ab-sence of feedback indication, thus forcing thesource to lower its rate [22]. In the present casehowever, unlimited rate may occur as a result ofa valid feedback indication, hence another solutionis to be designed.

Hereafter we use a hybrid rate control principledesigned to retain the desirable properties of TCP-friendly based regulation and to answer these non-self-limiting problems. Its basic idea is to performlossless rate adaptation whenever possible, using anRTT-based control loop, yet to embed a TCP-friendly rate prediction model which gets into playupon lossy conditions. The usual purpose of RTT-based rate control is to allow early reaction tocongestion, thereby avoiding packet losses. A clas-sical scheme is to compare RTT observationsagainst a given RTT threshold value, and to varythe source rate by performing additive increasewhen the last RTT measure is under the threshold,and multiplicative decrease when the observedRTT exceeds the threshold. Our approach retainsthe threshold mechanism, but di!ers from the pre-vious in the way the source rate is generated. The

48 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 15: Packet loss resilient MPEG-4 compliant video coding for the Internet

RTT-based control mechanism uses the smoothedRTT measure already computed for the TCP-throughput model as a basis to compute the RTTthreshold. More precisely, the RTT threshold¹

RTTis set to S

RTT#k] standard deviation (S

RTT).

The receiver maintains an averaged measure R ofits received rate. If the last RTT observation isabove the threshold, then it generates an RTT-predicted rate P

RTTequal to R!K, where K is

a constant rate increment. If the observed RTT isbelow the threshold, then the RTT-predicted ratePRTT

is set to R#K. In the subsequent experimentswe take k"0.9 and K"1 kb/s. The received rateR is smoothed by a time-based sliding window, thewidth of which is tuned to 30RTT, according tothe window size computation method proposed in[31]. We denote by P

TCPthe rate predicted by the

TCP throughput model. The actual mixed predic-tion P

MIXis computed as follows:

f if PTCP

is valid (that is, not in"nite), letPMIX

"(PTCP

#PRTT

)/2;f otherwise, let P

MIX"P

RTT.

The computation is performed once during a regu-lation round, the frequency of which is set to thelast smoothed RTT measure.

In this scheme, the TCP throughput-based pre-diction acts as a `master controllera with respectto the RTT-based prediction. This behaviour com-es from the use of the actually received rate asa basis for RTT-based prediction. Since the rateactually received relates to the previous feedbackindications, the RTT-based prediction is notsupposed to impact it by a large amount, butrather to closely follow it. Hence, under lossyconditions, the TCP throughput-based predic-tion takes a major role in the overall rate pre-diction.

If we indeed consider that PRTT

at a given regula-tion round is equal to P

MIXas computed during the

previous round (not taking into account packetlosses and the constant K), then, assuming a con-stant P

TCP, we have a geometric series MP

MIXN which

converges to PTCP

.

6.4. Experimentation results

In [30], a TCP-friendly rate control, based ona pure MF model, is used to regulate a video

source. Although the authors avoid the typical saw-tooth rate variations encountered with a pure MFregulation [31], they do so by estimating the modelparameters on a long term basis, and thus cannotexpect early reaction to network variations. Ourpurpose is to improve the bandwidth usage andvideo quality stability at the receiver, through a de-creased amount of packet losses, and yet havea highly reactive scheme in order to be actually`TCP-friendlya, with respect to classical data trans-missions.

6.4.1. Rate control within the video encoderThe rate control algorithm in the video encoder

has been modi"ed so as to adapt the source to thefeedback information received. This informationcorresponds to the new network bandwidth avail-able. Some parameters of the rate control [11]process are recalculated:f the bit budget available to encode the current

GOP,f the size of the bu!er,f the amount of bits to remove from the bu!er at

each frame.Quantization parameters of the encoder are thenadjusted using these new values, in the same way asin [11].

6.4.2. Network topologyDue to the non real-time nature of our ex-

perimental video codec, we performed rate con-trol experiments over the Internet using asimulated codec in order to get trace "les, andthen we used these "les o!-line as an input to thecodec so as to assess the impact of packet lossesand rate control over the quality of the videosequence.

The topology we experimented over is of wide-area network (WAN) kind, with the video sourcelocated at INRIA Sophia and the receiver locatedat INRIA Rennes (Fig. 11). The path betweensource and receiver thus crosses two regional net-works (Ouest-Recherche and R3T2) and the Frenchnation-wide R.N.I. interconnection network. Itconsists of a dozen routers. In addition, a packetsource was used at ENST Bretagne in order togenerate cross-tra$c within the regional network,thus adding to congestion.

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 49

Page 16: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 11. WAN topology.

Fig. 12. Consecutive packet losses without TCP-friendly rate control.

6.4.3. Rate control resultsFigs. 12 and 13 depict the packet loss patterns we

obtained from a typical experimentation round onthe aforementioned topology. The initial rate wasset to 384 kbits/s, we traced the packet losses withthe TCP-friendly rate control process disabled(Fig. 12) or enabled (Fig. 13). In the latter case,CBR rate control is performed. As can be easilyseen in the "gure, using the TCP-friendly ratecontrol feature allows to drastically reduce theoccurrence of packet losses. This is however atthe expense of the transmission rate, as the TCP-

friendly rate control mechanism triggers a largedecrease in the allowed rate through the feedbackindications, depicted by Fig. 14.

The feedback indications were then used to driveour video source, while the packet loss traces wereused to simulate the real channel. Fig. 15 depictsthe PSNR of the resulting decoded sequence as afunction of the frame number. The major result wecan observe is that the mean PSNR values obtainedare comparable for the two versions (with andwithout TCP-friendly rate control), despite thelarge di!erence in allowed rate. TCP-friendly rate

50 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 17: Packet loss resilient MPEG-4 compliant video coding for the Internet

Fig. 13. Consecutive packet losses with TCP-friendly rate control.

Fig. 14. Rate constraint provided by the receiver to the source.

control also allows a more stable PSNR curve(starting from frame number 75). This is explainedby the sensitivity of temporal prediction coding topacket losses. Encoding at a bit rate adapted to thelink, thus leading to few losses, turns out to be moree$cient than using a higher bit rate which yieldsa higher loss rate.

Fig. 15. PSNR values with constant bit rate and source rateadapted to TCP-friendly prediction.

7. Perspectives

7.1. Multicast communications

Since 1992, when the Mbone (Multicast Back-bone) was "rst introduced, multicast communica-tions across the Internet have become the focus of

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 51

Page 18: Packet loss resilient MPEG-4 compliant video coding for the Internet

a great amount of studies. Video multicastingacross a best-e!ort Internet is a specially interestingtopic as it faces heterogeneity issues. In a multicasttopology (multicast delivery tree in the 1PN case,acyclic graph in the MPN case), network condi-tions such as loss rate and queuing delays are nothomogeneous in the general case. Rather, theremay be local congestions a!ecting downstream de-livery of the video stream in some branches of thetopology. Furthermore, receiver's heterogeneitymay also be considered as real-time video decodingand display is somewhat tied to the particular re-ceiver's hardware and software performances.Recent work in this area has shed the light on thebene"ts of subband coding and multichannel multi-casting in the framework of heterogeneous band-width availability: by delivering subbands throughdistinct network channels (e.g. RTP sessions) andrelying on standard group management protocolsto join/leave these channels, it is possible for a givenreceiver to dynamically adapt the amount of videodata it receives to the available bandwidth (see forinstance the Receiver-driven Layered Multicast pro-posal [17]).

The issue of packet loss is also the subject ofintense study. On the one hand, reliable multicast-ing has been proposed with a variety of mecha-nisms making use of selective and hierarchicalacknowledgement and retransmissions schemes[14,20]. On the other hand, packet losses withina real-time #ow (e.g. video) can be dealt withusing FEC strategies tailored to the multicastframework. It is, for instance, possible for a videosource to send separate FEC channels that canbe joined by receivers experiencing a high lossrate [23].

7.2. Feedback in multicast

The use of feedback schemes in a multicast scen-ario faces two major issues. The "rst one deals withthe so-called feedback implosion which results fromstraightforward re-use of a unicast feedback schemein a multicast framework. As the number of partici-pants in the multicast session increases, so does thenumber of receiver reports that must be carried bythe network and processed by the source. More-over, the arrivals of receiver reports may well be

synchronized, raising the source's burden to anunbearable level [3]. A number of schemes havebeen devised to alleviate this problem. Probabilisticapproaches require each receiver to observe somerandom delay before sending a receiver report[36]. This makes it possible to assign unequal `im-portancea to receivers by changing some weightsin the random process, but the delays before a con-gestion indication is taken into account by thesource may be prohibitive. Another approach is forthe source to perform progressive, topological poll-ing of the receivers using, for instance, time to live(TTL) mechanisms to control the number of re-sponses. The main drawback of this method liesin its inaccuracy, as a modest TTL increase mayyield a drastic increase of the number of answers,thus causing a feedback implosion if the source isoverwhelmed. A third alternative was proposed[3]. Their scheme is also based on progressivepolling but relies on random keys computed byeach receiver. Progressive polling is done by vary-ing the number of signi"cative bits, thus allowingthe source to quite accurately poll the number ofreceivers it wants. This number is further reducedby "ltering the answers: a new poll embeds themaximum congestion state already known by thesource, so that only receivers incurring worse con-ditions have to reply. The RTP/RTCP standardalso tackles the feedback implosion problem.RFC 1889 suggests that the fraction of tra$c dedi-cated to control packets (i.e. reports) does notexceed 5% of the bandwidth assigned to the multi-cast session [26]. Moreover the inter-departuretime of RTCP packets is recommended to be largerthan a minimum of 5 s (a lower minimum may beallowed, particularly for unicast sessions [27]), andis subject to a random variation factor taken overthe range [0.5; 1.5]. The inter-departure time inter-val is calculated with the help of an estimate ofthe session size (i.e. number of participants), sothat they collectively attain the 5% bandwidth (theinterval scales linearly with the session size). Fur-ther studies have shed the light on scalabilitytroubles occurring when the number of participantsvaries rapidly, as the session size estimate may notre#ect the increase or decrease in a timely manner,and appropriate mechanisms have been proposed[24,27].

52 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 19: Packet loss resilient MPEG-4 compliant video coding for the Internet

The other issue associated with multicast feed-back is that of aggregating heterogeneous reportsinto a consistent view of the communication state.In some cases this considerably restricts the useful-ness of multicast feedback schemes. Consider forinstance an adaptation process where receiverssend back to the source the bandwidth availabilitythey infer. Then the source faces a problematictrade-o! as it has to determine a single transmis-sion rate that is bearable to all (or at least toa majority of) receivers. In most cases this translatesinto selecting the lowest common rate between allindicated values, or a rate low enough that a vastmajority of receivers will not experience congestion.The corollary is reduced quality for all receivers,whether they actually experience congestion or not[3,6,36]. When the feedback scheme is devoted toerror protection, as in our mode selection ap-proach, then the source has to take into account theworse error conditions encountered by the di!erentreceivers. The corollary in this case is suboptimalrate/distorsion ratio, since most forward error cor-recting data is useless to all but a few receivers.

Layered coding and transmission, as discussedabove, alleviate this problem by making it possibleto adapt the rate or amount of error control dataon a subband basis. A variety of multicast schemesmaking use of layered coding for audio and videocommunications have been proposed, some ofwhich rely on a multicast feedback scheme. TheDestination Set Grouping scheme is presented in [7],where a source produces multiple versions ofa video sequence and lets the receivers individuallychoose which version to join. In addition, a feed-back mechanism makes it possible for the receiversto alter the rate of the subband they chose. TheCafeMocha approach is proposed in [5], wherereceivers are supposed to leave an enhancementlayer as soon as the loss rate passes a given thre-shold. An indirect feedback mechanism is built byhaving the source monitor the session size of eachlayer, and adapt the rate of a layer that is too oftenabandoned by congested receivers.

7.3. Video encoder adaptation

The mechanisms we have proposed in the pre-vious sections, namely coding mode selection and

rate control using a TCP-friendly rate predictionmodel, may be adapted to the multicast frameworkand exploit the scalable coding features which havebeen designed in MPEG-4. Object scalability con-sists in coding each region of interest (ROI) indi!erent video object layers (VOL). The motiondetection process can be easily adapted to objectsof arbitrary shape, taking into account their posi-tion in a reference window (see [11]). Temporalscalability consists in increasing the temporal res-olution of the base VOL, or of a partial region of it.TCP-friendly rate control and coding mode selec-tion can be performed in a straightforward way onthe di!erent layers. Spatial scalability aims at in-creasing spatial resolution of the base layer VOPs.The mode selection process can be applied toPVOPs of the enhancement VOL by taking intoaccount additive coding modes based on predictionfrom the corresponding VOP in the base layer.Similarly, for B VOPs, prediction and interpolationfrom the base layer have to be taken into account.Since coding mode selection relies on the Elliott}Gilbert model of the channel, in a multicastscenario we have to consider having a distinctchannel model for each receiver, as loss conditionsin di!erent locations of the multicast distributiontree are unlikely to be identical. It is therefore theresponsibility of each receiver to calculate theElliott}Gilbert model for the path leading to it,and to communicate the transition probabilitiesp and q to the source. Next the source may applythe `worst-casea loss model to the mode selectionprocess, thus providing disproportionate error pro-tection to those receivers which experience fewlosses, but with moderate impact on the resultingquality.

Alternately, it is possible to further re"ne the lossmodel on a layer basis, by requiring the receivers toprovide the set of layers they have joined to alongwith the transition probabilities. This allows thesource to select a worst-case model for a given layerbetween only the loss models provided by receiversactually receiving this layer, and may potentiallyimprove the signal-to-noise ratio of higher layers asit is unlikely that heavily congested receivers havejoined them.

As for the rate adaptation process, our rateprediction model can be implemented inside each

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 53

Page 20: Packet loss resilient MPEG-4 compliant video coding for the Internet

receiver. This provides for individual congestionmonitoring and allows each receiver to determineits bottleneck rate. With the source advertising thelayers it generates along with their current rates ona multicast control channel, each receiver is free tojoin to the layers that collectively ful"ll its band-width capacity, similar to the approach taken in[32]. Rate adaptation may further be implementedin several ways.

For those applications which involve a limitednumber of participants (for instance up to ten re-ceivers), we envision that the number of layersequates the number of participants, so that eachone gets a dedicated enhancement layer. Consider-ing that receivers R

1, R

2,2,R

nare ordered by in-

creasing rate demand, the least demanding receiverR

1only joins to the base layer, R

2joins to the base

layer and the "rst enhancement layer, R3

joins tothe base, "rst and second enhancement layers andso on. This makes it possible to implement "negrain rate control as each receiver triggers the ratevariation of an enhancement layer. Moreover, asthe session size is kept small, the feedback fre-quency can be set quite high without the overallcontrol tra$c being too large. In addition, thisscenario makes it possible to tune the mode selec-tion coding process of a given layer on the basis ofthe Elliott}Gilbert models from those receiverswhich actually joined the layer.

In the case of multicast applications involvinga high number of participants, it is no longer pos-sible to apply the above scheme. We then envisionto classify the receivers according to the rate limita-tion they calculate using the rate prediction model,into a small number of classes which correspond todi!erent enhancement layers. The rate control ofeach layer would then be performed by aggregatingthe various rate indications inside the correspond-ing class, for instance using the mean rate, and sowould be the level of error protection provided bythe mode selection process.

8. Conclusion

Internet communications traditionally face con-gestion artifacts. With respect to real-time video

transmission, congestion must be addressed in twoways: packet loss resilience and packet loss avoid-ance (which translates into congestion and ratecontrol). Moreover, the current Internet (withbest-e!ort network service) requires data sourcesto behave fairly, that is, to harmoniously sharenetwork resources, as TCP does. Future im-provements of the Internet service model (i.e.di!erentiated service and/or integrated service)may well not remove this requirement, for instancewhen individual communications are aggregatedinto service classes which do not control resourceusage competition. The proposed coding modeselection mechanism, based on a channel charac-terization using an Elliott}Gilbert model, con-tributes to increase intrinsic robustness of thecompressed video stream. Jointly exploiting theknowledge of scene activity and channel charac-teristics leads to better trade-o! between codinge$ciency and resilience to packet loss. This paperalso describes a rate control scheme which embedsa TCP-friendly rate prediction model. Experi-mentation results show that the coding modeselection algorithm improves conditional replen-ishment in terms of compression e$ciency, yet ismore robust to packet losses than an MPEG-4video encoder. As for the congestion controlscheme, our experimental results shed the light onthe bene"ts of being TCP-friendly, as a better band-width usage is achieved when using a CBR, notcongestion-responsive video source. Coupling thetwo strategies developed in this work allows toglobally re"ne a video transmission scenario interms of provided service and social behaviour ofthe video source.

Some complementary work may allow anextension of the developed techniques to a multi-cast scheme, with time-varying channel char-acteristics.

Acknowledgements

The authors would like to thank Jean-Chrysos-tome Bolot from INRIA Sophia in France, forhelpful discussions, and Franck Galpin for provid-ing an e$cient software MPEG-4 decoder.

54 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56

Page 21: Packet loss resilient MPEG-4 compliant video coding for the Internet

References

[1] J.C. Bolot, T. Turletti, A rate control mechanism forpacket video in the internet, in: IEEE Infocom'94, Vol. 3,Toronto, Canada, June 1994, pp. 1216}1223.

[2] J.C. Bolot, T. Turletti, Adaptive error control for packetvideo in the internet, in: Proceedings IEEE ICIP'96, Vol. 1,Lausanne, September 1996, pp. 25}28.

[3] J.C. Bolot, T. Turletti, I. Wakeman, Scalable feedbackcontrol for multicast video distribution in the internet, in:ACM SIGCOMM'94, London, UK, September 1994,pp. 58}67.

[4] J.C. Bolot, A. Vega-Garcia, Control mechanisms forpacket audio in the Internet, in: Proceedings IEEEInfocom'96, Vol. 1, San Francisco, CA, April 1996,pp. 232}239.

[5] T.B. Brown, P.E. Cantrell, J.D. Gibson, Multicast layeredvideo teleconferencing: Overcoming bandwidth heterogen-eity, in: First Annual Telecommunication Conference,Austin, TX, 1996, pp. 145}152.

[6] I. Busse, B. De!ner, H. Schultzrinne, Dynamic QoS con-trol of multimedia applications based on RTP, ResearchReport, GMD-Fokus, Hardenbergplatz 2, D-10623 Berlin,May 1995.

[7] S.Y. Cheung, M.H. Ammar, X. Li, On the use of destina-tion set grouping to improve fairness in multicast videodistribution, in: IEEE Infocom'96, Vol. 2, San Francisco,CA, March 1996, pp. 553}560.

[8] G. Co( teH , B. Erol, M. Gallant, F. Kossentini, H.263#:video coding at low bit rates, IEEE Trans. Circuit SystemsVideo Technol. 8 (6) (November 1998) 849}866.

[9] R.O. Hinds, T.N. Pappas, J.S. Lim, Joint block-basedvideo source/channel coding for packet-switched net-works, in: Proceedings SPIE Visual Communication andImage Processing, Vol. 3309, 1997, pp. 124}133.

[10] ISO/IEC 14496-2, Coding of audio visual objects: Visual,ISO/IEC JTC1/SC29/WG11, Tokyo, March 1998.

[11] ISO/IEC JTC1/SC29/WG11, MPEG-4 video veri"cationmodel 10.0, Technical Report, ISO/IEC JTC1/SC29/WG11, February 1998.

[12] W. Kwok, H. Sun, L. Ju, Obtaining an upper bound inMPEG coding performance from jointly optimizing cod-ing mode decisions and rate control, in: Proceedings SPIEVCIP, Vol. 2501, 1995.

[13] J. Lee, B.W. Dickinson, Joint optimization of frame typeselection and bit allocation for MPEG video encoders, in:Proceedings ICIP, Vol, 2, 1994, pp. 962}966.

[14] C.G. Liu, D. Estrin, S. Shenker, L. Zhang, Local errorrecovery in scalable reliable multicast: comparison oftwo approaches, Technical Report 97-648, USC, January1997.

[15] J. Mahdavi, S. Floyd, TCP-friendly unicast rate-based#ow control, Technical note sent to the end2end-interestmailing list, January 8, 1997.

[16] M. Mathis, J. Semke, J. Mahdavi, T. Ott, The macroscopicbehaviour of the TCP congestion avoidance algorithm,Comput. Comm. Rev. 27 (3) (July 1997) 67}82.

[17] S. McCanne, V. Jacobson, M. Vetterli, Receiver-drivenlayered multicast, in: ACM SIGCOMM'96, Stanford, CA,August 1996, pp. 117}130.

[18] S. McCanne, M. Vetterli, V. Jacobson, Low-complexityvideo coding for receiver-driven layered multicast, IEEE J.Selected Areas Comm. 15 (6) (August 1997) 983}1001.

[19] J. Padhye, V. Firoiu, D. Towsley, J. Kurose, ModelingTCP throughput: a simple model and its empirical valida-tion, Technical Report TR98-008, UMASS CMPSCI,February 1998.

[20] S. Pingali, D. Towsley, J. Kurose, A comparison of sender-initiated and receiver-initiated reliable multicast protocols,Performance Evaluation Rev. 22 (May 1994) 221}230.

[21] K. Ramchandran, Joint optimization techniques in imageand video coding with applications to multiresolutiondigital broadcast, Ph.D. Thesis, Columbia University,1993.

[22] R. Rejaie, M. Handley, D. Estrin, RAP: an end-to-endrate-based congestion control mechanism for realtimestreams in the Internet, Technical Report 98-681, Com-puter Science Department, USC, August 1998.

[23] J. Rosenberg, H. Schulzrinne, An RTP payload format forgeneric forward error correction, IETF Internet Draftdraft-ietf-avt-fec-03, July 1998, work in progress.

[24] J. Rosenberg, H. Schultzrinne, Timer reconsideration forenhanced RTP scalability, in: IEEE Infocom'98, Vol. 1,San Francisco, CA, 1998, pp. 233}241.

[25] J. Sandvoss, J. Winckler, H. Wittig, Network LayerScaling: congestion control in multimedia communica-tion with heterogeneous networks, Technical Report43.9401, IBM European Networking Center, Heidelberg,1994.

[26] H. Schultzrinne, S. Casner, R. Frederick, V. Jacobson,RTP: a transport protocol for real-time applications,Request for Comments 1889, IETF Network WorkingGroup, January 1996.

[27] H. Schultzrinne, S. Casner, R. Frederick, V. Jacobson,RTP: a transport protocol for real-time applications, IETFInternet Draft draft-avt-rtp-new-01, August 1998, work inprogress.

[28] Y. Shoham, A. Gersho, E$cient bit allocation for anarbitrary set of quantizers, IEEE Trans. Acoust. SpeechSignal Process. 38 (9) (September 1989) 1445}1453.

[29] W.R. Stevens, TCP/IP Illustrated, Vol. 1 } The Protocols,Addison-Wesley Professional Computing Series, Ad-dison-Wesley, Reading, MA, 1994.

[30] W.T. Tan, A. Zakhor, Internet video using error resil-ient scalable compression and cooperative transportprotocol, in: Proceedings ICIP, Vol. 3, October 1998,pp. 458}462.

[31] F. Toutain, TCP-friendly point-to-point video-likesource rate control, In: Packet Video'99, March 1999,pp. 1}10.

[32] T. Turletti, S. Fosse-Parisis, J.C. Bolot, Experimentswith a layered transmission scheme over the internet,Technical Report RR-3296, INRIA Sophia-Antipolis,1998.

F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56 55

Page 22: Packet loss resilient MPEG-4 compliant video coding for the Internet

[33] J. Wen, J. Villasenor, Reversible variable length codes fore$cient and robust image and video coding, in: Proc.IEEE Data Compression Conf., Snowbird, UT, March1998, pp. 471}480.

[34] T. Wiegand, M. Lightstone, D. Mukerjee, T.G. Campbell,S.K. Mitra, Rate-distortion optimized mode selection forvery low bit rate video coding and the emerging H.263standard, IEEE Trans. Circuits Systems Video Technol.6 (2) (April 1996) 182}190.

[35] M.H. Willebeek-LeMair, Z.Y. Shae, Robust H.263video coding for transmission over the internet, in:Proceedings IEEE Infocom'98, Vol. 1, April 1998,pp. 225}232.

[36] R. Yavatkar, L. Manoj, Optimistic strategies for large-scale dissemination of multimedia information, in: ACMMultimedia'93, Anaheim, CA, August 1993.

56 F. Le Le&annec et al. / Signal Processing: Image Communication 15 (1999) 35}56