2 reading - ntuaimage.ece.ntua.gr/papers/486.pdf · is impossible to model every possible targetfor...

2nd Reading

July 29, 2007 21:57 00114

International Journal of Neural Systems, Vol. 17, No. 4 (2007) 289–304c© World Scientific Publishing Company

AN EMBEDDED SALIENCY MAP ESTIMATOR SCHEME:APPLICATION TO VIDEO ENCODING

NICOLAS TSAPATSOULIS∗,‡, KONSTANTINOS RAPANTZIKOS†,§and CONSTANTINOS PATTICHIS∗,¶

∗Department of Computer Science,University of Cyprus, CY 1678, Cyprus

†School of Electrical and Computer Engineering,National Technical University of Athens,

9 Iroon Polytechniou Str., 15780, Zografou, Greece‡[email protected]¶[email protected]§[email protected]

In this paper we propose a novel saliency-based computational model for visual attention. This modelprocesses both top-down (goal directed) and bottom-up information. Processing in the top-down channelcreates the so called skin conspicuity map and emulates the visual search for human faces performed byhumans. This is clearly a goal directed task but is generic enough to be context independent. Processingin the bottom-up information channel follows the principles set by Itti et al. but it deviates from themby computing the orientation, intensity and color conspicuity maps within a unified multi-resolutionframework based on wavelet subband analysis. In particular, we apply a wavelet based approach forefficient computation of the topographic feature maps. Given that wavelets and multiresolution theoryare naturally connected the usage of wavelet decomposition for mimicking the center surround process inhumans is an obvious choice. However, our implementation goes further. We utilize the wavelet decom-position for inline computation of the features (such as orientation angles) that are used to create thetopographic feature maps. The bottom-up topographic feature maps and the top-down skin conspicu-ity map are then combined through a sigmoid function to produce the final saliency map. A prototypeof the proposed model was realized through the TMDSDMK642-0E DSP platform as an embeddedsystem allowing real-time operation. For evaluation purposes, in terms of perceived visual quality andvideo compression improvement, a ROI-based video compression setup was followed. Extended experi-ments concerning both MPEG-1 as well as low bit-rate MPEG-4 video encoding were conducted show-ing significant improvement in video compression efficiency without perceived deterioration in visualquality.

Keywords: Visual attention model; embedded implementation; ROI-based video encoding.

1. Introduction

A popular approach to reduce the size of compressedvideo streams is to select a small number of inter-esting regions in each frame and to encode themin priority. This is often referred to as RegionOf-Interest (ROI) coding.24 The rationale behind ROI-based video coding relies on the highly non-uniformdistribution of photoreceptors on the human retina,by which only a small region of visual angle (the

fovea) around the center of gaze is captured at highresolution, with logarithmic resolution falloff witheccentricity.31 Thus, it may not be necessary or use-ful to encode each video frame with uniform quality,since human observers will crisply perceive only avery small fraction of each frame, dependent upontheir current point of fixation.

A variety of approaches have been proposed inthe literature for ROI estimation.24 In most of them

289

2nd Reading

July 29, 2007 21:57 00114

290 N. Tsapatsoulis, K. Rapantzikos & C. Pattichis

the definition of ROI is highly subjective; that is,they lack scientific evidence in supporting their claimthat the areas defined as ROIs are indeed regions ofinterest for the most of human beings. In this paperwe attempt to model ROIs as the visually attendedareas indicated by a saliency map15 in order to loweras much as possible the subjectivity in selectingROIs. Estimation of the saliency map is performedthough a novel model in which both top-down (goaloriented) and bottom-up information is utilized. Fur-thermore, emphasis was put to the minimization ofcomputational complexity through inline computa-tion of the features used for the computation of thevarious topographic maps.

In saliency-based visual attention algorithms effi-cient computation of the saliency map is critical forseveral reasons. First, the algorithm itself shouldmodel appropriately the visual attention processin humans. Visual attention theory has been con-structed mainly by neuroscientists without takinginto account computational modelling difficulties.On the other hand, computational models have beendeveloped mainly by engineers and computer sci-entists which in several cases compromise theoryin favor of implementation efficiency. Second, algo-rithm’s implementation should conform to real lifesituations and settings. Perceptual based video cod-ing is one of the areas that visual attention fits well.However, in applications like video-telephony real-time video encoding is required. Therefore, if a com-putational model of visual attention is to be used,then its implementation should be both fast andeffective. Finally, integration of the topographic fea-ture maps into the overall saliency map should beperformed in a reasonable way and not ad hoc asit happens in most existing models where normal-ization and additions is the combination method ofpreference.

In the proposed model for saliency map estima-tion, all the above mentioned issues were taken intoaccount. A computationally efficient way for identi-fying salient regions in images, based on bottom-upinformation, by utilizing wavelets and multiresolu-tion theory is employed. Furthermore, a top-downchannel, emulating the visual search for human facesperformed by humans has been also added. This goaloriented information is justified by the fact that inseveral applications like visual-telephony and tele-conferencing the existence of, at least, one human

face in every video frame is almost guaranteed.Therefore, it is anticipated that the first areato receive the human attention is the face area.However, bottom-up channels remain in processmodelling sub-conscious visual attention attraction.Finally, the overall saliency map estimator is imple-mented as an embedded system to allow real-timevideo encoding.

2. Saliency-Based Visual Attention

2.1. Existing computational models

The idea of attention deployment dates back to thepioneering work of James,14 the father of Ameri-can psychology. Several theoretical models have beenproposed in the past using the two component atten-tion framework, consisting of a top-down and abottom-up component, as proposed by James. Com-putational modelling of this theoretical framework(improved also by Treisman, see Ref. 28) has been animportant challenge for researchers both with a neu-roscience and engineering background. It is widelyaccepted that one of the first neurally plausible com-putational architecture for controlling visual atten-tion was proposed by Koch and Ullman.15 The coreidea is the existence of a saliency map that com-bines information from several feature maps into aglobal measure where points corresponding to onelocation in each feature map project to single unitsin the saliency map. Attention bias is then reducedto drawing attention towards high activity locationsof this map.

Influenced by the work of Koch and Ullman sev-eral successful computational models have been builtaround the notion of a saliency map. The GuidedSearch model of Wolfe33 hypothesizes that atten-tional selection is the net effect of feature mapsweighted by the task at hand (requires statisti-cal knowledge) and bottom-up activation based onlocal feature differences. Hence, local activity ofthe saliency map is based both on feature con-trast and top-down feature weight. The Feature-Gate model of Cave1 provides a full neural networkimplementation of a similar system that combinestop-down with bottom-up mechanisms. The morerecent work of Torralba,27 who relates attention tovisual context of a scene, is also in a similar vein.One of the most successful saliency-based modelswas proposed by Itti et al.13,12 It is based on the

2nd Reading

July 29, 2007 21:57 00114

An Embedded Saliency Map Estimator Scheme: Application to Video Encoding 291

principles governing the bottom-up component of theKoch and Ullman’s scheme. Visual input is firstdecomposed into a set of topographic feature maps.Different locations then compete for saliency inde-pendently within each map, so that only locationsthat locally stand out from their surround per-sist. This competition is based on center-surround-differences akin to human visual receptive fields.These center-surround operations are implementedin the model as differences between a fine and coarsescale for a given feature. Finally, all feature mapsare linearly combined to generate the overall saliencymap. Itti’s model has been widely used in the com-puter vision community, since it provides a completefront-end for the analysis of natural color images.Applications include experimental proof of model’srelation to human eye fixations, target detection12

and video compression.11

Tsotsos et al.30 proposed a different view of atten-tional selection. Their selective tuning model appliesattentional selection recursively during a top-downprocess. This means that attentional selection doesnot occur either at the top of the processing hierar-chy or at several levels during the bottom-up process.First the strongest item is selected at the top levelof the processing hierarchy using a Winner-Take-All(WTA) process (the equivalent of a saliency map).Second, the hierarchy of a WTA process is activatedto detect and localize the strongest item in eachlayer of representation, pruning parts of the pyramidthat do not contribute to the most salient item andcontinuously propagating changes upwards. Overall,saliency is calculated on the basis of feed-forwardactivation and possibly additional top-down biasingfor certain locations of features.

It is important to note that the saliency mapis not the only computational alternative for thebottom-up guidance of attention. Desimone andDuncan,2 Hamker7 and several other researchersargue that salience is not explicitly represented byspecific neurons (saliency map), but instead is implic-itly coded in a distributed manner across the vari-ous feature maps. Nevertheless, these models are notdirectly related to the proposed method and are notfurther examined.

In the majority of the studies mentioned above theterm top-down refers to the way computation of thesaliency maps is performed. That is, in contrary tobottom-up approaches in the top-down ones attention

is considered to be directed in a rather large scene areaand then within this area the attentional selectionsare identified. There is, however, another view of top-down attention selection. In this alternative view, fre-quently referred to as ‘goal directed attention’, atten-tional selection is based on the principle that humansare directing their attention on known specific tar-gets (i.e., searching for green cars). Unfortunately, itis impossible to model every possible target for everyhuman and under any context. Therefore, the major-ity goal directed approaches simply outline a generalframework,4,5,20 on how to use goal knowledge foridentifying visually salient areas.

2.2. The proposed visual attentionmodel

The proposed Visual Attention model is illustratedby the architectural diagram of Fig. 1. Saliencymap computation is based both on bottom-up andtop-down (goal directed) information. The inputsequence is supposed to contain regions of interestand non important distractors or background areas.The role of the top-down component, depicted onthe left, is to bias the attention system towards theseregions of interest that can be statistically modelledusing prior knowledge. Any region of interest may bemodelled using statistical methods and inserted tothis component. Nevertheless, since statistical mod-elling is not the focus of the proposed method, weonly use a previously developed29 statistical modelfor human skin representation, used for face detec-tion, to test the proposed scheme. Faces are probablythe only kind of objects in which attention to them isnatural to be drawn independently of context. In theproposed model, another conspicuity map, the skinmap, is computed based on the color similarity ofobjects with human-skin. The skin map is modulatedthrough multiplication with a texture map so as toemphasize on structured skin areas which have a highprobability to correspond to human faces. The modu-lating texture map is created through range filtering6

of the intensity channel (see also Fig. 3).Interaction between top-down and bottom-up

sub-systems is performed through modulation andtakes place both within feature integration (featurelevel) and across feature integration (result level).The feature level is related to the bias applied tospecific features in order to enhance regions similar

2nd Reading

July 29, 2007 21:57 00114


Fig. 1. The overall architecture of the Wavelet-based VA model for saliency map estimation.

to the prior model. For the sake of clarity there is anintentional redundancy in Fig. 1: skin areas may beenhanced using a combination of Cb and Cr channelsas indicated by the corresponding arrow; neverthe-less, we add an independent skin detection moduleto show the creation of skin map as well as the com-bination across the conspicuity maps at the resultlevel.

Let us consider a video-telephony scenario wherethe video stream contains one or more faces. Theproposed scheme is, then, activated as follows:

(a) The skin model is selected as the one to bias fur-ther analysis and the input sequence is decom-posed into different feature dimensions;

(b) each of the feature maps is transformed in thewavelet domain and center-surround differencesare independently applied. The center-surroundoperator is applied between a coarse and a finerscale and aims at enhancing areas that pop-outfrom their surroundings;

(c) the intermediate results (conspicuity maps) aremodulated by the top-down gains computedusing the selected prior model (this is not neces-sary for the intensity channel, since only a singleconspicuity map exists);

(d) all conspicuity maps are again weighted usingtop-down gains and finally fused to generate thesaliency map.

2.3. Bottom up saliency-mapcomputation

Multiscale analysis based on wavelets16 was adoptedfor computing the center-surround structure in thebottom-up conspicuity maps. The center-surroundscheme of Itti et al.12 includes differences of Gaussianfiltered versions of the original image and is in a waysimilar to a Laplacian pyramid that is constructed bycomputing the difference between a Gaussian pyra-mid image at a level L and the coarser level L+1 afterit has been expanded (interpolated) to the size oflevel L. The Laplacian pyramid, therefore, representsdifferences between consecutive resolution levels of aGaussian pyramid, similar to how the wavelet coeffi-cients represent differences. Two main characteristicsdifferentiate Laplacian from wavelet pyramids: (a) ALaplacian pyramid is overcomplete and has 4/3 asmany coefficients as pixels, while a complete wavelettransform has the same number of coefficients and,(b) the Laplacian pyramid localizes signals in space,but not in frequency as the wavelet decompositiondoes.

The YCbCr color space (instead of the RGB usedby Itti,13 was selected, first to keep conformance withthe face detection scheme,29 and second to use thedecorrelated illumination channel Y for the intensityand orientation conspicuity maps derivation and theCb, Cr channels for the color conspicuity map. In this

2nd Reading

July 29, 2007 21:57 00114


way parallel construction of the conspicuity maps isallowed and the required transformations of one colormodel to another is minimized.

Let’s consider a color image f, transformed intothe YCrCb color space. Channel Y corresponds tothe illumination, and can be used for identifyingoutstanding regions according to illumination andorientation, while Cr (Chrominance Red) and Cb(Chrominance Blue) correspond to the chrominancecomponents and can be used to identify outstandingcolor regions. In the proposed implementation salientareas based on intensity, orientation, and color arecomputed in several scales. In this way, outstand-ing areas of different sizes are localized. Combiningthe results of intensity, orientation, and color featuremaps at various scales provide the intensity (CI),orientation (CO) and color (CC) conspicuity maps.The motivation for the creation of the separate con-spicuity maps is the hypothesis that similar featurescompete strongly for saliency, while different modal-ities contribute independently to the saliency map.Hence, intra-feature competition is followed by com-petition among the three conspicuity maps to pro-vide the bottom-up part of the saliency map.

In order for multiscale analysis to be performeda pair of low-pass hφ(·) and high-pass filter hψ(·) areapplied to each one of the image’s color channels Y,Cr, Cb, in both the horizontal and vertical directions.The filter outputs are then sub-sampled by a factorof two, generating the high-pass bands H (horizontaldetail coefficients), V (vertical detail coefficients),D (diagonal detail coefficients) and a low-pass sub-band A (approximation coefficients). The processis then repeated to the A band to generate thenext level of the decomposition. The following equa-tions describe mathematically the above process forthe illumination channel Y. It is obvious that thesame process applies also to Cr and Cb chromaticitychannels:

Y−(j+1)A (m, n)

= {hφ(−m) ∗ {Y −jA (m, n) ∗ hφ(−n)} ↓2n} ↓2m

(1)

Y−(j+1)H (m, n)

= {hψ(−m) ∗ {Y −jH (m, n) ∗ hφ(−n)} ↓2n} ↓2m

(2)

Y−(j+1)V (m, n)

= {hφ(−m) ∗ {Y −jV (m, n) ∗ hψ(−n)} ↓2n} ↓2m

(3)

Y−(j+1)D (m, n)

= {hψ(−m) ∗ {Y −jD (m, n) ∗ hψ(−n)} ↓2n} ↓2m

(4)

where ∗ denotes convolution, Y −jA (m, n) is the

approximation of Y channel at j -th level (note thatY −0

A (m, n) = Y ), and ↓2m and ↓2n denote down-sampling by a factor of two along rows and columnsrespectively.

Following the decomposition of each color chan-nel at specific depth we use center-surround differ-ences to enhance regions that locally stand-out fromthe surround. Center-surround operations resemblethe preferred stimuli of cells found in some partsof the visual pathway (lateral geniculate nucleus-LGN).28 Center-surround differences are computedin a particular scale (level j ) as the point-by-pointsubstraction of the interpolated approximation atthe next coarser scale (level j + 1) from the approxi-mation at this scale (level j ). The following equationsdescribe the center-surround operations, at level j,for the various bottom-up feature maps:

I−j = |Y −jA (m, n) − ((Y −(j+1)

A (m, n) ↑2m)

∗ hφ(m)) ↑2n ∗hφ(n)| (5)

C−j = wCrC−jr + wCb

C−jb (6)

C−jr = |C−j

rA(m, n) − ((C−(j+1)

rA(m, n) ↑2m)

∗ hφ(m)) ↑2n ∗hφ(n)| (7)

C−jb = |C−j

bA(m, n) − ((C−(j+1)

bA(m, n) ↑2m)

∗ hφ(m)) ↑2n ∗hφ(n)| (8)

O−j = wYD |Y −jD − Y −j

D | + wYV |Y −jV − Y −j

V |+ wYH |Y −j

H − Y −jH | (9)

Y −jD = ((Y −(j+1)

D (m, n) ↑2m) ∗ hφ(m)) ↑2n ∗hφ(n)|(10)

Y −jV = ((Y −(j+1)

V (m, n) ↑2m) ∗ hφ(m)) ↑2n ∗hφ(n)|(11)

Y −jH = ((Y −(j+1)

H (m, n) ↑2m) ∗ hφ(m)) ↑2n ∗hφ(n)|(12)

2nd Reading

July 29, 2007 21:57 00114


(a) Original image (b) Orientation map (c) Intensity map

Fig. 2. The importance of the orientation channel. Compare the results shown in (b) and (c).

In the above equations by I−j, O−j , C−j , wedenote the intensity, orientation and colour featuremaps computed at scale (level) j ; C−j

rA, C−j

bA, are the

approximations of chromaticity channels Cr and Cbat j -th scale; ↑2m and ↑2n denote up-sampling alongrows and columns respectively, while Y −j

A , Y −jD , Y −j

H ,Y −j

V are the upsampled approximations of Y−(j+1)A ,

Y−(j+1)D , Y

−(j+1)V , and Y

−(j+1)H .

The weights {wCr , wCb, wYA , wYD , wYV , wYH}

correspond to the within feature modulating gainsobtained from the top-down subsystem as illustratedin Fig. 1. In the absence of any top-down informa-tion, modulating gains can be set to the same value.However, they must be non-negative and their summust equals one.

The conspicuity maps for intensity (CI), orien-tation (CO) and color (CC) are computed by com-bining the features maps at various scales in orderto identify both small and large pop-out regions.Combination of the feature maps at different scales isachieved by interpolation to the finer scale, point-by-point addition and application of a saturation (e.g.,sigmoid) function to the final result. The followingequations describe mathematically the creation ofthe intensity conspicuity map. The same processapplies also to orientation and color conspicuitymaps:

CI =2

1 + e−(P−1

j=−JmaxCj

I )− 1 (13)

C−jI = I−j(m, n) − ((C−(j+1)

I (m, n) ↑2m)

∗ hφ(m)) ↑2n ∗hφ(n) (14)

C−Jmax

I = I−Jmax (15)

The maximum analysis depth Jmax is computed asfollows:

Jmax =⌊

log2N

2

⌋, N = min(R, C) (16)

where in y = �x�, y is the highest integer value forwhich x ≥ y, and R, C are the number of rows andcolumns of input image respectively.

In Fig. 2(a) an image with an object differ-ing from the surround due to orientation is shown.As expected, the orientation conspicuity map, illus-trated in Fig. 2(b), captures this difference accu-rately. In contrast, the intensity map, shown inFig. 2(c) is rather noisy because there are no areasthat clearly stand-out from their surround due tointensity.

2.4. Combination of conspicuity maps

The overall saliency map is computed once the con-spicuity maps per channel have been computed.Apart from the bottom-up conspicuity maps the top-down map (in the particular case the skin map) isalso used in this stage. As in the previous case asaturation function is used to combine, the modu-lated by the across feature gains, conspicuity maps

2nd Reading

July 29, 2007 21:57 00114


into a single saliency map. Normalization and sum-mation, which is the simplest way of combining theconspicuity maps (as followed by Itti13), may createinaccurate results in cases where regions that stand-out from their surround in a single modality exist.For example in the case of Fig. 2(a) it is expectedthat only the orientation channel would produce asalient region. Averaging the results of orientation,intensity and color maps (not to mention skin map)will weaken the importance of the orientation mapin the total (saliency) map. Therefore, the saturatefunction is applied so as to preserve the indepen-dency and added value of the particular conspicuitymaps as shown in Eq. 17:

CS =2

1 + e−(CI+CO+CC+CF )− 1 (17)

where CI , CO, CC and CF are intensity, orientation,colour and skin conspicuity maps respectively, whileS is the combined saliency map.

Fig. 3. The saliency map estimator as an embedded system.

3. The Saliency Map Estimator as anEmbedded System

The saliency map estimator described in the previ-ous section has been implemented as an embeddedsystem with the use of the TMDSDMK642-0E DSPplatform.26 The main architectural components areshown in Fig. 3. The corresponding model was devel-oped using the SIMULINK model libraries19 andthe Embedded Target for TI C6000TM blockset.18

The application code was optimized for speedand deployed to the TMDSDMK642-0E DSP plat-form using the Texas Instruments’ Code ComposerStudioTM .10 A simplified version of the SIMULINKmodel can be found at Ref. 22. In this emulated ver-sion all hardware requirements have been removedto allow for anyone who wishes to test it.

In Fig. 3 the blocks named ‘TopDown SaliencyMap’ and ‘Bottom-up Saliency Map’ indicatethe embedded subsystems for the computation ofthe skin map and bottom-up conspicuity maps

2nd Reading

July 29, 2007 21:57 00114


(a) (b)

(c) (d)

Fig. 4. Conspicuity Maps of an example image.

respectively. In the ‘Map Combination’ block theacross feature combination of Maps is implemented.Finally, in the ‘Video Encoding’ block the origi-nal video frames are encoded as ROI-based MPEGvideo streams using the process described in thenext section.

An example of the application of the proposedvisual attention model in a still image is shownin Figs. 4 and 5. In Fig. 4(a)–4(d) the intensity,orientation, color, and skin conspicuity maps areshown respectively. It can be seen that regions thatstand out from their surround in each one of thefeature channels (intensity, orientation, color, skin)differ significantly and a linear combination of thecorresponding conspicuity maps would result in anoisy saliency map. The use of the saturation func-tion for map combination smooths this effect andmaintains the individual value of each feature chan-nel as shown in Fig. 5(a). In this Figure the com-bined saliency map of all feature maps is depicted. InFig. 5(b) the actual ROI area created by threshold-ing (using Otsu’s method23) the saliency map is illus-trated. Finally, in Figs. 5(c) and 5(d) the streamline

and ROI-based JPEG encoded images are shown.Non-ROI areas in Fig. 5(d) are smoothed, by using alow pass filter, before passed to the JPEG encoder.The compression ratio achieved in this particularcase, compared to standard (streamline) JPEG, isabout 1.2:1.

4. ROI-Based Image and VideoEncoding using Saliency Maps

The visually attended areas indicated by a saliencymap and computed through the proposed model,can be considered as ROIs in video entertainmentmovies as well as in a variety of applications suchas teleconferencing and video surveillance. In thefirst case bottom-up channels are mainly engagedto model sub-conscious visual attention attractionwhile in visual-telephony applications the existenceof human faces in every video frame is almost guar-anteed, and therefore, it is anticipated that the firstarea to receive the human attention is the face area.However, we cannot identify as ROIs only the facelike areas,17,32 because there is always the possibility,

2nd Reading

July 29, 2007 21:57 00114


(a) (b)

(c) (d)

Fig. 5. An example of the visually salient areas identified using the proposed algorithm.

even in a video-telephony setting, that other objectsin the scene attract the human interest in a sub-conscious manner.

For ROI-based video encoding we consider asROI an area created by thresholding the saliencymap. The latter is obtained by applying the pro-posed model to every video frame. Once ROI areasare identified the non-ROI areas in the video framesor images are blurred via a smoothing filter. It iswell-known that in smooth areas a higher compres-sion ratio than textural ones can be achieved due tothe spatial decorrelation obtained by applying eitherthe DCT transform (JPEG, MPEG-1) or waveletdecomposition (JPEG 2000, MPEG-4). The assump-tion made for smoothing non-ROI areas is that inthe limited time in which a frame is presented to anobserver the latter will concentrate on visually salientareas and will not perceive deterioration in non-visually important areas. Smoothing non-ROI areasis not optimal in terms of expected encoding gain.However, it has the advantage of producing com-pressed streams that are compatible with existingdecoders.

The quality of the VA-ROI based encoded videosand images is evaluated through a set of visualtrial tests. These tests were conducted based onten short video clips, namely: eye witness, fashion,grandma, justice, lecturer, news cast1, news cast2,night interview, old man, soldier (see Ref. 21). Allvideo clips were chosen to have a reasonably var-ied content, whilst still containing humans and otherobjects that could be considered to be more impor-tant (visually interesting) than the background.They contain both indoor and outdoor scenes andcan be considered as typical cases of news reportsbased on 3G video telephony. However, it shouldbe noted that the selected video clips were chosensolely to judge the efficacy of VA-ROI coding inMPEG-1 and MPEG-4 and are not actual video-telephony clips. In the MPEG-1 case variable bit rate(VBR) encoding was performed with a frame reso-lution of 288 × 352 pixels, frame rate of 25 fps, andGOP structure: IBBPBBPBBPBB. In the MPEG-4case, VBR encoding was also adopted but the basicaim was the low bit rate. Therefore, a frame res-olution of 144 × 176 pixels and a frame rate of

2nd Reading

July 29, 2007 21:57 00114


15 fps was chosen so as to conform to the con-straints imposed by 3G video telephony. The ImTOOMPEG Encoder plugin9 was applied to uncom-pressed avi files, generated by using the avifile func-tion of Matlab(R), to create three MPEG-1 and threeMPEG-4 video-clips for each case. The first one cor-responds to the proposed VA based encoding (namedVA-ROI), the second corresponds to VA based cod-ing proposed by Itti11 (named IttiROI), and thethird corresponds to standard MPEG (MPEG-1and MPEG-4) video coding. In both VA methods(the proposed and Itti’s) non-ROI areas in eachframe are smoothed before communicated to theencoder.

5. Visual Trial Tests and ExperimentalResults

The purpose of the visual trial test was to directlycompare the subjective visual quality of VA-ROIbased, IttiROI based, and streamline MPEG-1 andMPEG-4 video encoding. ROI s were determinedusing the proposed embedded saliency map esti-mator for the VA-ROI method and the Neuro-morphic Vision Toolkit8 for Itti’s method. In bothcases saliency maps were thresholded using Otsu’smethod23 to create the binary masks that correspondto ROI areas.

A three alternative forced choice (3AFC)methodology was selected because of its simplic-ity, i.e., the observer watches the three differentlyencoded video clips and then selects the one pre-ferred, and so there are no issues with scaling opinionscores between different observers.3 There were tenobservers, (five male and five female) with good, orcorrected, vision, all being non-experts in image com-pression (university students). The viewing distancewas approximately 20 cm (i.e., a normal PDA/mobilephone viewing distance) for the MPEG-4 videos and50 cm (typical PC screen viewing distance) for theMPEG-1 video. The video clip triples were viewedone at a time in a random order. The observer wasfree to view the video clip triples multiple timesbefore making a decision within a time framework of60 seconds. Each video clip triple was viewed twice,giving (10×10×2) 200 comparisons. Video-clips wereviewed on a Smartphone (NokiaTM N90) display inthe case of the MPEG-4 videos and on a typicalPC monitor in a darkened room (i.e., daylight withdrawn curtains) in the case of the MPEG-1 videos.

Prior to the start of the visual trial all observers weregiven a short period of training on the experimentand they were told to select the video clip they pre-ferred. Both the MPEG-1 and the MPEG-4 encodedvideos were tested through a visual trial.

Table 1 shows the overall preferences, i.e., inde-pendent of video clip, for the standard MPEG-1, theItti-ROI and the proposed VA-ROI-based method.It can be seen that there is slight preference to stan-dard MPEG-1 which is selected at 46.5% of the timeas being of better quality. The difference in selec-tions, between VA-ROI based (selected at 45.5% ofthe time as being of better quality) and standardMPEG-1 encoding, is actually too small to indicatethat the VA-ROI based encoding deteriorates signif-icantly the quality of the produced video stream. Atthe same time the bit rate gain, which is about 36%on average (see also Table 3), shows clearly the effi-ciency of VA-ROI based encoding. IttiROI encodedvideos were selected as few as 8% of the time as beingof better quality. This fact indicates a clear visualdeterioration. The slightly increased encoding gain(41% — see also Table 3), compared to the VA-ROImethod, does not trade off this lowering in perceivedvisual quality.

In Fig. 6, the selections made per video clip areshown. In one of them (justice) there is a clear pref-erence to standard MPEG-1, while in news cast2there is a clear preference to VA-ROI. The lat-ter is somehow strange because the encoded qual-ity of individual frames in VA-ROI based encodingis, at best, the same as standard MPEG-1 (in theROI areas). Therefore, preference to VA-ROI basedencoding may be assigned to denoising, performedon non-ROI areas by the smoothing filter. In theremaining eight video clips the difference in prefer-ences between VA-ROI and standard MPEG-1 maybe assigned to statistical error. On the other hand, itis important to note that Itti’s ROI-based encoding

Table 1. Overall preferences (independent ofvideo clip) — MPEG-1 case.

Encoding Method Preferences Average BitRate (Kbps)

VA-ROI 91 1125Itti-ROI 16 1081Standard MPEG-1 93 1527

2nd Reading

July 29, 2007 21:57 00114


Fig. 6. VA ROI-based encoding (centre column), Itti-ROI (right column) and standard MPEG-1 encoding (left column)preferences on the eye witness (1), fashion (2), grandma (3), justice(4), lecturer (5), news cast1 (6), news cast2 (7),night interview (8), old man (9) and soldier (10) video clips.

method is in all cases least preferable than the pro-posed VA-ROI method. This may be assigned to thefact that Itti’s saliency map estimation is optimizedfor identifying rather large objects that stand outfrom their surround. In this way small areas, suchas channel logos, are not recognized as ROI’s thoughthey attract human attention. In order to be fair weshould mention, however, that in typical 3G videotelephony circumstances the existence of TV chan-nel logos in a scene is rather unusual. On the otherhand, existence of other small but visually salientobjects is possible.

Table 2 shows the overall preferences of the visualtrial test for the MPEG-4 case. Standard MPEG-4 and VA-ROI were both selected at 39.5% of the

Table 2. Overall preferences (independent of videoclip) — MPEG-4 case.

Encoding method Preferences Average bitrate (Kbps)

VA-ROI 79 197.5Itti-ROI 42 194.0Standard MPEG-4 79 224.6

time as being of better quality. As in the MPEG-1case there is no indication that VA-ROI based encod-ing deteriorates the visual quality of the MPEG-4 video stream. On the other hand, the bit rategain achieved by the VA-ROI method is about 14%on average (see also Table 4), which is rather highif we take into account that encoding gain is dif-ficult to obtain for very low bit rates. MPEG-4encoded videos with ROIs identified based on Itti’svisual attention scheme were selected 21% of thetime indicating a clear deterioration in subjectivevisual quality compared to both standard MPEG-4 and VA-ROI. It should be noted, however, thatIttiROI MPEG-4 videos were selected significantlymore times than their IttiROI MPEG-1 counterpartsin the MPEG-1 visual trial test. This fact, as wellas the lower encoding gain achieved in the MPEG-4case for both VA-ROI and IttiROI videos (13.7% and15.7% respectively) is due to the reduced frame res-olution selected for the MPEG-4 frames (176 × 144pixels compared to 288 × 352 pixels of the MPEG-1frames). Downsampling video frames to 176 × 144pixels leads to an overall frame smoothing whichresults in lower difference in quality between ROI

2nd Reading

July 29, 2007 21:57 00114


Fig. 7. VA ROI-based encoding (centre column), Itti-ROI (right column) and standard MPEG-4 encoding (left column)preferences on the eye witness (1), fashion (2), grandma (3), justice(4), lecturer (5), news cast1 (6), news cast2 (7),night interview (8), old man (9) and soldier (10) video clips.

and non-ROI areas in both VA-ROI and IttiROIMPEG-4 videos. As a consequence a lower encod-ing gain is achieved (remember that, practically, instandard MPEG-4 encoding, the whole frame is con-sidered as ROI). The main conclusion of the previ-ous discussion is that smoothing of non-ROI areas isnot the appropriate way to apply ROI-based encod-ing. Modifications of the corresponding encoders andtheir encoding parameters such as the quality factorfor ROI and non-ROI macroblocks may be necessaryin order to take advantage of the existence of ROIareas. Another option is to encode the disconnectedROI areas as video objects keeping their shapes withthe aid of the Shape Adaptive DCT transform.25

In Fig. 7, the selections made per video clip, forthe MPEG-4 visual trial, are shown. In several cases(fashion, justice, news cast1 and soldier) the per-ceived visual quality for the three encoding modes isthe same. In other cases, such as the old man videoclip, the quality of the IttiROI video is much lowerthan the corresponding standard MPEG-4 and VA-ROI videos. In addition to missing some small visu-ally salient objects such as TV channel logos, Itti’smethod fails in some cases to identify foreground

human faces as ROI areas. This is, for example, thecase for the old man video clip. One the other handin all video clips there is no distinguishable differencein visual quality between standard MPEG-4 and VA-ROI. Therefore, the lower encoding gain achieved byVA-ROI compared to IttiROI is exchanged by per-ceived visual quality.

5.1. Bit rate gain

Table 3 presents the bit-rates achieved by VA-ROIbased, IttiROI based and standard MPEG-1 encod-ing in each one of ten video clips. It is clear thatthe bit rate gain obtained by the VA-ROI methodis significant, ranging from 15% (grandma videoclip) to 64% (soldier video clip). Furthermore, itcan be seen from the results obtained in the soldierand news cast2 video sequences, that increased bit-rate gain does not necessarily mean worse qualityof the VA-ROI encoded video. The encoding gainin the IttiROI encoded videos is, in general, simi-lar to that of VA-ROI. In the four video sequences(grandma, lecturer, soldier and news cast2 ) whereIttiROI clearly outperforms VA-ROI in terms of

2nd Reading

July 29, 2007 21:57 00114


Table 3. Comparison of VA-ROI based, IttiROI based and StandardMPEG-1 encoding in ten video sequences.

Video clip Encoding method Bit rate (Kbps) Bit rate gain

Eye witness VA-ROI 1610 30.5%Itti-ROI 1585 32.6%

Standard MPEG-1 2101

Fashion VA-ROI 1200 31.1%Itti-ROI 1188 32.4%


Grandma VA-ROI 1507 15.2%Itti-ROI 1300 33.6%


Justice VA-ROI 1468 30.5%Itti-ROI 1606 19.3%


Lecturer VA-ROI 950 57.3%Itti-ROI 848 76.3%


News cast1 VA-ROI 991 39.7%Itti-ROI 999 38.5%




Night interview VA-ROI 790 50.8%Itti-ROI 750 59.0%


Old man VA-ROI 1307 32.9%Itti-ROI 1085 60.0%


Soldier VA-ROI 500 63.9%Itti-ROI 614 33.3%

Standard MPEG-1 819

Average VA-ROI 1125 35.7%Itti-ROI 1081 41.2%


encoding gain this gain is non-gracefully exchangedwith visual quality as it can be seen in Fig. 6 (0, 0, 3,and 0 selections respectively in the visual trial test).In contrary, in the two cases (soldier, justice) whereVA-ROI clearly outperforms IttiROI in terms ofencoding gain, VA-ROI encoded videos also outper-form IttiROI videos in terms of visual quality.

In Table 4 the bit-rates achieved by VA-ROIbased, IttiROI based and standard MPEG-4 encod-ing are also presented. The bit rate gain obtainedby the VA-ROI method ranges from 10.4% (fash-ion video clip) to 28.3% (lecturer video clip) and is

important if we take into account that improvementin video compression at low bit-rates is more diffi-cult than improvement in intermediate (MPEG-1)and high (MPEG-2) bit rates. The encoding gain inthe IttiROI encoded videos presents higher varianceacross the various video clips since it ranges from6.9% (fashion) to 30.3% (old man). In general, how-ever, similar encoding gains are obtained by VA-ROIand IttiROI. In the two video sequences (grandmaand old man) where IttiROI clearly outperforms VA-ROI in terms of encoding gain the visual quality ofthe VA-ROI encoded videos is significantly higher

2nd Reading

July 29, 2007 21:57 00114


Table 4. Comparison of VA-ROI based, IttiROI based and StandardMPEG-4 encoding in ten video sequences.

Video clip Encoding method Bit rate (Kbps) Bit rate gain

Eye witness VA-ROI 392 12.2%Itti-ROI 381 15.3%

Standard MPEG-4 439

Fashion VA-ROI 288 10.4%Itti-ROI 285 11.7%

Standard MPEG-4 318

Grandma VA-ROI 264 11.9%Itti-ROI 247 19.5%

Standard MPEG-4 296

Justice VA-ROI 227 11.5%Itti-ROI 236 6.9%

Standard MPEG-4 253

Lecturer VA-ROI 107 28.3%Itti-ROI 110 25.3%

Standard MPEG-4 138


Standard MPEG-4 186


Standard MPEG-4 136

Night interview VA-ROI 127 15.7%Itti-ROI 128 14.8%

Standard MPEG-4 147

Old man VA-ROI 217 13.3%Itti-ROI 189 30.3%

Standard MPEG-4 246

Soldier VA-ROI 71 22.5%Itti-ROI 79 10.4%

Standard MPEG-4 87

Average VA-ROI 197.5 13.7%Itti-ROI 194.0 15.7%

Standard MPEG-4 224.6

(see also Fig. 7). In contrary, in the soldier videoclip where VA-ROI clearly outperforms IttiROI interms of encoding gain, it has similar visual qualitywith the IttiROI encoded video.

6. Conclusions and Future Work

In this paper we have presented a detailed implemen-tation of a visual attention model that can be used toidentify regions of interest for ROI-based video cod-ing. The proposed model operates along two separateinformation channels; a high-level one which mod-els conscious search for human faces and a low-level

one which models sub-conscious attention attraction.Processing within the low-level information channelis basically implemented in a bottom-up mannerwith the aid of wavelet decomposition for multiscaleanalysis. The high-level information channel modelsin a probabilistic manner the skin color and in combi-nation with a bottom-up created texture map gener-ates a skin conspicuity map. The latter correspondsto the likelihood of image pixels to belong to a humanface. Multiresolution analysis was also adopted inthe creation of skin conspicuity map in order toallow both small and large faces to be detected. Theskin conspicuity map and the bottom-up conspicuity

2nd Reading

July 29, 2007 21:57 00114


maps corresponding to intensity, orientation andcolor are combined with the aid of a saturationfunction to create the final saliency map. Throughthe saturation function strong stimuli in a particu-lar conspicuity map (intensity, color, orientation orskin) are not smoothed (as it could happen in a typ-ical linear combination) and they are propagated tothe final saliency map. A prototype of the porposedmodel has been implemented as an embedded systemwith the aid of the TMDSDMK642-0E DSP platformwhile a fully software version is available online.22

In order to apply the proposed model for ROI-based video encoding, ROI areas are generated bythesholding the computed saliency maps using animage-adaptive threshold. ROI based encoding isthen applied by smoothing the non-ROI areas witha low-pass filter. Coding efficiency was examinedbased on both visual trial tests and encoding gain.The results presented indicate that: (a) Significantbit-rate gain, compared to streamline MPEG-1 andMPEG-4, can be achieved using the VA-ROI basedvideo encoding, (b) the areas identified as visuallyimportant by the proposed VA algorithm are in con-formance with the ones identified by the humansubjects, as it can be deducted by the visual trialtests, and (c) VA-ROI outperforms the correspond-ing method proposed by Itti11 in terms of visual qual-ity but achieves slightly lower encoding gains.

Further work includes conducting experiments inan object basis framework where as objects will beconsidered the disjoint ROI areas. Furthermore, theeffect of incorporating priority encoding by varyingthe quality factor of the DCT quantization tableacross VA-ROI and non-ROI frame blocks will beexamined.

Acknowledgment

The study presented in this paper was supported(in part) by the research project “OPTOPOIHSH:Development of knowledge-based Visual Attentionmodels for Perceptual Video Coding”, PLHRO1104/01 (http://www.optopiisi.signalgenerix.com)funded by the Cyprus Research Promotion Founda-tion (http://www.research.org.cy/)

References

1. K. R. Cave, The feature gate model of visual selec-tion, Psychol. Res. 62(2).

2. R. Desimone and J. Duncan, Neural mechanisms ofselective visual attention, Annu. Rev. Neurosci. 18(1995) 193–222.

3. M. Eckert and A. Bradley, Perceptual models appliedto still image compression, Signal Processing 70(3)(1998) 177–200.

4. S. Frintrop, Vocus: A visual attention system forobject detection and goal-directed search (2006).

5. R. S. Gaborski, V. S. Vaingankar and R. L. Canosa,Goal directed visual search based on color cues: Co-operative effects of top-down & bottom-up visualattention, in Proceedings of the Artificial NeuralNetworks in Engineering — ANNIE2003, November(2003).

6. R. Gonzalez and R. Woods, Digital Image Processing(Prentice Hall Inc, 2nd edn (2002).

7. F. H. Hamker, A dynamic model of how feature cuesguide spatial attention, Vision Research 44 (2004)501–521.

8. iLab, Neurmomorphic Vision C++ Toolkit (iNVT).Univ. of Southern California, online at: http://ilab.usc.edu/toolkit/, last visited: January (2007).

9. ImTOOTM , ImTOO MPEG Encoder. online:http://www.imtoo.com/mpeg-encoder.html, lastvisited: January (2007).

10. Texas Instruments, Code Composer Studio User’sGuide, Texas Instruments Literature NumberSPRU328B, 2000.

11. L. Itti, Automatic foveation for video compressionusing a neurobiological model of visual attention,IEEE Trans. on Image Processing 13 (2004) 1304–1318.

12. L. Itti and C. Koch, A saliency-based search mech-anism for overt and covert shifts of visual attention,Vision Research 40 (2000) 1489–1506.

13. L. Itti, C. Koch and E. Niebur, A model ofsaliency-based visual attention for rapid sceneanalysis, IEEE Transactions on Pattern Anal-ysis and Machine Intelligence 20(11) (1998)1254–1259.

14. W. James, The Principles of Psychology (Cam-bridge, MA: Harvard University Press, 1890/1981).

15. C. Koch and S. Ullman, Shifts in selective visualattention: towards the underlying neural circuitry,Human Neurobiology 4(4) (1985) 219–227.

16. S. Mallat, A theory of multiresolution signal decom-position: The wavelet model, IEEE Transactions onPattern Analysis and Machine Intelligence 11 (1989)674–693.

17. E. Mandel and P. Penev, Facial feature tracking andpose estimation in video sequences by factorial cod-ing of the low-dimensional entropy manifolds due tothe partial symmetries of faces, in Proceedings of the25th IEEE Int. Conf. Acoustics, Speech, and Sig-nal Processing, volume IV, pages 2345–2348, June(2000).

2nd Reading

July 29, 2007 21:57 00114


18. The MathWorks. Embedded target for TI C6000TM

DSP 3.1. Online at: http://www.mathworks.com/products/tic6000/, last visited: January 2007.

19. The MathWorks. Simulink(R) — simulationand model-based design. Online at: http://www.mathworks.com/products/simulink/, last visited:January 2007.

20. V. Navalpakkam and L. Itti, A goal oriented atten-tion guidance model, in BMCV ’02: Proceedings ofthe Second International Workshop on BiologicallyMotivated Computer Vision, pages 453–461, London,UK (2002). Springer-Verlag.

21. [Online]. Original video frames. http://www.cs.ucy.ac.cy/∼nicolast/research/frames.rar, last vis-ited: January 2007.

22. [Online]. Visual Attention Model Code. http://www.cs.ucy.ac.cy/∼nicolast/research/VAmodel.zip, lastvisited: January 2007.

23. N. Otsu, A threshold selection method from graylevel histograms, IEEE Transactions on Systems,Man and Cybernetics 9 (1979) 62–66.

24. C. M. Privitera and L. W. Stark, Algorithms fordefining visual regions-of-interest: Comparison witheye fixations, IEEE Transactions on Pattern Analy-sis and Machine Intelligence 22(9) (2000) 970–982.

25. T. Sikora and B. Makai, Shape-Adaptive DCT forGeneric Coding of Video, IEEE Transactions on

Circuits and Systems for Video Technology 5(1)(1995) 59–62.

26. DSP Development Systems. Tms320dm642 evalua-tion module technical reference. Technical report,Spectrum Digital Inc. (2003).

27. A. Torralba, Modeling global scene factors in atten-tion, J. Opt. Soc. Am.-A 20(7) (2003) 1407–1418.

28. A. M. Treisman and G. Gelade, A feature integra-tion theory of attention, Cognitive Psychology 12(1)(1980) 97–136.

29. N. Tsapatsoulis, Y. Avrithis and S. Kollias, Facialimage indexing in multimedia databases, PatternAnalysis and Applications 4(2).

30. J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y.Lai, N. Davis and F. Nuflo, Modeling visual atten-tion via selective tuning, Artificial Intelligence 78(1-2) (1995) 507–545.

31. B. A. Wandell, Foundations of Vision, Sinauer Asso-ciates, Sunderland, MA 01375 (1995).

32. Z. Wang, L. Lu and A. Bovik, Foveation scalablevideo coding with automatic fixation selection, IEEETransactions on Image Processing 12(2) (2003) 243–254.

33. J. M. Wolfe, Visual search in continuous, naturalisticstimuli, Vision Research 34(9) (1994) 1187–95.

2 reading - ntuaimage.ece.ntua.gr/papers/486.pdf · is impossible to model every possible targetfor...

Documents