attention-based object detection - semantic scholar · 2017-05-14 · 2 the human visual system 5...

Universita di Pisa

Dipartimento di InformaticaScuola di Dottorato “Galileo Galilei”

Dottorato di Ricerca in Informatica

Ph.D. Thesis

Attention-based Object Detection

Franco Alberto Cardillo

Supervisor

Prof. Antonina Starita

October 15th, 2007

Alla mia famiglia.

Contents

1 Introduction 1

2 The Human Visual System 52.1 The eye and the retina . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 From the retina to the brain . . . . . . . . . . . . . . . . . . . . . . . 72.3 Visual pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Colour vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Visual Attention 153.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Spatial-based Visual Attention . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Metaphors of spatial-based attention . . . . . . . . . . . . . . 183.2.2 Feature processing in covert attention . . . . . . . . . . . . . . 193.2.3 Parallel versus Serial Search . . . . . . . . . . . . . . . . . . . 203.2.4 Feature Integration Theory . . . . . . . . . . . . . . . . . . . 223.2.5 The Guided Search model . . . . . . . . . . . . . . . . . . . . 27

3.3 Computational Models of Visual Attention . . . . . . . . . . . . . . . 303.3.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 The proposed model 394.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Bottom-up processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 Image pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Feature pyramids . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.3 Non-oriented features . . . . . . . . . . . . . . . . . . . . . . . 454.3.4 Feature (contrast) maps . . . . . . . . . . . . . . . . . . . . . 484.3.5 Feature conspicuity maps . . . . . . . . . . . . . . . . . . . . 544.3.6 Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3.7 Scene exploration through the saliency map . . . . . . . . . . 63

4.4 Top-down attention and visual search . . . . . . . . . . . . . . . . . . 654.4.1 Object learning . . . . . . . . . . . . . . . . . . . . . . . . . . 66

iv CONTENTS

4.4.2 Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . 694.4.3 Different exploration Strategies . . . . . . . . . . . . . . . . . 71

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Swarm Intelligence for Image Analysis 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Swarm Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1 Swarm Intelligence Properties . . . . . . . . . . . . . . . . . . 765.2.2 Designing swarm intelligence algorithms . . . . . . . . . . . . 775.2.3 Collective Prey Retrieval . . . . . . . . . . . . . . . . . . . . . 78

5.3 A New Coordination Mechanism Based on Collective Prey Retrieval . 795.3.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . 795.3.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 An Algorithm for image alignment and matching . . . . . . . . . . . 815.4.1 Image Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.2 Description of the Algorithm . . . . . . . . . . . . . . . . . . . 83

5.5 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Conclusions 87

Bibliography 91

Bibliography 91

List of Figures

2.1 Anatomy of the eye. Left: anatomical representation of the humaneye. Right: drawing of the layered retinal structure. . . . . . . . . . . 6

2.2 Left: Drawing of the foveal region. Right: rods and cones in thehuman retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Actual recordings performed by Hubel on ganglion receptive fields [1].From left to right: Light stimuli and corresponding recordings from,respectively, an on-center, off-surround retinal ganglion cell and anoff-center, on-surround retinal ganglion cell. For both images: firstrow, no stimulus; second row: response to a small spot (with optimumsize, i.e. filling the center of the receptive field); third row: large spotcovering both the centre and the surround; fourth row: no spot in thecentre, but ring covering the surround. The on–center, off–surroundcell is most active in correspondence with its optimum stimulus inthe second row. The off–center, on–surround cell is most active incorrespondence with the fourth stimulus. . . . . . . . . . . . . . . . . 8

2.4 Left image: dorsal and ventral flows. Right: Parallel processing flowsstarting in the retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Two visual stimuli to experience the spatial-variant organization ofour photoreceptors layer. If we fixate on the ‘*’ at the centre of thetwo images, we realize that we are still able to recognize charactersin the left image, but that we loose this ability in the right image,where the font size has been reduced. . . . . . . . . . . . . . . . . . . 16

3.2 The Stroop effect. Our internal processing of the words and theircolour suffer from major interferences. . . . . . . . . . . . . . . . . . 19

3.3 Three visual stimuli used in visual pop-out studies. Left image: pop-out caused by a difference in the colour feature. Middle: pop-outcaused by a difference in the orientation features. Right: Target,vertical yellow bar, defined by two properties, colour and orientation.There is no pop-out and a serial search is required to locate the target. 21

vi LIST OF FIGURES

3.4 Schematic representation of the Feature Integration Model. The reti-nal image is coded into separable features (here shown part of thecolour and orientation flows) that are binded in the object descriptionby the attentional system. The picture does not include any arrowconcerning the top-down processes deploying the focus of attention. . 23

3.5 A schematic representation of the model by Koch and Ullman [2]. Theresults of the preattentive computations are stored in feature maps,later globally combined in a saliency map where a Winner-Take-Allstrategy detects the most salient image region. Once a salient re-gion has been detected, it is analyzed by higher-level processes. Thereported image appears in the original article [2]. . . . . . . . . . . . 26

3.6 The architecture of the Guided Search 2.0 model (from [3]). Inputstimuli are first processed by parallel processes extracting basic visualfeatures. The features are encoded in feature maps, each one encodingthe saliency of an image part with respect to the particular featuredimension by top-down and task-dependent commands. The featuremaps are then combined in a global activation map, specifying wherethe most informative regions are in the current visual stimulus. . . . . 27

3.7 Architecture of the computational model proposed by Milanese et al.in [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.8 Overall architecture of the Selective Tuning Model. Left image: activeconnections when a stimulus composed by a red object and a blueobject is presented to the model. Right image: inhibited connections(in black) when top-down processes deploy the attention on the redstimulus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.9 The architecture of the computational model proposed by Itti et al.Left picture: original architecture, as proposed in [5]. Right picture:expanded architecture, as proposed in [6]. The latest model includescomponents to compute the gist of the scene and use it as a hint forenhance the spatial locations where the target has been previously seen. 33

3.10 Architecture of the computational model proposed by Hamker in [7]. 343.11 The schematic description of the object-based attentional model pro-

posed in [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.12 Schematic architecture of the Neurodynamical model of visual attention. 36

4.1 Schematic representation of the structure of the bottom-up compo-nent of our system. The top-down connections from the high-levelprocesses are not shown. . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Intensity and colour values in the classic image of the mandrill. Fromthe upper left corner, clockwise: original image, and, in order, thered, green, blue, yellow channels (after half-wave rectification) . . . . 47

4.3 Gaussian pyramid for intensity values. The leftmost image is theinput image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

0.0. LIST OF FIGURES vii

4.4 The DoG filter generated for computing centre-surround differences.From left to right: mask for the centre part, mask for the surround,DoG for computing on-centre/off-surround receptive fields, DoG forcomputing off-centre/on-surround receptive fields. The DoG havebeen obtained by subtracting the first two masks. The centre andthe surround masks have been generated with parameters γ = 0.5and centre radius r = 2. . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 Example of center-surround and colour-opponency organization.Left:input image; Middle: Coding in the on-center/off-surround channel;Right: coding in the off-center/on-surround channel. . . . . . . . . . . 54

4.6 Results of the algorithm on visual pop-out. Left: input image; middle:saliency map at the finest scale; right:saliency map at the coarsest scale 58

4.7 Results of the algorithm on visual pop-out. Left: input image; middle:saliency map at the finest scale; right:saliency map at the coarsest scale 59

4.8 Results of the algorithm on visual pop-out. Left: input image; middle:saliency map at level 1 (coarser than input image); right: saliencymap at the coarsest level . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 Results of the algorithm on a real image. Left: input image; middle:saliency map at the finest scale; right:saliency map at the secondscale. The saliency of the street lamps that is clearly visible on thefiner saliency map but it is lost on the coarser one. . . . . . . . . . . 60

4.10 Results of the algorithm on a natural image. From left to right: inputimage; global saliency map, local saliency maps at the four levels, fromfine to coarse scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.11 Results of the algorithm on two natural images. From left to right:input image; global saliency map, local saliency maps at level 2, wherethe windows appear more salient, input image with green and redtomatoes, saliency map. . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.12 ‘Starry Night’, by van Gogh, global saliency map, saliency map atlevel 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.13 Saliency maps computed by weighting each region according to theirsize (with respect to the others). From left to right, first and sec-ond images: black and white image containing a region larger thanthe others and corresponding saliency map. The text ‘max’ has beenadded on the location with maximum saliency since the saliency val-ues are not display as different. Third and fourth images: black andwhite image with a single smaller region and corresponding saliencymap. In each case the image different in size from the others has beengiven a greater saliency value. . . . . . . . . . . . . . . . . . . . . . . 62

viii LIST OF FIGURES

4.14 Saliency maps computed by weighting each region according to theirsize (with respect to the others). From left to right, first and secondimages: stimulus containing a smaller and a larger regions, corre-sponding saliency map. Third and fourth images: in this case thestimulus contain three white circles and two green ones. Saliencymap viewed in 3-D in order to better appreciate the results of theweighing strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.15 Exploration of the Starry Night painting. First row: fixation pointson a reduced-resolution input image and on the saliency map. Sec-ond row: object segmented in the first fixation, feature map used tosegment the second fixation point, result of the segmentation startingfrom the second point. Numbers indicate the order in the exploration. 65

4.16 Six step of the top-down deployment of attention on a simple visualstimulus. First row: input image, match on the first region (greencircle). Second row: centre of the area segmented starting from thepoint with maximum value, attentional beam onto the second image.Third row: centroid of the second segmented region, attentional beamtoward the third region. . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.17 Another example of the top-down deployment of attention on a sim-ple visual stimulus. First row: input image, image to be searched,match on the first region(green circle). Second row, first and secondimage: two attentional beams, looking at the detected rotation angle+ 180, beam toward the third location; since the algorithm has gotthe second match it does not look for α + 180circ . . . . . . . . . . . 73

5.1 A simulation of the democratic collective transportation model. Onthe top left box the initial position of the agents is represented.The top right box represents the direction chosen by the majorityof agents: during the first 50 iterations, 34% of the agents choseto go toward South-East, while in the last 35 iterations the 24% ofagents chose the direction North-East. In the bottom left box thefinal position of the agents is shown (agents have effectively movedin the direction chosen by majority of them). The bottom right boxrepresents the sum of agent errors in estimating Vg at each iteration(eq.5.7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2 Example of the execution of the algorithm. For every row, from left toright: Iinput, Itarget, differences between Iinput and Itarget, the outputof the algorithm (the aligned image), the difference between Itargetand the output of the algorithm. . . . . . . . . . . . . . . . . . . . . . 86

0.0. LIST OF FIGURES ix

5.3 Example of application of our algorithm to the Image Matching prob-lem. In this case the goal is to find the location of the patch Iinput inItarget. From left to right: Iinput, Itarget, the output of the algorithm,the differences between Itarget and the estimated location of Iinput inItarget. The black box means that the algorithm was able to correctlylocate the patch over Itarget. . . . . . . . . . . . . . . . . . . . . . . . 86

x LIST OF FIGURES

Chapter 1

Introduction

We see without any explicit effort in doing it. We simply open our eyes and let thelight strike our retina. Nevertheless, the two-dimensional pattern of light activatingour retinal photoreceptors makes us see a colourful, three-dimensional world full ofmeaningful objects. However, endowing a computer with the ability to replicate,even if only partially, our visual abilities on very specific tasks is not easy at all.

Research on digital image processing and computer vision started as soon aspowerful enough computer were built. At its earliest stage, late 1960s and earl1970s, researchers used the computer only for simple tasks, like image coding andcommunication. However, already in the early 1970s, computer started playing animportant role in medicine, with the introduction of computerized axial tomography.As times passed, their presence for image related tasks became crucial in every field.

Computer were no longer required to organize and facilitate the exchange ofdigital data: they were required to interpret the data and the images, and thislatter task was found to be not easy. Computer vision was then characterized by anincreasing number of different approaches to digitally process the images and givean interpretation. Algorithms for image processing and analysis were, and still are,task-dependent. One of the major problem in computer vision is, in fact, the lackof a general way to solve simple tasks. For example, there are many algorithms foredge extraction, but their performance change drastically when applied to differentimages. Furthermore, even on cases where they perform well, their performanceis not comparable with human results. If one analyzes the method used by manyimage processing and analysis algorithms find only few details that are inspired tothe way the human visual system perform visual tasks.

On the contrary, biologically-inspired computer vision algorithms try to solve animage-related problem by emulating some processes of our visual system. Our visualperception is the result of simple neural processing taking place in the early stagesof the visual system and complex processes of higher-level neural areas. Even if itis not yet fully understood how visual perception arises, algorithms can be inspiredto the basic processing steps performed by the early stages of our visual system.

The amount of information characterizing the visual stimulus would be too much

2 CHAPTER 1. INTRODUCTION

to process even for our brain. Evolution has endowed humans with attentionalmechanisms: they filter our as much information as possible, trying to select onlyrelevant information to our brain. Psychological studies on visual attention startedmany decades ago, but during the past 15 years, the computer vision communitytried to implement attentional mechanisms in computer vision algorithms.

This thesis presents a new computational model for visual attention. Startingfrom the influential models in [2, 5], based on the previous Feature IntegrationTheory, we propose a filter-based model of visual attention able to perform a mul-tiresolution analysis of the input image and detect salient regions that are worthy ofa detailed inspection. The model is composed of a bottom-up part and a top-downone.

The bottom-up part models the stimulus-driven attention. It produces a repre-sentation of the input image by constructing a pyramid of saliency levels each oneencoding the saliency of the image at a given resolution level. Such structure allowsthe algorithm to deploy attention like a zoom-lens and observe fine details or coarseregions.

The top-down part models volitional or focused attention, enhancing the featuresthat are relevant for the current task. The saliency of each image region, whentop-down modulations are sent to the lower level where the bottom-up module is,depends on our current task.

The model can be applied in three different modes. In scene exploration modeonly the bottom-up module is used. Attention is deployed onto regions characterizedby a high spatial saliency. Our model introduces a new normalization strategyfor computing the saliency of a region in different feature maps, before they arecombined in a saliency map.

In learning mode the model is able to construct a template for the object underinspection. In this mode both the bottom-up and the top-down modules are used.Salient regions detected by the bottom-up processes are analyzed by the top-levelmechanisms that produce a template of the object. The template contains infor-mation about the features characterizing the object and about the spatial relationsamong object parts.

In search mode the model uses an object template to modulate the bottom-upcomponents in order to detect the presence of the object. Top-down processes in-fluence both the spatial selection of the regions to inspect and the computation of asaliency map that take into accounts only the feature describing object parts. Ob-jects can be composed by an arbitrary number of parts, visible at different resolutionlevels.

Furthermore, our research has considered alternative computational models formodeling attention. In particular we considered the Swarm Intelligence paradigm.In such paradigm complex computations are the emergent property of complex in-teractions among simple agents in a complex environments. We tried to modelattention through Swarm Intelligence algorithm since some insects, like the bees,are able to communicate visually the presence of salient regions, like flowers, and

3

furthermore, their memory allow them to return to the locations characterized bymany food sources. However the initial investigations led us to implement only analgorithm for image registration. Such algorithm is not related to visual attention,but allowed to get some insights on the correct way to implement a computationalmodel of visual attention in the new paradigm. The proposed algorithm is ableto perform a rigid registration of noisy images and to locate an image patch in alarger image, without any modifications of its internal parameters or procedures.We tested our algorithm on medical images and geographical maps.

This thesis is organized as follows. In chapter 2 the human visual system isdescribed. The description does not pretend to be a complete introduction to ourvisual system; it contains a description of the only parts needed to understand psy-chological and computational models of visual attention. In chapter 3 we present areview of the most influential psychological theories and the most influential compu-tational models on visual attention. In chapter 4 we describe in details the proposedcomputational model and the initial results of its implementation. In chapter 5 wedescribe, after a short introduction about swarm intelligence, the new algorithm forimage-registration. In chapter 6 we describe the work done and that still has to bedone.

4 CHAPTER 1. INTRODUCTION

Chapter 2

The Human Visual System

Our visual system is just a little part of our larger central nervous system and, yet,it is not fully understood. Nevertheless the visual system is the part which has beenstudied earlier, more, and longer. This chapter is not intended to be a completeand in-depth discussion about the human visual system (HVS) or, more in general,about the visual system in the vertebrate species. It must be considered a briefreview of biological and physiological evidence that is relevant for a computationalmodel of visual attention.

When we open our eyes we see a colourful and meaningful three-dimensionalworld surrounding us. Such visual experience results from a sequence of transfor-mation performed on the light stimuli that starts in our eyes. The light is focusedon the retinal surface, then processed and transferred to our thalamus, and is finallyrouted to the cerebral cortex. At each step the visual stimuli are analyzed, divided,and merged by cells whose behaviour becomes more complex as we move along thehierarchy. The rest of the chapter discusses aspects of each stage that are relevantfor the study of computational models of visual attention.

Sections 2.1 and 2.2 discuss the structure of the human retina and describesthe processing of the ligth stimuli performed while the stimuli are being transferredto the brain. Section 2.3 discusses the various flows of information that proceedin parallel toward our higher cortical areas. Section 2.4 presents the basics forunderstanding colour vision. Section 2.5 briefly presents one of the most popularhypothesis about how the various flows are merged in order to have a unitary visualexperience.

2.1 The eye and the retina

The light, which enters our eyes projecting an image of the external world, mustbe transformed into a form suitable for our nervous cells to process. This initialtransformation is accomplished by the retina, a thin and layered structure locatedat the back of the eyeball (Fig. 2.1, left image). The functional role of the non-retinal

6 CHAPTER 2. THE HUMAN VISUAL SYSTEM

Figure 2.1: Anatomy of the eye. Left: anatomical representation of the human eye.Right: drawing of the layered retinal structure.

parts of the eye is to keep a sharp and focused projection on the retina, centred onthe object of interest we are looking at. Such task is mainly accomplished by thecornea and the lens: the initial deflection of the incoming light performed by thecornea is corrected by the lens, allowing us to focus on objects at various distancesfrom us.

The retina is composed by three layers of cells, separated by two layers containingthe synapses made by the axons and the dendrites of such cells [1]. A drawing ofthe retina is shown in the right side of Fig. 2.1: the incoming light passes throughthe outer layers and hits the inner layer, where the photoreceptors are located. Allthe neural computations start here: in the inner and farther place of the retina.Quite surprisingly, in fact, photoreceptors are not located in the foremost layer, butare located behind the other retinal cells [9]. The photoreceptors perform a kindof image sampling, as we will see they actually implement a space-variant samplingdrastically reducing the amount of information that the upper layers need to process.

There are two types of photoreceptors: cones and rods (Fig. 2.2, left picture).The rods are very sensitive to light and are located everywhere in the retina but inthe central area. They are not responsive in high illumination levels, but are veryactive in low light conditions providing us with the ability to see in dark conditions.The cones respond only in high illumination levels and are mainly located in thecentral area. The cones are responsible for our ability to see details and colours, butare inactive when the light level is low.

The central part of the retina, called fovea, is considered a specialized region:when we look at an object in order to catch its finest details, we dispose our eyes inorder to align our optical axis in order to have the region we want to inspect projectedon the fovea. In the fovea the retina presents only the layer with photoreceptors:if the other two layers were in front of the cones, our ability to distinguish objects

2.2. FROM THE RETINA TO THE BRAIN 7

according to their details would drastically reduce [1]. A representation of the foveais shown in the right picture of Fig. 2.2.

Figure 2.2: Left: Drawing of the foveal region. Right: rods and cones in the humanretina.

The synaptic terminals of the rods and cones are connected to the dendrites ofthe bipolar cells, that populate the second, middle layer. Between the first and thesecond layer, we find horizontal cells, that perform a pooling over more photorecep-tors, in particular rods, in order to increase the sensitivity to light. The bipolar cellsare then connected to the ganglion cells in the outer and last layer. Between themiddle and the outer layer, the amacrine cells perform a role similar to horizontalcells: they group more bipolar cells located at different distances from the targetganglion cell. The aim of such pooling operation is not only to increase sensitivenessto light: it is a sort of image compression, where only details in the central areaare kept. However, such “compressed” channel is very sensitive to movements: ifsomething moves around us when we are inspecting an object, our visual systemwill detect the movement and signal our brain that we should move our attentionto another area.

2.2 From the retina to the brain

The computations taking place in the three layers when light strikes the photore-ceptors do not produce a mere neural representation of the external image. Thevisual experience is being built: features are being extracted from the phtoreceptorsresponse, colours are being formed, features are being routed to the correct higherlevel neural circuits to be processed. Bipolar and ganglion cells present a structuredreceptive field, i.e., they are connected to well-defined areas in the retina and do notreact to the simple presence of a light stimulus.

In particular, ganglion cells have a receptive field with a center–surround organi-zation [10]. The receptive field consists in the retinal area which the ganglion cell is


connected to through the bipolar cells. It is divided into two parts with an oppositeinfluence on the ganglion output: the centre is a circle, the “surround” is an annuluscentred in the same point as the circle.

An on–center, off–surround ganglion cell reaches its maximum activity levelwhen light hits and fills the its receptive-field central part and no light stimuli arepresent in the surround area. An off–center, on–surround cell presents an oppositepreferred stimulus: it is most active when light hits only the surround. In particular,on a uniform background the ganglion cell has no output since the center and thesurround contributions balance each other. In Fig. 2.3, actual recordings performedby Hubel are shown. It has been found that the ganglion cells have different sizes

Figure 2.3: Actual recordings performed by Hubel on ganglion receptive fields [1].From left to right: Light stimuli and corresponding recordings from, respectively, anon-center, off-surround retinal ganglion cell and an off-center, on-surround retinalganglion cell. For both images: first row, no stimulus; second row: response to asmall spot (with optimum size, i.e. filling the center of the receptive field); thirdrow: large spot covering both the centre and the surround; fourth row: no spot inthe centre, but ring covering the surround. The on–center, off–surround cell is mostactive in correspondence with its optimum stimulus in the second row. The off–center, on–surround cell is most active in correspondence with the fourth stimulus.

depending on their distance from the fovea. Ganglion cells connected to the foveahave smaller receptive field and smaller dendritic fields: there is a one-to-one map-ping among photoreceptors (cones) and the ganglion cells. Moving out toward theretinal periphery, ganglion receptive fields and dendritic trees increase in size. Morephotoreceptors, through horizontal cells, and more bipolar cells, through amacrinecells, are connected to a single ganglion cells. As early as in the retina, two differ-ent flows of information are built. The first one conveys the encoding of the imagedetails and colours and passes through the small ganglion cells. The second one con-veys information extracted from the peripheral retinal regions, pooled over a largernumber of photoreceptors, and passes through large ganglion cells.

The axons of the ganglion cells are joined in the two optic nerves connecting the

2.3. VISUAL PATHWAYS 9

two retinas to the Lateral Geniculate Nucleous (LGN) of the thalamus. The LGN isa laminar neural structure composed by six layers of cells, each cell has a receptivefield with a center–surround organization as the retinal cells. The upper four layersare called parvocellular layers because they contain cells with a small body and asmall receptive field. The lower two layers are called magnocellular layers becausethey contain cells with large bodies and large receptive fields. LGN cells show thesame functional behaviour as the ganglion cells.

LGN cells are then connected to the striate cortex or area V1, which is the largestand most studied part of our visual system. In this layer cells start computing com-plex features: if ganglion and LGN cells are selectively responsive to discontinuitiesin the retinal stimuli, regardless of their orientation, V1 cells become more complexand show preference for specific orientations. They are subdivided into two maingroups:

• simple cells, whose behaviour on complex patterns can be predicted by theirbehaviour on simpler patterns. For example, simple V1 cells can be inhibitedor stimulated by a single spot of light: their response on more spots of lightcan be determined using linear models. Futhermore, the majority of them hasan elongated receptive field and can act as a line or edge detection with apreferred orientation.

• Complex cells do not present a linear behaviour any more. They are highlynon-linear, sensitive to motion, and not sensitive to the position of the stimuluson their receptive field, which is larger than the simple-cells receptive fields.Some complex cells, that were earlier classified as a different type, are calledhypercomplex and are able to detect segments of lines fully contained in theirreceptive fields. Some complex cells receive their input directly from the LGN,but the vast majority of them integrates the output values of more simple cells.

After being processed in the V1 area, stimuli are routed to different brain areas.The V1 output connections will be briefly discussed in the next section.

2.3 Visual pathways

The segregation of the visual pathways begins as early as in the retina. Each ganglioncell receives input from a small number of photoreceptors: each one of them respondsto light stimuli over a small portion of the retina. However, as previously said, thereis a basic distinction among ganglion cells: there are big and small cells. Large cellhave a large body and a large dentritic field, while small cells have small bodies andsmall dendritic fields.

Ganglion cells close to the center receive their input from a single or just a fewphotoreceptors. Peripheral large ganglion cells are connected to many photorecep-tors and have very large receptive fields. The existence of a diverse connectivity


pattern is clear if we consider the number of cells in each layer. There are morethan 125 million photoreceptors, but only 1 million ganglion cells: since the outputof the eyes is represented by the ganglion axons, more photoreceptors must con-verge onto a single ganglion cell. However, since we have the ability to distinguishfine details, the foveal connectivity pattern must be one–to–one: if more cones wereconnected to the same ganglion cell, we would loose the details in the image. Pe-ripheral retinal areas are characterized by a large number of rods: in fact our abilityto distinguish the details in our peripheral field of view decreases. In order to keepthe details, the retinal wiring has created two distinct flows: a cone-based flow anda rod-based one.

These two flows are not only anatomically distinct: they conveys informationused to accomplish different tasks. Small ganglion cells, wired in order to avoidmissing important details from the fovea, process colours and shape. Large ganglioncells, connected mainly to rods, process contrast differences, space, figure–groundsegregation, and movements. The organization of visual pathways here presented issimplified since it is likely that there are 15 to 20 retinal–to–ganglion pathways.

Small ganglion and large ganglion cells are then connected to different layers inthe LGN, that are connected, in turn, to different layers belonging to the V1 area.The flow that begins in cones and passes through small ganglion and parvocellularlayers is part of our What system, specific for object recognition and colour pro-cessing. The flow, originating in rods and passing through large ganglion cells andmagnocellular layers in LGN, is part of our Where system that allow us to expe-rience the 3-D space. We share the Where system with other animal species: it isfaster than the What system, allow our brain to react to external stimuli like thepresence of a pray or a predator. The What system characterized humans: it hasevolved later but gives us the ability to distinguish details and recognize objects.

As we move along the two streams we find more complex cells. Ganglion andLGN cells present the same complexity: they react to light stimuli falling within theircenter–surround receptive field. They show a selective preference for discontinuities,but do not present a preference for orientation. In the visual cortex, cells havea preferred orientation, are able to detect end-stopped segments or more complexpatterns. At this stage, the shape analysis starts and basic object features, likeedges, are computed.

Once reached the V1 area, the two streams are subdivided into finer informationflows and proceed toward different brain areas. The various paths proceed towardventral and dorsal areas as shown in Fig. 2.4 and can be summarized as follows:

• colour pathway: retinal small ganglion cells → LGN parvocellular layers →V1 (blob) → V2 (thin stripes) → V4 → . . .

• form pathway: retinal small ganglion cells → LGN parvocellular layers → V1(interblobs) → V2 (pale stripes) → V4 → IT . . .

• depth pathway: retinal large ganglion cells→ LGN magnocellular layers→ V1

2.4. COLOUR VISION 11

Figure 2.4: Left image: dorsal and ventral flows. Right: Parallel processing flowsstarting in the retina.

(subregion) → V2 (thick stripes) → MT → . . .

• motion pathway: retinal large ganglion cells → LGN magnocellular layers →V1 (subregion) → MT → MST → . . .

The previous schematic view of the visual information flow is nowadays challenged bynew anatomical findings. In fact, many interconnections among the previous areashave been found. The various information flows cannot be considered completelyindependent and influence each other .

However, people with severe brain injuries can loose part of the visual abilities(for example, colours) while maintaining the full capability about all the others.This finding indicates that, even if the various flows interact at some points, theyare able to fully accomplish their task without the presence of one of them.

2.4 Colour vision

It is commonly believed that objects look coloured because they are coloured, butneither the objects nor the light are coloured. Colour is a psychological propertyof our visual experience when we look at an objects or at a light source. From anabstract computational point of view, the light may be considered the input stimulusand the experience of colours as the output of our internal computations.

According to the first scientific theory, the thricomatric theory1, the human eyeis provided with three receptors. Such receptors respond to particular wavelengths:the first type has its peak activation at short wavelengths, the second type at mediumwavelengths, and the third type to long wavelengths. The three types have peak

1This theory was developed in 1777 by G. Palmer, autonomously rediscovered in 1802 by SirR. Young, and refined by Maxwell and Helmhotz in the late eighteenth century [11].


activation on, respectively, blue, green, and red. However, how do we experiencemillion of colours? Each colour, composed by light at different wavelengths, pro-duces an activation pattern across the three photoreceptors: each pattern is theninterpreted as a colour.

This theory is partially correct: we have three cone types and each one of themactivates on different wavelengths: L-cones on long wavelengths (blue), M-cones onmedium wavelengths (green), and S-cones on short wavelengths (red). They arenot evenly distributed on our retina and they are not in the same percentage (theratio of L to M to S cones is about 10:5:1 [12]). Furthermore, our fovea containsalmost only red and green cones, while blue cones are distributed in the periphery.The thricomatric theory, however, could not explain why certain colours, in case ofcolour blindness, are lost in pairs: either red and green or blue and yellow. If eachcone type is responsible for the perception of a given colour, we should loose singlecolours, not pairs.

Based on such evidence, a new colour theory was proposed by Hering, the op-ponent process theory. According to Hering, there are four chromatic primaries andthey are red, green, blue, and yellow. Furthermore, they are structured into pairof opposite colours: red versus green and blue versus yellow: there are cells thatrespond to those pairs of colours with a center–surround organization of their re-ceptive fields. The reason for this antagonism in colour processing is again to makethe visual system responsive to discontinuities: discontinuities, as edges, are whatbest describes the shape of an object.

The two theories are both correct: the thricomatric theory is able to explain theinitial processing of the light taking place in the photoreceptor layer, the opponentprocess theory explains the computations in higher layers. The theories were mergedin the dual process theory in 1957 by Hurvich and Jameson. Cells respecting theopponent process theory can be found as early as in the last retina layers: it ispossible to find not only bipolar, ganglion, and LGN cells that have a preferredwavelength with a center-surround organization. In particular, there are (R+, G-)cells, excited by a red centre and inhibited by a green surround, and (G+, R-), (B+,Y-), (Y+, B-), where ‘Y’ stands for yellow and ‘B’ for blue. These cells, together withthe achromatic channel composed by (Wh+, Bl-) and (Wh-, Bl+) cells (where ‘Wh’stands for White and ‘Bl’ stands for Black), allow our visual system to representmillion of colors by combining the activation patterns of the photoreceptors.

2.5 Conclusions

The present chapter has discussed basic evidence about the organization of thehuman visual system. Light is first processed in the retina and the results are sentto higher areas, where more complex and specialized transformations are performed.The results of the retinal cells are separated into two distinct flows, that are bothanatomically and functionally different. The Where pathway carries information

2.5. CONCLUSIONS 13

about the structure of the space surrounding us, while the What flow containsinformation about the details and the colours of the region we are attending to.

The two flows are kept anatomically separated, but there are cerebral areaswhere the interconnections between the two systems allow to merge the informationand get a full visual experience. However, from a perceptual point of view, theunderlying process to information fusion and selection is considered to be the visualattention, that will be discussed in the next chapter.

Chapter 3

Visual Attention

Whenever we are awake, we are flooded by a large amount of visual stimuli that ournervous system needs to process in order to react accordingly. However, we are notable to process the whole distal stimulus in a short time. Indeed, we need to selectand process only image subregions that are likely to contain relevant information.

The basic mechanism enabling us to selectively attend different regions of thecurrent scene is visual attention. Visual attention covers a broad range of perceptualprocesses. This chapter tries to discuss the most important points and, in particular,the psychological evidence that is used in the computational models of attentionalcomputations in machine vision.

3.1 Introduction

Visual stimuli contain much more information than our brain can process. Duringour evolution, our brain has not evolved by incrementing the number of neuralareas or visual neurons in order to fully inspect the distal stimulus. Evolution hasendowed humans with a series of filters able to reduce the large amount of incominginformation.

In fact, the first filter is represented by the sensitiveness of our photoreceptors.They respond only to light in a narrow range of the electromagnetic spectrum. Asecond filter is represented by the spatial organization of the light-sensitive retinallayer. As we have seen in the previous chapters, cones that represent the startingpoint for the high-resolution and detailed retinal image, are concentrated in the foveaand almost absent in the peripheral regions. Such space-variant organization restrictthe high-resolution processing to the centre of our field of view (accomplished bythe What system), while computing features becoming coarser as we consider moreperipheral regions. In order to consciously experience such strong limitations, thetwo visual stimuli shown in Fig. 3.1 can be used. If we attend the ‘*’ sign at thecentre of the image, we are still able to recognize the large characters in the firstimage, but we cannot recognize the smaller ones in the image on the right.

16 CHAPTER 3. VISUAL ATTENTION

Figure 3.1: Two visual stimuli to experience the spatial-variant organization of ourphotoreceptors layer. If we fixate on the ‘*’ at the centre of the two images, werealize that we are still able to recognize characters in the left image, but that weloose this ability in the right image, where the font size has been reduced.

It is clear that due to such space-variant retinal organization, causing visual acu-ity to vary strongly in the field of view, we are forced to move our eyes performingvoluntary eye movements, called saccades. Saccades align the foveas with the regionwe wish to inspect in full-detail, at the highest resolution available. Such eye move-ments cannot obvsiously be random, since an image sampling performed accordingto a random strategy would not guarantee that important regions are attended andscrutinized. The choice of the fixational points is performed by our attentional sys-tem, representing the evolutionary answer to our limited processing abilities. mitedvisual processing capabilities.

One of the oldest treatises containing studies about attention has been publishedin 1890 [13]. Attention is defined by an act of taking possession of the mind, with-drawing from some things in order to concentrate and dedicate all our resources todeal with others. This definition applies to different attentional systems we haveand not only to visual attention.

A more recent definition, referred to visual attention but presenting a generalscope, can be found in [11]. Visual attention is defined as those processes thatenable an observer to recruit resources for processing selected aspects of the retinalimage more fully than non-selected aspects. The previous definition contains twoimportant points:

• in order to attend an image subregion, we need resources. Resources arerelated to our mental and physiological state. Our abilities, in fact, depend onour fatigue, habituation, or motivation. How we gather the available resources

3.2. SPATIAL-BASED VISUAL ATTENTION 17

has been dedicated less attention than the next point and is not relevant fora computational model.

• Once gained enough resources, we can selectively attend regions, while ignoringor processing only partially the remaining image parts. Even if it will bediscussed more in detail in the next section, it should be noted that attentionmust not be intended as deployed to spatial regions only: we can modulateour attentional mechanisms in order to process a limited set of visual featuresrather than the complete one.

Visual attention is the result of very complex interactions among different neuralareas including the retina and high-level cortical areas. Furthermore, we can havespatial-based and object-based attention. The two different approaches depend onthe visual unit of attentional fixation, whether uniform or homogeneous spatial re-gions or objects. So far, every experimental work that tried to characterize attentionas one of those two types has failed [14]. There is, in fact, a profound differencein the perceptual level the two attentions are supposed to be active. If spatial-based attention can be considered as characterizing the low-level processing steps,object-based attention, where the unit of fixation are not mere connected regionsbut meaningful objects, is supposed to operate in higher level processes.

The current research consider the two forms of attention as results of attentionalmechanisms operating at different levels in our visual system. Early stages of thevisual computation can be necessarily characterized by the simpler spatial based at-tention. At those stages, in fact, our visual system has not been able to perform anyperceptual grouping of homogeneous or similar regions. Once higher level processeshave segregated figures from the background and performed a coherent groupings,object can be the unit of the attentional processes.

3.2 Spatial-based Visual Attention

Before introducing spatial-based attention, its metaphors and detailed psychologicalmodels, we need to specify the distinction between overt and covert attention, aswell as, to better specify the role of saccadic eye movements in scene exploration.

Overt attention can be considered the conscious act of focusing our attentionalmechanism on specific objects or image regions. It is strictly related to scene explo-ration through saccadic eye movements. In fact such movements depend mainly ontwo factors:

• the current task. Visual exploration of the surrounding environment is guidedby our high-level processes that modulate the stimuli in order to maximize thelikelihood that an attended spatial region contains what we are interested in.

• The peculiarity of the attended region. In fact, both in a task-dependent visualexploration and in a free-viewing behavioural state, our eyes can be attracted


by regions standing out from their surroundings.

In the first case, our brain guides the scene exploration based on the results ofthe early visual computations. In the second case, eye movements are involuntarydirected toward a region which might be relevant for our very survival.

Covert attention can be considered the act of mentally focusing on specific prop-erties of our field of view and is only weakly related to saccades. In fact, looking atthe ‘*’ in the left picture of Fig. 3.1, we can process the peripheral stimuli withoutmoving our eyes, but simply concentrating our attention on the periphery.

Spatial-based attention is then characterized by two aspect: the spatial selectionof the regions to inspect and the choice of the image features to process in theselected region.

3.2.1 Metaphors of spatial-based attention

The attentional mechanisms underlying the spatial selections of relevant regions canbe described with the help of two metaphors: the spotlight and the zoom-lens.

In the spotlight metaphor [15], attention is considered as a spotlight that movesover the image. When the spotlight moves over the image, it illuminates the corre-spondent regions, leaving the rest in the darkness. The illuminated regions can beprocessed, while regions not falling in the spotlight are ignored or processed onlypartially. This metaphor became very popular since it was able to describe and jus-tify many experimental results. However, it has its drawbacks. In fact, the spotlightis considered to be fixed in size, but many experiments show that people can varyits size: the spotlight is enlarged when looking at large objects or narrowed whenlooking at very small regions in the visual field.

In order to account for such variation is size of the attentional spot-light, thezoom-lens metaphor considers attention as acting with a variable size and scope [16].According to this second metaphor, we are able to modify both the size and thedetail level of the region under our attention inspection. Size and details, however,are considered related: wide regions can be processed at a coarse resolution, whilesmall region at finer ones. In other words, wide regions allow us to study low spatialfrequencies, while narrow regions high spatial frequencies.

It should be noted that the two metaphors are not mutually exclusive. In factwe can consider an attentional spotlight varying in size, but with a fixed power. Ifwe use it to illuminate larger regions, the illumination is dim, but if we use it onsmaller regions, the illumination would be strong enough to detect details.

The main limitation of both the metaphors consist in their strong restriction toa single focus of attention. In recent experimentations it is clear that we are able,under specific conditions, to split our attentional focus over multiple subregions [17].

The two metaphors, or their combination, do not explain how the next regionof fixation is selected. Neither they explain how we are able to selectively attendone single feature, while ignoring all the others. In fact, the covert processing of


the attentional beams are independent from the particular method to guide ourattentional beam.

3.2.2 Feature processing in covert attention

Our attentional system is able to focus restricted regions of the visual field and se-lectively process single features. Experimentations described in what follows provesthat, but it part of everyone’s experience the consciousness on the ability to selectparticular visual features, like to colour or the texture of an object, without beinginfluence by its shape, its location, or by its movements. Once the attentional beamhas illuminated and isolated a region, we can choose what feature to attend andwhat features to ignore.

Initial studies

One of the first experimental studies about human abilities in feature processingdates back to 1935 [18]. The study proved that our speed in processing couple offeatures varies with the particular coupling considered. Fig. 3.2 shows some wordsdisplayed in different colours. If we try to read out loud the name of the colour ofthe words, we realize that our internal computations are slow. The colour of thewords and the letters composing the word present strong interferences that forceour visual system to slow down in order to avoid mistakes. The Stroop effect is

Figure 3.2: The Stroop effect. Our internal processing of the words and their coloursuffer from major interferences.

not limited to colours and colour names: it affects several other couples of features.However, in the first studies, results were reported only for reading tasks, thatinvolve high level processes acquired by a long training. However, the Stroop effectwas later investigated for coupling of simpler features, like colour, orientation, basicshapes. The new studies reported that image regions can be characterized by featuredimensions of two types: separable and integral [19].


When the retina is struck by the light, it transforms the stimuli into featuresbelonging to either integral or separable dimensions. Example of feature dimensionsare “Colour”, with features red, green, blue, . . . or “Orientation” with featureshorizontal, vertical, . . . However, there is a difference in our abilities to selectivelyattend a single feature in presence of another one: on particular couples we can, onother couples we simply cannot.

Pairs of dimensions are said separable if we can selectively attend to one of themwithout the other one causing any interference. For example, we can attend thecolour of a region, without being influenced by its orientation or its shape. Suchfeatures appear to be processed by independent visual flows, leading to uncorrelatedinternal representation that we can consult in order to study the features.

On the contrary, integral feature dimensions are those that influence each other.With integral feature dimensions, We cannot selectively attend one single feature.One feature cannot be perceived without perceving the other. Example of integralfeature dimension are height and width of a figure, or saturation and lightness of acolour.

Separable and integral couples of feature dimensions suggest that our attentionalsystem processes spatial features in independent flows and represent each dimensionin different maps. Furthermore, the different speed in reaction times for differentcouples of feature dimensions suggest that some flows are parallel in nature, whileother operates sequentially.

3.2.3 Parallel versus Serial Search

The previous studies concentrated on focused attention and showed that there arefast and slow visual computations, corresponding to parallel and sequential internalprocesses.

This evidence, combined with the obvious existence of massively parallel atten-tional system enabling all the living creatures to react to dangers and predators in avery short time, led several researchers to investigate what features are processed inparallel and without focused attention and what features require a sequential scanof the input image. The effect studied in this experiments is called visual pop-out.

The goal of the experiments was to measure the time needed by a subject todetect a target object surrounded by different, but quite similar, objects, calleddistractors. The target could differ from the distractors in one, two, or more features.The hypothesis was that a target differing in a single feature was easier to locatethan targets described by a conjuction of two or more features.

In Fig. 3.3 three typical stimuli used in visual pop-out studies are presented. Inthe left image, the green circle (target) is immediately detected. The same effect iscaused by the stimulus in the middle where the tilted bar (target) stands out fromthe horizontal ones. In both cases, the targets stand out immediately, they simplypop-out of the image. Furthermore, the time needed to detect them (or establishthat they are not present) does not depend on the number of distractors in the


stimulus. In the third stimulus of Fig. 3.3, the target is the yellow and vertical bar,i.e., a bar described by the conjunction of two features. In this latter case, there isno visual pop-out and a sequential scanning strategy is needed to locate it. When asequential search is necessary, the time needed to detect the target is proportionalto the number of distractors.

Figure 3.3: Three visual stimuli used in visual pop-out studies. Left image: pop-out caused by a difference in the colour feature. Middle: pop-out caused by adifference in the orientation features. Right: Target, vertical yellow bar, defined bytwo properties, colour and orientation. There is no pop-out and a serial search isrequired to locate the target.

When the target stimulus differs from the distractors in a mutually exclusiveproperty (colour and orientation in Fig. 3.3), it is detected in a very short time. How-ever, if the target is described by a combination of property (right image in Fig. 3.3)we need to perform a sequential scanning and, for every fixational point, use ourcovert attention to study a single feature [20]. In this case, fixational points areselected using one out of the two features, making the reaction time proportional tonumber of distractors.

Pre-attentive and attentive vision

The evidence from visual pop-out experiments is considered strong enough to statethat our attentional system can be subdivided into two main components that op-erates very differently and at different stages.

The first system, called preattentive, starts operating as soon as the light strikesthe retinal photoreceptors. It processes basic visual features, like colour, orientation,size or movements, in parallel and over the entire field of view. In fact, whenthe target object of visual pop-out experiments differs in a single basic feature,it is detected without saccades and independently from the number of distractorssurrounding it.

The second system, called attentive, correspond to focused attention. When thetarget is not recognized by the preattentive system, the attentive processing startsand uses information computed by the preattentive system in order to select spatialregions that could contain the target. It necessarily operates sequentially since it


needs to focus several spatial regions and activate the mechanisms of covert spatialattention for each one of them.

All the experimental results and the new hypothesis were collected in a psy-chological model of visual attention, that represent a milestone for more complexmodels and many computational models. Such model, named Feature IntegrationTheory, is described in the following section.

3.2.4 Feature Integration Theory

Parallel, preattentive processes build an image representation with respect to asingle feature. Their results are encoded in feature maps, one for each processedfeature. Such maps, spatially organized in order to respect the retinal space-variantsampling, are encoded by cells in area V1. As discussed in the previous chapter,area V1 contains retinotopic maps encoding different features like color channels,orientation and so forth. The preattentive system, using such maps, is able to detectthe presence of a single region or, otherwise, must activate the attentional processesthat will cause eyes to move onto positions coded in the same retinotopic map.

However, there is an important point that need to be explained. Our visualperception is unitary: in fact we do not consider a green circle as a green andcircular region. The retinotopic maps, each one encoding a single feature, wouldresult in a fragmented spatial perceptual experience: there must be a process ableto unify the various feature maps.

The model named Feature Integration Theory [20] (FIT) tries, as the name im-plies, to explain how our brain merges the different retinotopic maps in a singleperception. According to FIT, the feature binding is performed by our focal atten-tion mechanisms.

In FIT, the visual scene is initially encoded along a number of separable dimen-sions, like colour, orientation, spatial frequency, brightness and others. In order tomerge the different representation in a single visual experience, focal attention mustinspect serially each single location. Only after conscious and focused inspection ofa spatial region, we can have a unitary, synthetic view of the input stimulus.

However, the previous description does not explain how we can perceive shapesand objects even in regions where we are not fixating on. If a red box falls in ourperipheral field of view, we still perceive it and can easily locate it on the image.According to FIT, features extracted from regions in the peripheral field of view arestill merged, but are not consciously perceived. When our attentional system focuson those regions, our brain is able to use past experience and, in particular, the jointperception of their features.

The architecture of the FIT model is shown in Fig. 3.4. The retinal image,containing red and blue bars with different colours and orientations, is explored by anattentional beam reading a master location map. The separable feature dimensionsshown in the image are colours and orientations: for each dimension the object


is coded into different retinotopic feature maps. For example, with respect to thecolour feature dimension, three feature maps are shown: red, green, and blue.

When an attentional beam select a location in the master map and illuminatesthe corresponding area in the retinal image, the region is processed in the differentfeature maps and then coded into an object model. In the diagram, it is clear thatunattended stimuli, even if not used in the current detailed analysis, are still presentin each feature map.

Figure 3.4: Schematic representation of the Feature Integration Model. The retinalimage is coded into separable features (here shown part of the colour and orientationflows) that are binded in the object description by the attentional system. Thepicture does not include any arrow concerning the top-down processes deploying thefocus of attention.

The model implies that there is an important perceptual difference in locating afeature and identifying it. In fact, we can locate unattended regions characterizedby a single separable feature, even if we are not able to identify it before our focalattention elaborates it.


In experimentations where the target regions are described by a conjunctionof separable features, the initial parallel processing is not enough to detect, evenunconsciously, the region. A serial search is needed: the attentive system consultsthe various feature maps in order to restrict the number of candidate areas butcannot explore, in parallel, different feature maps.

This model contains two very important results:

• we can narrow our attentional search not only to spatial regions, but also toparticular features;

• attention cannot be directed onto multiple locations, even if characterized bya single separable features (e.g., all red areas), but needs to take into accountspatial information in a serial way.

The previous points indicate that this model formalizes two types of attentionalmechanisms:

• a parallel and fast one, known as bottom-up attention, able to discriminatevisual areas characterized by a single separable features;

• a serial and slow one, known as top-down attention, able to modulate andcontrol the attentional beam in order to take into account the simultaneouspresence of more than one feature and spatial location of potential targetregions.

In conclusion, FIT assumes that primitive visual features are processed by paral-lel processes producing retinotopic feature maps. Each feature map encodes a singleseparable feature dimension, like colour, orientation, size, movements, and so forth.When one among those maps has a peak of activity on a single location, attentioncan be directed onto the target point without being distracted by the results inother features. If more feature maps presents peak of activity then our attentivesystem must take into account the spatial information and directs its beam onto theavailable candidate locations till a match is found. Spatial information is computedfrom the master location map, that remembers previously attended locations. Whenfixating on a subregion, our covert attention processes all the present feature andbuilds a coherent perceptual experience.

However, the Feature Integration Theory is not able to explain data from differentexperimental settings:

• parallel preattentive computation can detect targets described by a conjunc-tion of features [21];

• some experiments have shown the presence of parallel computations over higher-level features. The assumption that parallel search concerns only basic visualfeature cannot be considered a general property as the parallel-serial searchdichotomy is not valid in general [3].


• it is not clear how a master location map is coded by our visual system. Ifthe deployment of attention is considered to be spatial-based only, withoutany initial and approximate interpretation of the external world, it is not clearhow a master location map could encode regions previously attended. In fact,retinal stimuli changes for every saccadic movement and a master locationmap would not have a frame of reference. In order to partially solve thiscontroversial problem, object-based attention should be investigated.

The FIT model had a strong influence on successive research works. Some groupstried to better explain the distinction between preattentive and attentive vision. Inparticular, they tried to modify the model in order to better predict experimentaldata on visual search tasks and correct the model limitations in conjunctive search.Other works tried to propose more detailed FIT-inspired models with the aim toimplement computer simulations of attentional processes.

One of the most influential detailed models was proposed by Koch and Ullmanin 1985 [2]. Such model is similar to FIT in the description of the preattentive andattentive stages, but proposed some intermediate structures able to give a plausibleanswer to the attentional shifts, both in visual pop-out and in conjuctive search.

In the model proposed by Koch and Ullman, attention is assumed to operate atearly processing stages. Attention thus operates on an early visual representation ofthe surrounding world built by our retina and LGN. Each basic visual feature, likecolour or orientation, is coded into a feature map.

A salient, conspicuous region in the input image should determine the level ofactivity in the various feature maps. For example, if we see a red object on a darkbackground, we expect to find a strong peak of activity in the red feature map andno significant activity in, for example, the green map. Furthermore, if there aremore objects with sharing a similar conspicuity in a feature, for example redness,it is plausible to think that the feature map for red would encode them with valuesproportional to their degree of redness.

Using more appropriate words, the authors suggested that the various featuremaps are characterized by a sort of competition within them. This maps are theencoded in a single map, called saliency map that combines the information comingfrom the feature maps and produces a global map encoding the conspicuity of eachspatial region. The saliency map gives a new view of the external world, favouringlocations that are different globally or locally from objects in their neighbourhood.They suggeste the use of a competive Winner Take All network [22] for the compu-tation of the most salient region.

The saliency map is able to explain visual pop-out and and the sequential scan-ning in case of an object described by a conjuction of features. In fact:

• if a region has a salient property P able to differentiate it from the rest of theimage, then this region appears conspicuous both on the feature map for theproperty P and on the global saliency map.


• If a region is characterized by two or more properties, then it is not conspicuousin the global saliency map. In fact, if the regions have two properties P1 andP2, feature maps for P1 and P2 might contain several peaks of activity thatwould characterize also the global saliency map.

The previous explanation is also able to justify some visual pop-out effects in con-juctive visual search: in fact, if one of the properties is salient in its feature map,then it will appear more conspicuous in the global saliency map. If the activity isstrong enough with respect to the rest of the map, the region can cause a visualpop-out effect.

Figure 3.5: A schematic representation of the model by Koch and Ullman [2]. Theresults of the preattentive computations are stored in feature maps, later globallycombined in a saliency map where a Winner-Take-All strategy detects the mostsalient image region. Once a salient region has been detected, it is analyzed byhigher-level processes. The reported image appears in the original article [2].

The main steps of the model, whose main components and information flows areshown in 3.5, are then:

1. preattentive computation of early visual features and their encoding in featuremaps;

2. construction of a global saliency map according to a Winner-Take-All strategy;

3. selection of the most conspicuous region in the global saliency map for high-level attentive analysis. Once the most salient region has been inspected, theglobal saliency map is consulted for searching a new salient region to process.

The model proposed by Koch and Ullman, that shares many ideas with the earlierFIT, led to several computational models of spatial-based attention. However, before


introducing the main computational and algorithmic results, a third model needs tobe introduced.

3.2.5 The Guided Search model

The Guided Search (GS) model, first proposed in [23] and later refined in GS version2.0, 3.0, and 4.0 in [3, 24, 25], represents an alternative model to the Feature Inte-gration Theory. Even if it shares many conceptual properties with the FIT model,it is considered an independent model and more detailed model of visual attention.Before describing the model, it is worth specifying the different goals of FIT and GS.In fact, FIT tried to give an answer to the important question arising from visualpop-out experiments: how are basic features merged in an unitary visual experienceif they are processed by independent information flows? FIT stressed the impor-tance of focal attention for reaching a final consciuos visual experience. The GuidedSearch model proposes a more detailed, quasi-computational, model to explain howhigh-level processes can influence the visual exploration of a visual stimulus. Theinteractions among stimulus-dependent and task-dependent processes is central inthe initial proposal and in the refinement of the model.

Figure 3.6: The architecture of the Guided Search 2.0 model (from [3]). Input stimuliare first processed by parallel processes extracting basic visual features. The featuresare encoded in feature maps, each one encoding the saliency of an image part withrespect to the particular feature dimension by top-down and task-dependent com-mands. The feature maps are then combined in a global activation map, specifyingwhere the most informative regions are in the current visual stimulus.

The overall architecture of the GS model (version 2.0) is shown in Fig. 3.6.


The initial processing of the visual stimuli is carried out by basic visual processesable to extract, in parallel, basic properties of the image. Simple features likecolour, orientation, brightness, and so forth, are computed over the entire image andencoded in feature maps, one for each feature dimension. The bottom-up componentindicates, as in model by Koch and Ullman, how unique a spatial region is in thecurrent image. However, feature maps are modulated by commands coming fromhigher levels, specifying what features should be considered in the current task.Finally the feature maps are combined in a global activation map that specifies thecoordinates of next fixational point. All the major computational steps of the GSmodel are explained in the following.

Bottom-up processes correspond to preattentive computations of the visual stim-ulus. Bottom-up processes elaborate in parallel and over the entire visual field basicimage features and encode them in the feature maps. The role of these processes isto identify locations that are worthy further attention. For example, the feature mapcorresponding to the redness property indicates whether or not the input stimuluspresents part with an high saliency. The greater the activation in a feature map,the more “unusual” (in the input stimulus) is the particular region. The uniquenessof a given region is computing by comparing the activation on that region with sur-rounding units. The mechanism is quite similar to the competitive approach in [2].However, in the GS model, feature maps do not encode subsymbolic results fromearlier processing, but categorical representations computed by broadly-tuned spa-tial filter. For example, feature maps for orientation do not contain a measure of theangle of a tilted bar, but how much it is “horizontal”, “tilted”, and so on. Further-more, the model includes a threshold called preattentive Just Noticeable Difference:basically activations in feature maps below that threshold are considered irrelevant.

Top-Down activation commands, that modify the preattentive feature maps, areable to guide attention towards specific and desired stimuli in the visual field. Infact, bottom-up feature maps can only guide our attention onto salient regions. Theyfail in guiding us onto our target, in a task-dependent search, if the feature maps donot contain anything unique or relevant. For example, if we consider the left picturein Fig. 3.3, without a specific target the red circle will result in a visual pop-out.However, if we are interested in blue circle, the red circle, very salient but uselessfor the current task, represents only noise. Top-down commands act selecting theone feature map for each feature dimension according to the target description. Ifthe target were a red horizontal bar, only the feature map for colour and one featuremap for orientation would be selected.

Bottom-up and top-down activations are then merged in the global activationmap. This last map contains information about regions that could be relevant forthe current task. If we are in free-viewing settings, without any specific target, theactivation map contains only the activation coming from preattentive processes. Ifwe are in a visual search, the activation map contains the contribution from bottom-up and top-down processes. However, the peaks of activity in the global map donot contain any information about the features most contributing to the activation


level. Attentional processes, our covert mechanisms of attention, need to focus onthe region and inspect it in order to catch the details [26].

The distinction in bottom-up and top-down processes explains many experimen-tal results. Visual pop-out is possible since a single peak of activity in a singlefeature map corresponds to a single peak of activity in the activation map. Visualsearch tasks, characterized by a sequential scanning of many regions correspondingto distractors, are explained as many peak of activity in the bottom-up and top-down sum. Furthermore, the GS model is able to explain how attention can beguided toward regions characterized by more than one feature. In this case, in fact,the feature maps corresponding to the target features are selected by the top-downcommands and result in few peaks of activity in the global activation map. Obvi-ously, the time needed to scan sequentially all the potential regions depend not onlyon the number of distractors items, but also on the number of features shared bythe target and the distractors themselves.

The discussion about how attention is deployed does not consider the changein the retinal stimulus when the eyes move. In fact, the proposed method, usedin almost any computational model, requires only that the attended regions areinhibited. How the return to previously attended can be inhibited is a centralquestion for many psychological theories of attention. In fact, it is not yet clear howour visual system keeps trace of attended locations. It is worth saying that, sinceour visual system seems to keep trace of the attended location even after severalsaccades, the inhibition of return is likely to be computed by high level responsiblefor our very consciousness and our interpretation of the external world.

The previous models are spatial-based models: they consider spatial regions asthe elementary units of attention. However, there is experimental evidence thatattention is influence not only by salient regions in the visual field, but also fromobjects [14], more than object, we should say perceptual groupings of spatial re-gions. A very influential theory, validate by several experimental results, has beenproposed in [27] and further validate in [28] and extended in [29]. According tosuch theory, named Biased Competition Model and, later, Integrated CompetitionHypothesis, our visual attention is able to focus not only on salient spatial regions,but also on objects (resulting from an early perceptual grouping) in our current fieldof view. Furthermore, even if spatial attention is restricted on a single focus, ourattentional system can be influenced by stimuli possessing given behaviourally fea-tures. The result of bottom-up stimulus-driven activation and top-down attentionalmodulations are the combined in an exploration strategy selecting single spatial lo-cations, but proceeding in parallel along the feature enhancing/inhibition streams.Attention is thus considered the result of the competition among spatial regions andobjects present in the current scene.

The previous models are spatial-based models: They consider spatial regionsas the elementary units of attention. However, there is increasing evidence thatattention is attracted not only by salient spatial regions, but also to objects. Intruth, since all the models are mainly focused on the early stages of the visual


processing, before any computation by higher-level processes, attention is deployedto proto-objects, i.e., elementary perceptual grouping that are computed in the earlystages of our processing.

Even if there is increasing evidence that attention is deployed not only to spatialregions, but also by objects or proto-objects. Many models, reviewed in the nextsections, are still spatial-based.

3.3 Computational Models of Visual Attention

With the introduction of the Feature Integration Theory, research on visual attentionstarted proposing detailed models able to simulate results obtained in particular ex-perimental settings, like, for example, in visual search. As computer became morepowerful, new computational model of visual attention were proposed. FIT andfollowing models induced many computer scientists to investigate the potentialityof biologically-inspired attentional mechanisms in computer vision tasks. In fact,research on visual attention branched in two main paths. A first one devoted tothe implementation of biologically-inspired computer algorithms including atten-tional mechanisms. A second one devoted to further investigate the biological andphysiological mechanisms underlying visual attention: Where is exactly located theattentional subsystem? How are feature maps encoded in our brain? Is there asingle or multiple saliency map?

The first computational model of visual attention was proposed by Koch andUllman [2]. This model has been described in the previous section because it falls atthe intersection between psychological models and computational ones. The model,in fact, was not implemented but contained so many algorithm specifications andenough details that could be easily translated into a computer program. In fact, thismodel served as a foundation for many implementations proposed later and inspiresthe model proposed in this thesis described in the next chapter. It should be said,however, that this model looses part of its biological plausibility using a centralizedsaliency map that should encode the saliency of the various objects in differentfeature dimensions and at different scales. However, the complexity of the neuralprocessing needs to be reduced if a computational model has to be implemented andused in practice.

The model by Koch and Ullman was implemented years later by Milanese etal. in [4]. The model uses several early features, like colours organized in colour-opponent channels, orientations, and includes an alerting channel sensitive to move-ments. The conspicuity of each location is computed by center-surround differencesand stored in a single saliency map. The architecture of the model is shown inFig. 3.7. The model includes an object recognition system implemented as an as-sociative memory mapping salient regions to object. The top-down modulationcompetes with the bottom-up activation, allowing the system to attend previouslylearnt objects.

3.3. COMPUTATIONAL MODELS OF VISUAL ATTENTION 31

Figure 3.7: Architecture of the computational model proposed by Milanese et al.in [4]

However, in 2001, Tsotsos proposed a theoretical analysis about the complexity ofvisual search tasks and computational models of attention [30]. He proved that visualsearch, in particular settings, can be a very hard problem, as hard as NP-Completeproblems. He decided to investigate the real complexity of visual search tasks. Hewas inspired by discussion about how much ‘parallel’ a computer should have beenin order to emulate humans in their abilities, in particular visual abilities. Evenif part of his assumptions and part of his conclusions were not accepted by manyinfluential researchers (see commentaries associated to the cited article), everyoneagreed upon the need for visual attention, and other forms of biological inspirations,in computer vision tasks.

Using the same considerations proposed in [30], Tsotsos proposed a computa-tional model of visual attention, named Selective Tuning Model, in [31]. His modelhas a pyramidal and hierarchical layered organization of processing units. Basicallyit can be considered a neural network-based implementation of visual attention. Thearchitecture of the model is shown in Fig. 3.8. When a stimulus is presented to thenetwork, it activates all of the units in the pyramid to which it is connected in afeed-forward way. Once the output units have been activated (left image of Fig. 3.8)they perform both a spatial and a feature selection:

• spatial inhibition is performed by inhibiting all of the connections which arenot relevant for the current task.

• feature inhibition is performed by inhibiting all of the cells that are computing


Figure 3.8: Overall architecture of the Selective Tuning Model. Left image: activeconnections when a stimulus composed by a red object and a blue object is pre-sented to the model. Right image: inhibited connections (in black) when top-downprocesses deploy the attention on the red stimulus.

features not relevant for the current task.

At each layer in the pyramid, the unit become active after a Winner Take Allstrategy. In presence of a top-down bias, unit can be inhibited by informationrelated to the current task. The right image shows the latest stage of the model,where many connections, coloured in black, are inhibited and do not contribute dothe network output. The model has been extended in [31], but it is still limited toa single feature dimension. The model is not able to perform the feature bindingthat is considered, as seen in the previous section, one of the distinctive features ofattentional processes.

One of the most, if not the most, influential models of visual attention has beenproposed by Itti et al. in [5]. It is a complete implementation of the model proposedby Koch and Ullman and has been experimented on several real-world datasets. Itsarchitecture is shown in Fig. 3.9. In the initial version (left image in Fig. 3.9) themodel was used to compute salient location in the visual stimulus. In fact, it was apurely bottom-up architecture to guide saccades onto relevant regions in the inputimage. Since the model proposed in the thesis is inspired to the initial model of Kochand Ullman and has adopted many solutions proposed in [5], many details of themodel will be discussed in the next chapter. The initial model has been extendedin several works. For example, it has been extended for computing the gist of ascene [6], whose architecture is shown in the left picture of Fig. 3.9. Furthermore,The model has been extended in order to account for top-down guidance [32].

Another important model has been proposed by Hamker in [7, 33]. Its maingoal is to give a computational description of the human visual processes. It isnot meant to be, at least in its current implementation, a computational modelthat could be used in real-world applications. Its architecture is inspired to Kochand Ullman’s model and some implementation details are similar to Itti’s [5]. Itsarchitecture is shown in Fig. 3.10. The feature maps and the feature conspicuity


Figure 3.9: The architecture of the computational model proposed by Itti et al.Left picture: original architecture, as proposed in [5]. Right picture: expandedarchitecture, as proposed in [6]. The latest model includes components to computethe gist of the scene and use it as a hint for enhance the spatial locations where thetarget has been previously seen.

maps, using center-surround differences, are combined both in the saliency map andin two layers performing a population-coding [34] (Level I and II in the figure). Thetwo populations receive input connection from the target template: the activationof cells encoding expected features from the template is enhanced, whereas cellsthat are encoding task-irrelevant features are inhibited. The saliency map and thepopulation-based codings are combined in a decision map where candidate locationfor the target template object (or parts) are coded.

The main features of the model, that have inspired parts of the work in thisthesis, are:

• the target object is described using the same set of features processed by theattentional system. Basically, the feature-based attentional computations acton the object representation.

• the template, even if not built by the model itself, guide the deployment of at-tention by influencing the construction of the feature conspicuity maps. The


Figure 3.10: Architecture of the computational model proposed by Hamker in [7].

response of units encoding expected features is strengthened and enhanced.Match units are then responsible to filter out cells that are not spatially rele-vant for the detection of the target.

A major limitation of the model is represented by the object template. The templateis not built by the object: it is simply presented to the model. Furthermore, atemplate can be detected only if it appears in the same orientation as when presentedto the model.

The models discussed so far are all spatial-based. Even if some models are able touse an object template to bias the attentional search, such templates are used onlyto modulate the response of units computing features belonging or not belongingto the object template. No models present an explicit manipulation of objectsor perceptual grouping. A computational model operating on objects, rather thanspatial regions, have been proposed by [8]. It is based on the Integrated CompetitionHypothesis described in the previous section. The main innovation of this model isthe implementation of competition for saliency and, then, for the attraction of theattention not only within an object (feature-, spatial-based), but also among objectsin the visual field. Its architecture is shown in Fig. 3.11. The grouping saliency


Figure 3.11: The schematic description of the object-based attentional model pro-posed in [8].

mapping is characterized by a competition among feature saliency, spatial saliency,and object saliency. The resulting saliency is used to determine the next location ofan eye fixation. Furthermore, the model is implemented in order to inspect the imageat different resolution levels. However, in the experimental settings described by theauthor, the grouping of the input object was supposed to be already computed.

We have so far presented, with the exception of the Selective Tuning Model,computation models in the filter-based class. These models are characterized by aninitial stages where primitive features are computed and organized with classicalfiltering operations. With the exception of the model by Hamker, all the othermodels were functionally organized as to resemble our visual system, but no oneused artificial neural cells, connections, and weights.

There is another group of computational models of visual attention that aremore biologically inspired. They adopt a connectionist framework and model eachprocessing step in the attentional system with neural networks. Examples of con-nectionist approaches are the Selective Tuning Model, described in the previoussection, the model by Hamker (even if only partially since the feature extractionfollows a filtering approach), the FeatureGate model [35], and the Neurodynamicalmodel [36, 37].

The FeatureGate model is implemented as a multilayer neural network organizedas the first cortical areas. The first layer computes basic features and encode themusing a center-surround organization. The activation of each unit depends on itsdifference with the surrounding units. This encoding is passed to the following layers,where it is compared with the target stimulus. The result of such comparison is used


to inhibit the cells in the lower layers computing task-irrelevant features. A majorfeature of the model is the local nature of the computations, which may results in astrong response on isolated distractors, that must be isolated through a serial search.The main weakness of the model consists in a pixel-wise comparison between inputand target that makes it very sensitive to noise.

A much more complex model is the Neurodynamical one. It consists in a con-nectionist approach to visual attention performing a multiresolution analysis of theinput image. It is composed by two main subsystems, modeling the reciprocal con-nections between the What system, performing object recognition, and the Wheresystem, dedicated to locating candidate locations for the target object. A schematicview of its architecture is shown in Fig. 3.12. When the model is in visual search

Figure 3.12: Schematic architecture of the Neurodynamical model of visual atten-tion.

mode, the image is analyzed at its coarsest resolution by the Where subsystem acrossthe entire visual field. Since the image has been processed at the lowest resolution,there is no information other than those relative to candidate location. Based onsuch locations, the object recognition modules, in the What subsystem, inspect theimage at increasingly higher resolutions, until they can accept or reject the can-didate location as part of the object. Objects are learnt in a previous stage andencoded using information about local orientations. In this model, there is not anexplicit saliency map: the saliency of each location in the visual field correspondsto the modulations across the feature maps.

The models described above, however, represent the most influential researchworks proposed in the last years. However, there are several models not describedin this section. For example, the Cue-Guided Search [38], an attentional model


specialized in detecting faces in images, the work [39] where scene exploration islearnt using Q-Learning. An interesting approach is presented in [40], characterizedby What and Where sybsystems and an initial fovea-like sampling of the inputimage.

3.3.1 Conclusions

In this chapter the main theories of visual attention have been reviewed. Each theorytries to explain the reaction time of subjectes to a visual stimulus. In particular,they explained the presence of a parallel, preattentive system characterizing theearly visual stages and a serial, more accurate search characterizing focal attention.In particular, the models argue that attention acts as a feature binding process, ableto merge the information coming from different visual channels in a unitary visualperception.

However, the importance of biologically-inspired computer algorithms has ledmany researchers to propose computational models of visual attention. The reviewdiscussed the most influential works published in the last years. The larger part ofthe described models is spatial-based, i.e., models that process spatial regions thatare considered salient. There is, however, a certain evidence that our attention isnot only spatial-based, but also object-based. According to this last view, our visualsystem can deploy attention on salient meaningful areas.

Chapter 4

The proposed model

4.1 Introduction

The visual attention model proposed in the thesis is composed by two differentmodules:

• a bottom-up module, modeling the stimulus-driven (image-based) deploymentof the visual attention.

• a top-down module, responsible for the modulation of the visual attention intask-based scene exploration.

The bottom-up component constructs a pyramidal representation of the inputimage in order to perform a multiresolution image analysis.

uses pyramidal structures to perform a multiresolution image analysis. For eachlayer in the pyramid it builds feature maps and saliency maps indicating the presenceof relevant image structure. The local saliency maps are combined into a unique,global saliency map perform a max-pooling operation over the layers. The sceneis explored starting from the point of maximum saliency. We introduce a form ofobject-based bottom-up attention: attention is deployed to meaningful structureand not on simple spatial regions.

When in learning mode, the bottom-up system forces focused attention on therelevant object. An object template is built and is used when the top-down systemlooks for the object in visual search. The top-down system classifies the structuresusing features extracted by the bottom-up system.

The rest of the chapter is organized as follows. In section 4.2 the global architec-ture of the system is discussed and the interconnections among the various modulesdiscussed. In section 4.3 the construction of the saliency map and the dynamic ofthe stimulus-driven scene exploration is explained in details. In section 4.4 it isdescribed how attention is deployed over the different image regions in a top-downfashion, using an object template as the target object to be found. In section 4.5the model is discussed, presenting the innovative proposals and its strength.

40 CHAPTER 4. THE PROPOSED MODEL

Figure 4.1: Schematic representation of the structure of the bottom-up componentof our system. The top-down connections from the high-level processes are notshown.

4.2 Overall architecture

In Fig. 4.1, the overall structure of the introduced model is presented. The inputimage is separated into five distinct channels, which are coded in an image pyramid.There are four colour channels, corresponding to red, green, yellow, and blue, anda single channel for intensity. These channels are then encoded into feature mapswith a center-surround organization. They correspond to opponent channels andencode the feature saliency at the various scales in the pyramid:

• two maps encode the intensity feature. In particular, they encode an on-center/off-surround organization, with a maximal response on bright spots ondark backgrounds, and an off-center/on-surround, with a maximal responseon dark spots on bright backgrounds.

• four maps encode, respectively, red-green, green-red, blue-yellow, and yellow-blue opponent channels.

4.3. BOTTOM-UP PROCESSING 41

The information extracted from the two feature dimensions, intensity and colour, iscombined with local orientations maps, computed by convolving Gabor filters withthe gray-level image.

Feature conspicuity maps encode the saliency of a given feature dimension: fea-ture maps in the same feature dimension are normalized and summed in the con-spicuity maps. The normalizing operation enhances maps with few peaks of activityand inhibits maps presenting a diffuse activity that corresponds to the absence ofrelevant features. Conspicuity maps are then summed in local saliency maps, encodethe saliency of image subregions at the different pyramid levels.

A global saliency map, performing a max-pooling operation over the level saliencymaps in the pyramid encodes the saliency of the image as a whole. Maximal activityin the global saliency map corresponds to a salient region at a single or more saliencylevel in the pyramid. Once a location in the global saliency map has been selectedfor further analysis, a top-down process, which is not shown in the scheme, selectsthe local saliency map with maximal activity and segment the salient region.

An inhibition of return inhibits the regions corresponding to the segmented mostsalient region in order to prevent the focus of attention to be shifted onto the sameregion again [41]. The inspection of the saliency map continues as long as there aresignificant saliency values.

When the system is in learning mode, the top-down processes encode the salientregions as parts of the same objects. Stored prototypes of previously seen objectscan be used to modulate the feature maps when the system is in visual search mode.The relative locations of object parts are used to direct a beam of attention towardthe location where the successive location is expected to be.

4.3 Bottom-up processing

As light must strike our retina before we can see, basic visual features need to beextracted from the input image before any visual computation can be performed.The set of feature used by the model can be subdivided into two different subsets:

• a first one corresponding to the initial processing of light performed by theretinal layers. This subset contains features related to intensity and colourscomputed according to the center-surround receptive-field organization char-acterizing ganglion and LGN cells.

• the second subset contains oriented features computed in area V1, like (thepresence of) edges and lines.

An important component of human bottom-up attention is also motion. Our atten-tional system forces us to focus on moving object, in particular on objects movingfast toward our body. However, the model does not include, in the current version,any module dedicated to moving stimuli since it has been built for the analysis ofstatic 2-D images.


The response of the visual system to non-oriented and oriented features dependson the scale at which we are analyzing an image. Any multiscale analysis is composedof two basic components:

• the generation of a multiscale representation;

• the extraction of the information from such representation.

There are several algorithms that have been developed to reproduce the innatehuman abilities in multiscale pattern analysis. All of them are not comparable tothe human performance, but are able to provide useful results in a narrow range ofscales. In fact, the human visual system is likely to use a dense representation inthe cortical area encoding scale in a continuous space [42].

The two main approaches in multiresolution analysis can be summarized as fol-lows. The first approach uses resizable filters (with varying dimension) in order tostudy the input image at different resolution levels. The second approach consists inmodifying the dimension of the input image while keeping constant the size of thefilters. The latter approach basically constructs several copies of the input image inwhich both sample density and resolution are decreased and is commonly referredto as image pyramid. The difference between the two approaches can be betterillustrated using an object recognition task [43]. If a target object may appear atdifferent scales inside an image, we can construct several copies of the object atincreasing scales that are convolved with the input image. Alternatively, we canreduce the input image and convolve it using an object pattern at fixed size. Evenif the two approaches allow an exact match (if possible), there is a significant dif-ference in the computational cost. In fact, the convolution of the image with thetarget pattern expanded by a factor s requires s4 more multiplications than theconvolution of the target pattern with the image reduced in scale by a factor s.

4.3.1 Image pyramids

Image pyramids can be considered a support data structure for efficient scaled con-volutions through reduced image representations. However, they are considered tobe a ‘good approximation’ to the representation used by the human visual system.In fact they can be thought as an encoding strategy to achieve the space-variant en-coding performed by the spatial organization of photoreceptors in the retina [12, 44]:by using image pyramids an algorithm is able to access the visual stimulus at full orreduced resolution by simply using the image encoded at different pyramidal layers.Pyramids can also be used for image compression tasks [43].

Construction of image pyramids relies on two basic operations:

• filtering and blurring the image;

• interpolation and sampling for image reconstruction.


Since we are not interested in the pyramid structure as an efficient way to encode(and decode) the image, we will not discuss the second basic operations.

The first basic operation consists in convolving the image with a smoothingkernel in order to obtain a blurred version. The smoothed version is then sampledand a reduced resolution copy of the image is built. The goal of the blurring step isto remove high frequencies components. Once higher frequencies have been removedthe image (according to the sampling theorem) can be represented using less samplesthan the original one. In the 2-D case, at each step the sampling procedure isapplied to both the columns and the rows, resulting in a sampled image which hasone-quarter as many samples as the original.

We used a gaussian image pyramid[45, 46], where the smoothing kernel corre-sponds to a gaussian low-pass filter. Given the input image I, that correspondsto the pyramidal layer denoted with G0, each layer Gi in the gaussian pyramid fori = 1, . . . , nP , where nP is the total number of layers in the pyramid, is built by low-pass filtering the previous layer Gi−1 using a gaussian spatial mask and subsamplingby decimation with a factor of two.

Algorithm 4.3.1: BuildGaussianPyramid(I)

comment: Main procedure in the construction of the pyramidG0 ← Ifor each i ∈ 1, . . . , nP

do

G′i−1 ← LPF (Gi−1)Gi ← REDUCE(G′i−1)

The function LPF, that stands for Low Pass Filtering, is implemented by spa-tially convolving the image at the previous layer Gi−1 with a separable gaussianfilter. As previously said, the smoothing operation is needed becuase the reductionof the image size is implemented by a decimation approach. By discarding pixels,without modifying the spatial frequencies of the image, we would have an aliasingeffect since high frequencies could be lost.

The image is then subsampled by decimation by a factor of two. One pixel everytwo is kept in the subsequent layer of the pyramid, meaning that the dimensions ofan image at layer l will be half the dimensions of the image at the previous layerl − 1. A simplified version of the function REDUCE, used in the procedure above,can be described as follows.

Algorithm 4.3.2: REDUCE(I)

nr ← number of rows in I divided by twonc ← number of columns in I divided by twofor i← 1 to nr

do for j ← 1 to ncdo I ′(i, j)← I(2i, 2j)

return (I ′)


More formally, if the input image is I(i, j) with i ranging over image rows andj over columns, a pyramidal representation of I can be constructed with a numberof layers up to the log2 of I size. The layers are numbered from 0, that correspondsto the original input image I, to L − 1. The implicit assumption in the procedureREDUCE is that I is 2L×2L in size. The images processed by the actual implemen-tation of the model are not required to be squared or with dimensions correspondingto powers of two.

Let I0(i, j) be the input image. The pyramid is built by iteratively convolvingI0(i, j) with a family of gaussian functions that are local and symmetric. Theapplication of gaussian functions produces a sequence of images I0, I1, . . . , IL−1

known as the gaussian pyramid. However, we do not compute the entire pyramidsince very small images are not useful for saliency analysis. The number of computedpyramid layers is nP , where nP is chosen according to the input image size.

When computing layer Il+1, the smooothing filter W is:

gl+1(i, j) =k∑

m=−k

k∑n=−k

W (m,n) · Il(i−m2l, j − n2l), (4.1)

where i, j represent the pixel coordinates. The matrix W corresponds to a separablegaussian filter (generating kernel of the pyramid):

W = w · wt (4.2)

where w is normalized to 1 and symmetric:∑i

w(i) = 1 w(i) = w(−i), i = 0, . . . , k. (4.3)

Furthermore, the filter w needs to maintain an equal contribution from odd andeven pixels it is applied on:

w(0) +∑n

w(2n) =∑k

(2n+ 1) =1

2, (4.4)

where n ranges over permitted values. If the filter size is chosen to be five, as in ourcase, the values are:

w =

(1

16,

1

4,

3

8,

1

4,

1

16

)t. (4.5)

For each layer l in the pyramid, the image Gl is analyzed using the same setof image processing algorithm. Each layer inthe pyramid will thus be composed byseveral images, one for each feature that will be used by the algorithms for computingsalience. In the next sections we omit the specification of the scale (pyramid layer)when it is not needed to remove ambiguity about the symbols in the describedoperation. Therefore, the feature map RG (standing for the red-green channel)must be interpreted as a feature map belonging to a defined layer in the pyramid:the correct way to indicate such map would be RGl, where l is the index of the layerin the pyramid.


4.3.2 Feature pyramids

Each feature used by the model is encoded in a gaussian pyramid. Using the reducedversion of the input image stored at every layer, we compute a set of feature mapand organize them in pyramids. We compute two different types of features: non-oriented features (intensity and colours) and oriented features (bars, edges).

4.3.3 Non-oriented features

The input image is assumed to be encoded in the RGB colour model. Such modelis an additive model in which the three primary colours red, green, and blue areadded in order to reproduce other colours. RGB does not specify what is meant byred, green, or blue (spectroscopically): it is said to be a relative colour space wherethe results are just approximately correct to the human eye. In other cases RGBimages are encoded according to a sRGB colour spaces, where the three primaries areexactly defined. In the actual implementation the model is able to manage imagesencoded according to the sRGB colour space, which is by far the most common one.

In the RGB colour model, colour values are encoded in each pixel by a vector withthree components (r, g, b)t, standing for red, green, blue. We assume that each colourcomponent ranges from 0 to 1 (usually they are encoded using 8 bit per channel, witha maximum value equal to 255). The RGB colour space is not perceptually uniform:little differences in the RGB values can produce different colour experiences, whilecolour stimuli perceived as similar may have very different RGB encodings. Evenif it is not a problem in computer graphics applications [47], perceptually uniformspaces have been designed to overcome such problem, like the Luv colour space orits refinements [48]. However, we decided to avoid non-linear transformation intomore complex colour spaces since RGB allows an easier interpretation of colour data.Furthermore, other models, like the original model proposed by Itti [5], obtain goodresults.

Using RGB values

In the algorithm proposed in [5], non-oriented features were extracted as follows.The intensity is computed as:

II =r + g + b

3. (4.6)

The previous formula should inclue pairs of spatial coordinates (x, y) referring tothe pixel whose intensity is being computed: 3 · II(x, y) = r(x, y) + g(x, y) + b(x, y).Hereafter the pixel coordinates will not be indicated, unless they are needed. Themaximum value for intensity is reached on triplets whith r = g = b = 1, representingwhite. However, Eq. (4.6) gives the same result on green = (010)t and blue=(001),even if blue is perceptually darker than green. Even if it does not represent a major


problem in graphics applications, it can result in poor performance in algorithmsattempting to model the perception of colours .

The intensity values are then used to separate hue from intensity in the red,green, and blue channels.

r′ =r

II. (4.7)

The above expression must be interpreted as a point-by-point division of the matrixwith the only red values by the intensity image, where the expression makes sense(II > 0). The pixel with II = 0 are set to zero in the red channel. The ratio of theprevious computation is to decouple the hue of the colour from intensity.

The normalization is performed also on the green and the blue channels:

g′ =g

IIb′ =

b

II.

(4.8)

The raw r′, g′, b′ values are then used to extract the colour channels IR, IG, IB,and IY that correspond, respectively, to red, green and blue:

II =r′ + g′ + b′

3;

IR =

[r′ − g′ + b′

2

]+

;

IG =

[g′ − r′ + b′

2

]+

;

IB =

[b′ − r′ + g′

2

]+

;

IY = [r′ + g′ − 2 · (|r′ − g′| − b′)]+ .

(4.9)

The function [·]≥ stands for half-wave rectification:

[x]+ =

x if x ≥ 0

0 else(4.10)

Two different results of the colour extraction procedure are shown in Fig. 4.2.The images corresponding to the components computed in (4.9) on the classicalimage with a mandrill are shown in Fig. 4.2. For each component, the grayscaleimage represents how strong the presence of the colour is in the image: the brightestit is, the strongest the presence. For example, in the image corresponding to thered component, the part of the nose, which is red in the input image, receives thehighest value, corresponding to white in the colour maps.

The four colour channels and the two intensity channels will be combined in op-ponent channels The Gaussian pyramid for the intensity values in shown in Fig. 4.3.


Figure 4.2: Intensity and colour values in the classic image of the mandrill. Fromthe upper left corner, clockwise: original image, and, in order, the red, green, blue,yellow channels (after half-wave rectification)


Figure 4.3: Gaussian pyramid for intensity values. The leftmost image is the inputimage.

The RGB model and the previous initial encoding does not result in a percep-tually uniform color space. However it is fast and the obtained results allow aneasy interpretation. Other models use more complex color spaces [49] or differenttransformation of the RGB values [50]. A more biologically inspired, or perceptu-ally uniform, colour space will be necessary when the object detection module isextended to category learning [51, 12, 52, 53, 54].

4.3.4 Feature (contrast) maps

Once the basic visual features have been extracted, they must be encoded and sentto higher processing levels. The model uses the same center-surround organizationas found in different areas of the human brain, like, for example, ganglion cells.The computation of center-surround differences represents a data-compression step.In fact, by computing those differences, only regions presenting a contrast strongenough to cause a change in the activity of the cells are sent for further processing.The channel for intensity, for example, is encoded in two different contrast maps,the first one relative to on-center/off-surround receptive fields, the second one forthe off-centre/on-surround opponency. Both types of cells present a null responseon homogeneous areas, where the stimuli coming from the centre and the surroundof the receptive field compensate each other.

In the original model [5], the centre-surround organization is computed by across-scale subtraction. Basically, given a layer l in the Gaussian pyramid for a specificfeature f , the center-surround enconding for f at layer l is constructed by inter-polating to the images at coarser scales l + δ, with delta > 1. Given a pixel p at


scale l with coordinates (n,m), the value of p(n,m) at the same coordinates in theoversampled image can be considered as the mean value in the surround of p at scalel thanks to the low-pass filtering with decimation performed during the constructionof the pyramid. However, this corresponds to consider squared receptive fields when,in nature, they are circular and the influence of a cell in the surround depends onits distance from the centre. By using a squared background, the model assigns thesame weight to the value of pixels in the surround.

In the current model a different approach has been used in order to use thesymmetrical structure of receptive fields.

The basic idea is to use radial symmetric masks. There are basically two al-ternatives, each one with pros and cons (more complex windowing procedures arediscussed in [55]). They consist in:

• using a circular averaging filter resembling the properties of a ganglion recep-tive field.

• modelling the center and the surround influences on a cell input by usingGaussian masks and, in particular, difference of Gaussians (DoG).

Circular averaging filter

By using a circular averaging filter, given a feature f and its gaussian pyramid, webuild its centre/surround featura map at level l by averaging the map, at the samelevel l, corresponding to the opponent feature f ?. The feature (contrast) map atlevel l is thus obtained:

FCf,f?,l,r = Cf − (AV ∗ Cf?) (4.11)

where AV represents the circular averaging filter and r is the radius of such filter.This simple approach has a major drawback: circular averaging filters, in general,

are not separable: they cannot be expressed as the product of two vectors. Therefore,the complexity of their spatial convolution with an input image is O(K2), where Kis the number of rows (and columns) of the squared averaging mask.

Difference of Gaussian (DoG)

Cells with a center-surround receptive field can be modelled by means of a differenceof Gaussian functions [56]:

gσ,γ =1

2πσ2

(1

γ2e−x

2+y2

2γ2σ2 − e−x2+y2

2σ2

), (4.12)

where x and y are coordinates in the image, and the parameters σ and γ specify thestandard deviation of the centre (first term) and the surround (second terma). Inparticular, the Gaussian function corresponding to the centre has standard deviationσc = γ ·σ, with 0 < γ < 1. We used γ = 0.5, as experimentally found in [57] (another


method to set biologically plausible parameters can be found in [56]). Furthermore,the two gaussian functions are normalized to have integral value of 1.

The parameter σ can be computed according to the desired centre radius r. Thecentre, in fact, is delimited by the zero-crossing of the DoG:

1

2πσ2

(1

γ2e−x

2+y2

2γ2σ2 − e−x2+y2

2σ2

)= 0

e−x

2+y2

2γ2σ2 − γ2e−x2+y2

2σ2 = 0

−x2 + y2

2γ2σ2− log γ2 +

x2 + y2

2σ2= 0

(x2 + y2 = r2)

r2 =2γ2 log γ2σ2

γ2 − 1

r = γσ

√2 log γ2

γ2 − 1

The DoG form discussed so far, however, does not provide a good approximation ofreceptive fields for any value of γ between zero and 1. A more general formulationcan be found in [58, 59].

Using the previous expression for r, with γ = 0.5 we obtain:

σ =r

0.96. (4.13)

The DoG, however, does not make sense if we require the centre to be composed bya single pixel. The DoG masks used in the model are shown in Fig. 4.4. They havebeen generated using γ = 0.5 and imposing the centre region to have a radius r = 2pixels.

Computation of feature contrast maps

As previously said, the circular averaging filter is not separable: the computationof its convolution with large images can be quite expensive, in particular when wechoose a large value for the dimension of the surround region. DoG filters are com-posed by the sum of two Gaussian functions (masks). DoG filters are not separable,but Gaussian masks are: they can be expressed as the (outer) product of two vectorswcolwrow. The convolution of a separable filter can then be computed by convolvingthe image with the two vectors, resulting in an improvement in performance. In thislatter case, the complexity is O(2K), where K is the dimension of the two vectorsresulting in the gaussian filters.

For the previous reasons, we adopted an hybrid approach:


Figure 4.4: The DoG filter generated for computing centre-surround differences.From left to right: mask for the centre part, mask for the surround, DoG for com-puting on-centre/off-surround receptive fields, DoG for computing off-centre/on-surround receptive fields. The DoG have been obtained by subtracting the first twomasks. The centre and the surround masks have been generated with parametersγ = 0.5 and centre radius r = 2.

• DoG filters are used on finer pyramid scales, where a centre region composedby more than one pixel would help in reducing the number of strong contrastsdue to isolated pixels (that would be lost in the subsequent smoothing anddecimation process in the pyramid construction).


• Circular averaging filters are used in the coarser scales, where a single pixelcorresponds to larger regions in the original input image and where the use ofsingle-pixel centre regions makes sense.

By means of this hybrid approach, we are able to reduce the complexity of the convo-lutions on high-resolution fine pyramid layers, while avoiding to lose information incoarse layers (due to large centre regions), where the application of a non-separablefilter does not result in a significant loss in performance.

The algorithm used to compute the feature maps can be summarized as follows.

Algorithm 4.3.3: ComputeFeatureContrastMaps(P)

for each l ∈ 1, . . . , nP layers in Pdo for each couple (f, f ?) of opponent features

do

if lis a high-resolution, fine layer

then

M exc1 ← DoGexc

1 ∗ CfM inh

1 ← DoGinh1 ∗ Cf?

M1 ←M exc1 −M inh

1

M exc2 ← DoGexc

2 ∗ CfM inh

2 ← DoGinh2 ∗ Cf?

M2 ←M exc2 −M inh

2

else

MAv1 ← AV1 ∗ Cf?MAv2 ← AV2 ∗ Cf?M1 ← Cf −MAv1

M2 ← Cf −MAv2

for each (i, j) pixel coordinates ∈ FMf,f?

do FMf,f?(i, j)← [max(M1(i, j),M2(i, j))]+

For each layer in the pyramid and for each couple of opponent feature, we com-pute the centre-surround differences using DoG filters or the averaging filter depend-ing on the resolution of the current layer. If the layer is a high-resolution one, i.e.it is among the first layers in the pyramid, we use DoG filters. In particular, for thecouple of opponent feature (f, f ?), we convolve the channel for the first feature Cfwith the excitatory component of the DoG and the channel Cf? with the inhibitorypart. The results of the two convolutions are then subtracted in order to computethe centre-surround organization. Two DoG filters, DoG1, DoG2 are used to studycentre-surround organization with a different size of the surround region, while thecentre region is kept fixed to be 2 pixels in radius.

If the layer is a coarse one, we do not use DoG filters any longer but pass to thesimpler averaging filter. In this case, the opponent channel Cf? is convolved withthe filter and then subtracted from the unprocessed Cf . In this case two averagingfilters are used with AV1 being 5×5 in size and AV2 being 7× 7.

Each opponent couple is then studied at two different “resolutions”, correspond-ing to the different radii of the surround regions, resulting in a feature contrast map


FMf,f? constructed using a max operation over the two maps M1 and M2.The feature maps are computed for the following couples of ordered opponent

features:

• (R,G) and (G,R), encoding, respectively, red-on/green-off cells and green-on/red-off opponencies.

• (B, Y ) and (Y,B), encoding, respectively, blue-on/yellow-off and yellow-on/blue-off opponencies.

Furthermore, we encode center-surround differences for intensity in separate featuremaps: Ion,off , Ioff,on. The two maps encode, respectively, on-centre/off-surroundand off-centre/on-surround cells for intensity. The feature maps are hereafter de-noted with RG,GR,BY, Y B, Ion,off , and Ioff,on. All the maps are encoded as shownin the previous algorithm. For example, the RG channel is computed:

• at a fine scale by convolving the excitatory components of the filter with thered channel and the inhibitory part with the green channel.

• at a coarse scale by averaging values of the green channels and subtractingthem from the original red channel.

The current implementation uses only two different filters (DoG or averaging) forcomputing different values of the centre-surround differences. The inclusion of otherfilters with different centres or surrounds might result in a better and richer codingbut the computational cost of this step would increase. However, our encodingis already much richer than the original algorithm that uses only RG, BY , andintensity double-opponency:

• using only double-opponency channels, the algorithm is not able to distinguisha red object on a green background from a green object on a red background.Double-opponency cells activate on stimulus presenting different features inthe centre and in the surround parts of their receptive field. Their response,however, does not depend of the spatial disposition of the opponent stimuli.

• More channels increase the computational cost of the algorithm, but are neededif object description and detection is among the goals of the algorithm. Ourmodel uses a feature coding that allows it to better distinguish the visual stim-uli. Furthermore, visual pop-out on classical experimental stimuli is possibleonly if using separate channels. In fact, if the target is a red circle amongmany green circles (used as distractors), double-opponency cells result in asimilar coding of the target and the distractors.

• In particular, concerning the intensity, a single channel encoding both type ofcells would not allow to distinguish a black circle from a white one.


Figure 4.5: Example of center-surround and colour-opponency organization.Left:input image; Middle: Coding in the on-center/off-surround channel; Right: codingin the off-center/on-surround channel.

In Fig. 4.5 the advantages of using separate channels for two different opponenciesare shown. The input image, composed by many white dots and by a single blackspot on a gray background, is coded in two different ways in the on-center/off-surround channel and in the off-center/on-surround channel. If we used a singledouble-opponency channels, the black circle would not have been distinguished bythe white circles.

Orientation feature maps

Since gabor filters are differential operators, we do not compute any center-surroundorganization on the orientation maps. In implementation we used gabor filterswith zero mean and unit variance: they result in a zero response on homogeneousbackgrounds.

4.3.5 Feature conspicuity maps

The feature maps encode the strength of a given feature in the visual image. Inthe two images in Fig. 4.5, encoding the on-center/off-surround and off-center/on-surround channels for intensity, the responses of the cells with the different receptive-field organizations have the same strength on the their ‘preferred’ stimulus. However,there is only a black spot, that results in a visual pop-out effect. The two featuremaps, belonging to the same feature dimension, in this case the intensity dimension,need to be merged in a single map before they can contribute to the saliency mapat a given layer of the pyramid.

Different maps in the same feature dimension cannot be simply merged by com-puting the ‘average’ image. In general, they might contain a different degree ofrelevant data. Considering the two maps in Fig. 4.5, an image resulting from equalcontributions coming from the two images would not make the black circle relevant:


all of the cell responses, in fact, have the same value. A weighting strategy needs tobe implemented in order to assign a a greater weight to the most informative map.

Weighting single feature maps

There are several strategies that could be used for modifying a map according to itsrelevance [60]:

• the simplest strategy would be the simple summation. As previously said thisstrategy cannot be applied if the goal is the computation of a saliency map.

• We could compute the global maximum M in a map and the average valuem of all its other local maxima. The feature values would then be modulatedby multiplying them for (M −m)2. This strategy only works when there aresignificant differences among the various local maxima. In fact if m is veryclose to the global maximum M the number (M −m)2 would be very low andthe result of the modulation would be a global suppression at every locationin the map.

• A biologically-inspired and effective strategy consists in using a large 2-Ddifference of gaussians (DoG) mask in order to provide self-excitation andsurround-inhibition to large and isolated peaks of activity. The effectivenessof this strategy depends on the number of iterations, that cannot be deter-mined a priori. Furthermore, in presence of more feature maps within thesame feature dimension, the DoG operator cannot be used because it does nottake into account the number of salient regions in every single feature map.

All the strategies previously described try to enhance maps presenting few peaksof activity. We have adopted an opposite approach:

1. Small salient areas are filtered out, and large salient areas are enhanced, byusing the DoG operator:

DoG(x, y) =c2ex

2πσ2ex

e−x

2+y2

2σ2ex − c2

inh

2πσ2inh

e−x

2+y2

2σ2inh .

We perform three convolutions with the FM map and, for each iteration, wemodify FM as follows:

FM← [FM + FM ∗DoG− Cinh]+ , (4.14)

where all the parameters are set according to [60]: cex = 0.5, cinh = 1.5,σex = 2%, and σinh = 25% of the input image width.

2. After the convolutions with the DoG, we reduce the activity of maps stillpresenting many salient regions by assigning them a low weight.


More in detail, if the two feature maps FMf,f? and FMf?,f are to be merged afterthe convolution with the DoG, the following procedure computes the weight of anyof the two:

Algorithm 4.3.4: ComputeFMWeight(FMf,f? , FMf?,f )

comment: compute the maximum value in the entire feature dimension

jMax ← max(max(FMf,f?),max(FMf?,f ))comment: map normalization

for each FM ∈ FMf,f? , FMf?,f

do

comment: Binarization

FMbthreshold(FM, θ1)detect the connected components c1, c2, . . . , cn ∈ FM

for each ci ∈ c1, c2, . . . , cndo mi ← max(ci)

Count the number n of maximum values above the threshold θ2

FMn ← FMg(n)

,where g(x) is an increasing function over x

In our implementation, the binarization threshold θ1 is set to 0.5 times the globalmaximum jMax and the threshold θ2 is set to 0.8 times jMax.

The connected components are segmented using a simple region growing ap-proach: starting from pixels presenting an activity greater than the threshold θ2,all adjacent pixels are explored and merged in a single region. If no pixel presentsan activity greater than θ2 then the map is weighted with zero, i.e., it will no con-tribute to the saliency map for its level. When two feature maps are merged, infact, the global maximum between these two maps is used for the normalization: ifone of them present an activity significantly lower than the feature map encodingthe opponent property, it is possible that the normalization step will produce verylow values. In this case the other map will contain one region of maximal activityat least. In some cases, however, there can be feature maps empty: for example, inmonochromatic stimuli red, green, yellow, and blue channels do not present any sig-nificant activity. The segmentation method used in the above procedure slow downthe overall computation, but there are no fastest alternative besides computing thenumber of pixels with activity in a given range. Using such method, however, asingle homogeneous region with many active pixel would be penalized.

A similar procedure is applied to the orientation channels. In that case the globalmaximum jMax is computed over the entire set composed by four maps.

4.3.6 Saliency Map

The saliency map encodes the bottom-up saliency of the input image. Values inthe saliency map indicate the conspicuity of the various spatial regions in the input


image. A first analysis of the saliency of each region has been done when constructingconspicuity maps. However, they were limited to single features and to combinationof features in the same dimension. Conspicuity maps needs to be merged in asaliency map where the saliency of the image deriving from combinations of featurescan be highlighted. However, there are two major points that needs to be discussed:

• how can we combine information coming from different modalities?

• how can we build a global saliency map when performing a multiresolutionanalysis of the input image?

It is not clear how features that are qualitatively different should be combined [60].Even if they correspond to features that are different for our visual system, intensityand colours are computed using the same algorithms and their result is in a range ofvalues that makes them comparable. The orientation features, computed by Gaborfilters, are coded in a range that is completely different from that of intensity andcolour.

Our merging approach is similar to the original algorithm proposed in [5]. Asdescribed in the previous section, we weight maps in the same dimension by firstconvolving them with a large DoG and then applying a weight which is a functionof the number of maxima found in the map. Before computing the convolutionand weighting the map, the values are normalized in a fixed range. This method,however, could lead to false salient regions. In fact, when the maps are normalized bythe global maximum over the maps in the same dimension, regions with low saliency,but with maximal value in the dimension, become very salient. The normalizationstrategy makes all the feature maps with an activity greater than a given thresholdreach their maximum value before they are merged into the saliency map. Afterthe normalization, in fact, regions with low activity are enhanced by the DoG withtheir final value depending only on the number of salient regions in the same featuremap. This method cannot be considered particularly good, but we did not find anybetter one and, as far as we know, other models use the same approach.

The original model uses a single saliency map. The map is obtained by arossscale addition to the coarsest scale: feature conspicuity maps in the same featuredimension are summed up at a fixed scale of the pyramid, normalized again, andfinally summed. This approach allows a fast computation of the saliency map andproduces a map providing global and rough information about the image saliency,but the details coming from fine scales are lost during the merging.

Our model does not construct a single saliency map encoding the saliency of theimage at every level. On the contrary, we build several saliency maps, one for eachlayer. The saliency map at layer l is built by summing the conspicuity maps at thesame level l. Once each saliency map has been built, a global saliency map, whosesize is equal to 1

2(nP+1) times the dimension of the input image, is built with a maxoperation pooling over corresponding regions in the various levels in the saliency


Figure 4.6: Results of the algorithm on visual pop-out. Left: input image; middle:saliency map at the finest scale; right:saliency map at the coarsest scale

pyramid. The algorithm that builds the feature conspicuity maps, the local saliencymaps, and the global saliency map follows:

Algorithm 4.3.5: BuildSaliencyMap(P)

for each l = 1, . . . , nP

do

Build FCM for intensity FCI(l)Build FCM for red-green FCRG(l)Build FCM for blue-yellow FCBY (l)Build FCM for orientation FCO(l)FCcolour(l)← min(FCRG + FCBY , 1)Local Saliency Map LSM(l)← FCI(l) + FCcolour(l) + FCO(l)

Initialize the global saliency map SM to be1

2ntimes the input image in each dimension

for all (i, j) pixel coordinates in SM

do for all l = 0, . . . , nP

do

m← max

(LSM((i− 1) · 2(nG+1−k) + 1, . . . i · 2(nG+1−k); · · ·· · · (j − 1) · 2(nG+1−k) + 1, . . . j · 2(nG+1−k)))

)SM(i, j)← max(SM(i, j),m)

The expression in the algorithm above:

max(LSM((i−1)·2(nG+1−k)+1, . . . i·2(nG+1−k); (j−1)·2(nG+1−k)+1, . . . j ·2(nG+1−k)))

corresponds to the subregion of the local saliency map at level k mapped onto thecell with coordinates (i, j) of the global saliency map. The value SM(i, j) containsthe maximum value of the subregion described above for k ranging over the levelindex in the pyramid.

In Fig. 4.6, the saliency map computed on a stimulus with many green circlesbut with a single red one is shown. The red circle causes a visual pop-out effect.


Figure 4.7: Results of the algorithm on visual pop-out. Left: input image; middle:saliency map at the finest scale; right:saliency map at the coarsest scale

In fact, its saliency is the highest one both at the finest scale, i.e. at the originalimage size, and at the coarsest ones. This visual pop-out occurs because the outputof cells with red-green and green-red opponencies is coded in different channels.

In Fig. 4.7 the visual pop-out is caused by the single black spot. Since off-center/on-surround features are used by the model, the black circle has a differentencoding from the white circles and receives a high saliency both at the finest andat the coarsest scale.

Figure 4.8: Results of the algorithm on visual pop-out. Left: input image; middle:saliency map at level 1 (coarser than input image); right: saliency map at thecoarsest level

In Fig. 4.8 a visual pop-out effect determined by a difference in the orientationis shown. Saliency in the orientation channel is more problematic. The Gaborfilters have a strong response only when the bar falls in the excitatory part of theirreceptive fields. On the finer scale, in fact, the algorithm detects two regions withmaximal value, corresponding to the edges of the tilted bar. These two regions arenot merged by the segmentation algorithm because they are disconnected and theresponse of the filter on this image is divided by two. Furthermore, it detects also


two regions with maximal activity in the map computed with the filter rotated at135, corresponding to the short edges of the bar. On the coarser scale (right image)only the filter with a 45 orientation has a strong response and, since the numberof detected regions with maximal activity is only one, the saliency of the regions isnot decreased.

In Fig. 4.9, the saliency of a real-world picture is shown. The cars on the roadreceives a high saliency in both maps, but, in the finest saliency map, there are manysalient details, like the street lamps, lost during the down-sampling procedures. Suchdetails are lost when we consider the saliency at the coarsest scale, providing a goodexample on the usefulness of multiple saliency maps.

Figure 4.9: Results of the algorithm on a real image. Left: input image; middle:saliency map at the finest scale; right:saliency map at the second scale. The saliencyof the street lamps that is clearly visible on the finer saliency map but it is lost onthe coarser one.

In Fig. 4.10, the result of the algorithm on a nature image is shown. The fourlevel saliency maps show that the saliency of a region depends on the analysis scale.In fact, the level saliency map at level four contains a salient region correspondingto the wooden fence; the same region is not salient at finer scales. Another exampleof multiresolution saliency analysis is shown in Fig. 4.11.

Figure 4.10: Results of the algorithm on a natural image. From left to right: inputimage; global saliency map, local saliency maps at the four levels, from fine to coarsescales.

We need to say that the strategy of normalizing by the number of regions withmaximum activity do not always give ‘correct’ results in natural images. In fact,in some cases, the presence of a high number of regions in the same channel can


Figure 4.11: Results of the algorithm on two natural images. From left to right:input image; global saliency map, local saliency maps at level 2, where the windowsappear more salient, input image with green and red tomatoes, saliency map.

Figure 4.12: ‘Starry Night’, by van Gogh, global saliency map, saliency map at level3.

delete the influence of that channel in the conspicuity map. In the last image ofFig. 4.11, green tomatoes have zero saliency since the number of regions in thegreen-red channel is much higher (leaves, green tomatoes, background) than that inthe red-green one, that contains a single region. In cases where there are too manyregions with maximum activity the weighting procedure should not be applied. Inthe actual implementation there is only a lower limit on the weight factor (1/10),but we plan to study a method to correct the behaviour of the algorithm on suchtype of images in future studies.

Relating saliency to the dimension of the regions

If we look at an image containing many objects that are similar in size, but just a fewsmaller or bigger, our eye is attracted to the latter ones. The processing of simplestimuli, like the ones used for generating a visual pop-out effect, can be enriched bymodifying the processing of the feature contrast maps in order to relate the saliencyof a region to its dimension.

When the algorithm computes the number of regions presenting maximal activityin order to build the feature conspicuity maps, it can, with a small additional cost,compute also the size of the regions. We propose a very simple method to assign a


different weight to image regions in simple visual pop-out stimuli. If R1, R2, . . . , Rn

are the regions presenting maximal activity with size, respectively, S1, S2, . . . , Sn,

we can compute the mean size S =∑ni=1 Sin

and weight each region Ri by:

1 +|Si − S|∑ni=1|Si − S|

. (4.15)

The results of such weighting strategy are shown in Fig. 4.13 and Fig. 4.14.

Figure 4.13: Saliency maps computed by weighting each region according to theirsize (with respect to the others). From left to right, first and second images:black and white image containing a region larger than the others and correspond-ing saliency map. The text ‘max’ has been added on the location with maximumsaliency since the saliency values are not display as different. Third and fourth im-ages: black and white image with a single smaller region and corresponding saliencymap. In each case the image different in size from the others has been given a greatersaliency value.

Figure 4.14: Saliency maps computed by weighting each region according to theirsize (with respect to the others). From left to right, first and second images: stimuluscontaining a smaller and a larger regions, corresponding saliency map. Third andfourth images: in this case the stimulus contain three white circles and two greenones. Saliency map viewed in 3-D in order to better appreciate the results of theweighing strategy.


In particular, the last two pictures in the second figure show that the weightingstrategy exploiting difference in size is compatible with the strategy that computedthe number of maxima in the feature contrast map. In fact, the two green circlesreceive an higher salience since in the green-red feature contrast map there are lessregions than in the intensity channel. Among the three white circles, the smallestone receive an higher weight because its size is different from the other two circles.The small green circle receives an higher weight because, being small, it fills, at somelevel in the pyramid, the full centre region of a green-red cell.

We did not attempt to develop a better strategy for relating saliency to regiondimensions since it cannot be used on natural images. In fact, this strategy mightassign a large weight to regions that should not be considered relevant. If theimage contains, for example, the sky, such blue region might receive a high weightif another blue region were present in the rest of the image. This weighting strategyis applied only in case of visual pop-out stimuli. In order to extend it to naturalimages, more complex approaches that compute the gist of the scene need to beconsidered [61, 62].

4.3.7 Scene exploration through the saliency map

When no top-down influences are present, the exploration of the scene is guided bysaliency values only. Each point in the saliency map encodes the maximum saliencyof a region in the saliency pyramid whose dimensions increase as we move downwardto the first layer. The first step needed to select a region with a high saliency is todetermine the element in the saliency map with the highest value.

The algorithm used to analyze the most salient region is presented in the nextsection. However, once the focus of attention has attended the most salient region,we need to inhibit that region. If we did not, the focus of attention would keepstaring at the same region. The human visual system is equipped with an inhibitionof return strategy: it does not select the same region to focus on in successivefixations, unless a given period of time has elapsed. In our case, however, we cansimply inhibit the region since there is no sense, in our settings, to return focusingon it.

Using the point with maximum saliency, we need to detect the level in thesaliency pyramid where the saliency is maximum. Since each cell in the saliencymap performs a max pooling operation over scales, we are guaranteed that the samevalue is present in one saliency map at least. By moving downward, we reach thelevel saliency map that has most contributed to the saliency of the starting pointin the global saliency map. If there are two or more maps with the same saliencyvalue, the coarsest one is chosen.

There are two alternative ways to segment the most salient region:

• using local saliency map. The most active saliency map is selected and the re-gion with maximum saliency is selected starting from the pixel with maximum


value.

• using the feature contrast map that most contributed to the saliency value.Once a local saliency map has been selected, the feature contrast map thatmost contributed to the saliency is selected. Starting from the pixel with themaximum saliency value the region is segmented.

The choice of the segmentation method depends on the goal of the analysis. Morein details, it depends on the scale at which we desire to perform the analysis. Thefirst method corresponds to perform an analysis at a coarse scale: entire objects,composed by various salient heterogeneous parts, are treated like a single region. Thesecond method tries to separate the parts composing salient objects. By choosing afeature dimension as the domain of the segmentation algorithm, the second approachis able to perform more precise segmentation. The classical example used to explainthe difference in the two methods is about trees: we can be interested to the tree orto the leaves. In the first approach we would segment the ‘entire tree’, in the secondwe might be able to study the components of the tree. In our model, the secondapproach is used.

In order to segment a region, the following steps are performed:

• the most salient pixel is selected in the global saliency map.

• the level saliency map which contributed with that value to the global saliencymap is selected. The algorithm has a bias in favour of the selection of coarsescales. It might appear contradictory (coarse scales contain less details), butthe segmentation in the feature space using high resolution contrast map usu-ally yields poor results.

• the feature contrast map that most contributed to the saliency level is chosenand the segmentation takes finally place. In case of equal or very similarvalues, the algorithms prefers colour, intensity, and orientation [63].

The segmentation algorithm follows a region-growing approach: starting from theseed, all the pixels whose value is above a given threshold are connected to the seed.

An example of scene exploration is shown in Fig. 4.15. The algorithm has selectedseven points to explore the ‘Starry Night’ painting. The first point corresponds tothe moon, which is selected and segmented. The segmentation produce irregularborders, but it has been performed on a coarse scale. The second point correspondsto a star which is very close to another salient. The feature map with the highestcontribution to the saliency allows a more precise segmentation as shown in thefourth image from the left. The sixth and the seventh points correspond to edgeeffects.

4.4. TOP-DOWN ATTENTION AND VISUAL SEARCH 65

Figure 4.15: Exploration of the Starry Night painting. First row: fixation pointson a reduced-resolution input image and on the saliency map. Second row: objectsegmented in the first fixation, feature map used to segment the second fixationpoint, result of the segmentation starting from the second point. Numbers indicatethe order in the exploration.

4.4 Top-down attention and visual search

The saliency map encodes what is called bottom-up attention. Each location in thesaliency map encode how relevant the corresponding spatial region in the image is.The saliency of each region does not depend on our current task: it simply indicatesthe spatial contrast between the region and the surrounding. As we have seen inthe previous sections, an image does not contains many salient regions and can beinspected with few eye movements. In computer vision several approaches try tointegrate information extracted from local descriptors into a more global analysisresulting in object learning and recognition [64, 65].

However we rarely explore a scene in a pure bottom-up fashion. Our currenttask heavily influences what regions we select for further analysis. Even if we do nothave a task, we still influence, with our emotions or physical condition, the regionswhere we fixate on. Bottom-up and top-down, or goal-directed attention, are notseparated in our visual system, where they influence each other for taking a decisionabout what to inspect next.


From a computational point of view, we can distinguish between the two typesof attention. In particular:

• bottom-up attention, used to compute feature contrast and spatial contrast.By using it, we are able to measure how a region differs from its surroundingsand assign a saliency weight.

• Focused attention and top-down guidance, for object learning and detec-tion [66, 67].

4.4.1 Object learning

Bottom-up visual attention can be used to learn an object template by using itssaliency computations in order to detect salient regions of the object. In fact,we could apply the bottom-up component of the model on an object, exploreits locations in order of decreasing saliency and build a template for that ob-ject [68, 69, 70, 71].

The template should encode different properties of the object:

• the values of the different features characterizing its most salient parts;

• the relative location of the various features in order to guide the focus ofattention while searching the object [72].

The objects can be considered as a set of salient locations detected by the bottom-up component of the model. For each location, after we fixate on it, we can determineby using focused attention the conjunction of feature values distinguishing thatlocation. We have several information describing a single region:

• its saliency in the saliency map;

• the conspicuity of each feature, stored in the feature conspicuity map;

• the values of the single features, on-center/off surround intensity values orred-on/green-off cells.

From our point of view, the first two properties cannot be used as a reliable object(region) descriptor since they are strongly related to the relationship between theobject properties and the surroundings. The saliency and the conspicuity, even ifthey are computed on object parts, are strongly influenced by the global propertiesof the input image. For example, the conspicuity (and the saliency through it)depends on the number of regions with a similar feature in the input image.

The values of the single features are extracted from the object and depend onlypartially (along the object borders) by the environment, i.e., the input image.

The saliency and conspicuity values can be used for suggesting probable locationof the object in similar task, but not to represent the object itself. Bottom-up


models of visual attention are in fact used to compute the gist of a scene, i.e., theytry to understand what class of input image we are analyzing like, for example, animage taken indoor or outdoor [6].

The object template used in the current model separates two different types ofinformation about the object locations:

• spatial information, i.e. the relative position of the different parts is coded inthe template;

• feature information, for each object part we code the values of its features,used to compute subsequent matches.

The different information roughly correspond to the subdivision of our visual systemin a What and Where system: features computed by inspecting the region flow in theWhat pathway, the location of the object and its parts in the Where pathway [73].

Object learning is usually performed using two approaches in the algorithm train-ing:

• single-shot learning: a single object is presented to the algorithm. The algo-rithm is able to encode the properties of a single item, but has no mechanismto learn inductively. If a second object is presented, even if very similar, thealgorithm builds a new object template.

• multiple-instance learning: object observes more objects belonging to the sameclass. For each object, it extracts the visual features and updates the internalrepresentation of the object class.

Anyway, depending on the particular class of objects, single-shot learning wouldresult in a model that can be used to detect more than one similar object. Multiple-instance learning is more powerful and is often tackled with machine learning ap-proaches. By using machile learning algorithms or approaches similar to SIFT [65]we would loose the biological inspiration underlying this work.

The algorithm used by our model for single-shot learning can be summarized asfollows:

Algorithm 4.4.1: EncodeObject(SaliencyMap)

for each loc ∈ SM. SM(loc) > θ

do

Segment the most salient region R in the saliency pyramidEncode the max, min, and mean values for every featureEncode the coordinates (x, y) of the point with maximum value,the bounding box of the segmented region, the coordinatesof its centroid.Then inhibit the region R


The object is learnt by inspecting the parts it is composed of. We do not tryto segregate the entire object from the background, but limit our analysis to itsmost salient regions and the spatial relationships existing among those regions. Thesaliency of an object part depends on the scale we are performing our analysis. Itis reasonable to assume that the saliency of an object part can be high at a givenscale and low at another scale.

In fact, very small object parts can be salient only at fine scales. The decimationalgorithm used to construct the image pyramid might remove small regions in justa few steps. Regions characterized by details, like, for example, a particular textureresulting in a high response of the Gabor filters, can be made homogeneous byrepeated low-pass filtering and decimation steps. In order to overcome this problem,each location has a field containing the index of the scale where it has been detectedduring the object learning procedure.

Let us suppose that the first location loc1 of an object has been detected at scaleσ1 and the second location loc2 at scale σ2. Since the object in the search image canappear in a different size (and orientation) in the target image, we look for the firstlocation at any scale. If loc1 is detected at scale σ, during the search, the algorithmwill look for loc2 at scale σ′ = σ2 − σ1. This means that we look in every map inthe pyramids only for the first location. Subsequent locations will be searched inmaps at a scale determined by the first match and by the object template, codingthe match found during learning.

This form of spatial modulation resembles the zoom-light metaphor of visualattention (that inspired the model). In fact, we use the attentional focus to inspectthe image at different detail levels. Using the zoom-lens metaphor we are able torestrict the search space: the first match requires an inspection of the feature mapsat every scale. From the second match on, we limit the search at a single scale. Theprocedure makes an implicit assumption: the algorithm is able to measure the sizeof the object part detected in the first match. In particular, when the algorithmdetects the first match it consider that the appearance of the object that led to thetemplate and the actual object on the search image have the same size. By movingacross different scales when searching for more locations, the algorithm basically tryto inspect an image where the actual object in the image and the object templatematch.

Object processed during learning can undergo different geometric transforma-tions and appear different in the search image. However, we need to assume thatthe relative location of the object parts do not change much between the objecttemplate and the object as it appears in the search image. We can use the relativeposition of the object parts, stored in the template, to guide the visual search ontoimage regions where the target is likely to appear.

When the most salient region is detected and segmented, the algorithm computesits central point c and inhibit all the maps everywhere but in a diverging beamoriginating in c and ‘illuminating’ the image section where we expect to find thesecond location. The beam is computed in order to include the bounding box of the


second region to be detected.Since the object could have been rotated, the algorithm rotate this illuminating

beam if no match is found in the ‘normal’ orientation, that is the orientation codedin the object template. Each times the algorithm rotates the beam by 45. Whenthe second match has been found with a rotation α, the algorithms look for thethird object location at the normal orientation plus α and, if no match is detected,α+180. This last step is needed since, when the first two parts are circular regions,the third region can be located in the ‘normal’ direction or in its opposite one.

4.4.2 Feature matching

So far we have described the spatial deployment of attention. We need to explainhow candidate object parts in the search image are inspected by the algorithm.

First match

When looking for the first match, we need to look into every scale since we do notknow what transformation has been applied to the target object. In fact, the objectthat the algorithm learnt during the search procedure could have been rescaled.

Looking into the coding of the first location, we can extract the features charac-terizing the learning example. We do not use any information about the saliency ofthe example since it might have been influenced by the context. The feature usedto modulate the deployment of attention are computed as follows:

• for each feature dimension and for each feature-opponency channel we selectonly a single feature:

– in the feature dimension intensity, we use either on-center/off-surroundor off-center/on-surround, but not both together;

– in the feature dimension colour, we have two feature-opponency channels:red-green and blue-yellow. The algorithm selects either red-on/green-offor green-on/red-off, and either blue-yellow or yellow-blue;

– in the feature dimension orientation, we select a single orientation for theGabor filters.

Each channel is chosen by selecting the one that had the strongest response on theobject part during the learning procedure.

The maximum values coded in the object template for the given object part arethen used to modulate the feature maps obtaining a saliency map not influenced bythe bottom-up processes. Each pixel, in each feature map for feature f , is modulatedby an exponential function of the form:

e−γ(x−µf)2

(4.16)


with γ < 1 determining the spread of the exponential function. µf corresponds tomean value of the feature f on the object part, γ is set to 1

2σ2f, where σ2

f corresponds

to variance of the f values observed in the object part. The value of the exponentialis used to establish if the point belongs to the object part we are searching: theexponential value is thresholded using a threshold equal to 60%, which correspondsto the value of the exponential on values µf ± σf .

We modulate each selected feature map using the corresponding value for µf andσ2f . The results of the modulations need to be combined: if we consider them a sort

of conditional probability that the modulated point belongs to the object part, weshould multiply them.

However, since we are computing properties of regions using pixels, this couldlead to a global suppression of the activity. In fact, the various filters used tocompute the properties may not have their strongest response on the same pixel.The results of the exponential modulation are then summed. We consider only pixelsbelonging to the image part ‘illuminated’ by the beam previously described.

After the summation of the modulated feature map the pixel with the maximumvalue is selected and a region growing procedure is started using that pixel as theseed. We then multiply the maximum value of each feature and classify accordinglythe segmented region as belonging to the object or not.

If the object has been rotated, it is likely that the modulation of the orientationfeature map does not present strong results. When looking for the first rotation, infact, we need to rotate the object template in order to check for rotations of theobject. Values of the Gabor filters are encoded for 0, 45, 90, 135 degrees. Let ussuppose that:

• the object part in the template presents its strongest response on 0, corre-sponding to vertical lines;

• the object has been rotated anticlockwise by 45 degrees.

The response on 0 in the object template can now be found in the second positionof the array, corresponding to lines with a 45 degrees orientation.

In order to check for a rotation, we can simply modulate the map shifting thevalues in the orientation array of the object template.

Subsequent matches

The second match is performed as the previous one, but no rotations are checked.Let α be the rotation detected at the previous step. We need to check only onepossible rotation, i.e. by 180 degrees. However, a 180 degrees rotation does notchange the ordering of the values in the orientation array: when constructing theilluminating beam we need to look in the direction of the second location, rotatedby α and, if no match is detected, perform a second search looking at α+180. Oncethe second match is detected, there cannot be other rotations and the subsequentobject parts are searched using the computed rotation.


4.4.3 Different exploration Strategies

The exploration strategy of the input image can change accordingly to our goals.For example, we might allow partial matches when object parts are missing. Fur-thermore, we could look for overlapped object, i.e. objects sharing a common partthat are both present in the image. In the model we have implemented a greedysearch strategy. A description of the main procedure follows:

Algorithm 4.4.2: ExplorationStrategy(Map,ObjectTemplate)

for each α ∈ 0, 45, 90, 135

do

Look for first location assuming that the objecthas been rotated by αfor each candidate region with maximal activity

do

nextLocIdx← 2Look for object part with index equal to nextLocIdxconsidering two rotations:β1 = α and β2 = 180 + α

if second location has been found with rotation βido Look for other object parts with rotation βi

else if partial matches are alloweddo nextLocIdx← nextLocIdx+ 1else Proceed with another candidate region for first location

This strategy is greedy since it does not take into account all the available regionsfor matching. In fact, when looking for the second location, the algorithm considersonly one active regions, even if there are more active regions in the area illuminatedby the beam. Furthermore, it does not consider partially overlapped objects, in fact,when a region is labeled as belonging to the object it is not considered any further.

In Fig. 4.16 the execution of the algorithm on a very simple stimulus is presented.The first image, in the upper left corner, represents a enlarged version of the stimuluslearnt during the exploration of the scene. The object has been coded with threelocation, with the horizontal one being the most salient one. The first match iscomputed by modulating all the feature maps and selecting the largest value. Thematch falls on the corner of the region. The region is then segmented and its centroidis computed. An enhancing beam, that suppress all the external activities, is builttoward the expected location of the second feature. After another segmentation step,the centroid is computed and an attentional beam is deployed toward the expectedlocation of the third feature. Another example of the execution of the top-downdeployment of attention is shown in Fig. 4.17. In this case the target image containsa rotated version of the object. The algorithm detects the first match at the correctorientation α, then scan the image considering rotations of 180 degrees. After thesecond match, the attentional beam immediately finds the third and last location.


Figure 4.16: Six step of the top-down deployment of attention on a simple visualstimulus. First row: input image, match on the first region (green circle). Secondrow: centre of the area segmented starting from the point with maximum value, at-tentional beam onto the second image. Third row: centroid of the second segmentedregion, attentional beam toward the third region.

4.5 Conclusions

This chapter has presented our proposal for a computational model of visual atten-tion. The model, belonging to the filtering approach, is inspired to the model byKoch and Ullman [2] and to its implementation in [5].

The visual stimulus is initially processed to form a pyramid of image features.After the extraction of the basic channels, like the four primary colours, intensity,and orientation, feature maps are built using centre surround differences. Featuremaps belonging to the same feature dimension are unified in feature conspicuitymaps, that encode the saliency of the image in every feature dimension. The con-spicuity maps are then used to compute the saliency of the image at every scale. Asingle saliency map is then built by performing a max-pooling operation across thesaliency level.

The main innovations of our model are:

• a richer representation of the input stimulus. In particular our model does not

4.5. CONCLUSIONS 73

Figure 4.17: Another example of the top-down deployment of attention on a simplevisual stimulus. First row: input image, image to be searched, match on the firstregion(green circle). Second row, first and second image: two attentional beams,looking at the detected rotation angle + 180, beam toward the third location; sincethe algorithm has got the second match it does not look for α + 180circ

use double-opponency channels for coding feature maps, but single-opponencychannels. More in detail, we code both the on/off receptive-field organizatinfor every feature. This richer representation allow the bottom-up model ofvisual attention for object recognition tasks, where the ability to distinguish,for example, red on green rather than green on red can lead to different results.

• Feature conspicuity maps are ‘normalized’ with a better strategy. First ofall, small regions of activity are filtered out using 2-D difference of Gaussianfilters while larger regions, with a high activity, increase in conspicuity. Thisnormalization strategy is not enough in our case, since opponent channels mayhave a different number of conspicuous regions. For example, after the previousstep, the red channel could have two regions which are equally conspicuouswith the single conspicuous region in the green channel. If we just summedthem, the visual pop-out effect on the green object would not take place.

• There is not a single saliency map. Information about saliency is encoded in apyramid, over which a single saliency map performs a max-pooling operation.In truth, this single saliency map is not needed from a computational pointof view, since we could just compute the maximum over the entire pyramid of


saliency values. However, the availability of a saliency map at each level allowus to deploy attention as a zoom-lens and deploy an object-based attentionrather than a spatial-based one.

Furthermore, the current implementation of the model has different modalities ofactivity. Besides the bottom-up computations, it is able to built a template of objectcontained in the input image. The object template contains information about thefeature values and the relative location of its salient parts. The model can learn ina single-shot modality, meaning that the model is not able to generalize from theobservation of more training examples belonging to the same class. However this isvalid if we want to built a template composed by the object parts. The algorithmcould be easily extended to learn the representation of more object classes if it isnot required to inspect object parts.

The object template is finally used in visual search tasks. Starting from themost salient regions of the object (corresponding to the first location encoded), thetop-down system detects salient regions by modulating the feature channels andinhibiting the computations everywhere but under an enhancing beam that coversthe expected location of the next features. Furthermore, once the initial locationhas been detected inspecting all the scale, successive analysis will operate on asingle scale only computed by the object template, acting, again, in the zoom-lensmetaphor of attention.

Chapter 5

Swarm Intelligence for ImageAnalysis

5.1 Introduction

Our goal in the investigation of Swarm Intelligence in image analysis consists inverifying the possibility to implement an attention-based system based on propertiescharacterizing such field.

The application of Swarm Intelligence in image processing and analysis tasks isstill in its early stage, but several works have been published presenting good re-sults [74, 75, 76, 77]. In particular, the work [75] proposes a algorithm for attentional-based processing based on Particle Swarm Optimization.

Many insects, like bees, communicate each other in order to signal the presenceof flowers. When a bee detects a flower, which represents a salient object in itsworld, performs a sequence of movements that other bees will detect and interpret.Furthermore, bees, and other insects also, have a good memory and can locatepreviously visited regions which they know to be populated by food source. Theexistence of such behavioural patterns has led us to investigate their reproducibilityin an algorithm where agents, acting like bess, would be attracted by salient regions,like flowers, to indicate the location of such salient regions to other agents, in orderto perform a detailed analysis of the salient region. Furthermore, the memory ofsalient regions could lead to object recognition capabilities by the swarm of bees.

This chapter contains the results of our initial study. We do not propose adifferent model for visual attention, but an algorithm for image registration. Evenif it is a completely different problem, this work has allowed us to get insights tobetter understand the use of this paradigm for image processing and the modelingof visual attention.

This chapter is organized as follows. In section 5.2 we describe the SwarmIntelligence paradigm, its properties and the difficulties in implementing algorithmsin this paradigm. In section 5.3 we describe the new coordination model, which is

76 CHAPTER 5. SWARM INTELLIGENCE FOR IMAGE ANALYSIS

biologically inspired to the strategy adopted by some species of ants to collectivelytransport a prey to the nest. In section 5.4 we describe the model implementationin an algorithm for image alignment and matching. In section 5.5 we describe theresults of the initial experimentation of the algorithm on real data.

5.2 Swarm Intelligence

The term Swarm Intelligence (SI) was introduced for the first time in 1989 in [78] toindicate specific cellular robotics systems. The first general and commonly accepteddefinition appears in 1999 in [79], where Swarm Intelligence is defined as the propertyof a system whereby the collective behaviours of unsophisticated agents, interactinglocally with their environment, cause coherent functional global patterns to emerge.

The previous definition contains two important points:

• the agent are very simple and do not communicate each other;

• communications go through the environment, which influences each agent;

• the functional behaviour represents an emergent property, i.e. a property thatdoes not characterize any element in the system.

The previous definition of intelligence is totally different from the definition of‘intelligent agent’ given by classical Artificial Intelligence (AI). Both in strong AIand in weak AI, the intelligence characterizes a single individual. In SI, on theopposite, the agents are simple and collaborate through a complex environment.None of the agents is intelligent, but the result of their collaboration is [80].

According to the previous observation, it is clear that a SI system cannot bedesigned as a classical intelligent system. In fact, the agents must be designed ac-cording to simpler guidelines, while the environment is complex: the complexitycharacterizing classical intelligent agents is transferred into the environment. Noneof the pioneering works previously cited imposes specific constraints over the imple-mentation of agents and environments: agents must be as simple as possible andmust interact through a complex environment.

5.2.1 Swarm Intelligence Properties

Many algorithms in the SI paradigm are inspired to the social behaviour of simplecreatures [79], in particular ants or bees. Two typical global behaviours of antcolonies are generally used to describe swarm intelligence concepts.

The first behaviour concerns the search for food. In fact, ants are blinds but areable to communicate through the environment and collaborate to guide the entiregroup toward a food source. Every ant initially moves around without any specificdestination. Once it has detected a food source, the ant releases pheromone in theair, that is a chemical substance able to attract all the other ants. The quantity of

5.2. SWARM INTELLIGENCE 77

pheromone released in the air is proportional to the size of the food source: if thefood source is small, the ant will try and attract only few ants, otherwise it will tryto attract a large number. The ants attracted by the pheromone released by thefirst ant, while approaching the food source, release more pheromone, generating asnow-ball effect that will terminate only when the source is exhausted. When theants go away from the food source, the pheromone will evaporate and no more antswill be attracted to that location [79, 81].

The second behaviour concerns the social life in the ant colony. Ants try tospatially organize the eggs according to their dimension. They tend to create ho-mogeneous groups and to arrange the groups in order to leave larger eggs (larvae)in the periphery. Such grouping is performed without any centralized control anddepends only on the ability of a single ant to distinguish smaller from larger groupand to compare the carried egg with those already left in a group [79].

In this chapter, we do not characterize the Swarm Intelligence paradigm in fulldetails, but we will list only the basic properties that have strongly inspired andpermeate the algorithm we propose. An important property is called Stigmergy : itrefers to a class of multiagent coordination mechanisms based on the exchange ofinformation through a shared environment [82].

Since the single agents must be as simple as possible, they do not have a globalknowledge about the environment: at any given instant in time, they only knowtheir location and their surrounding. Their abilities should be limited, in particularthey can modify the environment only locally. The local modifications to the envi-ronment are the only way they have to send signals each other. The alterations tothe environment, even if affecting small regions, are amplified through a feedbackprocess. For example, the pheromone released in the environment when an ant findsa prey activates a sequence of events that will force other ants to release pheromone.The quantity of pheromone increases as long as the food source is not exhausted.

5.2.2 Designing swarm intelligence algorithms

The design of an algorithm inspired to the principles of Swarm Intelligence is quitedifferent from the standard approach. In fact the reductionist, where a problem issubdivided into many subproblems, each one solved by an agent/algorithm, cannotbe applied to Swarm Intelligence. According to Bonabeau [79], “swarm-intelligentsystems are hard to program, because the paths to problem solving are not pre-defined but emergent in these systems and result from the interactions among in-dividuals and their environment as much as from the behaviours of the indidualsthemselves. Therefore, using a swarm-intelligent system to solve a problem requiresa thorough knowledge not only of what individual behaviours must be implemented,but also of what interactions are needed to produce such or such global behaviours”.

Despite some works [83, 84, 85], there is a lack of general theories and program-ming methodologies in the Swarm Intelligence field. The difficulties have inducedresearchers to look for inspiration at biological phenomena. For example, Ant Colony


Optimization (ACO) and Particle Swarm Optimization (PSO) are optimization tech-niques inspired to coordination mechanisms used by, respectively, ants gathering forfood [79] and birds flocking [80].

5.2.3 Collective Prey Retrieval

The biological inspiration of our algorithm is to be found in ants carrying a preytoward their nest. In fact, in order to overcome the limits imposed by their smallsize and limited capabilities, many species of ants have evolved by developing col-laborative strategies. The carriage of a large prey into the nest is an example ofsuch process.

Some species have specialized workers able to cut the prey into small pieces that asingle ant can carry, while other species are able to collectively transport large preys.Experimental results show that the latter strategy, called collective transportation,is the most efficient one [79]. The species with the most interesting strategies arePheidole crassinoda, Myrmica rubra and Myrmica lugubris. They exhibit the samebehavioural patterns both in solitary and group transport [86].

An high level description of collective prey retrieval is summarized below:

1. When an ant finds a prey, it tries to carry it.

2. If the ant does not succeed in moving the prey, it tries to drag it in variousdirections (realignment behavior).

3. If the prey does not move, the ant grasps the prey differently, then tries anddrag it in various directions.

4. If the prey still does not move, the ant starts recruiting nest mates. First,it releases a secretion in the air in order to attract nearby ants (short rangerecruitment). If the number of recruited ants is not enough to move the prey,the ant goes back to the nest leaving a pheromone trail on the ground. Suchtrail will lead other ants to the prey (long range recruitment). The recruitmentphase stops as soon as the group is able to move the prey.

Resistance to traction represents a positive feedback mechanism. As specified atpoint 4, recruitment stops when resistance to traction ends and the prey startsmoving. While moving toward the nest, coordination among the ants occurs throughthe prey itself. The change of the force applied by a single ant modifies the stimuliperceived by the other ants (that react accordingly). Such coordination strategy isan example of stigmergy.

5.3. A NEW COORDINATION MECHANISM BASED ON COLLECTIVE PREY RETRIEVAL79

5.3 A New Coordination Mechanism Based on

Collective Prey Retrieval

The new coordination model is an extension of the last phase of the collective preyretrieval strategy: the coordination of forces during the transportation of the objectto the nest.

It introduces two significant differences with respect to the model described inthe previous section:

• in collective prey retrieval, each ant tries and carry the prey to the samedestination (the nest). In our model each agent has its own destination forthe prey. The group must move in the direction indicated by the majority ofits agents.

• The biological prey is considered a rigid body. A force applied to a rigid bodyis perceived instantaneously by all thecarrying ants. Anyway, the inclusion ofa similar propagation mechanism in a model would lead to an unacceptablelevel of complexity: either the agents or the preys should be equipped with abroadcasting mechanism. In the democratic transportation model, the appli-cation of a force by an agent is notified to its neighbouring agents only andis perceived by all the agents after some instants. Such delayed propagationcorresponds to considering the prey as a non-rigid body. In section 5.3.2 weshow how such modification still allows the coordination of the agents, whilekeeping simple the prey and the agent models.

5.3.1 Model description

In order to obtain the democratic collective transportation model, the biologicalmodel is to be modified as follows:

1. each agent constantly applies a force Vp on the prey toward his preferreddestination.

2. The intensity of the applied force is inversely proportional to the angle betweenVp and the direction Vg chosen by the majority of agents. The idea is to favourthe influence of those Vp contributing in the “right” direction and to penalizethose Vp trying to move the object in “wrong” directions.

3. The direction Vg of the majority of the ants is estimated by each agent simplyby looking at the movements of the object in the previous time steps.

In the following we outline the functions needed for a formal description of thedemocratic collective transportation model.


p denotes a generic agent of the system. The agents are grouped into a set P . p(t)is the position of agent p at time t. Each p has initial velocity 0 and movesaccording to Fn and Fp.

Fn is the force propagating the individual forces applied to the item being trans-ported. Given two agents p and q, Fn is obtained constraining each agent tokeep the initial distance from its neighbours at every time step:

Fn(p) = c ·∑q∈P

(q(t)− p(t)‖q(t)− p(t)‖2

)· (‖q(t)− p(t)‖2 − ‖q(0)− p(0)‖2) · δ(p, q)

0 ≤ c ≤ 1(5.1)

The first factor of eq. (5.1) is the versor from agent p to agent q. The sec-ond factor is the gap between current and initial distance between p and q.Function δ indicates whether p and q are to be considered neighbours:

δ(p, q) =

1 if q ∈ neigh(p)

0 if q /∈ neigh(p)(5.2)

Function neigh(p) determines the initial disposition of agents. For instance,if the disposition of the agents formed a square lattice, neigh could be

neigh(p) =q ∈ P : ‖p(0)− q(0)‖2 ≤

√2

(5.3)

Fp controls the velocity of the agents p moving in their preferred directions. Fp canbe expressed as:

Fp(p(t)) =

Fp(p(t− 1)) + Vp · λ(Vp · Vg(p)) if t > 0

0 if t = 0(5.4)

Fp and Vp have the same orientation and direction. The modulus of Fp isinversely proportional to the angle between the direction chosen by the agent,i.e. versor Vp, and the direction chosen by the majority of agents, i.e. Vg. Inorder to obtain the exact value of Vg, we should know the state of each agent

in any iteration, but such assumption violates SI principles. An estimate Vgof Vg can be obtained by using local information only, namely comparing thecurrent position of an agent with its position at time t− k:

Vg(p) =p(t)− p(t− k)

‖p(t)− p(t− k)‖2

(5.5)

In order eq. (5.5) to be consistent for every t, it is assumed that p(−1) =p(−2) = . . . = p(−k) = 0. In the first k iterations each p receives a positivefeedback from the system.

5.4. AN ALGORITHM FOR IMAGE ALIGNMENT AND MATCHING 81

The position of agent p at time t is expressed as follows:

p(t) = p(t− 1) + Fn (p(t)) +t∑

j=0

Fp (p(j)) (5.6)

In order to ease the description, we will identify agents with points of the item tobe transported.

5.3.2 Model Validation

The democratic collective transportation model has been validated through a seriesof simulations. The typical result of the simulations is reported in figure 5.1. In thiscase the simulation was run for 85 iterations with a population of 900 agents andparameters set as follows: c = 0.49, λ = 0.06, k = 3. Moreover, the velocity of eachagent was in the range [0, 0.24]. The V ps are versors randomly selected in the range[0, · · · , 1]. After 50 iterations each p reselected its preferred destination.

The main issue for the model to be consistent is the confidence of Vg estimates.In order to verify such estimates, we used the following error measure:

E =∑∀p∈P

1− Vg(p) · Vg(p). (5.7)

At the bottom right box of figure 5.1 shows, the sum of the errors rapidly decreases to0 (the peak at iteration 50 is due to new selection of the preferred destinations). Theslope of E depends on the percentage of agents willing to move in the direction of themajority. When preferred directions are chosen according to a uniform distribution,the simulations show that a 20% majority is enough to move the whole set of agentsin the same direction. The model is very scalable: we ran simulations with up to10000 agents with no loss of performance.

5.4 An Algorithm for image alignment and match-

ing

In this section we propose an algorithm for Image Alignment based on the democraticcollective transportation model.

5.4.1 Image Alignment

Image alignment is defined as the problem of finding an optimal spatial alignment oftwo images of the same scene/object taken in different conditions. For example, twoimages of the same object taken at different times, from different points of view,


Figure 5.1: A simulation of the democratic collective transportation model. Onthe top left box the initial position of the agents is represented. The top rightbox represents the direction chosen by the majority of agents: during the first 50iterations, 34% of the agents chose to go toward South-East, while in the last 35iterations the 24% of agents chose the direction North-East. In the bottom leftbox the final position of the agents is shown (agents have effectively moved in thedirection chosen by majority of them). The bottom right box represents the sum ofagent errors in estimating Vg at each iteration (eq.5.7).

with different modalities [87]. Formally an image is considered a bi-dimensionalfunction:

I : [0, . . . , n]× [0, . . . ,m] −→ [0, . . . ,MaxGreyValue]

where MaxGreyValue is the maximum grey value of the pixels, n and m are, re-spectively, the number of rows and the number of columns of I. Image alignment isthe problem of finding an optimal transformation ω minimizing dissimilarities be-tween an input image Iinput and a target image Itarget. The degree of dissimilarityis measured by a cost function f :

ωmin = argminω∈Ω f(ω(Iinput), Itarget). (5.8)

5.4. AN ALGORITHM FOR IMAGE ALIGNMENT AND MATCHING 83

In some cases the differences between the two images should not be corrected sincethey might contain relevant information. For example, the diagnosis obtained bysome image-based medical examinations relies on the differences between two im-ages acquired at different times. Any registration algorithm should correct all thedifferences caused by misalignment and should preserve all the other ones. A de-tailed description of the image alignment problem and an overview of classical andnew approaches can be found in [88, 87].

As eq. (5.8) suggests, image alignment can be seen as an optimization problem,where Ω is a family of functions differing only for a set of parameters. Classicaloptimization techniques as well as popular swarm intelligence methods, such asParticle Swarm Optimization [89] and Ant Colony Optimization [90], have beenapplied to the image alignment problem. Such methods require a global cost function(or error function) to drive the system toward an optimal choice for the parametersof ω. The algorithm we propose does not use a global cost function: each agent hasits local cost function.

5.4.2 Description of the Algorithm

Before describing in details the algorithm, we will sketch the relationship with themodel of section 5.3.1:

1. Iinput, the image to be registered, is considered the object that has to be moved.

2. Pixels of Iinput are considered as points (and therefore agents) moving in a bi-dimensional space. Each agent has 8 neighbours corresponding to the neigh-bourhood of the corresponding pixel in Iinput.

3. An application of a force on the object to be transported causes the pixel inIinput to move.

4. Each agent p has a set Dest(p) of possible destinations, corresponding to thecoordinates of the points of Itarget that are similar, according to eq. (5.10), top.

5. Each agent selects a point in Dest(p) and tries to move itself toward it.

The functions described in section 5.3.1 are modified as follows:

p is a generic agent of the system. Pixels of Itarget are grouped in a set O. At t = 0the set p ∈ P forms a grid of points. A function Color stores the gray valueof the pixels.

Fp modifies Iinput in order to make it as similar as possible to Itarget. The idea is tolet regions of Iinput with a high gradient be attracted by corresponding regionsof Itarget. The only differences with the democratic transportation model are


in the computation of Vp. Each p ∈ P has an associated set of pixels Dest(p),composed by the pixels of Itarget which are similar (see eq. (5.10)) to p:

Dest(p) = q ∈ O|sim(p, q) ≥ dsim (5.9)

where dsim is the similarity threshold. The similarity function used is:

sim(p, q) = |Color(p)− Color(q)|+ |∇p−∇q| (5.10)

where ∇(p) is the gradient of the image I at coordinates (px, py).

∇p = ∇I(x, y) =

∥∥∥∥∥(I(x, y)−

y+1∑j=y−1

I(x− 1, j)I(x, y)−x+1∑i=x−1

I(i, y − 1)

)∥∥∥∥∥2

Each p tries to reach a position corresponding to an element of Dest(p) storedby the function CurrentDestination(p). CurrentDestination(p) is modifiedevery g iterations according to probability density ρ defined as:

ρ(p, q) =

∑o∈Dest(p)

1

1 + ‖p(t)− o‖2

· (1 + ‖p(t)− q‖2)

−1

(5.11)

By reselecting CurrentDestination(p) every g iterations, the system exploresmore solutions. Since in the selection process closest destinations are preferred,when a good solution has been found each agent tends, with high probability,to continuously go back to the same point.Vp is the normalized vector connecting p with its current destination:

Vp =CurrentDestination(p)− p(t)‖CurrentDestination(p)− p(t)‖2

The dynamic of the algorithm pushes the majority of Iinput pixels in the directionof their current preferred destination. With high probability Iinput will move to aposition “satisfying” the majority of the agents. We hypothesize that this positionis the one with the highest probability to correctly align the image.

5.5 Experimentation

The algorithm has been tested on Magnetic Resonance images of the human brain.The image Itarget has been obtained by the original image by removing the back-ground and keeping only relevant regions. In order to test the effectiveness of thealgorithm, we obtained Iinput by translating Itarget to South-East and adding noise.Each row in fig. 5.2 represents a complete execution of the algorithm on noisy testimages.

5.6. CONCLUSIONS 85

In the first row a 45% salt & pepper noise was added to Iinput. In the second rowa 16% speckle noise was added. In the last row a 16% speckle noise and a 35% salt& pepper noise were added. The last image in each row represents the final resultof the algorithm. They clearly show that the algorithm corrects only the differencescaused by the translation. The algorithm is able to correctly align the two images,even when the input image is a translated and highly corrupted version of the targetone. In fig. 5.3 an experimentation for the Image Matching problem is shown. Thealgorithm finds the correct location of an image patch inside a given image.

The algorithm here described is different from classical population based opti-mization techniques such as genetic algorithms (GA), ACO, and PSO. In GA, ACO,and PSO at each iteration every agent proposes a complete solution to the prob-lem. The best solutions are then selected and influence the creation of the solutionsin subsequent iterations. Such approaches require a global cost function able toevaluate how good each proposed solution is. In our approach, only one solution isgenerated at every iteration. There is no need of a global cost function: each agentuses a local cost function which is much simpler than common global cost functions.The system is able to discard the contribution of those agents whose cost functionwould lead to a poor solution and to promote those agents whose cost function wouldincrease the quality of the solution.

5.6 Conclusions

In this chapter we have introduced a new model of multiagent coordination nameddemocratic collective transportation model. The model is inspired to some speciesof ants that are able to collectively carry a prey to the nest. Based on such modelwe implemented a new algorithm able to process images that has been applied toimage registration and image matching. The algorithm is able to perform a correctrigid registration even on highly noisy or corrupted images.


Figure 5.2: Example of the execution of the algorithm. For every row, from left toright: Iinput, Itarget, differences between Iinput and Itarget, the output of the algorithm(the aligned image), the difference between Itarget and the output of the algorithm.

Figure 5.3: Example of application of our algorithm to the Image Matching problem.In this case the goal is to find the location of the patch Iinput in Itarget. From left toright: Iinput, Itarget, the output of the algorithm, the differences between Itarget andthe estimated location of Iinput in Itarget. The black box means that the algorithmwas able to correctly locate the patch over Itarget.

Chapter 6

Conclusions

The work presented in this thesis has been motivated by an attempt to build acomputational model of vision able to perform tasks, even if simple, in an image-independent way. Many image processing and analysis algorithms, even the onesrelying on strong mathematical basis, are often dedicated to specific classes of im-ages.

A strong biological inspiration to the human visual system could be useful in thedesign and implementation of more general algorithms, able to perform a, possiblysmall, set of operations, but without the fine tuning needed by classical algorithm.

In this thesis, two biologically-inspired approaches have been presented: a com-putational model of the visual attention inspired to the human visual system anda swarm-intelligent algorithm able to perform rigid image registration and imagematching.

The computational model of visual attention has been designed to include notonly the bottom-up component, but also the top-down, task-dependent influences.Its major points of strengths are:

• a rich encoding of the input stimulus. By using opponent channels instead ofthe more common double-opponent channels, it is able to produce different rep-resentations for different objects. As discussed in chapter 4, double-opponenchannels do not selectively distinguish between the two opponent features.

• The use of single-opponent channels have forced us to propose a novel nor-malization of the feature maps for the construction of conspicuity maps. Infact, stimulus with equal representations in double-opponent channels are nowdifferent.

• A pyramid of saliency level gives to our model much more information thana single feature map. In particular, by using a pyramidal organization wecan move onto the zoom-lens metaphor (combined with the spotlight one)and inspect the image at various detail levels. Furthermore, attention can be

88 CHAPTER 6. CONCLUSIONS

deployed to homogeneous and possibly meaningful regions by inspecting thesaliency map at the correct level.

In our model object learning is possible, even if with some limitations. Themodel operates in single-shot learning and represents an object with its most salientregions. The model is not able to inductively learn a class of objects. The objecttemplate is the used in visual search tasks by sequentially modulating the map foreach of its feature.

This last stage, however, has some flaws. When top-down attention is deployedit illuminates a wide area because the actual scale of the object is not known. In thenon-inhibited areas, there can be many features on which the top-down attentionalmodulation could reach its strongest response. Furthermore, when we code objectparts, we study and extract properties from the regions, while the attentional mod-ulation performs its computations on single pixels. The problem could be solvedby simply segmenting regions and study maximal and minimal values belonging tothem. Regions could be analyzed based on their saliency. This last point representsanother weakness of the approach: the bottom-up saliency and the top-down influ-ences are in fact supposed to compete for attention. In our model the top-downmodulation inhibits the saliency map everywhere.

The biological inspiration of the second algorithm presented in this thesis comesfrom a different field. Our research has considered also the Swarm Intelligenceparadigm for the design of a visual attention system. Some insects, like the bees, areable to communicate visually the presence of flowers, that, in their world, representsalient regions. Furthermore, their memory allow to return on, and thus recognized,previously visited locations. However, our initial efforts are resulted in an algorithmfor image registration based on the collective behaviour of the ants. In particular,we propose a new model and a new algorithm inspired to the strategy adopted bysome species of ants for carrying their prey to the nest. In the algorithm, each antcarries a pixel and has its own preferred destination. It tries to carry the pixel tothe chosen location, but it is pulled by the majority of the other ants. After fewiterations, all the ants moves toward the correct destination. The algorithm has beenexperimented on real image for image registration tasks and image (patch) matching,that can be considered a sort of visual search. Rigid registration is a problem forwhich an exact solution can be found. However, the algorithm, inspired to the newagent coordination model we introduced, can register images highly corrupted bynoise. Furthermore, without any other effort, it is able to locate an image patch ina larger regions.

There are several ways we plan to expand our work in the future. First of all,the model needs to be experimented on large sets of images in order to evalutaehow good the bottom-up part is. The top-down component needs to be extendedin order to perform region-based operations, avoiding the problem described above,and, overall, a competition between the two different types of attention, bottom-upand top-down, must be introduced in the model. The object recognition capabilities

89

could be extended by introducing higher layer modeling cells in higher neural layers.

90 CHAPTER 6. CONCLUSIONS

Bibliography

[1] D.H. Hubel. Eye, Brain, and Vision, volume Scientific American Library, No22. W. H. Freeman & Co., 1988.

[2] C. Koch and S. Ullman. Shifts in selective visual attention: towards the under-lying neural circuitry. Human Neurobiology, 4:219–227, 1985.

[3] J.M. Wolfe. Guided search 2.0: A revised model of visual search. PsychonomicBullettin & Review, 1(2):202–238, 1994.

[4] R. Milanese, H. Wechsler, S. Gil, J.-M. Bost, and T. Pun. Integration of bottom-up and top-down cues for visual attention using non-linear relaxation. In Proc.of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’94), pages 781–785, 1992.

[5] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attentionfor rapid scene analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 20(11):1254–1259, 1998.

[6] C. Siagian and L. Itti. Rapid biologically-inspired scene classification usingfeatures shared with visual attention. IEEE Transactions on Pattern Analysisand Machine Intelligence, 29(2):300–312, 2007.

[7] F.H. Hamker. The emergence of attention by population-based inference and itsrole in distributed processing and cognitive control of vision. Computer Visionand Image Understanding, 100:64–106, 2005.

[8] Y. Sun and R. Fisher. Object-based visual attention for computer vision. Ar-tificial Intelligence, 146:77–123, 2003.

[9] A. Wirth, G. Cavallacci, and F. Genovesi-Ebert. The advantages of an invertedretina. a physiological approach to a teleological question. Developments inOphthalmology, 9(20–28), 1984.

[10] W.S. Kuffler. Discharge patterns and functional organization of mammalianretina. Journal of Neurophysiology, 16:37–68, 1953.

[11] S.E. Palmer. Vision Science, Photons to phenomenology. The MIT Press, 1999.

92 BIBLIOGRAPHY

[12] B.A. Wandell. Foundations of Vision. Sinauer Associates, Inc., 1995.

[13] W. James. The principles of psychology. New York: Holt, 1890.

[14] B.J. Scholl. Objects and attention: the state of the art. Cognition, 80(1–2):1–46,2001.

[15] M.I. Posner. Chronometric exporations of mind. Erlbaum, 1978.

[16] C.W. Eriksen and J.D. St. James. Visual attention within and around the fieldof focal attention: A zoom lens model. Perception & Psychophysics, 40(4):225–240, 1986.

[17] K.R. Cave and N.P. Bichot. Visuospatial attention: Beyond a spotlight model.Psychonomic Bullettin & Review, 6(2):204–223, 1999.

[18] J.R. Stroop. Studies on interference in serial verbal reactions. Journal ofExperimental Psychology, 18:643–662, 1935.

[19] W.R. Garner. The processing of information and structure. Erlbaum, 1974.

[20] A. Treisman and G. Gelade. A feature-integration theory of attention. CognitivePsychology, 12(1):97–136, 1980.

[21] S. Grossberg, E. Mingolla, and W.D. Ross. A neural theory of attentive visualsearch: Interactions of boundary, surface, spatial and object representations.Psychological Review, 101(3):470–489, 1994.

[22] J.A. Feldman and D.H. Ballard. Connectionist models ant their properties.Cognitive Science, 6:205–224, 1982.

[23] J.M. Wolfe, K.R. Cave, and S.L. Franzel. Guided search: An alternative to thefeature integration model for visual search. Journal of Experimental Psychology:Human Perception and performance, 15(3):419–433, 1989.

[24] J.M Wolfe and G. Gancarz. Basic and Clinical Applications of Vision Science,chapter Guided Search 3.0: A model of visual search catches up with Jay Enoch40 years later, pages 189–192. Kluwer Acadmic Press, 1999.

[25] J.M. Wolfe. Guided Search 4.0: A guided search model that does not requirememory for rejected distractors. Journal of Vision, 1(3):349–349, 2001.

[26] J.M. Wolfe and T.S. Horowitz. What attributes guide the deployment of visualattention and how do they do it? Nature Review, Neuroscience, 5:1–7, 2004.

[27] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention.Annual Reviews of Neuroscience, 18:193–222, 1995.

6.0. BIBLIOGRAPHY 93

[28] R. Desimone. Visual attention mediated by biased competition in extras-triate cortex. Philosophical Transactions of the Royal Society of London,353(1373):1245–1255, 1998.

[29] J. Duncan. Converging levels of analysis in the cognitive neuroscience of vi-sual attention. Philosophical Transactions of the Royal Society of London,353(1373):1307–1317, 1998.

[30] J.K. Tsotsos. Analyzing vision at the complexity level. Behavioral and BrainSciences, 13:423–469, 1990.

[31] J.K. Tsotsos, S.M. Culhane, W.Y.K. Wai, Y. Lai, N. Davis, and F. Nuflo.Modeling visual attention via selective tuning. Artificial Intelligence, 78:507–545, 1995.

[32] V. Navalpakkam and L. Itti. Modeling the influence of task on attention. VisionResearch, 45:205–231, 2005.

[33] F.H. Hamker. Modeling feature-based attention as an active top-down inferenecprocess. BioSystems, 86:91–99, 2006.

[34] T.D. Sanger. Probability density estimation for the intepretation of neuralpopulation codes. Journal of Neurophysiology, 76:2790–2793, 1996.

[35] K.R. Cave. The featuregate model of visual selection. Psychological Research,62(2–3):182–194, 1999.

[36] G. Deco and J. Zihl. A neurodynamical model of visual attention: feedbackenhancement of spatial resolution in a hierarchical system. Journal of Compu-tational Neuroscience, 10:231–253, 2001.

[37] G. Deco and E.T. Rolls. A neurodynamical cortical model of visual attentionand invariant object recognition. Vision Research, 44:621–642, 2004.

[38] K.W. Lee, H. Buxton, and J. Feng. Cue-guided search: a computational modelof selective attention. IEEE Transactions on Neural Networks, 16(4):910–924,2005.

[39] S. Minut and S. Mahadevan. A reinforcement learning model of selective visualattention. In Proc. of AGENTS ’01 Conference, pages 457–464, 2001.

[40] I.A. Rybak, V.I. Gusakova, A.V. Golovan, L.N. Podladchikova, and N.A.Shevtsova. A model of attention-guided visual perception and recognition. Vi-sion Research, 38:2387–2400, 1998.

94 BIBLIOGRAPHY

[41] J.-M. Hopf, C.N. Boehler, S.J. Luck, J.K. Tsotsos, H.-J. Heinze, and M.A.Schoenfeld. Direct neurophysiological evidence for spatial suppression surround-ing the focus of attention in vision. Proceedings of the National Academy ofScience, 103(4):1053–1058, 2006.

[42] R. L. De Valois and K.K. De Valois. Spatial Vision. Oxford University Press,1988.

[43] E.H. Adelson, C.H. Anderson, J.R. Bergen, P.J. Burt, and J.M. Ogden. Pyra-mid methods in image processing. RCA Engineer, 29(6):33–41, 1984.

[44] T. Pavlidis and S.L. Tanimoto. A hierarchical data structure for picture pro-cessing. Computer Graphics and Image Processing, 4:104–119, 1975.

[45] P.J. Burt and E.H. Adelson. The laplacian pyramid as a compact image code.IEEE Transactions on Communications, 31(4):532–540, 1983.

[46] H. Greenspan, S. Belongie, P. Perona, R. Goodman, S. Rakshit, and C. An-derson. Overcomplete steerable pyramid filters and rotation invariance. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR94), pages 222–228, 1994.

[47] M. W. Schwarz, W. B. Cowan, and J. C. Beatty. An experimental comparison ofrgb, yiq, lab, hsv, and opponent color models. ACM Transactions on Graphics,6(2):123–158, 1987.

[48] M. Sarifuddin and R. Missaoui. A new perceptually uniform color space withassociated color similarity measure for content-based image and video retrieval.In Proceedings of the 28th Annual International ACM Conference on Researchand Development in Information Retrieval, 2005.

[49] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau. A coherent computationalmodel to model bottom-up visual attention. IEEE Transactions on PatternAnalysis and Machine Intelligence, 28(5):802–817, 2006.

[50] Maik Bollmann and Barbel Mertsching. Opponent color processing based onneural models. In Advances in Structural and Syntactical Pattern Recognition,pages 198–207, 1996.

[51] S. Sussstrunk, J. Holm, and G. D. Finlayson. Chromatic adaptation perfor-mance of different RGB sensors. IS&T / SPIE Electronic Imaging, 4300, 2001.

[52] W.S. Stiles and J.M. Burch. N.p.l. colour matching investigation: Final report.Optica Acta, 6(1):1–26, 1959.


[53] D. L. Ruderman, T.W. Cronin, and C.-C. Chiao. Statistics of cone responses tonatural images: implications for visual coding. Journal of the Optical Societyof America A: Optics, Image Science, and Vision, 15(8):2036–2045, 1998.

[54] E.J. Chichilnisky and B.A. Wandell. Trichromatic opponent color classification.Vision Research, 39:3444–3458, 1999.

[55] Y. Boykov, O. Veksler, and R. Zabih. A variable window approach to earlyvision. IEEE Transactions on Pattern Analysis and Machine Intelligence,20(12):1283–1294, 1998.

[56] D. Marr. VIsion. Freeman Publishers, 1982.

[57] F. Kingdom, M. McCourt, and B. Blakeslee. In defence of ”lateral inhibition”as the underlying cause of induced brightness phenomena. Vision Research,37:1039–1043, 1997.

[58] P. Kruiniza and N. Petkov. Computational model of dot-pattern selective cells.Biological Cybernetics, 83:313–325, 2000.

[59] N. Petkov and W.T. Visser. Modifications of center-surround, spot detectionand dot-pattern selective operators. Technical report, Institute for Mathematiceand Computing Science, University of Groningen, The Netherlands, 2005.

[60] L. Itti and C. Koch. A comparison of feature combination strategies for saliency-based visual attention systems. In Proceedings of the SPIE Human Vision andElectronic Imaging Conference (HVEI’99), pages 473–482, 1999.

[61] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic repre-sentation of the spatial envelope. International Journal of Computer Vision,42(3):145–175, 2001.

[62] A. Torralba. Modeling global scene factors in attantion. Journal of the OpticalSociety of America A, 20(7):1407–1418, 2003.

[63] T. Jost, Ouerhani N., et al. Assessing the contribution of color in visual atten-tion. Computer Vision and Image Understanding, 100:107–123, 2005.

[64] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models forrecognition. In Proceedings of the European Conference on Computer Vision,pages 18–32, 2000.

[65] D.G. Lowe. Distinctive image features from scale invariant keypoints. Interna-tional Journal of Computer Vision, 60:91–110, 2004.

[66] P. Dayan, S. Kakade, and P. Montague. Learning and selective attention. NatureNeuroscience, 3:1218–1223, 2000.

96 BIBLIOGRAPHY

[67] H. Deubel. Localization of targets across saccades: role of landmark objects.Visual Cognition, 11:173–202, 2004.

[68] M. Li and J. Clark. Learning of position and attention-shift invariant recogni-tion across attention. In Proceedings of the International Workshop on Atten-tion and Performance in Computational Vision, pages 41–48, 2004.

[69] A. Oliva, A. Torralba, M.S. Castelhano, and J.M. Henderson. Top-down controlof visual attention in object detection. In Proceedings of the IEEE InternationalConference on Image Processing, pages 253–256, 2003.

[70] S. Mannan, K.H. Ruddock, and D.S. Wooding. Automatic control of saccadiceye movements made in visual inspection of briefly presented 2-d images. SpatialVision, 9(3):363–386, 1998.

[71] T. Wu, J. Gao, and Q. Zhao. A computational model of object-based selectivevisual attention mechanism in visual information acquisition. In Proceedings ofthe 2004 International Conference on Information Acquisition, pages 405–408,2004.

[72] L. Paletta, G. Fritz, and C. Seifert. Q-learning of sequential attention for visualobject recognition from informative local descriptors. In Proceedings of the 22ndInternational Conference on Machine Learning, pages 649–656, 2005.

[73] K. Chokshi, C. Panchev, S. Wermter, and J.G. Taylor. Knowing what andwhere: a computational model for visual attention. In Proceedings of the 2004IEEE International Conference on Neural Networks, pages 524–529, 2004.

[74] V. Ramos and F. Almeida. Artificial ant colonies in digital image habitats - amass behaviour effect study on pattern recognition. In Proceedings of the 2ndInternational Workshop on Ant Algorithms (From Ant Colonies to ArtificialAnts), (ANTS00), pages 113–116, 2000.

[75] Y. Owechko and S. Medasani. A swarm-based volition/attention frameworkfor object recognition. In Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR05), 2005.

[76] Y. Owechko and S. Medasani. Cognitive swarms for rapid detection of objectsand associations in visual imagery. In Proceedings of the 2005 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition (CVPR05),2005.

[77] Peng-Yeng Yin. Particle swarm optimization for point pattern matching. Jour-nal of Visual Communication and Image representation, 2006.

[78] G. Beni and J. Wang. Swarm intelligence. In Proc. of the Seventh annualmeetings of the Robotics Society of Japan, pages 425–428, 1989.


[79] E. Bonebeau, M. Dorigo, and G. Theraulaz. Swarm Intelligence: from naturalto artificial systems. Oxford University Press, 1999.

[80] J. Kennedy and R.C. Eberhart. Swarm Intelligence. Morgan Kauffmann Pub-lishers, 2001.

[81] D.R. Chialvo and M.M. Millonas. How swarms build cognitive maps. NATOASI Series, pages 439–450, 1995.

[82] H. Van Dyke Paranuk. Making swarming happen. In Proceedings of the Confer-ence on Swarming and Network Enabled Command, Control, Communications,Computers, Intelligence, Surveillance and Reconnaissance (C4ISR), 2003.

[83] M. Birattari, G. Di Caro, and M. Dorigo. Ant Algorithms, 3rd InternationalWorkshop ANTS 2002, chapter Toward the formal foundation of Ant program-ming. Number 2463 in LNCS. Springer Verlag, 2002.

[84] M. Mamei and F. Zambonelli. Co-fields: a physically inspired approach todistributed motion coordination. IEEE Pervasive Computing, 3(2):52–61, 2004.

[85] V.D.H. Parunak and S.A. Brueckner. Methodologies and Software Engineeringfor Agent Systems, chapter Multiagent systems, artifical societies, and simu-lated organizations. Springer US, 2006.

[86] Ronald C. Kube and Eric Bonabeau. Cooperative transport by ants and robots.Robotics and Autonomous Systems, 30(1/2):85–101, 2000.

[87] B. Zitova and J. Flusser. Image registration methods: a survey. Image andVision Computing, 21(11):977–1000, 2003.

[88] J. Maintz and M. Viergever. A survey of medical image registration. MedicalImage Analysis, 2(1):1–36, 1998.

[89] Mark P. Wachowiak, Renata Smolıkova, Y. Zheng, Jacek M. Zurada, andAdel Said Elmaghraby. An approach to multimodal biomedical image registra-tion utilizing particle swarm optimization. IEEE Transactions on EvolutionaryComputation, 8(3):289–301, 2004.

[90] Souham Meshoul and Mohamed Batouche. Ant colony system with extremaldynamics for point matching and pose estimation. In Proceedings of the 16 thInternational Conference on Pattern Recognition (ICPR’02) Volume 3 - Volume3, pages 823–826, 2002.

attention-based object detection - semantic scholar · 2017-05-14 · 2 the human visual system 5...

Documents