08.10.12 artificial intelligence and cognition - a.i

Nick Hawes

Natural Cognition and Artificial Intelligence

http://www.cs.bham.ac.uk/~nah

What can AI learn from Biology

“It is the science and engineering of making intelligent machines, especially intelligent computer programs.

It is related to the similar task of using computers to

understand human intelligence, but AI does not

have to confine itself to methods that are

biologically observable.”John McCarthy

http://www-formal.stanford.edu/jmc/whatisai/

http://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)

perception action

world

cognition

AI Biology

Biology AI

Biology AI

what how build result?

perception action

world

cognition

1254 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 11, NOVEMBER 1998

Short PapersA Model of Saliency-Based Visual Attention

for Rapid Scene AnalysisLaurent Itti, Christof Koch, and Ernst Niebur

Abstract—A visual attention system, inspired by the behavior and theneuronal architecture of the early primate visual system, is presented.Multiscale image features are combined into a single topographicalsaliency map. A dynamical neural network then selects attendedlocations in order of decreasing saliency. The system breaks down thecomplex problem of scene understanding by rapidly selecting, in acomputationally efficient manner, conspicuous locations to be analyzedin detail.

Index Terms—Visual attention, scene analysis, feature extraction,target detection, visual search.

———————— ! ————————

1 INTRODUCTIONPRIMATES have a remarkable ability to interpret complex scenes inreal time, despite the limited speed of the neuronal hardware avail-able for such tasks. Intermediate and higher visual processes appearto select a subset of the available sensory information before furtherprocessing [1], most likely to reduce the complexity of scene analysis[2]. This selection appears to be implemented in the form of a spa-tially circumscribed region of the visual field, the so-called “focus ofattention,” which scans the scene both in a rapid, bottom-up, sali-ency-driven, and task-independent manner as well as in a slower,top-down, volition-controlled, and task-dependent manner [2].

Models of attention include “dynamic routing” models, inwhich information from only a small region of the visual field canprogress through the cortical visual hierarchy. The attended regionis selected through dynamic modifications of cortical connectivityor through the establishment of specific temporal patterns of ac-tivity, under both top-down (task-dependent) and bottom-up(scene-dependent) control [3], [2], [1].

The model used here (Fig. 1) builds on a second biologically-plausible architecture, proposed by Koch and Ullman [4] and atthe basis of several models [5], [6]. It is related to the so-called“feature integration theory,” explaining human visual searchstrategies [7]. Visual input is first decomposed into a set of topo-graphic feature maps. Different spatial locations then compete forsaliency within each map, such that only locations which locallystand out from their surround can persist. All feature maps feed, ina purely bottom-up manner, into a master “saliency map,” whichtopographically codes for local conspicuity over the entire visualscene. In primates, such a map is believed to be located in theposterior parietal cortex [8] as well as in the various visual maps inthe pulvinar nuclei of the thalamus [9]. The model’s saliency mapis endowed with internal dynamics which generate attentionalshifts. This model consequently represents a complete account of

bottom-up saliency and does not require any top-down guidanceto shift attention. This framework provides a massively parallelmethod for the fast selection of a small number of interesting im-age locations to be analyzed by more complex and time-consuming object-recognition processes. Extending this approachin “guided-search,” feedback from higher cortical areas (e.g.,knowledge about targets to be found) was used to weight the im-portance of different features [10], such that only those with highweights could reach higher processing levels.

2 MODELInput is provided in the form of static color images, usually digit-ized at 640 ¥ 480 resolution. Nine spatial scales are created usingdyadic Gaussian pyramids [11], which progressively low-passfilter and subsample the input image, yielding horizontal and ver-tical image-reduction factors ranging from 1:1 (scale zero) to 1:256(scale eight) in eight octaves.

Each feature is computed by a set of linear “center-surround”operations akin to visual receptive fields (Fig. 1): Typical visualneurons are most sensitive in a small region of the visual space(the center), while stimuli presented in a broader, weaker antago-nistic region concentric with the center (the surround) inhibit theneuronal response. Such an architecture, sensitive to local spatialdiscontinuities, is particularly well-suited to detecting locationswhich stand out from their surround and is a general computa-tional principle in the retina, lateral geniculate nucleus, and pri-mary visual cortex [12]. Center-surround is implemented in themodel as the difference between fine and coarse scales: The centeris a pixel at scale c Œ {2, 3, 4}, and the surround is the correspondingpixel at scale s = c + d, with d Œ {3, 4}. The across-scale differencebetween two maps, denoted “!” below, is obtained by interpolationto the finer scale and point-by-point subtraction. Using several scalesnot only for c but also for d = s - c yields truly multiscale featureextraction, by including different size ratios between the center andsurround regions (contrary to previously used fixed ratios [5]).

2.1 Extraction of Early Visual FeaturesWith r, g, and b being the red, green, and blue channels of the in-put image, an intensity image I is obtained as I = (r + g + b)/3. I is

0162-8828/98/$10.00 © 1998 IEEE

!!!!!!!!!!!!!!!!

•! L. Itti and C. Koch are with the Computation and Neural Systems Pro-gram, California Institute of Technology—139-74, Pasadena, CA 91125.!E-mail: {itti, koch}@klab.caltech.edu.

•! E. Niebur is with the Johns Hopkins University, Krieger Mind/Brain Insti-tute, Baltimore, MD 21218. E-mail: [email protected].

Manuscript received 5 Feb. 1997; revised 10 Aug. 1998. Recommended for accep-tance by D. Geiger.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 107349.

Fig. 1. General architecture of the model.

Ales Leonardis, 2012

12

painting

lamp

sofa

sofa

table

chair

window

people

globe

bus

street lamp

microwave

washing machine

chair

pipe

cupboard

plate

painting

teapot

stand

table

table

table

laptop

chairchair

chair

books

bus poster

reflectors

birds

building

banner

street lamp

people

bookcase

cabinet


14

Layer 1

Layer 2

Layer 3


six cortical cell layers and other features) [13]. The GNOSYS soft-ware uses a simplified model of this hierarchical architecture,with far fewer neurons and associated trainable synapses. Thenumber of model neurons in each of the modules of Fig. 4 isshown in Table 1.

We tested cells in the various modules for their sensitivity to var-ious stimulus inputs; the result for a cell in V4 is shown in Fig. 5.

The effects of attention feedback in the GNOSYS system areshown in Fig. 6.

As in all such neural models of attention, input is created fromvisual input by suitable photo-detectors whose current is thenpassed into a spatially topographic array of units sensitive to suchcurrent input. These latter can represent either a simplified retinaor, in our case, the lateral geniculate nucleus of the thalamus (onecell layer up from the retina), as indicated in Table 1. This inputthen activates the most sensitive cells to that input, which thensend the activity sequentially up various routes in the hierarchy(dorsal, ventral, colour in the GNOSYS case) shown in Fig. 4. Atten-tion feedback then occurs from the highest level activity (from FEFor IFG, in Fig. 4). There is a similar feedback process in neural mod-els of attention [14–16], with similar amplification (of target acti-vations) and inhibition (of distractor activations) effects. One of thenovelties of our work is the employment of attention so crucially inobject recognition and other processes involving the componentsof GNOSYS towards solving the reasoning tasks. Its control struc-ture also allows the attention system to be extended to the moregeneral CODAM model [17–19] thereby allowing the introductionof a modicum of what is arguably awareness into the GNOSYSsystem.

In an earlier visual model of the ventral and dorsal streams,which did not include the recognition of colour and was smallerin size, we investigated the abilities of such a model with attentionto help solve the problem of occlusion [20,21]. This model wastrained to recognise three simple shapes (square, triangle and cir-cle), by overlapping two shapes (a square and a triangle) to differ-ing degrees. We investigated how attention applied to either theventral or dorsal stream could aid in recognising that a squareand a triangle were present. In the case of the ventral stream,attention was directed to a specific object, which in our case was

LGN (ventral)

V1 (ventral)

V2

V4

TEO

TE

IFG

IFG_no_goal

TPJ

LGN (dorsal)

V1 (dorsal)

V5

LIP

FEF

FEF_2

FEF_2_no_goal

SPL

Object 1

Object 2

Space 1

Space 2

Spatial goal signal Object goal signal

VENTRAL DORSAL

Fig. 4. The architecture of the hierarchical neural network used in the visual perception/concept simulation in the GNOSYS brain. There is a hierarchy of modules simulatingthe known hierarchy of the ventral route of V1 ? V2 ? V4 ? TEO ? TE ? PFC(IFG) in the human brain. The dorsal route is represented by V1 ? V5 ? LIP ? FEF, with alateral connectivity from LIP to V4 to allow for linking the spatial position of an object with its identity (as known in the human brain). There are two sets of sigma–pi weights,one from TPJ in the ventral stream which acts on the inputs from V2 to V4, the other from SPL which acts on the V5 to LIP inputs. This allows for the multiplicative control ofattention.

Table 1Numbers of neurons in each of the modules of Fig. 4. The letters e and i denoteexcitatory and inhibitory, respectively.

Module name Module shape Total number of neuronspresent in the module

LGN (R + G + B + edge) 4 layers of160 ! 120 excitatory(e) neurons

76,800

Ventral (shape + color)V1 400 ! 180 (e) 72,000V2 200 ! 120 (e + inhibitory (i)) 48,000V4 200"120 (e + i) 48,000TEO 150 ! 90 (e + i) 27,000TE 75 ! 45 (e + i) 6750IFG 7 ! 10 (e + i) (recognises 3

shape + 3 colour)140

TPJ 7 ! 10 (e + i) (recognises 3shape + 3 colour)

140

Dorsal (spatial)V1 80 ! 60 (e + i) 9600V5 80 ! 60 (e + i) 9600LIP 80 ! 60 (e + i) 9600FEF1 40 ! 30 (e + i) 2400FEF2 40 ! 30 (e + i) 2400SPL 40 ! 30 (e + i) 2400

1646 J.G. Taylor et al. / Image and Vision Computing 27 (2009) 1641–1657

perception action

world

cognition

Actuation

Compliance+

Actuation

Compliance+

Dissipation

Animal

Multi-jointed Legs

Mor

e co

mpl

ianc

eLe

ss a

ctua

tion

Actuation

Compliance+

Actuation

Dissipation

BigDog

Jindrich and Full / J. Exp. Biol. 205 (2002)

Andrew Spence & Dan Koditschek

perception action

world

cognition

cognition

perception action

world

cognition

08.10.12 artificial intelligence and cognition - a.i

Documents

build result cognition

visual search

visual field

models of attention

human intelligence

human brain

complexity of scene

early primate visual