a biologically-motivated developmental system towards ...a biologically-motivated developmental...

A Biologically-Motivated Developmental System

Towards Perceptual Awareness in Vehicle-Based

Robots

Zhengping Ji ∗, Matthew D. Luciw ∗, Juyang WengDept. of Computer Science and Engineering

Michigan State University{jizhengp, luciwmat, weng}@cse.msu.edu

Shuqing Zeng, Varsha SadekarElectrical Controls and Integration Laboratory

R&D Center, General Motors Inc.,{shuqing.zeng, varsha.sadekar}@gm.com

Abstract

Existing learning networks and architec-tures are not suited to handle autonomousdriving or driver assistance in complex,human-designed environments such as citydriving. Developmental learning techniquesfor such “vehicle-based” robots will be nec-essary. Motivated by neuroscience, we pro-pose a system with a design based on thecriteria of autonomous, open-ended develop-ment. The eventual goal is perceptual aware-ness – a conceptual and symbolic understand-ing of the sensed environment, that can becommunicated, developed and refined usinga teacher defined language. In the systemproposed here, radars and a camera are in-tegrated to localize nearby objects for furtheranalysis. The attended areas are each trans-formed into sparse representation by a layerof developed natural filters analogous to V1.Taking that layer’s response, MILN (Multi-layer In-place Learning Network) integratesunsupervised and supervised learning to self-organize efficient representations for recogni-tion of the types of the objects. We trainedour system with data from 10 different cityand highway road environments and comparewith other learning algorithms favorably. Re-sults of the comparison show that this sys-tem is the only one tested that can fit all thespecified criteria of development for a general-purpose learning architecture.

∗Both authors contributed equally to this paper.This work is supported in part by General Motors Re-

search and Development.

1. Introduction

Due to the DARPA Grand Challenge and UrbanChallenge [DARPA, 2007], many systems for au-tonomous driving have been created and many moreare under development. Yet, the constraints ofthe contests have not yet required local perceptualawareness besides a classification of safe and non-safe areas. Skilled driving requires a rich under-standing of the complex road environment, whichcontains many signals and cues that visually conveyinformation, such as traffic lights and road signs, andmany different types of objects, including other ve-hicles, pedestrians, and trash cans, to name a few.We argue that an autonomous driving system that isadept at interpreting and understanding the human-designed road environments will require human-levelperceptual awareness. Previously, it has been ar-gued that such systems will require a developmentalapproach [Weng et al., 2001], where a suitable de-velopmental architecture, coupled with a nurturingand challenging environment, as experienced throughsensors and effectors, allows mental capabilities andskills to emerge.

This challenge therefore has implications be-yond advancing the state-of-the-art in autonomousdriving. An autonomously developing system shouldbe heavily motivated by studies in developmentalpsychology and neuroscience. So, the challengesin building such a system may also lead to in-sights about biological mental development. The sys-tem presented in this paper presents a biologically-motivated system for object detection, learning, andrecognition, tested in highway and urban road envi-ronments. It’s design is motivated by the constraintsof large-scale, open-ended development. Natural im-age processing for general and complex settings is

Berthouze, L., Prince, C. G., Littman, M., Kozima, H., and Balkenius, C. (2007). Proceedings of the Seventh International Conference on Epigenetic Robotics: Modeling

Cognitive Development in Robotic Systems. Lund University Cognitive Studies, 135.

Projection & Window Guess

Labels

…

Teacher

Radar(s)

Image Queue

…

Label Queue

Layer One (Derived filters)

Layer Two (Sparse representation)

Layer Three (Recognition)

Image Window

Extraction

Camera

Teaching Interface

Innate Receptive

Fields

Figure 1: Current system’s architecture. The camera and radars work together to provide a set of image regions,

possibly containing nearby objects. The teacher can communicate with the system through an interface and label the

objects. A three-layer network learning network is used. The first layer encodes each image using localized (small

receptive fields), sparse, orientation-selective filters comparable to those in V1. Localized receptive fields will allow

spatial attention selection in later versions of this system. The second layer neurons have a classical receptive field over

the entire input image and learn prototypical object features in sparse representation space. Layer-3 links layer-2’s

global features with output tokens defined by the teacher.

beyond the limit of traditional hand-programmedimage processing methods. The high-dimensional,appearance-based, developmental learning methodpresented here is characteristic of a “non-task spe-cific” approach.

1.1 Problem definition

Our eventual goal is to enable a vehicle-based agentto develop the ability of perceptual awareness, for ap-plications including intelligent driver assistance andautonomous driving. Perceptual awareness is a con-ceptual and symbolic understanding of the sensed en-vironment, where the concepts are defined by a com-mon language between the system and the teachersand users. A language can be as simple as a pre-defined set of tokens or as complex as human spo-ken languages. Teachers are required to “arrangethe experience” of the system so that it learns thelanguage – e.g., a teacher points out sensory exam-ples of a particular conceptual class and the systemlearns to associate a symbolic token with the sensedclass members, even those that have not been exactlysensed before, but instead share some common char-acteristics (e.g., a van can be recognized as a vehicleby the presence of a license plate, wheels and tail-lights). More complicated perceptual awareness be-yond recognition involves abilities like counting andprediction.

The general setup and learning framework is illus-trated in Figure 1. Sensors used are video camerasand short and long-range radars. There is one long-range radar, which scans in the horizontal field of15o, with detection range up to 150 meters. Fourshort-range radars cover a 180o scanning area and

are able to detect objects within 30 meters. A singlecamera provides a 45o field of view.

The specific skills to be taught are to localize, iden-tify and communicate the objects that the vehiclepotentially will interact with – especially those thatmight lead to collisions. This is non-trivial to learnsince there are zero or more objects in each imagefrom the camera, each of which may vary in terms ofposition, scale, 2D rotations (affine transformations)and other variations such as 3D rotation or lightingfrom other objects of the same communicative type.Overall, most pixels are “background” pixels that donot correspond to any nearby object.

1.2 Requirements of a developmental archi-tecture

A high-level formulation of a developmental sys-tem is as a function that maps the current1 sen-sory input (including internal sensations) vector tothe next effector (including internal effects) outputvector: y(t + 1) = f(x(t)). Assume x(t) ∈ X andy(t) ∈ Y, where both spaces are typically very high-dimensional being raw sensory input and raw motoroutput. The goal of learning is to approximate someunderlying function f ′ : X → Y using the agent’smental resources and architecture, where this func-tion is shaped by e.g., some set of core motivations.

A developmental architecture is needed when thelearning problem is open-ended. It may contain tasksthat are not even known a priori, and it may containtasks that are not well-defined (there is no guaranteea dataset contains all situations). If any tasks are not

1Assume discrete time steps where the current time step isdenoted by t

well-defined (“muddy”) or unknown, the problem isconsidered to be open-ended, meaning currently un-available experience which will occur at some inde-terminate set of future times will be needed to learnthe tasks. Open-ended problems require develop-mental learning architectures.

Some existing and well-known artificial neural net-works are the feed-forward networks trained withbackpropogation (FBP), those using radial basisfunctions (RBF), constructed using the cascade cor-relation learning architecture (CCLA), the batch andincremental support vector machines (SVM and I-SVM), and the self-organizing maps (SOM). Thereare many other less well-known networks. None thatwe know of can meet all the requirements (most ofthem were not designed to) of an architecture foran open-ended developmental system within an au-tonomous agent.

Non task-specific – A common idea in super-vised learning, and used by FBP, RBF, CCLA, SVM,and I-SVM is to use the available data to attempt tooptimize the system’s performance (minimize error).Optimization is a “greedy” strategy in the followingsense: it discards information that is not useful foraccomplishing the current task (minimizing error fora particular set of data). It only learns when there isan output. An alternative is to use any sensory in-formation to develop the internal model, even whenthere is no task, in case it may be useful for learningfuture tasks.

The brain utilizes this strategy by selectively cod-ing the environment, as experienced through the sen-sors. Earlier levels of sensory representation (e.g.,LGN or V1) should not affected as much by the bio-logical agent’s actions (from motor areas) than laterareas, such as the object recognition area IT, whichprojects to and receives projections from the premo-tor cortex. Along sensorimotor pathways, we postu-late there is a spectrum in learning mode from mainlyunsupervised (at early levels) to supervised (at laterlevels). Natural environment based low-level featurederivation is “non task-specific” in the sense that thelow-level features used promote an efficient coding ofthe environment for any visual task.

Training time and training complexity – Akey to any developmental system is that it exists ina closed feedback loop with the environment. There-fore, the architecture must be able to learn from dataonline and in real-time. High-dimensional input, alarge storage resource, and the requirement of on-line training is prohibitive for incremental algorithmswith more than linear training time-complexity foreach storage element. Cell-centered learning, alsocalled in-place learning [Weng and Zhang, 2006], hasthe lowest possible time complexity for training.

Model-free learning – A major requirement ofdevelopmental learning techniques is automatic gen-eration of internal representation. This is impor-

tant for complex sensing modalities such as vision,and complex environments, such as street driving,where surroundings are too varied, complex and un-predictable to hand-design a static set of feature de-tectors.

Teachable – Another key component is the needfor a teacher – someone that can “arrange experi-ence” so that the learning system is able to developskills. In online learning, the performance result canbe confirmed or corrected by a teacher in a timelyfashion. It also allows a teacher to dynamically de-termine the training samples according to the currentsystem’s performance. This is critical, as it is impos-sible to fully predict the performance of a learningsystem ahead of the time during an otherwise batchsample collection stage, which makes any batch col-lection of training data less practically attractive.This is especially the case when the learning systemonly occasionally makes errors. This advantage en-ables a teacher to train the system with additionalcases in problem areas so as to improve the system’sperformance in these problem areas.

Local Analysis for Attention Selection – Re-ceptive fields must exist at all possible scales on theimage plane (corresponding to the fovea). For at-tention selection, certain spatial locations are sup-pressed or enhanced for local analysis. This abilityis essential for true generalization (e.g., recognitionby parts). This implies that input should be seg-mented and locally, instead of globally, or “mono-lithically”, processed. Networks that process the in-put in a monolithic fashion cannot perform attentionselection. Nearly all existing networks are mono-lithic. An internal attention selection action (choos-ing which receptive fields to suppress) will be neededto fully realize this capability.

Theoretically, our system here can handle theabove criteria. For a further discussion of these andthe other criteria (e.g., long-term memory, which isalso crucial) see [Weng et al., 2007]. This system isdesigned with these criteria in mind.

2. Design of the Vehicle-Based Devel-opmental System

2.1 Coarse Attention

The combination of the radar and camera sensors isan efficient way to perform coarse attention selec-tion. In the uncluttered, relatively open, road envi-ronments, they detect regions in the image that con-tain nearby objects. Radars were used to find salientimage regions, corresponding to possible (there arefalse alarms) nearby objects within the larger image.A group of target points in 3D world coordinatesare detected from radar sensors. In some cases, sev-eral target points would refer to the same object.Kalman filtering is applied to the original radar re-turns to generate the fused target points. We dis-

(a) (b)

(c) (d)

Figure 2: Examples of images containing radar returned

points, which are used to generate attention windows.

This figure shows some examples of the different road

environments in our dataset.

carded radar returns more than 100 meters in dis-tance ahead or more than eight meters to the rightor left. The radar-centered coordinates are projectedinto the image reference system, using a perspectivemapping transformation. Given a 3D radar-returnedfused target point, an attention window is createdwithin the image, taking the parameters of expectedmaximum height (three meters) and expected max-imum width (3.8 meters) of the vehicles. Figure 2shows examples of the “innate” attention windowgeneration. Through the first-stage attention pro-vided by the radar, most of the non-object pixelshave been identified.

For each radar window, the attended pixels areextracted as single images. Each image is normalizedin size, in this case to 56 rows and 56 columns. Toavoid stretching small images, if the radar windowcould fit, it was placed in the upper left corner of thesize-normalized image , and the other pixels are setto intensities of zero. These images are used as theinput to the learning network. We assume that thereis only one single object within each radar window.

This innate attention leads to the following advan-tage: for each image from the camera, the desiredoutput of the network is simplified from an indeter-minate number of labels of objects in the large imageto a single label for each radar window image.

2.2 Early coding

Information travels along the early part of the visualpathway (from the retina to V1), and the represen-tation of natural signals that develops is both local-ized and sparse [Olshausen and Field, 2004] withinV1. By localized, it means that most V1 neurons(the so-called simple cells) will respond only when asmall area on the fovea is stimulated – additionally,

neurons are orientation selective, so this area will bemost likely be a small, oriented slit on the retina.And by sparse, it means that only a few neurons willfire for a given stimulus. Sparseness is thought tolead to better storage capacity for associative mem-ories and improves generalization. In addition to be-ing sparse, V1 filters are also highly redundant andovercomplete – e.g., there is a 25:1 output to inputratio in V1 of the cat. As postulated in [Olshausenand Field, 2004], this overcompleteness may serve tomake nonlinear problems more linear by mapping theinput to a much higher-dimensional space.

#1 #2 #c

Image Plane

#(rc - c + 1)

#c + 1

#rc

pixel

staggerdistance

Figure 3: Receptive field boundaries and numbering

scheme of neural columns on layer-one.

How do V1 neurons develop this localized orien-tation selectivity and a sparse representation? Atickand coworkers [Atick, 1992] proposed that early sen-sory processing decorrelates inputs to V1 and showedhow a whitening filter can accomplish decorrelation.Srinivasan [Srinivasan et al., 1982] proposed that pre-dictive coding on the retina causes this decorrelation– basically: since the signals on the retina are highlyspatially correlated, the retinal ganglion cells withcenter-surround receptive fields can act as a predic-tion of a central intensity based on surrounding in-tensities. Given a network that whitens the input,Weng and Zhang [Weng and Zhang, 2006] showedhow the two well-known biological mechanisms ofHebbian learning and lateral inhibition led to thedevelopment of localized, orientation selective filtersthat form a sparse representation of the input, froma set of natural images. That same procedure wasused to generate the prototypes for each neural col-umn within this system’s layer-one (see below).

Layer-1: neural columns of natural filters –Within V1, neurons are organized in sets of denselypopulated, vertical columns. Receptive fields forneurons in each column are very similar – they aredistributed closely around a central point on theretina [Dow et al., 1981]. Neighboring columns over-lap significantly. The first layer of this system isthe visual primitive layer, which is organized into aset of neural columns, each of which contains a set

of neurons. Neurons on this layer are only initiallypartially connected to the pixels of the normalizedimage window, meaning each has a different initialreceptive field. A neuron’s initial receptive field isdependent on the neural column it belongs to. Weused innate (hard-coded) square, overlapping, 16 x16 initial receptive fields. Figure 3 shows the orga-nization of initial receptive fields We used a staggerdistance of 8 pixels, therefore for the 56 by 56 images,there were 36 total neural columns.

Developing the first layer neurons – We gen-erated the layer-1 derived filters from real-world“natural” images, using the LCA algorithm, as wasdone in [Weng and Zhang, 2006]. The statistics ofnatural images are representative of the signals weinterpret through vision. An overcomplete set of 512lobe components were developed for a 16 × 16 pixelarea. To do so, 16× 16 pixels were incrementally se-lected from random locations in 13 natural images2.For decorrelation similar to what may be done inthe retina, the image patch x is pre-whitened by:x̂ = Wx, where x̂ is the whitened sample vector andW = VD is the whitening matrix, V is the ma-trix of principal components, generated from 50,00016× 16 natural image patches. The matrix containseach principal component v1,v2, ...,vn as a columnvector, and D is a diagonal matrix where the ma-

trix element at row and column i is1√λi

, and λi is

the eigenvalue of vi. Pre-whitening is necessary forlocalized filters to develop, due to the statistics ofthe non-white natural image distribution [Olshausenand Field, 2004].

Figure. 4 shows the result after 10,000,000whitened input samples. The LCA algorithm usesonly biologically-plausible cell-centered mechanisms:Hebbian learning and lateral inhibition. We dis-carded some neurons with low update (win) totals,and kept 431 neurons. Each layer-1 neural columnshares these same 431 neurons, therefore layer-1 has431 ∗ 36 = 15, 516 neurons total. Layer-1 servesto map the raw pixel representation to a higher-dimensional, sparse encoded space for layer-2. Itleads to a sparse representation of inputs, meaningthat for a given input, very few neurons will be ac-tive. The localized orientation-selective filters arefunctionally similar to those found in V1.

2.3 Object representation and motor output

Layer-2: recognition area – From V1 to IT, theclassical receptive field of neurons becomes larger ineach area. Neurons in the second layer of this systemwill have a classical receptive field over the entire 56×56 image plane, since each neuron is fully connectedto all neurons in all neural columns in layer-one. This

2from http://www.cis.hut.fi/projects/ica/data/images/via Helsinki University of Technology

Figure 4: The developed filters that are learned from

natural images, and that were used in each neural column

of the proposed system. In this figure, each patch shows

the receptive field of a model neuron within a 16×16 pixel

image patch. The figures are placed in order of highest

number of updates, from the upper left, row-wise, to the

bottom-right.

layer, and layer-3, are meant to be developed in thedriving environments. The bottom-up input to thislayer is a radar-returned image mapped to the sparse-coded space, so the input dimensionality is 15,516.

A limited resource square grid of c (values triedwere 100 and 225) of neurons in this layer are uti-lized to represent the inputs. However, one neuronis not intended to represent a single object. Theyself-organize using the MILN algorithm presented inSection 3., and each input will be represented by thetotal population response.

Layer-3: motor area – The motor layer is anextendable layer that allows real-time supervisedlearning. It is made up of a single neural column.These neural weights will only update when thereis top-down input from a teacher. Whichever labelthe teacher provides sets the output, and specifiesthe single winning neuron, which then updates itsweights to the layer-2 neurons currently active (thelayer-2 population response). When the teacher pro-vides a new (previously unexperienced) label, a newneuron is added. In this way, classes can be taughtwithout turning off the system. New classes maytake advantage of the existing layer-two representa-tions (“soft” invariance) to be learned quickly. Whenthere is no label given by the teacher, the output ofthis layer is interpreted to give the system’s guess ofthe output token of the queried radar window.

3. Multilayer In-place Learning

Layer-one is a function that transforms an imagevector into a higher-dimensional sparse coded space.

Taking this as input is a Multi-layer In-place Learn-ing Network (MILN) as formulated in [Weng et al.,2007]. We do not intend to formulate MILN inthis paper, but some key elements will be dis-cussed. The network is called “in-place” since theself-organization of different areas occurs in a cell-centered (local) way. It is a concept motivated bythe genomic equivalence principle, by which everycell of an organism shares the same genome. whichcan be considered as the developmental program thatdirects development in a cell-centered way. This con-cept of in-place learning implies that there can notbe a “global”, or multi-cell, goal to the learning, suchas the minimization of mean-square error for a pre-collected (batch) set of inputs and outputs. In in-place learning, each neuron learns on its own, as aself-contained entity using its own internal mecha-nisms. The mechanisms contained in this programand the cell’s experience (stimulation) over time af-fect the cell’s fate (here: what feature it detects).

As a model of a cortical neuron, each MILN-neuron is contained within a particular layer, andhas either excitatory or inhibitory feedforward (froma lower layer), horizontal, or feedback (from a higherlayer) input. It develops its connections and synapsesthrough activity of other neurons. The external sen-sors are considered to be on the bottom (layer 0) andthe external motors on the top (layer N). It’s synap-tic conductance is modeled by three weight vectors:one for each of the three connection types: bottom-up weights wb, lateral (horizontal) weights wh, andtop-down weights we (e is for efferent). MILN inte-grates both bottom-up unsupervised and top-downsupervised learning modes via the explicit weights.The top-down supervision is only active when themotor output is imposed. In this way, the supervi-sion impacts the organization of earlier layers.

The total activity zi of neuron i in response to y,representing afferent activity, h (lateral activity) ande (activity from the next layer) is zi = g(wb ·y−wh ·h + we · e) where g is its nonlinear (or a piecewiselinear approximation) sigmoidal function.

MILN parameters – In this application, for thei-th neuron, we utilize explicit bottom-up and top-down weights, but approximate lateral activity. Theresponse zi is given as

zi = gi

((1− αl)

wb,i(t) · y(t)‖wb,i(t)‖‖y(t)‖ + αl

we(e, i) · e(t)‖we,i(t)‖‖e(t)‖

)

Instead of using explicit lateral weights, a winner-take-all approach was used as a computationally ef-ficient approximation of lateral inhibition. We ap-proximate lateral excitation via a 3 × 3 update. Theneurons are placed (given position) in a square grid.A winner neuron will cause the neighboring neuronsto also update and fire. All non-updating neuronshave response set to zero.

The weight αl controls how much layer l is influ-enced by top-down supervision from the next layerversus bottom-up unsupervised learning. is within (01) and is layer specific. It controls how much the areais influenced from the next layer versus the previouslayer. Here, α1 = 0, α2 = .3, and α3 = 1.

The values for the “top-down” weights are copiedover from the corresponding bottom-up weights. Toclarify: the same weights were used as the afferentweights to layer l as the top-down weights to layerl − 1, by using a neuron’s “fan-out” weights as itstop-down weights. The firing rate transfer functiong here is a low threshold: gi(zi) = 0 if zi ≤ θl, and zi

otherwise, where θl is a layer-specific low-thresholdvalue from 0 to 1. We set θ2 = 0.4.

…

.. …

…

.. …

Figure 5: (Left): The set of radar windows for a sequence,

in both highway and city driving environments. (Right):

The receptive field of the top-responding level-2 neuron.

Note that the size of the right window is less than the

size of the left window containing the vehicle. This shows

the further receptive field development for that layer-2

neuron.

The winning neuron (max-response) and its neigh-bors were allowed to fire and update: these are calledthe winners. For a winner cell j, update the weightsusing the lobe component updating principle usingthe neuron’s own internal temporally scheduled plas-ticity as wb,j(t) = β1wb,j(t − 1) + β2zjy(t) wherethe scheduled plasticity is determined by its two age-dependent weights:

β1 =n(i)− 1− µ(n(j))

n(j), β2 =

1 + µ(n(j))n(j)

,

with β1 + β2 ≡ 1. Finally, the cell age n(j) for thewinner increments: n(j) ← n(i)+1. All non-winnerskeep their ages and weight unchanged.

µ(n(i)) is a plasticity function defined in [Wengand Zhang, 2006]. It is called “CCI plasticity” andis formed so that the learning rate for new data β2

Table 1: Average performance & comparison of learning methods over 10-fold cross validation for pixel inputs

Learning Final # Overall “Vehicle” “Other objects” Training time Test timemethod storage elements accuracy accuracy accuracy per sample per sample

NN 1198 80.32% 72.28% 98.57% n/a 455msISVM 85 71.97% 72.49% 70.79% 130ms 2.2msIHDR 1198 79.51% 71.54% 97.61% 2.7ms 4.7msMILN 100 (10 × 10) 84.76% 87.81% 83.42% 17ms 8.8ms

Table 2: Average performance & comparison of learning methods over 10-fold cross validation for sparse coded inputs

Learning Final # Overall “Vehicle” “Other objects” Training time Test timemethod storage elements accuracy accuracy accuracy per sample per sample

NN 1198 90.49% 89.69% 92.31% n/a 2273msISVM 86.6 77.1% 75.51% 80.71% 330ms 7.6msIHDR 1198 85.36% 86.12% 83.69% 12ms 22msMILN 225 (15 × 15) 86.4% 89.0% 80.45% 110ms 43ms

will never converge to zero as t → ∞. This allowslifetime neuron plasticity.

4. Experiments and results

We used an equipped vehicle to capture many real-world image and radar sequences for training andtesting purpose. A dataset3 is composed from 10different “environments” – stretches of roads at dif-ferent looking places and times (see Figure 2 for afew examples of different environments). From eachenvironment, several different interesting sequenceswere extracted. Each sequence contains some similarbut not identical images (different view point varia-tion, illumination and scales), which were capturedwith a time interval of 0.2 second. The challenge forthe learning algorithms is to classify each radar win-dow’s contents from one of two-classes: vehicles andother objects. There were 928 samples in the vehicleclass and 409 samples in the other object class.

Four different algorithms were compared for theirevaluation of potential for open-ended autonomousdevelopment, where a efficient (memory controlled),real-time (incremental and fast), autonomous (can-not turn the system off to change or adjust), and ex-tendable (the number of classes can increase) archi-tecture is needed. We tested the following classifica-tion methods: nearest neighbor using a L1 distancemetric for baseline performance, incremental-SVM[Cauwenberghs and Poggio, 2001]4, Incremental Hi-erarchical Discriminant Regression (IHDR) [Wengand Hwang, 2007] and MILN [Weng et al., 2007] 5.

3http://www.cse.msu.edu/ei/datasets.htm4Software obtained from http://bach.ece.jhu.edu/pub/

gert/svm/incremental5Both the MATLAB interface to IHDR and the mono-

lithic MILN are available at http://www.cse.msu.edu/ei/software.htm

We used a linear kernel for I-SVM, as is suggestedfor high-dimensional problems [Hsu et al., 2003]. Wedid try several settings for a RBF kernel but did notobserve as good performance as the linear kernel.

We used a “true disjoint” test, where the sampleswere left in sequence and broken into ten sequentialfolds. In this case (as opposed to randomly arrangingsamples), the problem is difficult, since there wouldbe sequences of types of vehicles or objects in thetesting fold that were never trained. This truly testsgeneralization. For all tests, each large image fromthe camera was 240 rows and 320 columns. Eachradar window was size-normalized to 56 by 56 andintensity-normalized to (0, 1). As inputs to all net-works were inputs from the non-transformed space(“pixel” space with input dimension of 56×56 = 3136pixels) versus after transformation into the sparse-coded space (with dimension of 36 × 431 = 15, 516)by layer-one.

Our results are summarized in Tables 1 and 2.Nearest neighbor performs fairly well, but is pro-hibitively slow. IHDR combines the advantage ofNN with an automatically developed overlaying treestructure that organizes and clusters the data. It isuseful for extremely fast retrievals. So, IHDR is alittle worse than NN, but is much faster and can beused in real-time. However, IHDR typically takesa lot of memory. It allows sample merging, but inthis case saved every training sample, so it did notuse memory efficiently. I-SVM performed the worstwith both types of input, but it uses the least mem-ory (in terms of number of support vectors), andthe number of support vectors is automatically de-termined by the data. A major problem with I-SVMis lack of extendability – by only saving samples tomake the best two-class decision boundary, it throwsout information that may be useful in distinguishing

other classes that could be added later. Of course,SVM is not formulated so that any more than twoclasses can be added autonomously, while IHDR andMILN, as general purpose regressors, are able to doso. MILN is able to perform better than all othermethods for the “pixel” inputs using only a 10× 10grid with a top-down supervision parameter of 0.3,over three epochs. MILN is also fairly fast.

When comparing the results of the pixel inputsto the sparse-coded inputs, it is apparent that per-formance improves in the sparse coded space, whichfollows from what was postulated in [Olshausen andField, 2004]. We scaled MILN up in size (15 × 15neurons) since the dimension of the data increasedsubstantially. I-SVM’s size does not increase since itis only concerned with the boundary, where MILNtries to represent the entirety of the sensorimotormanifold well to aid later extension.

Overall, MILN does not fail in any criteria, al-though it is rarely the “best” in any one category, ascurrently implemented. NN is too slow and I-SVM isnot extendable for open-ended development. IHDRhas problems using too much memory and does notrepresent information efficiently and selectively (su-pervised self-organization). The overall performanceon this data showcases the need for attention selec-tion, or local analysis. Our system in Fig. 1 extendsMILN to have a localized layer-one. Attention selec-tion capability will allow the system to focus analy-sis on subparts within the radar windows to improvegeneralization (e.g., recognize as a vehicle as a com-bination of license plate, rear window and two tail-lights). None of the other methods allow local analy-sis needed for this key upcoming extension.

An incremental teaching interface was developedto experiment with the system shown in Fig. 1. Theteacher could move through the collected images inthe order of their sequence, provide a label to eachradar window, train the agent with the current la-bels, or test the agent’s current knowledge. Someexamples of results are shown in Fig. 5. Even inthis MATLAB, non-parallelized version, the speed isclose to real-time use. The average time for the en-tire system (not just the algorithm) to train sampleswas 5.95 samples/s and the average time for testingwas 6.32 samples/s.

5. Conclusions

We now return to the stated goal of perceptualawareness. From raw data, this system incremen-tally learns to answer one “question” of the type ofobject within a radar window with one of two an-swers: “A vehicle” or “Something else”. However,this is not the limit of this system. The design of thearchitecture presented here allows growth to learnmore answers (conceptual categories). It can alsocontinuously learn new examples without the learn-

ing rate ever converging to zero. It remains an openquestion as to how to learn different types of “ques-tions”. There seems a long way to go yet towardstrue perceptual awareness. However, fundamentallynew types of learning architectures will be necessaryand this paper’s MILN is an initial investigation intothis domain.

References

Atick, J. (1992). Could information theory providean ecological theory of sensory processing? Net-work, 3:213–251.

Cauwenberghs, G. and Poggio, T. (2001). Incre-mental and decremental support vector machinelearning. In Advances in Neural InformationProcessing Systems, volume 13, pages 409–415,Cambridge, MA.

DARPA (2007). DARPA urban challenge 2007:Rules. Technical report, DARPA.

Dow, B., Snyder, A., Vautin, R., and Bauer, R.(1981). Magnification factor and receptive fieldsize in foveal striate cortex of the monkey. Exp.Brain Res., 44:213.

Hsu, C., Chang, C., and Lin, C. (2003). A practicalguide to support vector classification.

Olshausen, B. and Field, D. (2004). Sparse codingof sensory inputs. Current Opinion in Neurobi-ology, 14:481–487.

Srinivasan, M., Laughlin, S., and Dubs, A. (1982).Predictive coding: a fresh view of inhibition inthe retina. Proc. R. Soc. Lond. B Biol. Sci.,216:427–459.

Weng, J. and Hwang, W. (2007). Incremental hi-erarchical discriminant regression. IEEE Trans.on Neural Networks, 18(2):397–415.

Weng, J., Lu, H., Luwang, T., and Xue, X. (2007).A multilayer in-place learning network for de-velopment of general invariances. InternationalJournal of Humanoid Robotics, 4(2).

Weng, J., McClelland, J., Pentland, A., Sporns, O.,Stockman, I., Sur, M., and Thelen, E. (2001).Autonomous mental development by robots andanimals. Science, 291(5504):599–600.

Weng, J. and Zhang, N. (2006). Optimal in-placelearning and the lobe component analysis. InProc. World Congress on Computational Intel-ligence, Vancouver, Canada.

a biologically-motivated developmental system towards ...a biologically-motivated developmental...

Documents