psychological review - semantic scholar · in the beginner's mind there are many...

39
Psychological Review 0 Copyright © 1977 C_J by the American Psychological Association, Inc. VOLUME 84 NUMBER 5 SEPTEMBER 1977 Distinctive Features, Categorical Perception, and Probability Learning: Some Applications of a Neural Model James A. Anderson, Jack W. Silverstein, Stephen A. Ritz, and Randall S. Jones Brown University A previously proposed model for memory based on neurophysiological consid- erations is reviewed. We assume that (a) nervous system activity is usefully represented as the set of simultaneous individual neuron activities in a group of neurons; (b) different memory traces make use of the same synapses; and (c) synapses associate two patterns of neural activity by incrementing synaptic connectivity proportionally to the product of pre- and postsynaptic activity, forming a matrix of synaptic connectivities. We extend this model by (a) intro- ducing positive feedback of a set of neurons onto itself and (b) allowing the individual neurons to saturate. A hybrid model, partly analog and partly binary, arises. The system has certain characteristics reminiscent of analysis by distinc- tive features. Next, we apply the model to "categorical perception." Finally, we discuss probability learning. The model can predict overshooting, recency data, and probabilities occurring in systems with more than two events with reason- ably good accuracy. In the beginner's mind there are many possibilities, but in the expert's there are few. —Shunryu Suzuki 1970 I. Introduction If we knew some of the organizational principles of brain tissue, it might be possible to make a few general statements about how the brain works in a psychological sense. There is a close relation between available building blocks and the performance that can be realized easily with those component parts in most systems, and we should expect the same to be true for the nervous system. Let us consider an example of some impor- tance to psychologists. In current theories of cognition and perception one often finds ex- planations of phenomena in terms of what are essentially little computer programs, complete with flow charts and block diagrams. Certainly, the desire to decompose complex mental events into simpler basic units follows the strategy that has been triumphantly successful in the physical sciences. Even a poor understanding of brain organiza- tion might be of value in placing some kinds of limits on these elementary operations. All of us are somewhat familiar with digital computers and, particularly, with computer programs. In some psychological models, the elementary instructions that we use to tell a computer what to do seem to serve as the model for brain function. Assumed are "com- parisons," "scans," "lists," "decisions," and other computer-like operations. However, computers are made with fast, reliable, binary electronic components that are designed to operate by executing very quickly a long series of simple operations. When an elementary operation takes a fraction of a microsecond and when internal noise is not a 413

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

Psychological Review0 Copyright © 1977 C_J by the American Psychological Association, Inc.

V O L U M E 8 4 N U M B E R 5 S E P T E M B E R 1 9 7 7

Distinctive Features, Categorical Perception, and ProbabilityLearning: Some Applications of a Neural Model

James A. Anderson, Jack W. Silverstein, Stephen A. Ritz, andRandall S. JonesBrown University

A previously proposed model for memory based on neurophysiological consid-erations is reviewed. We assume that (a) nervous system activity is usefullyrepresented as the set of simultaneous individual neuron activities in a groupof neurons; (b) different memory traces make use of the same synapses; and(c) synapses associate two patterns of neural activity by incrementing synapticconnectivity proportionally to the product of pre- and postsynaptic activity,forming a matrix of synaptic connectivities. We extend this model by (a) intro-ducing positive feedback of a set of neurons onto itself and (b) allowing theindividual neurons to saturate. A hybrid model, partly analog and partly binary,arises. The system has certain characteristics reminiscent of analysis by distinc-tive features. Next, we apply the model to "categorical perception." Finally, wediscuss probability learning. The model can predict overshooting, recency data,and probabilities occurring in systems with more than two events with reason-ably good accuracy.

In the beginner's mind there are many possibilities,but in the expert's there are few.

—Shunryu Suzuki1970

I. Introduction

If we knew some of the organizationalprinciples of brain tissue, it might be possibleto make a few general statements about howthe brain works in a psychological sense. Thereis a close relation between available buildingblocks and the performance that can be realizedeasily with those component parts in mostsystems, and we should expect the same to betrue for the nervous system.

Let us consider an example of some impor-tance to psychologists. In current theories ofcognition and perception one often finds ex-planations of phenomena in terms of what areessentially little computer programs, completewith flow charts and block diagrams. Certainly,

the desire to decompose complex mental eventsinto simpler basic units follows the strategythat has been triumphantly successful in thephysical sciences.

Even a poor understanding of brain organiza-tion might be of value in placing some kinds oflimits on these elementary operations.

All of us are somewhat familiar with digitalcomputers and, particularly, with computerprograms. In some psychological models, theelementary instructions that we use to tell acomputer what to do seem to serve as themodel for brain function. Assumed are "com-parisons," "scans," "lists," "decisions," andother computer-like operations.

However, computers are made with fast,reliable, binary electronic components that aredesigned to operate by executing very quicklya long series of simple operations. When anelementary operation takes a fraction of amicrosecond and when internal noise is not a

413

Page 2: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

414 ANDERSON, SILVERSTEIN, RITZ, AND JONES

major problem, this is a very successfultechnique.

But most neuroscientists agree at presentthat the brain is a slow, intrinsically parallel,analog device that contains a great deal ofinternal noise from a variety of sources. It isvery poorly suited to the accurate execution ofa long series of simple operations. What is itgood at?

The brain is best adapted to interacting,highly complex, spatially distributed paralleloperations. Adjectives such as "distributed,""parallel," and "holographic" are sometimesused to describe operations of this type.

Since neurons are very slow compared toelectronic devices—on the order of a fewmilliseconds at the very fastest—we shouldexpect there to be time for only a few of theelementary operations that compose the in-structions for a "brain computation." Therewould be neither the speed, equipment, noraccuracy for enormous strings of simpleoperations.

We may conclude that the elementaryoperations used by the brain are of a muchmore powerful and different kind than thesimpler instructions familiar to us from ourexperience with computers.

In this article we will present the outlines ofa theory that is suggested by the anatomy andphysiology of the brain and that is realized byarrays of simple, parallel, analog elements thatare meant to be an oversimplification of theneurons in a mammalian neocortex. The model

This work was supported in part by grants from theIttleson Family Foundation to the Center for NeuralStudies, Brown University, and by a Public HealthService General Research Support Grant from theDivision of Biological and Medical Sciences, BrownUniversity. Jack W. Silverstein was supported by theNational Science Foundation Grant MCS75-15153-A01.

Some material derived from this paper has been pre-sented at the "Workshop on Formal Theories inInformation Processing," Stanford University, Stan-ford, California, June 1974. Some material was alsopresented at the 7th Mathematical Psychology meeting,Ann Arbor, Michigan, August 1974; at the annualmeeting of the Society for Neuroscience, St. Louis,Missouri, October 1974, and at the 9th MathematicalPsychology Meeting, New York, August 1976.

Requests for reprints should be sent to James A.Andeison, Center for Neural Studies, Department ofPsychology, Box 1853, Brown University, Providence,Rhode Island 02912.

is associative and distributed, and we shall seethat its elementary operations are of a verydifferent type from simple logical operations.Systems that are intrinsically parallel have anumber of pronounced and unfamiliar prop-erties, as well as some impressive capabilities.

When we propose a model that tries to jointogether information from several fields, wehave difficulties when we try to verify it. Atpresent, we simply do not know enough aboutthe detailed connectivity and synaptic prop-erties of the brain to do more than make ourmodels in qualitative agreement with what isknown of the neurobiology. Similarly, thepsychological data are useful when it comes totesting general approaches, but they often donot allow unequivocal tests of the details of atheory. Thus, it seems to us best to try to makeour neurally based models refined enough tofit, in detail, a few experiments—just to showit can be done. But we would also like to pointout, in a more impressionistic way, areas ofagreement between theory and observationelsewhere. We have deliberately made ourtheories extremely simple, perhaps unrealistic-ally so, because if we can make adequatemodels with very simple theories, surely it willbe possible to do better when more complexand/or more realistic assumptions are made.We feel that many different versions ofdistributed memories can be made to giveresults similar to those we describe here.

We will first discuss some necessary theo-retical background. We will present a simpleversion of a distributed, associative memory.We will then modify the simple, linear modelto incorporate positive feedback of a set ofneurons on itself. We will show that feedbackgives rise to behavior that is reminiscent ofthe analysis of an input in terms of what arecalled "distinctive features," a type of analysisthat is commonly held to be of great impor-tance in perception. We will then apply themodel to two widely differing psychologicalphenomena. We shall discuss in detail theclassic set of experiments generally describedas "probability learning," and we shall discussmore generally the perceptual phenomenoncalled "categorical perception."

These diverse effects, which at first sightmight seem to be very complicated and in-volve much information processing, may be

Page 3: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 415

explainable by a single, rather simple, set ofassumptions.

II. Theoretical Development

In the past few years a number of relatedrealizations of distributed memories applied tobrain models have been put forward by severalgroups of investigators (Anderson, 1970, 1972;Cooper, 1974; Grossberg, 1971; Kohonen,1972, 1977; Little & Shaw, 1975; Willshaw,Buneman, & Longuet-Higgins, 1969). Oneform of the central learning assumptions ofthese models was first proposed by Hebb (1949)but, as is often the case in an active field ofscience, many of the fundamental assumptionsof the models have been arrived at inde-pendently by different groups.

In our development here, we will follow thenotation and basic assumptions of Anderson(1968, 1970, 1972, 1977; Cooper, 1974). Thisversion of a distributed, associative memory isformally exceptionally simple. It is easy towork with and may give a first approximationto some of the common properties of manydistributed models.

We start by making two central assumptions.First, nervous system activity can be most use-fully represented as the set of simultaneousindividual neuron activities in a group ofneurons. Neuron "activity" is considered to berelated to a continuous variable, the averagefiring frequency. Patterns of individual activi-ties are stressed, because properties of particu-lar neuron activities need not be related toeach other for the system to function. Indeed,some evidence (Noda & Adey, 1970) indicatesthat interneuronal spike activity correlationsof nearby cells recorded with the same elec-trode in parietal ("association") cortex maybe quite low when the brain is doing "interest-ing" things. They found that in an animal inREM (rapid eye movement) sleep or in theawake, alert state, correlations were very low,whereas the same pairs of cells had highlycorrelated discharges in deep sleep. The sameappears to be true in hippocampus (Noda,Manohar, & Adey, 1969).

Recent work by Creutzfeldt, Innocenti, andBrooks (1974) seems to suggest that most cellsin primary visual cortex, even those close toone another in the same cortical column, are

not strongly coupled together, again implyinga good deal of individuality of cell response.The individuality of cell responses in auditorycortex has been remarked upon (Goldstein,Hall, & Butterfield, 1968). Morrell, Hoeppner,and de Toledo (1976) studied single units incat and rabbit parastriate cortex. They foundthat nearby units responding to the samestimulus (recorded with the same electrode)often differed in the type and direction of alter-ation of their discharges when a stimulus con-figuration and a cutaneous shock were paired ina Pavlovian conditioning paradigm. They com-ment that their data "provide little support forthe notion of coherent changes in largeneuronal populations" (p. 448).

Thus we are dealing, to a first approxima-tion, with a system of individualistic cells, eachwith its own properties. Although cells near toone another may show similar responseproperties (for example, orientation or binocu-larity in visual cortex), each cell behavesdifferently from its neighbors when studied indetail. We will be concerned with the behaviorof large groups of cells, but this does not meanthat each cell is doing the same thing. Indeed,most distributed systems do not work well ifcells all respond to the same inputs in the sameways, since diversity of cell properties allowsfor better operation. Simple redundancy is themost uninteresting way to provide reliability.

This assumption allows us to represent theselarge-scale activity patterns as vectors of highdimensionality with independent components.All the models we work with will use thesevectors as the elementary units. We will showthat it is possible to develop theories wherethese complex activity patterns, representingdischarges of very many neurons, can act asbasic units that combine and interact inrelatively simple ways.

As our second major assumption, we holdthat different memory traces (sometimescalled "engrams"), corresponding to theselarge patterns of individual neuron activity,interact strongly at the synaptic level so thatdifferent traces are not separate in storage.Considerable physiological evidence supportsthis idea. As Sir John Eccles comments, "eachneurone and even each synaptic junction arebuilt into many engrams. The systematic study

Page 4: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

416 ANDERSON, SILVERSTEIN, RITZ, AND JONES

SET OF N NEURONSa

SHOWS ACTIVITY PATTERN

7

SET OF N NEURONS

SHOWS ACTIVITY PATTERN

9

Figure 1. We consider the properties of sets of N neurons, a and /3. Close inspection of this spaghetti-likedrawing will reveal that every neuron in a projects to (i.e., has a synapse with) every neuron in 0. Sincethis drawing, where N = 6, understates the size and connectivity of the nervous system by several ordersof magnitude, it can be seen that single neurons and single synapses may have little effect on the dischargepatterns of the group as a whole. Properties of such large, interconnected systems can sometimes bemodeled simply with the techniques of linear algebra, the approach taken in the text, (f and g indicatevectors.)

of the responses of individual neurones in thecerebrum, cerebellum, and in the deeper nucleiof the brain is providing many examples of thismultiple operation" (Eccles, 1972, p. 59).

Association Model

Let us assume we have two groups of Nneurons, a and ft (see Figure 1). We willassume that every neuron in a projects toevery neuron in ft. This clearly unrealisticassumption is not made in Anderson (1972).A detailed discussion of some of the mathe-matical aspects of this model are found inCooper (1974) and Nass and Cooper (1975).

To proceed further, we must describe howthe activity of a neuron reflects its synapticinputs. For many cells, this can be extremelycomplex. However for some systems rathersimple relations are found. We shall assume atfirst that neurons are simple linear analogintegrators of their inputs, and we shall seewhat kinds of models evolve from this assump-tion. There is evidence supporting this assump-tion in a few well-studied systems. In the

lateral inhibitory system of Limulus, rathergood linear integration is found (Knight,Toyoda, & Dodge, 1970; Ratliff, Knight,Dodge, &Hartline, 1974). Linear transmission,according to Mountcastle (1967), holds formany mammalian sensory systems, oncebeyond an initial nonlinear transduction.

We assume there is a synaptic strength, an,which couples the jth neuron in a with the iihneuron in ft. Thus, subject to our assumption oflinear integration, we can write the following:If J(j) is the activity shown by neuron j in a.and if g(i) is the activity shown by neuron iin ft at a given time, then

(1)

Let us now consider the following situation:The set of neurons a shows an activity pattern,a vector, fi, the set of all the f ( j ) . The set ofneurons ft shows an activity pattern, a vector,gi. We wish to associate the pattern f i with thepattern gi so that later presentation of f i alonewill give rise to gi in the set ft. Let us assumethat initially our set of synaptic connectivitiesan is zero.

Page 5: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 417

We can ask what detailed local informationcould influence a synaptic junction to allowstorage of memory. Locally available informa-tion includes the presynaptic activity and, weshall assume, the postsynaptic activity. Letus make about the simplest assumption forsynaptic learning that allows for pre- andpostsynaptic interaction at the synaptic level.Let us postulate, as our essential learningassumption, that io associate pattern fi in awith pattern gi in ft we need to change the set ofsynaptic weights according to the product ofpresynaptic activity at a junction with theactivity of the postsynaptic cell.

For convenience, let

\\ii\\ = 1, where [E/Wi-l

This quantity is usually called the length ofthe vector. The change in the i/th synapse isgiven by fi(j)gi(i). The set of connectionsform a matrix AI given by

A!

(2)= [gi/itt), gi/i(2) • • • gi

where T is the transpose operation (throughoutthis paper all vectors will be assumed to beA^-dimensional column vectors).

Assume that after we have "printed" theset of connectivities AI, pattern of activity fiarises in a. Then we see that activity in /J isgiven by

AA HfiH'gi = gi; (3)

so we have gi appearing as the pattern ofactivity in /3, which corresponds to our defini-tion of association.

It is very unlikely that these sets of neuronsexist only to associate a single set of activitypatterns. Let us assume that we have K setsof associations, (fi,gi), (f»,ga), .. ., (fe,gjc),each generating a matrix of synaptic incre-ments, A*. Then, since we have assumed thata single synapse participates in storing manytraces, we form an overall connectivity matrixA given by

A = E A,.k

Let us assume that the f < are mutually ortho-gonal, that is, that the inner product of

fi with lj, defined as

f .rf . _ y" f.(** *3 *^ J * V

is zero for i p* j. Then, if pattern f,• is impressedon the set of neurons a, we have

k

= AA+E.AA^ i v^ ^ // ipt \

Thus, the system associates perfectly. If the fsare not orthogonal the system will producenoise as well as the correct association, but thesystem is often quite usable. Actually, the"mistakes" made by this system are often asinteresting as the "correct" responses (seeAnderson, 1977).

Orthogonality is an effective way of dealingwith the notion of independence of inputs. Iftwo inputs are orthogonal, then there is nointeraction between them; that is, the responseof the system to one input is in no way in-fluenced by the other input. It is as if theinputs are going through completely differentmechanisms.

A Numerical Example

To show how this system works, let usconstruct a simple, eight-dimensional system.Assume we have the four orthogonal inputvectors, fj, fj, £3, and I4 shown in Table 1.These vectors are normalized Walsh functions,which are digital versions of sine and cosinefunctions. We choose arbitrary output vectors,gi, &> g>> and g4 and wish to make the associa-tions between pairs of vectors (f,,gi). Notethat we need place no restrictions on the gs.Thus gi and ga are orthogonal, and g3 and g4

are very close together. They also vary con-siderably in length: gi is 2.24 units long, g2 is3.61 units long, and gs and g4 are 4.47 unitslong. The AI matrix, the association matrixbetween fi and gi, is shown in Table 2. Wehave not shown the other three matrices,As, AS, and A4, but their construction shouldbe clear from AI. The resulting sum of all fourmatrices is shown in the second part of Table 2.

Page 6: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

418 ANDERSON, SILVERSTEIN, RITZ, AND JONES

Table 1Input and Output Vectors Used in the Numerical Example Given in the Text

f = —A/s

r riii

-i-i-ii

iVs

Ty]

•-r-i

ii

-i-i

iI iJ

De of vectors

Input

f = _LVi

i-i

i-i

i-i

i-t

iA/8

'— '—

——

Output

gl =

' 10

-101

-1-1

0

g,=

' 1 *

20

-1-1-1

1

- 2-

gs =

3^0

-1-1-2

0-1

2

B.=

" 4"0

-1-1-1

001

Inspection of this matrix shows little sign ofthe component parts that went to make it up,and asking about the value of any particularelement is pointless in relation to the informa-tion stored in the matrix. However, calculationwill show that

Af, = g<

for all four input vectors, which shows thatindeed such a simple matrix can "learn" fouressentially arbitrary associations.

One of the most important aspects of thesemodels is their similarity to a filter in thestrict sense of a system which responds weaklyto an input that has not been learned (i.e., to

Table 2Matrices Associating Pairs of Vectors in the Numerical Example

Matrix

A - 1

1 A/8

10

-101

-1-1

0

10

-101

-1-1

0

A,

10

-101

-1— 1

0

-glf.10

-101

-1-1

0

r

-1010

-1110

-1010

-1110

-1010

-1110

-n010

-111o,

A = A, + A, + A3 + A4

1Vl

1-2-1

110

-1.-1

-5-2

13501

-5

72

-3-3~~ O

-2-3

S

12

-1-1

1-2-1

1

7-2-1-1-3

211

1-2

11123

-3

~~ O

21

-1-3

0-1

3

-9"231101

-1

Page 7: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 419

zCD

X

o

0 40zv>X

1 30

XH

$ 20ITO

OLU

HISTOGRAM Of LENGTHS OF OUTPUT VECTORS.1000 RANDOM UNIT VECTORS INPUT.

10U.O

ITUJffi

BIN WIDTH

HI-

ILENGTH OF OUTPUT VECTOR

Figure 2. One thousand random unit vectors were input to the matrix constructed as a numerical examplein the bottom half of Table 2. The figure shows a histogram of the lengths of the 1,000 resulting outputvectors. The arrows point to the lengths of the actual output vectors associated with the four inputsused in the matrix, (gi, gj, g3, and g4 indicate vectors.)

which the filter is not "tuned") but whichresponds strongly to an input that has beenseen before. We can demonstrate this propertywith our matrix. Suppose we have a randomunit vector as input, that is, a vector generatedby the uniform distribution on the unit spherein eight-dimensional space. We should expect,if the system is filter-like, that there will be ashort vector appearing at the output, on theaverage. We used a computer to generate 1,000random unit vectors and looked at the lengthsof the resulting output vectors. The distribu-tion of output lengths is shown in Figure 2.The lengths of the actual stored associations—the gs—are shown by arrows in the figure.Strictly on the basis of length, almost no out-puts due to random vectors were as long as gs

and gi (96% were shorter), and 85% of theoutputs were shorter than g2. Even gi, whichwas half the length of gs and g4, was longerthan 45% of the random outputs. Thus, evenwith as crude a measure of filter characteristicas output length, in a very low dimensionality

system, a fairly good job of discrimination canbe made between old and new inputs. Intro-duction of even a very few elementary mecha-nisms to "sharpen up" the response from thesystem, such as cascades of filters or positivefeedback (discussed in Section IV), can makethese matrices good filters with an interesting"cognitive" structure to them.

We should observe, however, that the systemcan make mistakes. Noise is inherent in thesystem. The histogram shows that there are afew inputs that can give a larger output thangiven by any of the inputs the system haslearned. Thus, we can see that a useful type ofanalysis for such systems is statistical, andmany aspects of their behavior can be studiedwell with the techniques of communicationtheory and decision theory. If such a distrib-uted system is indeed present in our cortex,possibly we can see why such statisticalmethods are so successful in accounting formany of the phenomena found for the psy-chology of even our highest mental functions.

Page 8: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

420 ANDERSON, SILVERSTEIN, RITZ, AND JONES

III. Biological and Psychological "Features"

Introduction

The association model just presented is agood associator, and it has been applied else-where to several sets of experimental data(see Anderson, 1977; Nass & Cooper, 1975).A simple variant of the model, relying on thefilter characteristics of a similar system, hasbeen used to propose an explanation for someof the data arising in the Sternberg list-scanning experiment (Anderson, 1973).

We hope now to show that an extension ofthe model has interesting qualitative simi-larities to some important characteristics ofhuman perception. We hope to show that wecan represent noisy inputs to the system interms of their "distinctive features" and thatthe features that the model generates are bothcognitively significant, in that they are themost useful for discriminating among membersof a stimulus set, and, at the same time, aremost strongly represented in the output fromthe system. Thus, what appears to be a highlystructured and analytical approach to per-ception—distinctive feature analysis—can beexplained as the result of the operation of ahighly parallel, analog system with feedback.

We shall interweave the theoretical discus-sion with a very brief discussion of the psycho-logical theory and a little data from theneurosciences. This occasionally awkwardmeans of presentation is intended to conveythe close interdependence between theory anddata from several fields.

Let us make clear all we hope to do. At thistime, we do not have the data from eitherpsychology or neuroscience to decisively testthe theory we are to present. However, we hopeto show that there are striking qualitativesimilarities between the structure of the theoryand the structure that many feel typifies somekinds of perception.

What is a Distinctive Feature?

That there are entities called distinctivefeatures and that these entities are somehowof importance in perception is a commonlyaccepted belief is psychology at present. Arecent elementary textbook (Lindsay & Nor-

man, 1972) builds a large part of the first fewchapters around the idea of features, and thisapproach is central to Neisser's very influentialbook, Cognitive Psychology (1967). Since theearly 1950s, there has been strong evidencefrom linguistics that phonemes could becharacterized as being represented by thepresence or absence of a small set (12 in theoriginal analysis) of distinctive features(Jakobsen, Fant, & Halle, 1961). Eachphoneme was uniquely represented by its ownparticular set of features. The use of a goodset of features is an excellent type of pre-processing and is a commonly used practicalpattern recognition technique. The greatreduction in dimensionality of the stimulusallows the system to discard irrelevant in-formation and eliminate noise, while makinglater stages simpler and more reliable, sincethey have to cope with less complex inputs.

How Might the Brain Do Feature Analysis?

Feature analysis seems to be a strategy usedby the brain. How is this analysis performed?

The simplest way might be to have, some-where, neurons that respond when, and onlywhen, a particular distinctive feature appears.These would then be true "feature detecting"neurons. We might point out that in this case,discharges corresponding to different featureswould be orthogonal in the sense discussed inthe previous section, since different featureswould give rise to activity patterns with non-zero values in different sets of elements.

Barlow (1972) has argued strongly that thisis truly the way the brain works. He stated anumber of what he calls "dogmas" about therelation between brain and perception, pro-posing that sensory systems are organized soas to achieve as complete a representation ofthe stimulus as possible with the smallest num-ber of discharging neurons. He estimated thatas few as 1,000 cells in visual cortex may firein response to a complex visual stimulus. Heproposed that what we call "perception" maycorrespond to the activity of a small numberof high level, very selective neurons, each ofwhich "says something of the order of com-plexity of a word" (p. 385). Informal discus-sions indicate to us that many neurophysiolo-gists are sympathetic to this point of view.

Page 9: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 421

Biology and Psychology: Audition

Suppose we look at published feature listsfor spoken language (see Lindgren, 1965) andfor letter perception (Gibson, 1969; Laughery,1969). The features proposed by these andother writers are psychological features. Theyare quite complex. Lindgren describes theacoustic characteristics of the distinctive fea-tures proposed for spoken language. Almostevery feature is characterized by complexchanges in wide, often ill-defined bands offrequencies. For example, the "vocalic versusnon-vocalic" feature is characterized by "pres-ence versus absence of a sharply definedformant structure" (Lindgren, 1965, p. 55).The "nasal versus oral" feature is described as"spreading the available energy over wider(versus narrower) frequency regions by areduction in the intensity of certain (primarilythe first) formants and introduction of addi-tional (nasal) formants" (Lindgren, 1965,p. 55). The only feature that seems to corre-spond to a well-defined set of frequencies is the"voiced versus voiceless" feature, which de-scribes the presence or absence of vocal cordvibration. Even here, though, men, women,and children have characteristic vocal cordfrequencies differing over a two-octave range,all presumably capable of exciting the voicingfeature. The complicated structure of thevocal tract would usually be expected to giverise to equally complicated variations in fre-quencies of resonances with changes ingeometry.

The neurophysiology of the auditory systemis very complex and not well understood atpresent. Although the lower levels of theauditory pathway seem to be primarily fre-quency analyzers, neurons in auditory cortexare highly individualistic and variable in theirresponses. Many cells in primary auditorycortex have very sharply tuned responses, butothers have quite wide bandwidths.

If species-specific vocalizations are used tostimulate cells, the picture is no simpler.Wollberg and Newman (1972) recorded fromthe auditory cortex of squirrel monkeys, usingrecordings of species-specific calls as stimuli.They reported, "some cells responded withtemporally complex patterns to many vocaliza-tions. Other cells responded with simpler pat-

terns to only one call. Most cells lay betweenthese two extremes" (p. 212).

Funkenstein and Winter (1973; Winter &Funkenstein, 1973) did a similar experimentwith a wider range of squirrel monkey vocaliza-tions. They also found a variety of cell re-sponses, from a small percentage of cells thatresponded only to particular calls, to cells thatresponded to particular frequencies in anycontext—noise, pure tones or calls.

Evans (1974) reviews a number of experi-ments and describes the bewildering variety ofresponse types observed. Even in research onan animal as unintelligent as a bullfrog, whichhas a behaviorally important mating call withtwo spectral peaks, Frishkopf, Capranica, andGoldstein (1968) did not uncover cells thatresponded only to both peaks presented simul-taneously, a property which would be requiredof a "mating call detector." Although thefrequency responses of cells in the frog audi-tory system were commonly tuned to one orthe other spectral peak, the investigators foundno cells in the midbrain and medulla that putthe two peaks together.

Biology and Psychology: Vision

The feature lists that are proposed forrecognition of capital letters are deceptivelysimple. In the lists, one typically finds pro-posed features such as vertical line segmentsat left, center, or right; horizontal line seg-ments at top, middle, or bottom; or curveslants or parallel lines (Laughery, 1969).

With a list of this type, it is only a smallconceptual leap to identify these psychologicalfeatures with groups of particular cells of thetype known to exist in primary visual cortexwhich show orientation and edge sensitivity(Hubel & Wiesel, 1962, 1968). It should beemphasized that single cells in primary visualcortex do not show the requisite selectivity inthe sense that they respond to features andonly features. Cells in primary visual cortexare quite selective, but they respond to manyaspects of the stimulus. Schiller, Finlay, andVolman (1976), in perhaps the most carefulquantitative study of single-cell responseproperties in monkey striate cortex, make thefollowing comments in the abstract of the

Page 10: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

422 ANDERSON, SILVERSTEIN, RITZ, AND JONES

summary paper:

1. Several statistical analyses were performed on 205S-type and CX-type cells which had been completelyanalyzed on 12 response variables: orientation tun-ing, end stopping, spontaneous activity, responsevariability, direction selectivity, contrast selectivityfor flashed or moving stimuli, selectivity for inter-action of contrast and direction of stimulus move-ment, spatial-frequency selectivity, spatial separa-tion of subnelds responding to light increment orlight decrement, sustained/transient response toflash, receptive-field size, and ocular dominance.

2. Correlation of these variables showed that withinany cell group, these response variables varyindependently, (p. 1362)

(S and CX cells correspond roughly to thefamiliar simple versus complex distinction, butwith more precise definition.)

Besides finding support of our earlierassumption of the great individuality ofcortical neurons, we can see that single cellscan have their discharges modified by a widerange of aspects of the stimulus. Here also wedo not seem to find cells that respond only topsychological features, although there are cellswith very pronounced selectivities.

A Regrettable Misapprehension

It is apparent from reading the literature inthis area that the word "feature" as used bypsychologists and by neuroscientists has cometo mean different things. When a psychologistdiscusses features, what seems to be meant isa complex kind of perceptual atom which isindependent of other atoms and constitutes anelementary unit out of which perception isbuilt. The feature lists that have been proposedfor both letter perception and speech percep-tion involve many different aspects of the inputstimulus. Their simplicity is deceiving whenconsidered in light of the properties of thesingle cells of the nervous system.

It is also apparent, regrettably, that when aneurobiologist refers to a "feature detector,"he is typically referring to a single neuronwhich displays a certain amount of selectivityin its discharge, often for the biologically im-portant and relevant aspects of the stimulus.This does not mean that this cell has thespecificity of response to be a detector of thepsychological feature. Something more isinvolved.

IV. A Theoretical Approach toPsychological Features

There is an implicit neural model—thatformulated clearly by Barlow—in the identifi-cation of single-cell properties with features.There is an alternative point of view that usesthe observed single-cell selectivities to provideselectivity to the psychological features, butnow as part of well-defined activity patternsthat use cell properties.

Mcllwain (1976) makes the important pointthat the large receptive fields often seen in thehigher levels of the visual system and theresponse of the cells to many aspects of astimulus do not necessarily mean that theoverall system lacks precision. Our distributedsystem using the output of the entire group ofcells can be very precise, in that the dischargepattern due to one input can be reliably differ-entiated from that arising from a differentstimulus, even though there are many cells inthe group that may respond to both inputs.As a relevant example, the associative networkpresented in Section II of this article is com-pletely interconnected, A cell in the second setof neurons can respond to any cell in a, thefirst set. A single cell has a large receptive fieldand is very unselective. However, we showedthat the output patterns of all the cells in ft canbe made to respond strongly only to particularinputs, and thus the system displays consider-able selectivity.

Can we make a neural model analyze itsinputs in terms of features? It is clear thatdistinctive feature analysis has an importantlearned component. In the perception ofwritten letters, this is obvious. In spokenlanguage, which is much more biologicallydetermined than are written letters, Eimas,Siqueland, Jusczyk, and Vigorito (1971) haveshown that some aspects of linguistic distinc-tive features are both built in and modifiable.For example, a category boundary for voiceonset time, which appears to correspond to thevoicing-voiceless feature, is present in thehuman infant. Yet this feature boundary canbe modified in the adult, depending on thephonetic structure of the language spoken(Eimas & Corbit, 1973).

Our theoretical aim is to reduce the dimen-sionality of the stimulus so that a very com-

Page 11: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 423

INPUTf SET OF NNEURONS

a

I. SET OF N NEURONS, a2. EVERY NEURON IN a IS CONNECTED TO EVERY

OTHER NEURON IN a THROUGH LEARNINGMATRIX OF SYNAPTIC CONNECTIVITIES A

Figure 3. A group of neurons feeds back on itself. Again, note that N = 6, as in Figure 1. Note thateach cell feeds back to itself as well as to its neighbors.

plicated stimulus, exciting perhaps millions ofselective cells, could act as if only a smallnumber of independent elements were involved.

A Model

Assume we have an associative system(Section II) which couples a set of neurons, a,to itself, instead of to a different set of neurons.We again make the approximation that everyneuron projects to every other neuron. Let usassume that this feedback connection isthrough a matrix, A, of synaptic connectivities.Figure 3 shows this situation.

Let us consider the case where a pattern ofactivity on the set a is coupled to itself. Theincrement in synaptic strength, Aay is propor-tional to the product of the activity shown bythe ith neuron, f ( i ) , and the jth neuron, f(j)-Note that

Aa,-y = Aa,-,-. (4)

This means that A is what is called a symmetricmatrix.

This implies, in turn, the existence of Nmutually orthogonal vectors ei, . . . , e# suchthat

Ae* = XjC,-, i = 1, ...,N,

where each X, is a real number. The CiS arecalled eigenvectors of A, while each X> is calledthe eigenvalue associated with e,-. Since thereare N mutually orthogonal eigenvectors, everyvector is a linear combination of the eigen-vectors. An important consequence of this isthat A is completely determined by its sets ofeigenvectors and corresponding eigenvalues.

In many systems, the eigenvalues and eigen-vectors are of great importance. Our system isno exception. Let us consider how a matrix,starting from zero, would develop.

Assume we present K orthonormal inputs,that is, mutually orthogonal unit vectors, eachinput f, appearing ki times. Then by ourassociative model, each f, is an eigenvectorof A with corresponding eigenvalue ki, since

A = E kMtf

and

Af, = [ E kUi^i = j = l,...,K.

If K < N, the remaining N — K eigenvectorsof A have zero eigenvalues. We see then, inthis case, that the eigenvectors of A with largeeigenvalues will tend to correspond to com-

Page 12: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

424 ANDERSON, SILVERSTEIN, RITZ, AND JONES

monly presented patterns, and the eigenvalue(in a more complex system) will be at leasta rough estimate of the frequency ofpresentation.

Suppose now the inputs are arbitrary exceptthat the average input

KE

is zero. This important assumption is acknowl-edgment of the fact that inhibition and excita-tion are equally important and equallyprominent in the operation of the nervoussystem. We make the above assumption in theform it takes because it is very convenient forour mathematical interpretation, but theexact conditions that occur in the nervoussystem are beyond our present knowledge.

With this assumption, A is then a scalarmultiple of the sample covariance matrix ofthe inputs, since the latter is given by

(s)

where pt = &</£ kit i = 1, . . . , K. We shouldpoint out that this matrix is positive semi-definite, that is, that all the eigenvalues aregreater than or equal to zero.

From principal components analysis, wefind that the eigenvectors of A are related verystrongly to the inputs in the following way.We will use terms from probability theory.Let f be the random vector which takes thevalues fi with probability pf, i = 1, . . . , K.Let cov(f) denote the covariance matrix of f,given by Equation 5. The main result fromprincipal components analysis states that anyunit vector u that maximizes the variance ofthe random variable c = urf over all unitvectors must be an eigenvector ui of cov(f)with the largest eigenvalue Xi. The varianceof c\ = Uirf turns out to be Xi. The maximumvariance of urf over all unit vectors orthogonalto Ui is the second largest eigenvalue \2 ofcov(f), and it must occur at the correspondingeigenvector. This maximal principle followsthrough for all the eigenvalues where at the ythstep we maximize all of the unit vectors orthog-onal to the j — 1 eigenvectors already estab-lished. The random vector f can then be

expressed asN

i — V r-n- r - = n-TtI — l_, C,U,, C, — U, I.i-l

The d values turn out to be mutually un-correlated, and we see that

N ffIjflln T—4 rtlf- ^—» -1var||f||2 = var £ CiUir[ £ c.-u.-J

N ff

= £ var d = £ X*.i-l i-l

Each input then is a linear combination of theeigenvectors of A, that is,

N

f,- = £cy%',j-i

where, qualitatively, the eigenvectors withlarge eigenvalues account for most of thedifferences between inputs. Furthermore, nocorrelation exists between cy,* and c,-,', whereji T± j2. The eigenvectors of A can thereforebe considered as the basic components of theinputs.

As an additional comment, we know fromlinear algebra that the eigenvectors give themaximum responses from the system. Over allvectors of unit length (that is, all vectors x,such that ||x|| = 1), the largest value of ||Ax||occurs when x is the eigenvector e, correspond-ing to the largest eigenvalue. The next largestvalue of 11 Ax 11 over unit vectors orthogonalto e,- occurs when x is the eigenvector ey corre-sponding to the second largest eigenvalue,and so on.

Significance of Feedback

Let us consider what this system might doto an input to the set of neurons from thesensory receptors or from an earlier stage ofprocessing.

Suppose the input is composed of one of theeigenvectors of the feedback matrix that has alarge positive eigenvalue. This means that theactivity pattern will pass through the feedbackmatrix unchanged in direction. It will addalgebraically to what is already going on in theset of neurons. Since the eigenvalue is positive,it will add to ongoing activity, that is, to theeigenvector. The larger amplitude pattern willbe fed back again and the output from the

Page 13: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 425

feedback matrix will again add to the activityin the set of neurons. This is positive feedback,and the amplitude of the eigenvector will grow.Depending on the details of the system, it maygrow without bound or merely show a longerand stronger response, but it will be increasedin strength relative to other patterns. Consideran input which contains a contribution froman eigenvector with small or zero eigenvalue.Positive feedback will not significantly enhancethis pattern, and the amplitude may increasevery slowly or not at all.

Thus the input pattern, after a while, willtend to be composed of only the components ofthe original input that have large positiveeigenvalues, and only these components willparticipate in further processing. We have justseen that the eigenvalue is in some sense ameasure of the importance of the particulareigenvector in discriminating among differentmembers of the items the system has learned,so the patterns with large eigenvalues are themost important patterns for the system. Sincethis is exactly the behavior we want fromdistinctive features, let us specifically identifythe eigenvectors of the feedback matrix with largepositive eigenvalues as the distinctive features ofthe system. We see that in all important aspectsof their behavior they act as we would likedistinctive features to act.

Let us note as well that a similar techniqueis used in pattern recognition and statistics.We have previously mentioned the similarityof this analysis to principal componentsanalysis, and many pattern recognition tasksuse very similar techniques because of theirtheoretical optimality (Young & Calvert,1974, Chap. 6).

The operation of this system gives us goodinsight into the profound practical differencesbetween the brain and a digital computer. Thereason such pattern recognition techniques arenot used more widely is economic: Excessivecomputation time is required because of thelarge matrix operations involved. However,we see that an adaptive parallel feedbacksystem with highly interconnected analogelements can process an input so it moststrongly weights its features, according .toimportance, in only one step. We shall discussthis process further in the next few sections.

Some Comments on Neurobiology

The idea of a distributed feedback networkis consistent with much we know about thephysiology and anatomy of cerebral cortex.We should emphasize as well that the actualpathways may sometimes be more complexthan our model. For example, inhibition seemsoften to be accomplished in mammals byinhibitory interneurons. This extra neuronneed not affect the mathematics of the model.

There are a number of feedback systems incortex and thalamus. Perhaps the mostattractive candidate to perform operations likethose we discuss in this article is the very richnetwork of recurrent collaterals of corticalpyramidal cells. Recurrent collaterals areaxonal fibers which branch off the axon of apyramidal cell, loop back into the nearby graymatter, and synapse extensively with thedendrites of nearby pyramids over a range ofseveral millimeters. Globus and Scheibel (1967)comment that the recurrent collaterals ofpyramids are the most common class of fibersin neocortex.

Freeman and his collaborators (see Freeman,1975, for a detailed review) have workedextensively, both experimentally and theo-retically, on the electrical activity and con-nections of prepyriform cortex (olfactorycortex) and olfactory bulb in cats. This primi-tive cortex may show in simple form the con-nections that are more highly developed inneocortex. Freeman has had considerablesuccess in applying linear systems analysis tothese networks. He has incorporated in someof his models the type of excitatory feedbackthat we have suggested as a basis for featureanalysis. An excitatory collateral system fromprepyriform pyramids onto nearby pyramidshas been described anatomically, as has asimilar collateral system in hippocampus,another primitive cortex. The anatomy ofthese connections, and other recurrent systemsin neocortex, as well as some of the physiology,is reviewed in Shepherd (1974).

Higher order loops, from one region ofcortex to another and back, or from thalamusto cortex and back, are also common in thebrain and may also be candidates for thepsychologically significant feedback loops wewould like to find. However, the physiological

Page 14: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

426 ANDERSON, SILVERSTEIN, RITZ, AND JONES

data on these systems, other than those sug-gesting their existence and importance, arepresently sparse. Further discussion of evidencefor physiological mechanisms that may partici-pate in feedback interactions is given elsewhere(Anderson, 1977, Note 1).

More Detailed Study of Feedback

In the study described in detail in Anderson(1977), we used linear systems analysis toobtain exact solutions of an interesting case.We showed that the response of the feedbacksystem had the properties we claimed for it.Suppose we represent an input as a weightedsum of eigenvectors of the feedback matrix, A.Since the eigenvectors are orthogonal, theycan serve as a basis set. Suppose the input isthen presented to the system. We showed thatafter a period of time, the relative weights ofthe eigenvectors changed, and that the eigen-vectors with large positive eigenvalues weremuch more heavily represented in the activityof the set of neurons than were the eigenvectorswith smaller eigenvalues. We also showed thatthe response of the system to eigenvectors withlarge positive eigenvalues lasts longer. Somevariants of the model have regions of stabilityas well; that is, the response of the system diesback to zero or remains bounded as long as thelargest eigenvalue does not exceed a certaincritical value. Above this value, the system"blows up"; that is, the amplitude of theactivity pattern increases without bound. Inthis calculation, the time constants of the feed-back system of the brain were quite long rela-tive to the duration of the neural activityrepresenting the sensory input.

In this article, we consider a slightly differentmodel. We assume, essentially, that the timeconstant of the feedback is fast compared tothe sensory input. Thus, feedback and currentactivity directly add in the same time period.

As a speculative comment, we observe thatthe first model bears a certain impressionisticrelation to the way visual processing in readinghas been conjectured to be performed. A visualinput is received and coded with a great burstof activity from the sensory neurons, whichthen become relatively quiet; however, process-ing continues for 200 or 300 msec before theeyes jump to a new location in a saccade. Thenthe process is repeated. We might conjecture

that the auditory system—particularly inspeech perception—is following a somewhatdifferent strategy. Phonemes follow one anotherin quick succession, tens of milliseconds apart,and the nervous system must respond morequickly to a constantly changing input.

There are many ways we could set up thefeedback model we shall discuss for the re-mainder of this article, but we initially choseone that was exceptionally convenient forcomputer simulations. Our general philosophyof modeling is always to try simple things first.We suspect that similar but more complexvariants will not show very different qualitativebehavior.

The dynamics of the system are assumed tooccur in discrete time. Let x(t) denote theactivity vector (the "state vector") at time /.Integer values are taken by t. The activity attime / + 1 is assumed to be the sum of theactivity at time t and the action of the feed-back matrix on the activity at time t. Thesumming of the output of the feedback systemand the activity at this time is assumed tooccur in the same time quantum, which impliesthat their time courses are comparable. Thus,we have

x(* + 1) = x(0 + Ax(/) = (I + A)x», (6)

where I is the identity matrix. Throughout thisdiscussion, we let A be fixed. The system is,indeed, a positive feedback system, due to thefact that all eigenvalues of A are nonnegative.To see this, let ei, . . . , CAT denote the ortho-normal eigenvectors of A with correspondingnonnegative eigenvalues X,-, i = 1, . . . , N. ThenA can be written as

Let

Then

A =

\(t) = £ Xid.

- A)x(0N

(1 + Xi)e*eir] E xfi3-1

JV

<=1

(7)

Page 15: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 427

so that

= C E *<(! + Af)e<]r L Xf(l + Xy)e,

i-l<)2 > L */ = I2- (8)

Thus, the length of the activity vector is non-decreasing at every step. If x(/) is made uponly of eigenvectors of A with zero eigenvalues,then from Equation 6 we see that \(t + 1)= x(£), so if the system starts at one of thesepoints it stays there for all time. All othervectors will respond to the system. In fact, forthese vectors strict inequality holds inEquation 8.

At this point we must introduce an im-portant feature into the model, one that willbreak with the linearity that we have assumedup to now. We pointed out that in one versionof the feedback model certain positive eigen-values of A were stable, in that the systemactivity did not grow indefinitely large as timeprogressed. In the present version of the model,the same is not true ; activity will grow withoutbound. Unfortunately, the desirable featuresof positive feedback are exactly the ones thatcause catastrophes. This is inappropriatebehavior for a system that requires unques-tioned stability at all times. The cases of un-stable neuronal discharge that we know of giverise to highly pathological seizure states. Thenormally functioning brain seems to beextremely stable and resistant to "runaway,"which is a very important observation in viewof the powerful excitatory mechanisms thatthe brain contains.

The simplest way of containing the activityof the system is to use the fact that neuronshave limits on their activities: They cannotfire faster than some frequency (usuallyaround several hundred spikes per sec, and insome auditory units, as fast as 1,000 spikes persec), and they cannot fire slower than 0 spikesper sec. Thus, there are positive and negativelimits on firing rate.

Suppose we incorporate this property intoour model. A particular activity pattern is apoint in a very high dimensionality space. Thecoordinate axes correspond to activities of

individual neurons. Thus, putting limits onfiring frequency corresponds to putting theallowable activity patterns into a box. Sincelarge, high-dimensionality vectors describingthe system are often called "state vectors," wename this model the "brain-state-in-a-box"model, with the associated image of a statevector, a point in space, buzzing around likea bee under the influence of input and feedback.

We formalize this situation by assuming(possibly unrealistically) that the limits aresymmetrical around the origin. That is, satura-tion in the system is achieved by keeping thesystem on or inside the cube in JV-dimensionalspace defined by *,• = ±C; i = 1, . . ., N,where x< is the activity of the iih neuron.

Applying this assumption to our model, ateach time step, the activity vector first under-goes the change given by Equation 6, and theneach coordinate which is either greater than Cor less than — C is replaced by C or — C,respectively. Using the maximum (max) andminimum (min) functions, we can write thedynamics of the saturating system for the ithelement of x, at time t + 1 , as

x(t + 1)<= max (-C, min (C, [(I + A)*(0]0),

Our primary interest in the cube is thecorners, that is, points of the form

There are 1N different corners. Suppose x0 is acorner with the property that each coordinateof Axo is nonzero and carries the same sign asthe corresponding coordinate of x0. Then itfollows that there exists a neighborhood Nof XQ (for our purposes, a neighborhood of X0

can be thought of as a ball centered about x0)such that if the activity vector ever lands inthe intersection of N and the cube, it eventuallyreaches x0 and stays there for all future time.Points of this type are called stable. If some ofthe coordinates of Ax0 are zero, then x0 canstill be stable if there is a neighborhood N ofx0 such that each element x of N, wherex 7* XQ, satisfies the above condition. It is easyto see that if a corner is stable its antipodalcorner is stable.

If an eigenvector of A lies along a diagonal

Page 16: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

428 ANDERSON, SILVERSTEIN, RITZ, AND JONES

of the cube, then the two corresponding cornersare stable. Moreover, all points (except 0)sufficiently close to the eigenvector will windup in either corner. The collection of points inthe cube that are attracted to a stable cornerXQ is called the region of stability of x0. We willuse these regions (where all points are con-sidered equivalent, since they all end in thesame corner) in our applications in the nexttwo sections.

Our experience with computer simulationsand our intuitions suggest that the qualitativebehavior of. the system is straightforward.Suppose we start off with an activity vectorwhich is receiving powerful positive feedback;that is, without limits the vector would growindefinitely. The vector lengthens until itreaches one of the walls of the box; that is, oneof its component neurons reaches the firinglimit. The vector will try to get longer, but itcannot escape from the box. Thus, it will headfor a corner, where it will stay if the corner isstable. It can be shown that in many casesonly some corners are stable.

An Important Special Case

In some of the calculations we shall performin the next two sections, we must specify howthe eigenvalues change with time. In oneimportant special case, this is extremely simple.

We observe that in a saturating model, thefinal state of the system is always a corner.Thus, if we increment matrix A by the finalstate given by that corner, the incremental setof synaptic changes will always be the same.

If the corners are eigenvectors, which couldbe the case after a long time learning onlycorners, then the eigenvalues will be increasedby a constant amount every time that thecorner corresponding to that eigenvectorappears. This can be seen easily. If f is both aneigenvector and a corner, then Af = Xf. Theincremental change in A due to learning isgiven by AA = rjf f T , where 7) is a learningparameter. If f is normalized so ||f|| = 1, then

andAAf = ,f (Pf) = qf

(A + AA)f = (X + r,)l (9)

A Further Comment

Mathematically, the covariance matrix de-fined in Equation 5 is positive semidefinite;that is, all the eigenvalues are either zero orpositive. One reason for this is the presence oflarge positive values on the main diagonal,the an. This is so, since if we have K storedinputs with each input f t appearing kt times,then

This term corresponds to the feedback of aneuron on itself. Although so-called "autapses"are found occasionally in cortical pyramidalcells (van der Loos & Glaser, 1972), most ofthe physiologically studied cases where a cellfeeds back on itself involve special neuralcircuits, often inhibitory. A well-known ex-ample is the Renshaw cell system, a specialclass of cells providing inhibitory feedback tospinal motor neurons. We need not be re-stricted to the values given above for the maindiagonal of the matrix. We can let these valuesbe zero, if we wish, as we show in Appendix C.

Zero Activity Level

The model presented previously predictsthat almost all neurons will be firing at eithermaximum or minimum rate. Clearly, in a realnervous system, many, if not most, neuronsprobably will not respond to a given stimulus,although a sizable fraction may participate inthe activity pattern. There are several ways wecan have a number of zero elements in ouractivity vectors. Even in the model given here,where feedback is very powerful and con-nectivity is complete, input patterns willoccasionally give rise to a zero activity level of aneuron in the set if all the eigenvectors withnonzero eigenvalues have zero in the coordinatecorresponding to that neuron. If the space iseven moderately filled with eigenvectors (i.e.,if K is somewhat comparable to N and manyeigenvalues are nonzero) or if noise is intro-duced into the system, then this is very un-common. However, if we assume any one ofseveral mechanisms—thresholds of feedback,adaptation, restricted connectivity—we caneasily produce a system where many elements

Page 17: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 429

in the W-dimensional space remain at zero,even in the final state.

In the previously mentioned paper byBarlow (1972), it was pointed out that corticalneurons seem to respond in a characteristicway. A linear model would make no restrictionson the activity pattern. That is, we could haveactivity patterns containing a great many smallchanges in activity or a few large ones. But thenervous system seems to have chosen the latterkind of response pattern. We see, if we makeappropriate assumptions, that we may restrictsaturation to a set of the most "important"neurons for the discrimination of the inputstimulus.

Cells in cortex are rather selective in thatthey do not respond to most inputs but have astrong response to some. Our brain-state-in-a-box model has something of this aspect to it,in that in the final state cells may be fully on,fully off, or not participating in the activitypattern at all. It would be of interest to lookat cortical neurons in the awake, behavinganimal in light of our proposed model.

V. Categorical Perception

The best evidence for distinctive featureanalysis comes from linguistics. Therefore, itseemed natural to us to try to apply this modelto find if it agreed with the kind of perceptualanalysis that occurs in speech perception.

Preprocessing

Distinctive features are usually viewed as asystem for efficient preprocessing, whereby anoisy stimulus is reduced to its essentialcharacteristics and decisions are made onthese. We showed that the feature vectorsarising in the model are akin to those found inprincipal components analysis and are indeedthe most useful patterns for these kinds ofdiscriminations. More, however, is suggestedby the model. The brain-state-in-a-box modelsuggests that the final output of such a systemis a stable state corresponding to a corner of thehigh-dimensional hypercube formed by satu-rating neural activity patterns. A good pre-processor should put a noisy input into a moreor less noise-free standard form for use in later

stages of perception. Analysis and noise-freeresynthesis of the signal is not necessary if thewhole system gives directly as output a noise-free standard form. Features may never occurby themselves, for example, but only serve thefunction of "steering" the noisy input into theappropriate corner.

Categorical Perception

One of the characteristics that seems to befound in speech perception is "categoricalperception." When stimuli of an artificiallyconstructed set vary continuously across afeature dimension, the listener is experiment-ally found to have difficulties making dis-criminations within categories. Thus, voiceonset time—the time when the vocal cordsstart to vibrate— is the physical feature usedin making the /p/ versus /b/ discrimination.It is found that the listener cannot discriminatewell among stimuli classified as /p/ eventhough voice onset times may vary consider-ably. Conversely, discriminations across acategory boundary are very good. Two stimulidiffering only slightly acoustically, with smalldifferences in voice onset time, are welldiscriminated if they happen to fall on differentsides of the category boundary. It has beensuggested (see review by Studdert-Kennedy,1975, for references to the literature) that in-coming speech stimuli are analyzed by both a"categorizer" and a "precategorical acousticstore" (PAS) analogous to iconic memory invision. It is found experimentally that conso-nants, particularly stop consonants, displaymuch stronger categorical perception than dovowels. The reason for this, it is suggested, isthat consonants a^e so short in duration thatthe sensory information contained in the PASdecays too quickly to be used to make dis-criminations, while the categorical informationis much more stable. This hypothesis predictsthat vowels, for example, will show moreaspects of categorical perception if they aredegraded in noise or shortened in duration,both of which would primarily distort the PAS.The predicted result is found. Thus categoricalperception may be a very important—indeed,characteristic—property of language percep-tion.

Page 18: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

430 ANDERSON, SILVERSTEIN, RITZ, AND JONES

The Brain-State-in-a-Box Model

The brain-state-in-a-box model presentedhere is a model of categorical perception in arather pure sense. All points in a region areclassified together (differences vanish) and allpoints in different regions, even if initially verynear each other, are classified apart. Thus, nodiscriminations are possible within the regions,and perfect discrimination occurs betweenregions.

In previous work (Anderson, 1977; Ander-son, Silverstein, & Ritz, in press) we showedthat it was possible to use the saturating modelto categorically perceive "vowels,"—-that is,vectors corresponding to the outputs from abank of frequency filters when spoken vowelswere the input. For the simulation, we used aneight-dimensional space (i.e., an eight-dimensional vector representing each vowel)and showed that if we let our feedback matrixA learn according to the rules presented earlier,the system would eventually come to act as agood preprocessor. By this, we meant that nineinitial vowel codings, derived from experi-mental data and often starting close to oneanother, would, after the operation of feed-back, be associated with separate stablecorners. The system "learned" to do this withabout 20,000 total presentations of the ninevowels in the set of stimuli. Initially the systemclassified different vowels in the same cornerand took many computer iterations (i.e.,passages through the feedback system accord-ing to Equation 6), but after 20,000 learningtrials, all vowels had their own final corner andthe system performed correct classificationafter only seven iterations.

This simultaneous increase in both accuracyand speed of classification struck us as a gooddemonstration of what we feel are the practicalvirtues of such a system for perception.

A Computer Simulation

Testing categorical perception experiment-ally involves two parts: identification anddiscrimination. A set of artificial stimuli arefirst constructed which vary smoothly fromone speech sound to the other. Sometimes it iseasy to construct a continuously graded set ofstimuli—for example, voice onset time, where

the physical feature is quite clear-cut—butoften it is more difficult, as for some of theconsonant-vowel formant transitions. The es-sence of the effect is that subjects do not hearthe continuous variation, but instead perceivean abrupt shift from one sound to the other.

For the identification experiment, subjectsare presented with different stimuli and askedto say which phoneme they hear. Most of thepublished discrimination data have relied onthe psychophysical technique called "ABXdiscrimination." Two stimuli—A and B—which differ by a few milliseconds in voice onsettime are presented. The subject is then pre-sented with a third stimulus—either A or B—and is required to say whether the third stimu-lus matches the first or the second. Thus, if thesubject cannot make a discrimination he mustguess, which means he will be right 50% of thetime, on the average.

We decided that the most straightforwardcomputer simulation we could do would be tosimply duplicate these experiments, using theoutputs from the saturating neural model, andto show that it behaved in a way which lookedlike the human data. Since we know verylittle, to say the least, about the neuro-physiology of speech perception, it seems to uspremature to do more than point out generalsimilarities of the model with the phenomenonof categorical perception.

We used an eight-dimensional system again,which seemed to us large enough to be indica-tive of the behavior of a real system, yet smallenough to be manageable and of reasonablecost. We assumed we had two eigenvectors ofthe feedback matrix with nonzero eigenvaluespointing toward two corners. Corner A was(1,1,1,1,—1,—!,— !,—1), and Corner B was(1,1,-1,-1,1,1,-1,-1). This situation isshown in Figure 4. To perform our simulations,we had to have a set of stimuli which variedsmoothly from one feature to the other. Wepicked 16 equally spaced points along the unitsphere through the plane containing the twoeigenvectors and the origin.

In a real system, of course, there would benoise as well as other eigenvectors. Our simula-tions would show perfect categorization ifthere were no noise. We added zero-meanGaussian noise to each of the eight componentsof the initial position. In the figures, SD repre-

Page 19: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 431

sents the value of the standard deviation ofthis noise. Figure 4 shows the average lengthof the noise vector added to the starting point.Note that an SD of .4 corresponds to anaverage added noise about 1 unit long, or adistance as long as that from the origin to thestarting point. This is a great deal of addednoise. An SD of .1 corresponds roughly to thedistance between initial positions two stepsapart. This seemed to us to be a reasonablerange of lengths to cover, and one that mightbe found in a perception system under normaloperating conditions. The noise vector is notconstrained to lie on the plane containing theeigenvectors and the origin. The geometry ofthis model is very complicated.

Results of the Simulations

We did a Monte Carlo simulation of 100presentations of each starting point with addedrandom noise. When the final state of thesystem was Corner A, we called it Response A,

AVERAGE LENGTHOF ADDED NOISE=

Figure 4. In the simulation of categorical perceptionpresented in Section V, an eight-dimensional system isconstructed with two eigenvectors which point towardtwo corners. Limits of saturation are two units fromthe origin. Starting positions for test inputs are equallyspaced points along the unit circle, numbered 0 to 15.Zero-mean Gaussian noise of standard deviation (SD)is added to each of the eight components. The averageresulting lengths of the noise vectors corresponding todifferent SDs are shown in the lower-left quadrant. Thisfigure portrays a two-dimensional slice of an eight-dimensional space.

o i 2 3 4 5 6 7 8 9 10 I I

STARTING POSITION

Figiire 5. In the "identification" experiment, a "re-sponse" corresponds to the corner in which an inputends. The curves show the results of adding variousamounts of random noise to a given starting position.One hundred trials were used for each point.

and similarly for B. Figure 5 shows the resultsof this simulation for several values of SD. Ifthere is no added noise (i.e., SD — 0), thesystem categorizes perfectly. If there is noise ofSD equal to .4, there is a nearly linear decreasein the probability of Response A as the startingpoint moves from 0 to 15. Intermediate valuesof noise produced curves which look very muchlike the published data presented in Studdert-Kennedy (1975), Pisoni (1971), or Eimas andCorbit (1973).

The real test of a categorical perceiver is thedifficulty it has performing discriminationswithin categories. Since experiments often usean ABX paradigm, we simply did so in oursimulation, although it was expensive in termsof computer time. We took as an initial input astarting point at, say, point n. We then addedrandom noise and noted which corner appearedas the final state. We repeated the process foran input at point n + 4 or n + 6. We againnoted the final state. Then we randomly chosethe first or second starting point, addeddifferent noise, repeated the process, and at-tained a third final state. Finally we deter-mined whether the final state agreed with whatit was supposed to ("correct") or whether itwas in error. For several combinations ofcorners, it was necessary to require the programto guess, corresponding to a forced choice. Inthis case, it was correct or incorrect with aprobability of .50. We did Monte Carlo simula-tions using only 40 trials per point to save

Page 20: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

432 ANDERSON, SILVERSTEIN, RITZ, AND JONES

inLJWzoCLV)UJo:

oUJo:o:oo

LJOCCUJ£L

100

75

50

25

6-STEPSEPARATION

4-STEPSEPARATION

SD = 0.2

I I I I I I I I I i I I I i

Maximum change in the number of stepsrequired between conditions was around 20%,and the shape of the curve was quite similarin all cases.

The required number of steps was abouttwice as great when the stimulus started froma point near the category boundary. An effectlike this has been observed. Data from Pisoniand Tash (1974) for an experiment usingvoice onset time showed a change in reactiontime from about 475 msec in the centers of thecategories to about 575 msec at the categoryboundary, as determined by the identificationfunction. The distribution of reaction timeswas symmetrical around the category bound-ary, as is ours.

0 I 2 3 4 5 6 7 8 9 101112131415 Adaptation

CENTER OF TWOSTARTING POSITIONS

Figure 6. The "discrimination" experiment had thecomputer perform an ABX experiment. Two startingpoints four or six steps apart corresponded to A and B.The program noted which final state appeared for eachinput. A third input—X—was chosen from either A orB, and the computer decided whether it was classifiedcorrectly, incorrectly, or whether guessing was neces-sary. Added Gaussian noise had standard deviationof .2 units. For each point, 40 responses were averaged.

computer money. The results are given inFigure 6.

We have shown only data for an addednoise of SD equal to .2 units, since this valuehad identification functions that looked ap-propriate to us in light of the experimentaldata. The discrimination functions look similarto what is seen—discrimination shows apronounced peak when the category boundaryseparates the starting points and shows adrop if both points start on the same side ofthe boundary.

An interesting by-product of the identifica-tion simulation was an estimate of the "re-action time" required for the system to classifyan input as one or the other corner. We simplycounted the number of iterations required tosaturate the system and attain the final state.This number was averaged over the 100 trialsfor each point and is plotted in Figure 7 forSD equal to .2. Decreasing noise seemed toslightly increase average categorization time.

Since we potentially have a learning system,and since we have already shown in ourprevious discussion how the eigenvalues changewhen the system is learning eigenvectorspointed toward corners (see Equation 9), itseemed an obvious extension of our simulationsto look briefly at the effects of "adaptation" oncategorical perception.

EC.UJZa:Ooo

20

10

SD = 0.2

i i i i i I i i i i i i i i i i0 1 2 3 4 5 6 7 8 91011 I2I3I4I5

STARTING POSITIONFigure 7. The number of computer iterations requiredfor all components of the system to saturate is plottedhere. Added noise had a standard deviation of .2. Asstarting position varied, the number of steps requiredto saturate increased near the boundary. Differentnoise conditions showed very similar patterns. If eachiteration or its equivalent takes about the same time toperform in a real nervous system, then this graph couldbe interpreted as a rough indicator of the pattern ofreaction times that would be observed in categoricalperception as an input stimulus is moved from onecategory to another across a boundary. Each point isthe average of 100 responses.

Page 21: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 433

100 1̂

80

60

40

20

•I 25

I 2 3 4 5 6 7 8 9 10 II 12 13 14 15

STARTING POSITION

Figure 8. When the eigenvalues associated with theeigenvectors are not equal, the identification functionshifts. If we assume repeated presentation of an adapt-ing stimulus causes synaptic "antilearning," then thissimulation corresponds to how we would expect oursystem to behave if one feature vector is adapted. Eachpoint is the average of 100 responses.

Cooper (1975) has reviewed some of theadaptation literature and points out some of thegeneral features experimentally observed.Adaptation is typically produced experiment-ally by having the subject listen to a minuteof the adapting stimulus at a presentation rateof 2 per sec. After adaptation, the identificationfunction for the unadapted stimulus (A)indicates a shift toward the adapted stimulus(B). The curve for the identification functionis displaced parallel to its initial position; asCooper comments, "the slopes of the identifi-cation functions obtained after adaptation wereas steep as the slopes of functions obtained inthe unadapted state" (p. 26). There wasevidence of crossed series adaptation; that is,subjects who adapted to voicing in one pairof phonemes also adapted (to a slightly lesserextent) to another pair of phonemes differingin the same feature.

Our simple simulation could not test crossedseries adaptation because our system only hadtwo eigenvectors. However, we can easily checkthe shift in identification function.

We assumed that "adaptation" was equiv-alent to synaptic "antilearning," or learningaccording to Equation 2 with negative sign.This is not simple fatigue but a more complexand subtle process involving precise synapticchange at many synapses. By Equation 9,we see that adaptation causes the eigenvalue

of the adapted stimulus to decrease slightly.The eigenvectors did not shift direction. Inour initial simulation (see Figure 5) we as-sumed the two eigenvalues were equal. Forthe simulation of adaptation we used exactlythe same program but decreased the eigenvalueXB. The results for several different ratios ofeigenvalue are shown in Figure 8. The noisehad SD equal to .2 units. The ratio XA/XB= 1.0 was included to correspond to the initialsimulation. It can be seen that the data lookvery much like parallel displacements of theunadapted curve toward Response B, exactlyas seen in the experimental data.

It might be mentioned that we did virtuallyno searching for parameters in most of thesesimulations, and the simulations used manyof the same programs as those used for proba-bility learning in the next section.

Conclusions

We suggest that the simulations presentedhere are quite good replicas of the majorexperimental findings of categorical perception.Our previous work with the simulation ofvowel preprocessing, coupled with the workdiscussed here, suggests that models for speechperception might consider using some suchideas as positive feedback, saturation, andsynaptic learning, which seem to be responsiblefor the interesting effects in our simulations.

This simulation shows clearly some of thefeatures we feel may be typical of naturalsystems constructed with distributed, parallelarrays of interconnected analog elements. Thesystem acts like an adaptive filter, where thefilter characteristics are determined by thepast history of the filter. The system does not"analyze" its inputs in the sense that a com-puter or logician might analyze them, bydissecting them into component parts, but itsimply responds to them. However, the re-sponse of the system is determined by its past,so its analysis becomes meaningful in terms ofthis past.

VI. Probability Learning

We shall consider here the set of experi-ments usually called "probability learning."We shall make a direct application of the model

Page 22: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

434 ANDERSON, SILVERSTEIN, RITZ, AND JONES

previously described and show that it canprovide a model for this seemingly remoteapplication.

The ability to estimate the probability ofoccurrence of an event with a random com-ponent—whether or not it will rain, who willwin an election, what the stock market willdo—is obviously important in daily life. It hasalso served as the basis of a large body ofwork in experimental psychology. Probabilitylearning has been studied in a number of ways.A classic experimental technique involves aprediction by the subject as to which of twolights will be turned on. Typically, a subjectwill sit facing the lights. He will be asked,usually immediately after a signal, which lightwill turn on after a brief interval. The twolights are usually turned on randomly.

The general results are quite consistent fromexperiment to experiment. The subjects willtend to "match" the probabilities of the events;that is, if the left light occurs with a proba-bility of .8, the subject, after a number oftrials, will predict that light about 80% of thetime. This result is not what one would expectif one were to assume that the subject wastrying to maximize his chances of successfulprediction. If he were, then he would choosethe most probable light all of the time. Ofcourse, by providing appropriately largepayoffs, it is possible to encourage the subjectsto change strategies, but, if left to themselves,there is a strong tendency to match probabili-ties in most simple experimental situations.

Statistical Learning Theory

The prediction of probability matching is asimple consequence of statistical learningtheory, which is one of the reasons for the greatinterest in the effect. Derivation of this resultis given in a number of places (see Estes &Straughan, 1954, Estes, 1957, and the otherpapers collected in Neimark & Estes, 1967).

Let us briefly sketch some of the importantaspects of the derivation. We assume there area number of alternative responses, Ai,A2, • • -Ar, which can be made by the subjectpredicting, for example, which of several lightswill turn on. The response is followed by anevent—EI, £2, • • • Er—which is the actualturning on of one of the lights. Learning theoryassumes that the actual event will change

the future probability of predicting that event.Suppose the probability of a response, Aj, onthe rath-trial is given by pnj. If an event, Ey,occurs, then the probability of Response Ayis increased. In early formulations of thelearning rule, the correctness or incorrectnessof a prediction determined the subsequentlearning. However, it seems that only theoccurrence of a particular event is required toincrease the associated response probability.As Estes (1972) says in a review article, "themathematical operator applied on each trialis determined solely by the information re-ceived by the subject" (p. 82) as long as specialrewards for success and failure are not present.This was shown directly by Reber and Mill-ward (1968). They showed that mere observa-tion of the event lights, in the absence ofprediction, produced probability matching inthe subjects.

The quantitative rule that the increase inprobability of Ay will follow if Event Ey occurscan be derived from statistical sampling theory,or can arise from other assumptions, andfollows the form,

/>y,n+i = (i - e)pi,n + e. (10)

The parameter, 0, is an important learningparameter in statistical learning theory, and0 < 6 < 1 . If Event Ey does not occur, then theprobability of making Response Ay in thefuture will be given by

Pi,n+i = (i - e)pi.n. (ii)Qualitatively, these expressions make theprobability of a response tend to increase afterthe associated event, and tend to decrease afteranother event. If we consider the behavior ofthe probabilities, the expected value for pi,n+iis given by

) + ftr*.., (12)

where 7r,,n is the current value of the actualprobability of occurrence of the Event E;.Thus, after the first trial,

after the second trial,

and so on.61(1 - 07T;,2,

Page 23: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 435

For the important special case, often used inexperiments, where the probability of an Event£< is fixed at some value, Ti, the expected valueof the probability of making Response A,-after n trials is given by

E(#i,n) = T, - fr, - #u)(l - «)-'. (13)

The second term of this expression will go tozero as n increases if 0 is greater than zero,that is, if the system learns at all. Thus, theasymptotic value of £(#,-.„) is given by theprobability of the event, TT,-, which is essentiallywhat is observed. This result is completelyindependent of the initial probability of the re-sponse and the learning parameter, 9.

Problems

This elegant result agrees well with manyaspects of the data. However, the actual datapresent some difficulties for the simple theory,which has led to a number of attempts tomodify or extend the simple model.

In some of these extensions, subjects areassumed to memorize past sequences and topredict the event that occurred in the past.There are other approaches, involving "hy-pothesis testing" or other, generally "intelli-gent" behavior on the part of the subject.

We will assume here that the tendency ofthe subjects to look for regularities and se-quencies in the data interferes with, and isseparate from, an underlying straightforwardlearning phenomenon with no cognitivecomponent.

There are two problems with the simplemodel, which can be seen in most of the experi-mental data.

First, when there are only two alternativeevents, the subjects match probabilities closelybut they do not do so exactly. Myers (1976)says in a recent review that the responseprobability "consistently overshoots" exactmatching, and the overshooting has been seen"by almost anyone who has run subjects formore than 300 trials" (p. 173). Overshootingis not large, usually at most a few percent inthe two-choice experiment, but it is significantand thus a problem for theorists. The over-shooting becomes much larger when there aremore than two alternatives (Neimark & Estes,1967, p. 261). Estes (1964) suggests that over-

shooting may be a function, to a certain extent,of the instructions given the subject. Myerscomments that "investigators who are deter-mined to do so can produce overshooting in avariety of ways; undershooting is somewhatmore difficult to achieve, but possible"(p. 175).

Exact probability matching is a good firstapproximation. However, it is clear that amodel should have sufficient flexibility toexplain the small but consistent deviationsabove and below exact matching that are afeature of the experimental data.

Second, a matter that could either be con-sidered aesthetic or substantive, depending onperspective, is the value of the learningparameter, 6. Although 6 does not appear inthe expression for the asymptote, it determinesthe rate and trajectory with which the responseprobability will approach its asymptotic value,because the factor (1 — 0)""1 appears in thesecond term of Equation 13. Given the timecourse of response probabilities, it is possibleto estimate 6. When different probabilityconditions are run with the same subjects, 6does not appear to be constant (Neimark &Estes, 1967, pp. 257-258). The change in 9may be considerable, even in very similarexperiments.

A Neural Model for Probability Learning

We have at hand, in our brain-state-in-a-boxmodel, the means for automatically generatingdiscrete responses. We also have, if we intro-duce noise into the system, a means of intro-ducing probabilities naturally into the model.Although we can handle any number of events,we will restrict ourselves to two at first.

Let us observe here that there are two partsto this model. The first is the learning feedbackmatrix. We have discussed the properties of thismatrix in some detail previously. The secondpart involves the dynamics of the system. Theinitial state moves, under the influence offeedback, into a corner. How fast it moves andinto which corner it moves are given by theproperties of the feedback matrix at that time.We assume here that the feedback matrixchanges only after a stable corner has beenreached. This means that the matrix does notlearn while the state vector is changing; that

Page 24: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

436 ANDERSON, SILVERSTEIN, RITZ, AND JONES

REGION A

EIGENVECTOR A

REGION B

EIGENVECTOR B

XA / 1.33 1.67

2.0 3.0 5.0Figure 9. A two-dimensional, saturating neural system. Each coordinate axis corresponds to a singleneuron, or the amplitude of a pattern of neuron activity. This activity saturates; thus activity is con-fined to the interior and edges of the square. There are two eigenvectors in this system, one pointing toeach pair of diagonally opposite corners. When feedback is positive, all inputs to points in Region Acause the activity vector to end up in the corner to which eigenvector A points, or in the opposite corner.As the ratio of eigenvalue A to eigenvalue B increases, the size of Region A increases.

is, the time constant of learning is long com-pared to the dynamics of the system.

Suppose we have a two-dimensional systemwith the dimensions coupled by a feedbackmatrix. Thus the two-dimensional limits ofsaturation form a square. Suppose that thetwo eigenvectors of the feedback matrix pointtoward the corners of the square. If we startoff at some initial point of the square, positivefeedback will drive the system toward one ofthe corners (see Figure 9).

Let us identify one pair of diagonal corners,associated with one eigenvector, with oneresponse. Let us identify the response associ-ated with corners (—!,+!) and (+1, — 1)as Response A and the other pair of corners(1, 1) and (—1, —1) as Response B. Identify-ing two diagonally opposite corners with one

response is made primarily for convenience,and could be avoided if necessary.

This two-dimensional system is not sorestrictive as it seems. Assume that oneresponse is associated with one large patternof neural activity, as it might be in a realsystem, and the other response is associatedwith another large pattern, orthogonal to thefirst. These patterns give rise to two orthogonalvectors in a high-dimensional space. Then byconsidering the plane through the vectors andthe origin, we have a two-dimensional system,with corners associated with eigenvectors. Theaxes of our two-dimensional system mightcorrespond to amplitudes of complex activitypatterns interacting with each other. Thetractable case we will discuss might be a

Page 25: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 437

reasonable approximation to a much morecomplicated, high-dimensionality system.

As we have mentioned, the plane will bedivided into two regions. The origin is un-stable and need not be considered further. Oneregion will be associated with one response,and the other region with the other. This isshown in Figure 9.

Suppose the subject wishes to make aprediction (i.e., to make one response or theother). There is no lack of random noise in thenervous system. We will assume that thesubject initiates the prediction process bystarting at some random point on the plane.If the chances of starting at all points on theplane are equal, then by simply calculatingthe areas associated with each region anddividing by the total area, we know the proba-bilities of each response.

The calculation of these probabilities isstraightforward, though involved, and is pre-sented in Appendix A. The boundaries betweenregions are simple curves, and the areas of theregions can be calculated exactly, with asimple resulting expression. If XA and XB arethe eigenvalues of the feedback matrix associ-ated with Response A and Response B,respectively, and if

x - X AX O ~XB" 'then the probability of Response A is given by

3X02 + X0

3

(X0 + I)1'(14)

As the ratio increases, Response A becomesmore and more probable. The shapes of theregions for different values of the ratio X0 areshown in Figure 9. The probability of ResponseA versus the ratio of the eigenvalues is shownin Figure 10. Note that #A is monotonicallyincreasing for XA/XB > 1.

Learning Rules

To make predictions of response probabili-ties, we must specify only how the eigenvalueschange with time. We then have a rule forturning the ratio of eigenvalues into responseprobabilities.

We observe that in the saturating model,the final state of the system is always a corner.

Figure 10. In the model for probability learning pre-sented in the text, it is possible to calculate the proba-bility of each response if the eigenvalue associated witheach response is known. Their ratio then gives theprobability according to the graph.

Thus, as we showed at the end of Section IV,the eigenvalues follow a very simple learningrule: They increment by a constant amountevery time a corner is reached, as shown inEquation 9.

Let us allow that memory decays with time.We also assume that there is a stable part ofthe eigenvalue which is not affected by eitherdecay or learning. This might correspond tothe knowledge, for example, that there are twopossible responses. Thus, we get a formula forthe eigenvalues that is similar to, but notidentical with, that used for the probabilitiesin statistical learning theory.

Let g be a decay factor, 0 ^ g ̂ 1, and let 77be the amount of increment produced whenthe system learns. Then, if Event A has justoccurred, and if the set of synaptic changesassociated with that corner has been added tothe feedback matrix, that is, A has beenlearned, the two eigenvalues for the n + 1sttrial are given by

XA(» g[\A(n) - 1] + r, (ISa)

and

XB(w+ 1) = - 1]. (ISb)

The first term, "1," is the constant part; thesecond term is the decay term; and the thirdpart is the increment. If Event B occurs, XBreceives the increment. Note that the successor failure of the prediction made by the subjectdoes not appear in this scheme.

Since we are concerned only with responseprobabilities, we are concerned only with theratio of XA and XB. In other applications—

Page 26: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

438 ANDERSON, SILVERSTEIN, RITZ, AND JONES

prediction of reaction time, for example, inrelated tasks—we cannot make this assump-tion and must know the true values of theeigenvalues.

The general similarity of this model tomany aspects of stimulus sampling theoryshould be emphasized. The learning schemefor eigenvalues is a variant of the "linear"learning model. Thus, many of the predictionswill be somewhat similar to those of statisticallearning theory. The essential difference is thathere the eigenvalues are what are changing—the size of an eigenvalue becomes a sophisti-cated measure of trace strength—and thatchanges in probability of response occur as aresult of the operation of a complex processinstead of directly.

Asymptotic Behavior

It can be seen that the model is completelydenned once we specify the constants and havethe sequence of events. We can derive asymp-totic behavior of the eigenvalues quite easilyin some cases. If there is a constant value ofprobability of Events A and B, with proba-bilities TA and TTB, and if the eigenvalues startat XAO and XBo, then average values of XA andXB after n trials are given by

our purposes by letting

E|>A(»)] = 1 +

and(16a)

- g")

+ r(^B 0 - l ) ; (16b)and after a very large number of trials, ifg < 1, the asymptotic values of the averagesare given by

E(XA) = 1 + T* -̂

E(XB) = 1 + f^.

Since our formula for response probability(Equation 14) requires knowledge of the ratioof the two eigenvalues, it is easy to get anestimate of this value. We can approximatethe expected value of the ratio well enough for

E r= 3*E(XA)E(XB)'

(17)

At asymptote, we see that

•p I _._ I /-^/ g

-« + •For the values of parameters used here, theapproximation is good to better than 1%.

This relation does not predict simple proba-bility matching but a complex relation in-volving T? and g. Using simple calculus, we cansee the largest expected value of overshootwill occur when there is no forgetting, that is,when g = 1. Then

E(XA/XB)7TA

KB'

We can then calculate the estimate of themaximum expected value of probability fromEquation 14. The maximum expected over-shoot in this case is about 10% when TTA isaround .75.

The smallest average asymptotic value ofXA/XB will occur when the second term is zero,that is, when there is no learning. Then

XA/XB = 1,

and the probability of both Response A and Bwill be .5. Thus, we can predict values ofasymptotic probability both above and belowprobability matching.

It does not seem to be possible to deriveexact expressions for the important charac-teristics of the random variable XA/XB, such asthe expected value and the variance. However,characteristics of the response probability(Equation 14) can, somewhat surprisingly, becalculated, a result shown in Appendix B.

Application to Data

Let us see if we can fit some experimentaldata with the model as it stands. Our purposein this section, it must be emphasized, is notprimarily to provide a different or quanti-tatively more satisfactory model for proba-bility learning, but to establish that ourneurally based model is capable of handling awide range of interesting phenomena. We are

Page 27: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 439

not concerned with explaining every aspectof the available data.

A good set of data to use is the series ofexperiments reported by Friedman et al.(1964). They used an exceptionally large groupof subjects—80 Indiana University under-graduates—in a 3-day series of probabilitylearning experiments. In the first 2 days,subjects received sequences of 48 trial blocks.The probabilities of an event changed fromblock to block ("variable — IT series"). Oddnumbered blocks had both TTA and TTB equalto .5. Probabilities in the even numberedblocks varied from .1 to .9 in steps of .1,excluding .5. Each subject received all proba-bility conditions during the 2 days. Twodifferent sequences of block probabilities wereused, but the first, last, and alternate inter-mediate blocks were assigned .5 probabilities.On the 3rd day, eight 48-trial blocks weregiven. The first and last blocks had TA equalto .5; the middle six blocks had TTA equal to .8.

Friedman et al. (1964) were able to fit thedata reasonably well by assuming the basicequations of statistical learning theory, Equa-tions 10 and 11, with a changing 6. Theparameter, 8, varied over a considerable range.To fit the transition between a block with TTAequal to .1, to the succeeding block with TTAequal to .5, the best 9 was found to be .62.To fit the transition between TTA equal to .4and ITA equal to .5, 8 was found to be .07. Othertransitions fell in between, with 8 increasingas the difference in probability between blocksincreased.

We felt it would be of interest to see if ourmodel would fit the data with a single set ofparameters. Our model requires, as doesstatistical learning theory, detailed knowledgeof the stimulus sequences and ordering ofblocks to make the best predictions. This datawas not available in most cases in the Friedmanet al. (1964) paper, although a wealth ofcarefully gathered and computed averageddata was presented.

Strong learning effects were demonstratedto be present between the beginning and theend of the series, and it was stated that theordering of blocks had a "highly significanteffect" which "severely limits the analysesthat can usefully be accomplished with the

variable — IT sequence" (Friedman et al., 1964,p. 260).

Friedman et al. (1964) were primarily con-cerned, in the analysis of the variable — irseries, with the transitions between a variableprobability block and the following .5 proba-bility block, since this allowed them to estimate6 and to get some idea of the general behaviorof the statistical learning theory model withrespect to the data.

Our model was sufficiently complex andprobabilistic to make it difficult to generatesimple expressions for some of the averages.Since we had access to a large computer, themost direct way of initially checking the fitof our model to Friedman et al.'s (1964) datawas to do a simulation, using 80 computer-generated pseudosubjects and then to comparethe results with the real data.

We at first assumed that our pseudosubjectswould receive the sequence of stimuli inFriedman et al.'s (1964) Summary Table 7,which has the following sequence of blockprobabilities: .5, .1, .5, .2, .5, .3, .5, .4, .5, .6,.5, .7, .5, .8, .5, .9, .5. Every pseudosubjectreceived an individual set of events generatedwith probabilities given by the probabilitysequences. Eigenvalues associated with re-sponses of the pseudosubjects were calculatedaccording to the learning scheme, and theprobabilities of response of each subject werecalculated and averaged across subjects. Thismeant the pseudosubjects did not actuallymake pseudoresponses, which would be thenprocessed as the responses of the real subjectswere, so the data have lost a significant sourceof variance. The statistics can also be calcu-lated directly by use of the formulas inAppendix B. The resulting average proba-bilities were then compared with the real datain Friedman et al.'s (1964) Table 7. Figure 11shows the results.

A crude search was made in order to seewhich set of parameters produced a simulationmost resembling the actual data. The bestfitting parameters were found to be TJ = .3and g = .90. Parameters were not especiallycritical. The value of the decay factor, .9, wassuch as to make the contribution to the currenteigenvalue from trials over 48 trials (i.e., ablock) in the past negligible, so this simulationcould be viewed as a sequence of fits of transi-

Page 28: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

440 ANDERSON, SILVERSTEIN, RITZ, AND JONES

PROBABIL ITY OF STIMULUS A DURING BLOCK

.5 . 1 .5 .2 .5 .3 .5 .4 .5 .6 .5 .7 .5 .8 .5 .9 .5

V)zo0.COUJtr

u.o

m<motra.

EXPERIMENTALSIMULATION

8 16 24 32 40 48 56 64

BLOCK NUMBER ( 12 TRIALS PER B L O C K )

Figure 11. Computer simulation and real data of the first two days of the experiment performed byFriedman et al. (1964). The solid line gives the results of a computer simulation with 80 pseudosubjectsreceiving random sequences according to the probability schedule. The learning parameter, t;, was .3;and the decay parameter, g, was .90. The pseudosubjects' calculated average response probabilities areplotted. The dashed line is the actual experimental data, with 80 subjects making real responses. Sub-jects did not receive blocks in the sequence given.

ID(Oz2ing

t .7

m<mo

EXPERIMENTALSIMULATION g - . 9 OSIMULATION g = . 95

4 8 12 16 20 24

BLOCK NUMBER (12 TRIALS PER BLOCK)

Figure 12. Two computer simulations and real data ofthe 3rd day of the experiment in Friedman et al. (1964).The learning parameter, i\, of the simulation was .3.The decay parameter, g, takes two values, .90 and .95.The experimental data are given by the solid line.

tions to the .5 blocks. Friedman et al. (1964)plotted these transitions (and fitted curves tothem) individually in their Figure 4.

Data from the 3rd Day

The fit of the simulated to the actual datawas encouraging, although the actual datafrom the first 2 days of the experiment con-cealed a demonstrated multitude of extraneouseffects, such as event sequence, and subjectlearning.

The probability sequence on the 3rd daywas the same for all the subjects, now wellpracticed, and provided a good test of themodel. Figure 12 shows the fits for data fromthe 3rd day. The same set of parametersprovided a reasonable fit, although the fit wasmade better by increasing the decay param-eter, g, from .90 to .95. Note that both thedata and the simulation show a small, butdefinite, overshoot above .8. The responseprobability over the last two 48-trial blocks

Page 29: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 441

(12 trial blocks, 25-32) was .82, as opposedto probability matching. The same averagefor the simulation with g equal to .95 was .81.

In a previous paper (Anderson, 1973), amodel similar in many respects to this wasproposed for the learning of short lists and fora choice reaction time experiment. A quantity,7, was defined, and is equal to ij/(l ~ l) m

the present section. This quantity was found,in fitting the data, to vary from about 2 toabout 7. In the probability learning experi-ments, with rj equal to .3 and g equal to .95,7 = 6; and with t\ equal to .3 and g equal to.90, 7=3 , which falls in the same range.Since list learning and probability learningseem at first glance to be very different tasks,this coincidence is interesting.

Recency Curves

Friedman et al.'s (1964) article, and otherarticles as well, pay special attention to thebehavior of response probability when a con-tinuous sequence of a particular event ("arun") occurs. Both statistical learning theoryand our theory predict a so-called "positiverecency" effect, where the response proba-bility for an event increases after a run ofthat event. When subjects are well practiced,they show this effect very strongly, althoughthey do not always do so during the firstfew blocks of trials.

Friedman et al. (1964) provide extensiverecency data for their subjects during the 3rdday, when TTA = .8 and TB = .2, and at theend of the 2nd day, when TTA and TTB = .5.

When they fit the data with the simplestatistical learning theory model, they findfair fits, but 6 again is not constant. In thedata from the 3rd day, 6 was estimated to be.058 during the transitions to and from theTA = .8 condition, while the fit during a runsuggested 8 = .17.

If we assume the eigenvalues start off, inour model, at asymptote, calculated from theformula and parameters given previously, wecan calculate (exactly) the eigenvalues aftera particular sequence by use of Equations15a and 15b and can then calculate theassociated probabilities using Equation 14.

When this was done with the data given inTable 11 of Friedman et al. (1964) for runs

in the TTA = .8, 3rd-day condition, and withthe data read from Figure 11 for the lastblock (.5 probability) during Day 2, the fitwas not very good for runs of 2, 3, or 4 events,although the match was quite good after arun of 15 or 20 A events.

Our model predicts that response probabilitycan never equal 1.00, even after an infiniterun of a particular event. As Reber andMillward (1971) point out, published recencydata usually indicate asymptotes well below1.00. In Reber and Millward's (1971) experi-ment, which involved a continuously variableevent probability which was "tracked" bythe subjects, the probability of Response Aafter a run of up to 10 A events was extra-ordinarily low and seemed stable at about .78.

In the data from runs of Event A in theiTA = .8 condition of the Friedman et al. (1964)study, values of response probability alsoseemed well away from 1.0. Since there were,as might be expected, relatively few very longruns, even in a .8 condition, the individualdata points are of dubious significance beyondruns of about 10. However, the averageprobability of Response A for runs of length17 through 20 in the 48 trial blocks 3, 4, 5,and 6 is .91; that is, there were 211 A responsesout of 231 total responses. We can predictwhat the model with parameters determinedpreviously would give us after 20 trials of thesame event simply by following the formula.We find that, with ij equal to .3 and g equal to.95, the probability of Response A after arun of 20 is .91, indicating good fit for longruns. However, the failure of the fit for shortruns is intriguing and seems also to be impliedby the Friedman et al. finding of wide differ-ences in 6.

Short-term Memory and Probability Learning

Perhaps the simplest explanation of thisshort-run divergence from a theory satis-factory for the long run is simply that thereis a highly weighted contribution from thepast few events. Possibly, we are seeing aneffect that is due to the limited capacity,short-term store, which is nearly universallyaccepted to exist in human memory.

Millward and Reber (1968) asked subjectsabout past trials in a probability learning

Page 30: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

442 ANDERSON, SILVERSTEIN, RITZ, AND JONES

.8

CD<CDOOCCL

HItozOO.(/>UJtr

.6

.4

.2

r T i iRESPONSE

RESPONSE A

RESPONSE BTT.-0.2

B

• EXPERIMENTAL

THEORETICAL

0 1 2 3 4 5 6 7

NO. OF CONSECUTIVEPRECEDING EVENTS

Figure 13. Experimental recency data and theoreticalcalculations for the experiment in Friedman et al.(1964). Response probabilities are calculated for "runs"of events. A run of zero length means that the precedingevent was the other event. These data were simulatedby postulating a rapidly decaying short-term trace, inaddition to the more slowly decaying trace assumedpreviously (see text for details).

experiment and found, in most conditions,that "memory for past events goes back aboutfive events and/or four event runs" (p. 988).

Since limited capacity, in our model, can bemodeled by assuming it is synonomous with"rapid decay," suppose we postulate that thereis a rapidly decaying short-term representationof the preceding three or four events.

There seemed no point in making extensiveparameter fits, since we would now have twomore parameters to add to our basic model

As is well known, -with a large computer,enough parameters, and enough persistence,any given set of data can be fitted, so we simplypicked some reasonable parameters. We letthe short-term decay factor, ga, be .65, whichcorresponds to a memory span of three orfour. Since we know that a g of .95 fitted dataquite well when averaged over 12 trial blocks(which would wash out contributions from aprocess with a decay factor of .65), we assumedthe long-term decay factor, gi = .95.

We assume the two processes are separate,decay separately, and add together to givethe magnitude of the eigenvalue, each followingan equation like Equation 15a or 15b. Thus, atasymptote, for the eigenvalue associated withResponse A, we have

X A = 1 +1TA.T1B

As a starting point, let us assume that atasymptote there are equal contributions fromthe short- and long-term processes; that is,the second term equals the third term. Wethen find that the learning parameter of theshort-term process has T/S equal to 1.067 andthat of the long-term process, tji, equal to.1525. This is all the information we need tocalculate the average probabilities of thesubjects during runs of a particular event. Weshow the theoretical and experimental curvesin Figure 13. The fit is quite good. All thesepoints are fitted with the above parameters,and only the starting points vary. There is avalue plotted for a run of zero length, whichgives the probability of one response when arun of the other event is about to start. (Arun of one event has to start with the otherevent.) We assume that the eigenvalues at the"—1" trial started at asymptote.

Comments A bout Variance

Analysis of the variance in these models israther complicated, since several mechanismscan contribute to the observed experimentalvariance. First, the raw data is generated by aprobabilistic process. Even if we knew thesubject's response probabilities exactly, theresulting data from a relatively small number ofresponses would contribute a significantamount of variance. Second, in our model

Page 31: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 443

Table 3Comparison Between Published Three-Choice Experimental Data and a Computer Simulation

Event p

Source of data

Gardner"Cotton &GardnerCotton &GardnerCotton &GardnerCotton &Cole"Cole

Rechtshaffenb

Rechtshaffen

Rechtshaffen

Rechtshaffen

TTA

.60

.60

.70

.70

.44

.67

)TB

.30

.20

.20

.15

.33

.22

ire

.10

.20

.10

.15

.22

.11

Simulation

A

.75

.79

.84

.86

.52

.82

B

.21

.10

.11

.06

.32

.13

C

.04

.11

.05

.08

.16

.05

Experiment

A

.68

.66

.68

.66

.80

.80

.80

.81

.51

.84

B

.24

.16

.13

.10

.31

.11

C

.08

.16

.07

.10

.17

.05

a Data taken from Gardner (1957). Average of response probabilities for Trials 286-450.b Data taken from Cotton and Rechtshaffen (1958). Average of response probabilities for Trials 286-450.0 Date taken from Cole (1965). Average of response probabilities on Trials 501-1,000. A correction procedurewas used in this experiment.

there is a contribution due to the intrinsicvariance in the probability of a subject'sresponse. This rather involved calculation isgiven in Appendix B. Third, in our model,intersubject differences in parameters have aprofound effect on asymptotic probability,causing a large increase in the variance of thepooled data. This effect is largest when devia-tions from equal probability are large.

Our guess is that intersubject variation inparameters will turn out to be a major causeof experimental variance, and since we have noidea what kind of distribution of parametersmight exist for subjects, we felt it best to dono detailed calculations of variance at thistime.

Extension to More Events

The derivation, sketched earlier, of thesimplest form of statistical learning theorysays that probability matching is independentof the number of stimuli. However, experi-mentally, overshooting of the most probablestimulus increases greatly as the number ofalternatives increases (Estes, 1972). Somedata are available for three-choice experiments(Cole, 1965; Cotton & Rechtshaffen, 1958;Gardner, 1957). All these experiments showpronounced overshooting above matching,generally many percent.

Since our model becomes difficult to workwith analytically in higher dimensional boxes,

we resorted to a Monte Carlo simulation toestimate the response probabilities in a three-choice system. We considered three orthogonalvectors pointing toward corners in a four-dimensional space. (There are orthogonalvectors pointing toward corners only when thedimensionality of the space is divisible by apower of 2.) The orthogonal vectors wereei= (l.-l.-l, l ) ,e ,= (1,1, -1, -l),ande3 = (1, — 1,1, — 1). The eigenvalues couldbe calculated from our asymptotic formula.With eigenvectors and eigenvalues known, thefeedback matrix was constructed. Then arandom point in the box was chosen (with auniform distribution) as a starting point. Thefeedback then forced the system into a corner(see Table 3).

Occasionally stable corners not associatedwith an eigenvector appeared when the eigen-values were nearly equal. These corners wereignored in the calculation of probabilities. Itis not clear what these extraneous cornersmight correspond to psychologically—possiblyparalyzed uncertainty.

In each Monte Carlo simulation, 1,000initial random vectors were used, and thenumber of times a particular corner appearedwas counted. This gave an estimate of theprobabilities of a particular response. We usedthe same parameter values we used previously,i} = .3 and g = .95. These two values producedonly very slight overshooting in the two-choice

Page 32: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

444 ANDERSON, SILVERSTEIN, RITZ, AND JONES

system but very pronounced overshooting inthe three-choice system.

Table 3 shows the results of the simulationsand compares them with the limited experi-mental data. We find agreement with theactual data is surprisingly good in severalcases. In a couple of the cases where the fit isnot so good (the 60-20-20 and 60-30-10conditions of Gardner, 1957, and of Cotton &Rechtshaffen, 1958), the data in their publishedfigures clearly indicate that asymptote had notbeen reached and the probabilities of the mostprobable response were still increasing. Coleused the last 500 trials of a 1,000-trial experi-ment, while the other two workers usedaverage probabilities of Trials 286-450 in ashorter experiment. Cole's experiment in-corporated a correction procedure, wheresubjects continued to predict until they werecorrect. This should not greatly affect ourprediction for the first response probabilitiesin this experiment.

One might wonder why it would not bepossible in the three-choice case to considerResponses B and C, say, to be a single "re-sponse" and then to apply the two-choiceanalysis to them. We showed in our simulationsthat the two-choice model does not give riseto as much overshooting (for the same param-eters) as the multiple-choice model, so thismust be an incorrect approach. One reasonfor this is that we are not using a linear systembut one with a high degree of nonlinearity,both in geometry and dynamics. Inputs andoutputs cannot be freely combined in a non-linear system because the superposition prin-ciple does not hold. Many "obvious" ap-proaches are incorrect when applied to non-linear systems, a point to be careful of whenanalyzing the operation of a system with suchspectacular nonlinearities as the brain.

VII. Conclusions

Our aim in this article has been to presenta relatively detailed and precise model, whichwas suggested by the anatomy and physiologyof the brain, and to consider some interestingpsychological applications. Although it wasnecessary to oversimplify reality to get amodel that we could work with easily, theresulting model was sufficiently rich to have

a pronounced structure, and, with a verymodest amount of manipulation, gave rise tosome testable predictions. We feel that thebest way to work with such a very wide-ranging model is to try to fit with reasonablesuccess many different phenomena, rather thanto try to explain every detail of a limited bodyof experimental data.

We have argued that our model provided atheoretical framework for entities that actedand behaved very much like the distinctivefeatures of psychology. A set of neurons withpositive feedback tends to analyze its inputsby most heavily weighing the eigenvectors ofthe feedback matrix with large positive eigen-values and by suppressing the rest. We alsopointed out that these particular eigenvectorsare often the most meaningful in terms of thediscrimination to be performed, since theycontain most of the information allowingdiscriminations to be made among the simulusset.

When we introduced a saturating non-linearity—the "brain-state-in-a-box" model—we were able to suggest directly a model forprobability learning, which fitted some actualexperimental data in reasonable detail. As anapplication of the model in a different area, weshowed that the same model, with slightmodifications, acted as a categorical perceiverwith properties similar to those seen in recentdata.

The model has some obvious shortcomings.Among these are some grevious oversimplica-tions of the physiology, the assumption of ahigh degree of linearity in some parts of thesystem, and, later, the assumption of "hard"saturation, with no transition from linearityto saturation. Other necessary details that areneeded for an actual operating system aresimply ignored. For example, a mechanismmust be provided to get the brain state outof a corner, once it has gotten in. There are amultitude of ways this could be accomplished,all of which could be used together. Therecould be selective adaptation or habituationof rapidly firing cells and of their synapses;there could be large amounts of noise whichcould force the system from corner to corner;and, quite probably, there could be a specialneural circuit, where rapidly firing cellsgenerate recurrent inhibition to turn them-

Page 33: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 445

selves off. We preferred to ignore this problemhere, since we did not need to consider it forthe problems we discussed. However, even-tually we must make concrete assumptionsabout how brain states decay as well as howthey grow.

Some of the quantitative predictions forprobability learning may be fairly sensitiveto the assumptions about the geometry ofthe system. It is quite unlikely that ourhypothetical enclosing "box" is a hypercube,with all sides equal. We are not sure how amore general box would affect the behaviorof the system. Many of the qualitative proper-ties would remain the same, but the system issufficiently complex that we cannot say forcertain.

There are many immediate extensions of themodel. As one example, the probability learningmodel serves also as a model for choice reactiontime. If we assume that instead of startingrandomly, we start at a point in the box thatis determined to some extent by the stimulus—by initially passing the stimulus through amemory filter of the kind we have been dis-cussing here, for example—we can make amodel for choice reaction time. In fact, themodel as stated has similarities, which canbe made precise and explicit, with randomwalk models for reaction time, which are verysuccessful in explaining many of the quanti-tative aspects of reaction time experiments(Link, 1975). If we consider the corner as anabsorbing state and the noisy memory filteroutput evolving in time (with additive noise)as the directed random walk, then the re-semblance becomes striking.

As another point, the model stronglysuggests that the brain would rather be wrongthan undecided; that is, a misperception (thewrong corner) is better than no perception atall. Some misperceptions are clearly morelikely than others, reflecting the past learningof the system. Also, the model suggests thatbrain states in perception move from stablestate to stable state, with abrupt transitionsbetween stable states.

All these suggestions can be made preciserelatively easily and can be compared with therich body of data and concepts in experi-mental psychology. We have tried to show,with theoretical discussion and with several

detailed examples, that such an effort may beworthwhile.

Reference Note

1. Anderson, J. A. What is a distinctive feature? (Tech.Rep. 74-1). Providence, R.I.: Brown University,Center for Neural Studies, 1974.

References

Anderson, J. A. A memory storage model utilizingspatial correlation functions. Kybernetik, 1968, 5,113-119.

Anderson, J. A. Two models for memory organizationusing interacting traces. Mathematical Biosciences,1970, 8, 137-160.

Anderson, J. A. A simple neural network generating aninteractive memory. Mathematical Biosciences, 1972,14, 197-220.

Anderson, J. A. A theory for the recognition of itemsfrom short memorized lists. Psychological Review,1973, 80, 417-438.

Anderson, J. A. Neural models with cognitive implica-tions. In D. LaBerge & S. J. Samuels (Eds.), Basicprocesses in reading: Perception and comprehension.Hillsdale, N.J.: Erlbaum, 1977.

Anderson, J. A., Silverstein, J. W., & Ritz, S. A. Vowelpre-processing with a neurally based model. In Con-ference Record: 10771.E.E.E. International Conferenceon Acoustics, Speech, and Signal Processing. Hartford,Conn.: in press.

Barlow, H. B. Single units and sensation: A neurondoctrine for perceptual psychology. Perception, 1972,1, 371-394.

Cole, M. Search behavior: A correction procedure forthree choice probability learning. Journal of Mathe-matical Psychology, 1965, 2, 145-170.

Cooper, L. N. A possible organization of animalmemory and learning. In B. Lundquist & S. Lund-quist (Eds.), Proceedings of the Nobel Symposium onCollective Properties of Physical Systems. New York:Academic Press, 1974.

Cooper, W. E. Selective adaptation to speech. In F.Restle, R. M. Sniffrin, N. J. Castellan, H. R. Lind-man, & D. B. Pisoni (Eds.), Cognitive theory (Vol. 1).Hillsdale, N. J.: Erlbaum, 1975.

Cotton, J. W., & Rechtshaffen, A. Replication report:Two- and three-choice verbal conditioning phe-nomena. Journal of Experimental Psychology, 1958,56, 96-97.

Creutzfeldt, O., Innocent!, G. M., & Brooks, D. Verticalorganization in the visual cortex (area 17) in the cat.Experimental Brain Research, 1974, 21, 315-336.

Eccles, J. C. Possible synaptic mechanisms subservinglearning. In A. G. Karczmar & J. C. Eccles (Eds.),Brain and human behavior. New York: Springer, 1972.

Eimas, P. D., & Corbit, J. D. Selective adaptation oflinguistic feature detectors. Cognitive Psychology,1973, 4, 99-109.

Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito,

Page 34: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

446 ANDERSON, SILVERSTEIN, RITZ, AND JONES

J. Speech perception in infants. Science, 1971, 171,303-306.

Estes, W. K. Theory of learning with constant, variable,or contingent probabilities of reinforcement. Psycko-metrika, 1957, 22, 113-132.

Estes, W. K. Probability learning. In A. W. Melton(Ed.), Categories of Imman learning. New York:Academic Press, 1964.

Estes, W. K. Research and theory on the learning ofprobabilities. Journal of the American StatisticalAssociation, 1972, 67, 81-102.

Estes, W. K., & Straughan, J. H. Analysis of a verbalconditioning situation in terms of statistical learningtheory. Journal of Experimental Psychology, 1954,47, 225-234.

Evans, E. F. Neural processes for the detection ofacoustic patterns and for sound localization. In F. O.Schmitt & F. G. Worden (Eds.), The neurosciences:Third study program. Cambridge, Mass.: MIT Press,1974.

Freeman, W. J. Mass action in the nervous system. NewYork: Academic Press, 1975.

Friedman, M. P., et al. Two-choice behavior under ex-tended training with shifting probabilities of rein-forcement. In R. C. Atkinson (Ed.), Studies inmathematical psychology. Stanford, Calif.: StanfordUniversity Press, 1964.

Frishkopf, L. S., Capranica, R. R., & Goldstein, M. H.,Jr. Neural coding in the bullfrog's auditory system: Ateleological approach. Proceedings of the I.E.E.E.,1968, 56, 969-980.

Funkenstein, H. H., & Winter, P. Responses to acousticstimuli of units in the auditory cortex of awakesquirrel monkeys. Experimental Brain Research,1973, 18, 464-488.

Gardner, R. A. Probability learning with two and threechoices. American Journal of Psychology, 1957, 70,174-185.

Gibson, E. J. Principles of perceptual learning anddevelopment. New York: Meredith, 1969.

Globus, A., & Scheibel, A. B. Pattern and field incortical structure: The rabbit. Journal of ComparativeNeurology, 1967, 131, 155-172.

Goldstein, M. H., Jr., Hall, J. L. II, & Butterfield, B. O.Single-unit activity in the primary auditory cortex ofunanesthetized cats. Journal of the Acoustical Societyof America, 1968, 43, 444-455.

Grossberg, S. Pavlovian pattern learning by nonlinearneural networks. Proceedings of the National A cademyof Sciences, 1971, 68, 828-831.

Hebb, D. 0. The organization of behavior. New York:Wiley, 1949.

Hubel, D. H., & Wiesel, T. N. Receptive fields, binoc-ular interaction, and functional architecture in thecat's visual cortex. Journal of Physiology, 1962, 160,106-154.

Hubel, D. H., & Wiesel, T. N. Receptive fields andfunctional architecture of monkey striate cortex.Journal of Physiology, 1968, 195, 215-243.

Jakobson, R., Fant, G. G. M., & Halle, M. Preliminariesto speech analysis: The distinctive features and theircorrelates. Cambridge, Mass.: MIT Press, 1961.

Knight, B. W., Toyoda, J. I., & Dodge, F. A., Jr. Aquantitative description of the dynamics of excitation

and inhibition in the eye of Limulus. Journal ofGeneral Physiology, 1970, 56, 421-437.

Kohonen, T. Correlation matrix memories. I.E.E.E.Transactions on Computers, 1972, C-21, 353-359.

Kohonen, T. Associative memory: A system theoreticapproach. Berlin: Springer-Verlag, 1977.

Laughery, K. R. Computer simulation of short-termmemory: A component decay model. In G. H. Bower& J. R. Spence (Eds.), The psychology of learning andmotivation (Vol. 3). New York: Academic Press, 1969.

Lindgren, N. Machine recognition of human language:Part 2. Theoretical models of speech perception andlanguage. I.E.E.E. Spectrum, 1965, 2, 45-59.

Lindsay, P. H., & Norman, D. A. Human informationprocessing: An introduction to psychology. New York:Academic Press, 1972.

Link, S. W. The relative judgment theory of two choicereaction time. Journal of Mathematical Psychology,1975, 12, 114-135.

Little, W. A., & Shaw, G. L. A statistical theory of shortand long term memory. Behavioral Biology, 1975, 14,115-135.

Mcllwain, J. T. Large receptive fields and spatial trans-formations in the visual system. In R. Porter (Ed.),International review of physiology: Neurophysiology II(Vol. 10). Baltimore, Md.: University Park Press,1976.

Millward, R. B., & Reber, A. S. Event-recall in proba-bility learning. Journal of Verbal Learning andVerbal Behavior, 1968, 7, 980-989.

Morrell, F., Hoeppner, T. J., & de Toledo, L. Massaction re-examined: Selective modification of singleelements within a small population. NeuroscienceAbstracts, 1976, 2, 448.

Mountcastle, V. B. The problem of sensing and theneural coding of sensory events. In G. C. Quarton,T. Melnechuk, & F. L. Schmitt (Eds.), The neuro-sciences. New York: Rockefeller University Press,1967.

Myers, J. L. Probability learning and sequence learn-ing. In W. K. Estes (Ed.), Handbook of learning andcognitive processes (Vol. 3). Hillsdale, N.J.: Erlbaum,1976.

Nass, M. M., & Cooper, L. N. A theory for the develop-ment of feature detecting cells in visual cortex.Biological Cybernetics, 1975, 19, 1-18.

Neimark, E. D., & Estes, W. K. Stimulus samplingtheory. San Francisco: Holden-Day, 1967.

Neisser, U. Cognitive psychology. New York: Appleton-Century-Crofts, 1967.

Noda, H., & Adey, W. R. Firing of neuron pairs in catassociation cortex during sleep and wakefulness.Journal of Neurophysiology, 1970, 33, 672-684.

Noda, H., Manohar, S., & Adey, W. R. Correlatedfiring of hippocampal neuron pairs in sleep andwakefulness. Experimental Neurology, 1969, 24, 232-247.

Pisoni, D. B. On the nature of categorical perception ofspeech sounds. Unpublished doctoral dissertation,University of Michigan, 1971.

Pisoni, D. B., & Tash, J. Reaction time to comparisonswithin and across phonetic categories. Perception &•Psychophysics, 1974, I S , 285-290.

Page 35: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 447

Ratliff, F., Knight, B. W., Dodge, F. A., Jr., & Hartline,H. K. Fourier analysis of dynamics of excitation andinhibition in the eye of Limulus: Amplitude, phase,and distance. Vision Research, 1974, 14, 1155-1168.

Reber, A. S., & Millward, R. B. Event observation inprobability learning. Journal of Experimental Psy-chology, 1968, 77, 317-327.

Reber, A. S., & Millward, R. B. Event tracking inprobability learning. American Journal of Psychology,1971, 84, 85-99.

Schiller, P. H., Finlay, B. L., & Volman, S. F. Quantita-tive studies of single-cell properties in monkey striatecortex: V. Multivariate statistical analyses andmodels. Journal of Neurophysiology, 1976, 39,1362-1374.

Shepherd, G. M. The synaptic organization of the brain.New York: Oxford University Press, 1974.

Studdert-Kennedy, M. The nature and function ofphonetic categories. In F. Restle, R. M. Shiffrin,N. J. Castellan, H. R. Lindman, & D. B. Pisoni(Eds.), Cognitive theory (Vol. 1). Hillsdale, N.J.:Erlbaum, 1975.

Suzuki, S. Zen mind, beginner's mind. New York:Weatherhill, 1970.

van der Loos, H., & Glaser, E. M. Autapses in neo-cortex cerebri: Synapses between a pyramidal cell'saxon and its own dendrites. Brain Research, 1972,48, 355-360.

Willshaw, D. J., Buneman, O. P., & Longuet-Higgins,H. C. Non-holographic associative memory. Nature,1969, 222, 960-962.

Winter, P., & Funkenstein, H. H. The effect of speciesspecific vocalization on the discharge of auditorycortical cells in the awake squirrel monkey. Experi-mental Brain Research, 1973, 18, 489-504.

Wollberg, Z., & Newman, J. D. Auditory cortex ofsquirrel monkey: Response patterns of single cells tospecies-specific vocalizations. Science, 1972, 175,212-214.

Young, T. Z., & Calvert, T. W. Classification, estima-tion, and pattern recognition. New York: AmericanElsevier, 1974.

Appendix A

The calculation of size of the regions corre-sponding to each corner is reasonably straight-forward. In our system we have two orthogonaleigenvectors which point in the (1, 1) and(— 1, 1) directions. Rotate the system by — Tr/4radians to obtain the normalized eigenvectors

-0-0The eigenvalues are unchanged by a rotation ;that is,

A point with coordinates, v = I I in\yo/

rotated frame, then may be expressed as

this

v = (Al)

If we consider that a feedback cycle occursonce every unit time, /;, i = 0, 1 , 2 ..... then

v(/0) = *oe*

Xv)

— (\ I \ \2

By induction we get

yoev(l

y0e,(l

= Q, e, = (°), we haveSince e

In order to get a reasonably simple, generalsolution for the sizes of the regions, we mustmake the approximation, which should bequite good if step size is not too large, thattime is continuous.

For continuous time we have

dx/dt = lxxdy/dt = \vy.

These two equations may be solved to deter-mine the "motion" of a point (x, y ) as it isfed back through the system. The solutions are

Page 36: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

448 ANDERSON, SILVERSTEIN, RITZ, AND JONES

Xz"1 In* — t -\- ki

XjT1 lay = t + k2.

Eliminating t, we find

\x Iny — \v Inx = constant.yXz

lti-r- = constant3CX»

y** = ££*» or y =

(A3)

(A4)

where £' is a constant. This family of curvesprovides the trajectories that will be followedby an initial point under the influence of feed-back. A point with initial coordinates (#o, yo)will follow the curve expressed by EquationA4 with

*' = (AS)

We must now consider the behavior of thesecurves at the boundary of the square. A pointwill travel along the curve given by Equation

A4 until it reaches the boundary. At this pointall motion in the direction normal to theboundary will be prevented. Only the com-ponent of the predicted motion that is tangentto the boundary will contribute to the actualmotion. If t is a normalized tangent vectoralong the boundary, then

dvdt along

boundary

We are interested in finding the curve repre-sented by Equation A4 for which a point willreach the boundary and stop. This will occurwhen the scalar product (dot product) ofEquation A6 of the point reached on theboundary is zero. Thus we wish to find thepoint at which

Figure Al. Regions of integration for the calculation of response probabilities for the two-dimensionalneural system. The eigenvectors point along the coordinate axes.

Page 37: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 449

Consider only the positive quadrant, with anormalized tangent vector,

t =—ex + ev

VZ

—e, + e,V2

We have then

0 = {x\xex + y\v6v

so that0 = — x\x + y\y.

Since x + y = V2 on the boundary, we have

0 = x\x

and therefore

x\v —

J X* + \»

The lines, x = V2 \u/ (\x + \v) and x= —V2Xj,/(Xx + XB) determine by their inter-section with the boundary four (unstable)equilibrium points. The curves representing allinterior points of the square which will moveso as to intersect the boundary at one of theseequilibrium points are given by Equation A4with

*' = -

= , /V v2xB

The regions of the two-dimensional system arethus determined (see Figure Al).

The area of the sector labelled I and II inFigure Al can be found by integrating y= k'x^** from 0 to * = ^2\v/(\x + Xy) andadding the area of the triangular segment, /.This area times 4 will give the amount of thetotal area for which a vector will end up incorners (1, 1) or (— 1, — 1) in the nonrotatedsystem. Since the total area of the square is 4,the area I-II corresponds to the probabilityof Response A.

If P is the probability of ending up in corners(1, 1) or (-1, -1) then

VSX»/(Xx+X»)

x + X,

X,

vX* + X + \

(X, + X,)3

3X2 + X3

(X + I)8'(A7)

where X = \x/\y This is the formula used tocalculate response probabilities in the sectionon probability learning.

Appendix B

We can write the equation governing \A as

\A(U+ 1) = 1 + Pn+i, (Bl)where

P»+l = g-Pn + 7)/n+l, Po = 0, (B2)

where the 7ns are independent random vari-ables taking the value 1 with probability irand 0 with probability 1 — IT. The solutionfor Equation B2 is given by

P* = n L F-'3-1

(B3)

and can easily be verified. However it isvirtually impossible to express the distributionof Pn in a workable form. We can calculatemoments of Pn using Equation B3. Forexample the expected value of Pn is

i-lEj-0

1-2 '

while the second moment is

Page 38: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

450 ANDERSON, SILVERSTEIN, RITZ, AND JONES

E(pn2) =

y-i ir-y/y)( E «-*/*)] = if E gn-'+-*£(/y/*)

= >?2I> E g2(n-" + E g2n~'-"E (/;)£(/*)]y-i y^t

y) E «-— r y=i

[ 1 _ P2n /I

-WV + -<T

The variance, then, of Pn is

var(Pn) = E(Pn2) - E(P)2

- TT)1 -1 —

(B6)

The higher moments can be found in the sameway, although the calculations become moretedious.

As n approaches infinity, Pn cannot convergeto a real number or to another random vari-able, since at each step we are adding a non-attenuating random element. However thedistribution functions Fn(x) = P(Pn < x) con-verge to the distribution function of therandom variable

P = E g'ii,3=0

(B7)

where the 7ys are defined as before. Since thedistributions of P„ are defined on the intervalbetween 0 and jj/(l — g), the moments of Pn

converge to the moments of P. The momentsof P are easy to get by using an iterativescheme. We can write P as

P = gP' + r,I, (B8)

where P' = E g'lj+i, and / = h. Since Py=o

and P' have the same distribution, and since P'and / are independent we have for all n > 1

£(pn) = E(gP> + ^/)" =

n-J'), (B9)

where (w, r) are the binomial coefficients[that is, (n, r) = n\/r\(n — r)!]. Therefore wehave

(BIO)

(B5)

It is easy to see that

Xf l(n) = 1 + 7? E g^l - /y)y=i

so that

l+Pn

- Pn, (Bl l )

1 + ̂ J!)_Pn

f - l , (B12)p

where on = 1 + [(1 - gn)/(l - g)]. How-ever, the random variable l/(an ~ Pn) isdifficult to deal with even in the limiting case.All moments of X(») must be derived frominfinite series of moments of Pn. It is computa-tionally feasible to get approximations, butwe made no attempt to do so.

Fortunately, we have a remarkably differentsituation for the probability of selection. Wewrite

Prob[X(w)] =X(n)[3X(n)

1 + 3X («) (B13)

and it can easily be verified that

1 + 3X(«) 1

X

+ 3(1-

- 6aBP(n)

+ 2P3(w)], (B14)

so that

Prob[X(w)] = an)3

Page 39: Psychological Review - Semantic Scholar · In the beginner's mind there are many possibilities, but in the expert's there are few. Shunryu Suzuki 1970 I. Introduction If we knew some

SOME APPLICATIONS OF A NEURAL MODEL 451

X [2P3(n) + 3(1 - an)P2(w) - 6a.P(n)].(B15)

Therefore in order to determine the expectedvalue of Prob[X(«)] we need only to computethe first three moments of -P(w), while thevariance of Prob[X(n)] requires P(n)'s firstsix moments.

As n approaches infinity the distributions ofProb£X(n)J approach the distribution of the

random variable Prob(X), where X is given by

:-1, o = i + (B16)

Equation BIS holds true for X with P(n) re-placed by P, so that that expected value andvariance of Prob(X) can be derived as in thefinite case.

Appendix C

We would like to see if we must be restrictedto the values given in Equation 5 for the feed-back system to have its desirable properties.Suppose we let the main diagonal be zero. Thismeans a cell influences only its neighbors, andnot itself. We see that the value of the diagonalelement given by Equation 5 is related to themean square of the activities /(») over all thetraces. If we make the assumption — in har-mony with our general approach, but presentlyuntestable — that, on the average, cells areequally active across the total stimulus set,that is, they are all equally "important" insome sense, then the diagonal elements will bealmost the same.

Let us consider the matrix we shall call D,which contains only diagonal elements. Then,

where c is a positive constant. If the covariancematrix given by Equation 5 is denoted V andwe require the feedback matrix, A, to havezeroes along the diagonal, then

A ^ V - D S ^ V - c I . (Cl)

Suppose Cj is an eigenvector of the covariancematrix with eigenvalue X» and with all theimportant discriminative properties we havediscussed. Let us assume that the equality inEquation Cl holds. Then if

A = V - cl

Ae< = Ve,- — cGi

= (X< — c)e<. (C2)

Thus, 6j is an eigenvector of A and its eigen-value is (Xi — c). We see that all the eigen-values are reduced by an amount c. Thus someeigenvalues can be negative. If the equalitydoes not hold, then for small deviations of Dfrom a constant times the identity matrix, theeigenvectors of A will be close to the eigen-vectors of the covariance matrix and will haveabout the same properties.

As a special case of interest, assume theeigenvectors pointed toward the corners ofan JV-dimensional hypercube. Suppose thesystem only learns these corners. Since thecube has equal sides, then all the /*2(i) will bethe same (i.e., the square of the saturationlimit) and we see that the exact equality willhold). Thus, we can avoid the awkwardnecessity to have large positive elements alongthe main diagonal and lose little, if any, of theinformation processing power of the system.

Note that negative eigenvalues, as long asthey are greater than —1, may cause the finalstate of the system to converge to zero, as canbe seen from Equation 8. This will occur ifthe initial input is a linear combination ofeigenvectors with eigenvalues in the interval(-1,0).

Received September 30, 1976 •