vision by man and machine 1984.pdfble to either. we shall have a common language in which to discuss...

16
Vision by Man and Machine How does an animal see? How might a computer do it? A study of stereo vision guides research on both these questions. Brain science suggests computer programs; the computer suggests what to look for in the brain. Tomaso Poggio April, 1984 T he development of computers of increasing power and sophistication often stimulates comparisons between them and the human brain, and these comparisons are becoming more earnest as computers are applied more and more to tasks formerly associated with essentially human activities and capabilities. Indeed, it is widely ex- pected that a coming generation of computers and robots will have sensory, motor and even "intellec- tual" skills closely resembling our own. How might such machines be designed? Can our rapidly grow- ing knowledge of the human brain be a guide? And at the same time can our advances in "artificial intelligence" help us to understand the brain? At the level of their hardware (the brain's or a computer's) the differences are great. The neurons, or nerve cells, in a brain are small, delicate struc- tures bound by a complex membrane and closely packed in a medium of supporting cells that control a complexand probably quite variable chemicalen- vironment. They are very unlike the wires and etched crystals of semiconducting materials on which computers are based. In the organization of the hardware the differences also are great. The connections between neurons are very numerous (anyone neuron may receive many thousands of inputs) and are distributed in three dimensions. In a computer the wires linking circuit components are limited by present-day solid-state technology to a relatively small number arranged more or less two-dimensionally. In the transmission of signals the differences again are great. The binary (on-off) electric pulses of the computer are mirrored to some extent in the all-or-nothing signal conducted along nerve fibers, but in addition the brain employs graded electrical signals, chemical messenger substances and the transport of ions. In temporal organization the dif- ferences are immense. Computers process infonna- tion serially (one step at a time) but at a very fast rate. The time course of their operation is governed by a computer-wide clock. What is known of the brain suggests that it functions much slower but that it analyzes information along millions of chan- nels concurrently without need of clock-driven operation. How, then, are brains and computers alike? Clearly there must be a level at which any two mechanisms can be compared. One can compare the tasks they do. "To bring the good.news from Ghent to Aix" is a description of a task that can be done by satellite, telegraph, horseback messenger or

Upload: others

Post on 22-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

Vision by Man and Machine

How does an animal see? How might a computer do it? A study of stereo visionguides research on both these questions. Brain science suggests computer programs;

the computer suggests what to look for in the brain.

Tomaso PoggioApril, 1984

The development of computers of increasingpower and sophistication often stimulatescomparisons between them and the human

brain, and these comparisons are becoming moreearnest as computers are applied more and more totasks formerly associated with essentially humanactivities and capabilities. Indeed, it is widely ex-pected that a coming generation of computers androbots will have sensory, motor and even "intellec-tual" skills closelyresemblingour own. How mightsuch machines be designed? Can our rapidly grow-ing knowledge of the human brain be a guide? Andat the same time can our advances in "artificialintelligence" help us to understand the brain?

At the level of their hardware (the brain's or acomputer's) the differences are great. The neurons,or nerve cells, in a brain are small, delicate struc-tures bound by a complex membrane and closelypacked in a medium of supporting cells that controla complexand probably quite variable chemicalen-vironment. They are very unlike the wires andetched crystals of semiconducting materials onwhich computers are based. In the organization ofthe hardware the differences also are great. Theconnections between neurons are very numerous(anyone neuron may receive many thousands of

inputs) and are distributed in three dimensions. In acomputer the wires linking circuit components arelimited by present-day solid-state technology to arelatively small number arranged more or lesstwo-dimensionally.

In the transmission of signals the differencesagain are great. The binary (on-off) electric pulsesof the computer are mirrored to some extent in theall-or-nothing signal conducted along nerve fibers,but in addition the brain employs graded electricalsignals, chemical messenger substances and thetransport of ions. In temporal organization the dif-ferences are immense. Computers process infonna-tion serially (one step at a time) but at a very fastrate. The time course of their operation is governedby a computer-wide clock. What is known of thebrain suggests that it functions much slower butthat it analyzes information along millions of chan-nels concurrently without need of clock-drivenoperation.

How, then, are brains and computers alike?Clearly there must be a level at which any twomechanisms can be compared. One can comparethe tasks they do. "To bring the good.news fromGhent to Aix" is a description of a task that can bedone by satellite, telegraph, horsebackmessengeror

Page 2: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

82 • TOMASO POGGIO

pigeon post equally well (unless other constraintssuch as time are specified). If, therefore, we assertthat brains and computers function as information-processing systems, we can develop descriptions ofthe tasks they perform that will be equally applica-ble to either. We shall have a common language inwhich to discuss them: the language of informationprocessing. Note that in this language descriptionsof tasks are decoupled from descriptions of thehardware that perform them. This separability is atthe foundation of the science of artificial intelli-gence. Its goals are to make computers more usefulby endowing them with "intelligent" capabilities,and beyond that to understand the principles thatmake intelligence possible.

In no field have the descriptions of information-processing tasks been more precisely formulatedthan in the study of vision. On the one hand it is thedominant sensory modality of human beings. If wewant to create robots capable of performing com-plex manipulative tasks in a changing environment,we must surely endow them with adequate visualpowers. Yet vision remains elusive. It is somethingwe are good at; the brain does it rapidly and easily.It is nonetheless a mammoth information-process-ing task. If it required a conscious effort, like addingnumbers in our head.. we would not undervalue itsdifficulty. Instead we are easily lured into oversim-ple, noncomputational preconceptions of what vi-sion really entails.

Ultimately, of course, one wants to know howvision is performed by the biological hardware

of neurons and their synaptic interconnections. Butvision is not exclusively a problem in anatomy andphysiology: how nerve cells are interconnected andhow they act. From the perspective of informationprocessing (by the brain or by a computer) it is aproblem at many levels: the level of computation(What computational tasks must a visual systemperform?), the level of algorithm (What sequence ofsteps completes the task?) and then the level ofhardware (How might neurons or electronic circuitsexecute the algorithm?). Thus an attack on the prob-lem of vision requires a variety of aids, includingpsychophysical evidence (that is, knowledge of howwell people can see) and neurophysiological data(knowledge of what neurons can do). Finding work-able algorithms is the most critical part of the proj-ect, because algorithms are constrained both by thecomputation and by the available hardware.

Figure 5.1 STEREO VISION BY A COMPUTER is shownin aerial photogI"aphs (provided by Robert J. Woodham).They were made from different angln !IOthat objects ineach have slighUy different positions. The images weremade by a mosaic of microelectronic 8e1UlOl'8,each of whichmeasurn the intensity of light along a particular line ofsight, as do the photoreceptor cells of the eye. The map illthe bottom wal generated by a computer programmed 10follow a procedure devised by David Marr and the authorand further developed by W. Eric L Grimson. The com-puter &ltered the images 10 emphasiu spatial changes inintensity. Then it pedormed stereopsis: it matched featarnfrom one image to the other, determined the disparitybetween tbeir positions and calculated their relativedepths in the three-dimensional world. Increasing eleva-tions in the map are coded in colors from blue 10 red.

Here I shall outline an effort in which I am in-volved, one that explores a sequence of algorithmsfirst to extract information, notably edges, or pro-nounced contours in the intensity of light, fromvisual images and then to calculate from thoseedges the depths of objects in the three-dimensionalworld. I shall concentrate on a particular aspect ofthe task, namely stereopsis, or stereo vision (seeFigure 5.1). Not the least of my reasons is the centralrole stereopsis has played in the work on vision thatmy colleagues and I have done at the Artificial In-telligence Laboratory of the Massachusetts Instituteof Technology. In particular, stereopsis has stimu-lated. a close investigation of the very first steps invisual information processing. Then too, stereopsisis deceptively simple. As with so many other tasksthat the brain performs without effort, the develop-ment of an automatic system with stereo vision hasproved to be surprisingly difficult. Fmally, the studyof stereopsis benefits from the availability of a largebody of psychophysical evidence that defines andconstrains the problem.

The information available at the outset of theprocess of vision is a two-dimensional array of

measurements of the amount of light reflected intothe eye or into a camera from points on the surfacesof objects in the three-dimensional visual world. Inthe human eye the measurements are made by pho-toreceptors (rod cells and cone cells), of which thereare more than 100 million. In a camera that mycolleagues and I use at the Artificial IntelligenceLaboratory the processes are different but the resultis much the same. There the measurements aremade by solid-state electronic sensors. They pro-

Page 3: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions
Page 4: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

84 • TOMASO POGGIO

duce an array of 1,000 by 1,000 light-intensityvalues. Each value is a pixel, or picture element (seeFigure 5.2).

In either case it is inconceivable that the gap be-tween the raw image (the large array of numbersproduced by the eye or the camera) and vision(knowing what is around, and where) can bespanned in a single step. One concludes that visionrequires various processes - one thinks of them asmodules-operating in parallel on raw images andproducing intermediate representations of theimages on which other processes can work. For ex-ample, several vision modules seem to be involvedin reconstructing the three-dimensional geometry ofthe world. A short list of such modules would haveto include modules that deduce shape £rom shad-ing, from visual texture, from motion, from con-tours, from occlusions and from stereopsis. Somemay work directly on the raw image (the intensitymeasurements). Often, however, a module mayoperate more effectively on an intermediaterepresentation.

Stereopsis arises from the fact that our two eyesview the visual world from slightly different angles.To put it another way, the eyes converge slightly, sothat their axes of vision meet at a point in the visualworld. The point is said to be fixated by the eyes,that is, the image of the point falls on the center ofvision of each retina. Any neighboring point in thevisual field will then project to a point on eachretina some distance from the center of vision. Ingeneral this distance will not be the same for botheyes. In fact, the disparity in distance will vary withthe depth of the point in the visual field with re-spect to the fixated point (see Figure 5.3).

Stereopsis, then, is the decoding of three-dimen-sionality from binocular disparities. It might appearat first to be a straightforward problem in trigonom-etry. One might therefore be tempted to program acomputer to solve it that way. The effort would fail;our own facility with stereopsis has led us to glossover the central difficulty of the task, as we may seeif we fonnally set out the steps involved in the task.They are four: A location in space must be selectedfrom one retinal image. The same location must beidentified in the other retinal image. Their positionsmust be measured. From the disparity between thetwo measurements the distance to the location mustbe calculated ..

The last two steps are indeed an exercise in trigo-nometry (at least in the cases considered in this

chapter). The first two steps are different. They re-quire, in effect, that the projection of the same pointin the physical world be found in each eye. A groupof contiguous photoreceptors in one eye can bethought of as looking along a line of sight to a patchof the surface of some object. The photoreceptorslooking at the same patch of surface from the oppo-site eye must then be identified. Because of binocu-lar disparity they will not be at the same positionwith respect to the center of vision.

This, of course, is where the difficulty lies. For usthe visual world contains surfaces that seem effec-tively labeled because they belong to distinct shapesin specific spatial relations to one another. Onemust remember, however, that vision begins withno more than arrays of raw light intensity measuredfrom point to point. Could it be that the brainmatches patterns of raw light intensity from one eyeto the other? Probably not. Experiments with com-puters place limits on the effectiveness of thematching, and physiological and psychophysicalevidence speaks against it for the human visual sys-tem. For one thing, a given patch of sudace will notnecessarily reflect the same intensity of light to botheyes. More important, patches of surface widelyseparated in the visual world may happen to havethe same intensity. Matching such patches would beincorrect.

A discovery made at AT&T Bell Laboratories byBela Julesz (now at Rutgers University) shows thefull extent of the problem. Julesz devised pairs ofwhat he called random-dot stereograms. They arevisual stimuli that contain no perceptual dues ex-cept binocular disparities. To make each pair hegenerated a random texture of black and white dotsand made two copies of it. In one of the copies heshifted an area of the pattern, say a square. In theother copy he shifted the square in the oppositedirection. He filled the resulting hole in each patternwith more random texture. Viewed one at a timeeach pattern looked uniformly random (see Figure5.4). Viewed through a stereoscope, so that each eyesaw one of the patterns and the brain could fuse thetwo, the result was startling. The square gave avivid impression of floating in front of its surround-ings or behind them (see Figure 5.5). Evidently ster-eopsis does not require the prior perception of ob-jects or the recognition of shapes.

J ulesz' discovery enables one to formulate thecomputational goal of stereopsis: it is the extrac-

tion of binocular disparities from a pair of images

Page 5: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

Figure 5.2 BEGINNING OF VI·SION for an animal or a computeris a gray-Ievd array: a point-by-point representation of the inten-8ity of Ughl produced by a grid ofdetectoR in the eye or in a digilalcamera.. The image at the top ofthis illustration is 8uch an array.It Willi produced by a digital am-era as a set of intensity va.lues in agrid of 576 by 454 picture ele-ments ("pixels"). Intensity valuesfor the part of the image insidethe rectangle are given digitally atthe bottom. (Figures 5.2 and 5.6were prepared by H. Keith NWti-bara of the Artifidal IntelligenceLaboratory.)

'" '" '" "' '" '" "" "' '" "" "" '" '" '" '" '" '"'" "" '" '" '" '" ,,, '" ,,, '" '" '" '" "" '" '" ,,,'" "' '" '" '" ", '" "" '" '" '" '" '" '" '" '" '"'" '" '" ,,, '" '" '" '" '" '" '" '" '" '" '" '" '"'" '" '" '" '" '" '" '" '" '" '" '" '" '" '" '" '",,, '" '" "" '" "' '" '''' "" '" '" '" '" '" '" '" '""" ,,, '" '" '" "' '" ,eo "" '" '" '" '" '" '" '" '"'" '" m "' '''' '" '" "' '" "" '" '" '" "" '" '" '"'" '" '" '" "' "" '" '" ,,,

"" '" '" '" '" '" '" '"'" "",,, '" '" '" ", "" '" '" '" '" '" '" '" '" '""'" '" "' '" '" '" ", '" "" "' "" '" '" '" '" '" '""" '''' "" "" '" '" '" '" '" '" '" '" '" "" '" '" '""'" '" '" "' '" '" '" '" '" "" '" '" '" '" '" '" '"'" '" "" ", '" '" '" ", '" '" '" "" "'" '" '" '" '"'" '" ''''' '" ", ", '" ", '" '" '" '" '" "" '" "" '"'" '" '''' ,,, "" '" ", '" ", m '" '" '" '" '" '" '"'" "" "" "" '" '" '" "" "" "" '" '" '" "" '" '" '"'" '''' '" "" "" ", '" ,,, '" '" '" '" '" '" '" '" '"'" "" '" '" '" '" ,... '" ", "" "" '" '" '" '" '" '"'" "" '" '" '''' "" '" "'" '" '" '" '" '" '" '" '" '"'" "" '" "" "" "" '" '" '" '" '" "" '" ", '" '" '"'" '" "' '" "" '" ,,, "" '" '" '" '" '" '" ". '" '"'" '" ,,, '" ", ,•.. '" '" '" '" '" '" '" '" '" '" '"'" '" "" '" '" "" '" '" '" '" '" '" '" '" '" '" '"'"" ,,, '" '" '" ", '" "" '" '" '" '" '" '" '" '" '""" ", '" '" '" '" '" '" '" '" '" '" '" '" '" '" '"'" '" '" '" '" '" '" '" '" '" '" '" '" '" '" '" ""'" '" '" '" '" '" '" '" '" '" '" "" '" '" '" '" '"'" '" '" ,,, '" '" ", '" '" '" '" '" '" '" '" '" '"'" '" '" "' '" "" "" '" '" '" '" '" '" '" '" '" '"'" "' '" '" '" '" '" '" '" '" '" '" '" '" '" '" '27

Page 6: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

86 • TOMASO POGGIO

Figure 5.3 BINOCULARDISPARITIESare the basis furstereopsis.They arisebecausethe eyes convergeslightly,so that their axesof visionmeetat a point in the exierIYlworld (a-). The point is "fixated."A neighboringpoint in

without the need for obvious monocular clues. Inadctition the discovery enables one to formulate thecomputational problem inherent in stereopsis. It isthe correspondence problem: the matching of ele-ments in the two images that correspond to thesame location in space without the recognition ofobjects or their parts. In random-dot stereogramsthe black dots in each image are all the same: theyhave the same size, the same shape and the samebrightness. Anyone of them could in principle bematched with anyone of a great number of dots inthe other image. And yet the brain solves this false-target dilemma: it consistently chooses only the cor-rect set of matches.

It must use more than the dots themselves. Inparticular, the fact that the brain can solve the cor-respondence problem shows it exploits a set of im-plicit assumptions about the visual world.. assump-tions that constrain the correspondence problem,making it determined and solvable. In 1976 DavidMarr and L working at MIT, found that simpleproperties of physical surfaces could limit the prob-lem sufficiently for the stereopsis algorithms (proce-dures to be followed by a computer) we were theninvestigating. These are, first, that a given point on aphysical surface has only one three-dimensiona1lo-cation at any given time and, second, that physicalobjects are cohesive and usually are opaque, so thatthe variation in depth over a surface is generallysmooth, with discontinuous changes occurring onlyat boundary lines. The first of these constraints-uniqueness of location-means that each item in

b

a

the world(b) will then projecttoa pointonthe retina somedlslancefromthe centerof vision.Thedi8lancewill nolbethe samefur eacheye.

either image (say each dot in a random-dot stereo-gram) has a unique disparity and can be matchedwith no more than one item in the other image. Thesecond constraint-continuity and opacity-means that disparity varies smoothJy except at ob-ject boundaries.

Together the two constraints provide matchingrnIes that are reasonable and powerful. I shall de-saibe some simple ones below. Before that, how-ever, it is necessary to specify the items to bematched. After all, the visual world is not a ran-dom-dot stereogram, consisting only of black andwhite dots. We have already seen that intensityvalues are too unreliable. Yet the information thebrain requires is encrypted in the intensity arrayprovided by photoreceptors. If an additional prop-erty of physical surfaces is invoked, the problem issimplified. It is based on the observation that atplaces where there are physical changes in a sur-face, the image of the surface usually shows sharpvariations in intensity. These variations (caused bymarkings on a surface and by variations in its depth)would be more reliable tokens for matching thanraw intensities would be.

Instead of raw numerical values of intensity,therefore, one seeks a more symbolic, compact androbust representation of the visual world: a desaip-tion of the world in which the primitive symbols-the signs in which the visual world is coded-areintensity variations. Marr called it a "primalsketch." In essence it is the conversion of the gray-level arrays provided by the visual photoreceptors

Page 7: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

VISION BY MAN AND MACHINE • 87

Figure 5.4 RANDOM-DOT STEREOGRAMS devised byBela JuIeu: working at ATkT Bell Laboratories are visuallextures conlaining no clues for stereo visiOD except binoc-ular disparllies. The stereograms themselves are the samerandom lexture of black and while doll (top). In ODe of

into a form that makes explicit the position, direc-tion, scale and magnitude of significant light-inten-sity gradients, with which the brain's stereopsismodule can solve the correspondence problem andreconstruct the three-dimensional geometry of thevisual world. I shall desaibe a scheme we havebeen using at the Artificial Intelligence Laboratoryfor the past few years, based on old and new ideasdeveloped by a number of investigators, primarilyMarr, Ellen C. Hildreth and me. It has several at-tractive features: it is fairly simple, it works well andit shows interesting resemblances to biological vi-sion, which, in fact, suggested it. It is not, however,

them. however, a square of the lexture is shifted towardthe left; in the other II is shifted toward the righl (bottom).The resulling hole in each image is fiUed with more ran-dom doll IgrtlYRrrtls).

the full solution. Perhaps it is best seen as a workinghypothesis about vision.

Basically the changes of intensity in an image canbe detected by comparing neighboring intensity

values in the image: if the difference between themis greal, Ihe intensity is changing rapidly. In mathe-matical terms the operation amounts 10 taking thefirst derivative. (The firsl derivative is simply therate of change of a mathematical function. Here it issimply the rale al which intensity changes on a pathacross the gray-level array.) The position of an ex-tremal value-a peak or a pit-in the first deriva-

Page 8: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

88 • TOMASO POGGIO

Figure 5.5 VIVID PERCEPTION OF DEYIH results whenthe randOin-dot stereograms shown iP Figure 5.4 areviewed through a stereoscope, 80 that each eye _ one of

tive turns out to localize the position of an intensityedge quite well (see Figure5.6) In turn the intensityedge often corresponds to an edge on a physicalsurface. The second derivative also serves well. It issimply the rate of change of the rate of change andis obtained by taking differencesbetween neighbor·ing values of the first derivative. In the second de·rivative an intensity edge in the gray·level arraycorresponds to a zero·crossing: a place where thesecond derivative crosses zero as it falls from posi.tive values to negative values or rises from negativevalues to positive.

Derivatives seem quite promising. Used alone,however, they seldom work on a real image, largelybecause the intensity changes in a real image arerarely clean and sharp changes from one intensityvalue to another. For one thing, many differentchanges, slow and fast, often overlap on a variety ofdifferent spatial scales. In addition changes in in-tensity are often corrupted by the visual analogue ofnoise. They are conupted, in other words, by ran-dom disruptions that infiltrate at different stages asthe image formed by the optics of the eye or of acamera is transduced into an array of intensity mea-surements. In order to cope both with noisy edgesand with edges at different spatial scales the imagemust be "smoothed" by a local averaging of neigh·boring intensity values. The differencing operationthat amounts to the taking of first and second deriv-atives can then be performed.

There are various ways the sequence can be man-

the pair and the brain can fuse the two. ne sight of pad ofthe image "Boating" establishes tbat slueop8is does nolrequire the recognition of objects in the visual world.

aged, and much theoretical effort has gone into thesearch for optimal methods. In one of the simplestthe two operations-smoothing and differentiation-are combined into one. In technical tenns itsounds forbidding: the image is convolved with afilter that embodies a particular center·surroundfunction, the Laplacian of a Gaussian. It is not asbad as it sounds. A two--dimensionalGaussian is thebeU·shaped distribution familiar to statisticians. Inthis context it specifies the importance to be as·signed to the neighborhood of each pixel when theimage is being smoothed. As the distance increases,the importance decreases. A Laplacian is a secondderivative that gives equal weight to all paths ex·tending away from a point. The Laplacian of aGaussian converts the bell·shaped distribution intosomething more like a Mexicanhat. The bell is nar-rowed and at its sides a circular negative dipdevelops.

NoWthe procedure can be described nontechni·cally. Convolving an image with a filter that

embodies the Laplacian of a Gaussian is equivalentto substituting for each pixel in the image aweighted average of neighboring pixels, where theweights are provided by the Laplacian of a Gaus·sian. Thus the filter is applied to each pixel. It as-signs the greatest positive weight to that pixel anddecreasingpositive weights to the pixelsnearby (seeFigure 5.7). Then comes an annulus-a ring-in

Page 9: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

b

c

d

Figure 5.6 SPATIAL DERIVATIVES of an im9 empha-size its spatial variations in iDtensity. Left: AD edge isshown between two even shades of gray (a). The latensityalong a path across the edge appean below it (b). The firstderivative of the intensity is the rate at which intensitychanges (c). Toward the left or toward the right Ihere is nochange; the first derivative therefore hi uro. Along theedge itseU, however, the rate of change rises and faUs. The

which the pixels are given negative weightings.Bright points there feed negative numbers into theaveraging. The result of the overall filtering is anarray of positive and negative numbers: a kind ofsecond derivative of the image intensity at the scale

VISION BY MAN AND MACHINE • 89

a'

b'

c'

second derivative of the intensity is the rafe of change ofthe rate of change (d). Both derivatives emphasize the edge.The first derivative marks it with a peak; the &e«lnd deriv-ative marks it by crossing zero. Right: The edge is moretypical of the visual wmld (a'). The related intensity COR-four (11') md its first and second derivatives W, d') are"noisy." The edge musl be smoothed before derivatives arelaken.

of the filter. The zero-crossings in this filtered arraycorrespond to places in the original image where itsintensity changes most rapidly. Note that a binary(that is, a two-valued) map showing merely thepositive and negative regions of the filtered array is

Page 10: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

90 • TOMASO POGGIO

!+--------S76 PIXELS

Figure 5.7 CENTER-SURROUND fILTERING of animage se.rves both to smooth it and to take its secondspatial derivative. Left: Image is shown with filters oftwo sizes depicted scltematiWIy; the "tiller" is ac:tuilllycomputationaL Each intensity measuremenl in theimage Is replaced by a weighted average of neighboringmeasurements. Nearby measurements contribute posi-tive weights to the average; Ihus the &Iter's center is"excitatory'" (red). Then COJResan annulus. or rills.. Inwhich the measurements contribute negative weights;thus the &Iter's "sW'l'011nd" is "inhibitory'" (bllle).lUght:Maps produced by the tillers are no longer gray-levelarrays. They have both positive values (red) and nega-tive values (blue!. They are maps of Ibe second deriva-tive. Transitions from one color to the other aie zero-crossings. The maps empbasiu the uro-crossings byshowing only positive regions (red) and negative re-gions (blue).

essentiallyequivalent to a map of the zero-crossingsin that one can be constructed from the other.

In the human brain most of the hardware re-quired to perform such a filtering seems to bepresent. As early as 1865 Emst Mach observed thatvisual perception seems to enhance spatial varia·

1< ,I5PIXELS

16PIXELS

tions in light intensity. He postulated that the en-hancement might be achieved by lateral inhibition,a brain mechanism in which the excitation of anaxon, or nerve fiber, say by a spot of bright light inthe visual world, blocks the excitationof neighbor-ing axons. The operation plainly enhances the con-

Page 11: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

trast between the bright spot and its surroundings.Hence it is similar to the taking of a spatialderivative.

Then in the 1950's and 1960's evidence accumu-lated suggesting that the retina does somethingmuch like center-surround filtering. The outputfrom each retina is conveyed to the rest of the brain

VISION BY MAN AND MACHINE • 91

by about a million nerve fibers,each being the axonof a neuron called a retinal ganglion cell. The cellderives its input (by way of intermediate neurons)from a group of photoreceptors, which form a "re-ceptive field," What the evidence suggests is thatfor certain ganglion cells the receptive field has acenter-surround organization closelyapproximating

Page 12: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

92 • TOMASO POGGro

the Laplacian of a Gaussian. Brightness in the centerof the receptive field excites the ganglion cell;brightness in a surrounding annulus inhibits it. Inshort, the receptive field has an ON-center and anOFF-surround,just like the Mexican hat (see Figure5.8).

Other ganglion cells have the opposite properties:they are oFF-center, oN-surround. Ifaxons couldsignal negative numbers, these cells would be re-dundant: they report simply the negation of whatthe ON-center cells report. Neurons, however, can-

PHOTORECEPTOR

"

not readily transmit negative activity; the ones thattransmit all-or-nothing activity are either active orquiescent. Nature, then, may need neuronal oppo-sites. Positive values in an image subjected tocenter-surround filtering could be represented bythe activity of ON-centercells; negative values couldbe represented by the activity of OFF-centercells. Inthis regard I cannot refrain from mentioning therecent finding that oN-center and oFF-center gan-glion cells are segregated into two different layers,at least in the retina of the cat. The maps generated

RETINAL GANGLION CELL

"

figure 5.8 BIOLOGICAL CENTER-SURROUND FILTERembodied by cells in the retina resembles the computerpnxedure shown in Figure 5.7. The filter begins with alayer of photoreceptors connected by way of intermediatenerve (ells, nolshown in the diagram.. to a layer of retinalganglion rells, whiclt send visual data to higher visual

centers. For the sake of simplicity only one set of connec-fionB is shown. A photoreceptor ceO fred) exciles aD .•.•ON-cenler" ganglion reI1 by promotiDg its tendency to generateneural signals; the surroundiag photoreceptors (bllle) ia-hibit the ganglion reIL

Page 13: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

by our computer might thus depict neural activityrather literally. In the maps in Figure 5.7 red mightcorrespond to oN-layer activity and blue to oFF-layeractivity. Zero-crossings (that is, transitions from onecolor to the other) would be the locations whereactivity switches from one layer to the other. Here,then, is a conjecture linking a computational theoryof vision to the brain hardware serving biologicalvision.

It should be said that the center-surround filteringof an image is computationally expensive for a com-puter because it involves great numbers of multipli-cations: about a billion for an image of 1,000 pixe1sby 1,000. At the Artificial Intelligence Laboratory,H. Keith Nishihara and Noble G. Larson, Jr., havedesigned a specialized device: a convolver that per-fonns the operation in about a second. The speed isimpressive but is plodding compared with that ofthe retinal ganglion cells.

Ishould also mention the issue of spatial scale. Inan image there are fine changes in intensity as

well as coarse. All must be detected and repre-sented. How can it be done? The natural solution(and the solution suggested by physiology and psy-chophysics) is to use center-surround filters of dif-ferent sizes. The filters turn out to be band~pass:they respond optimally to a certain range of spatialfrequencies. In other words, they "see" onlychanges in intensity from pixel to pixel that areneither too fast nor too slow. For anyone spatialscale the process of finding intensity changes con-sists, therefore, of filtering the image with a center-surround filter (or receptive field) of a particular sizeand then finding the zero-crossings in the filteredimage. For a combination of scales it is necessary toset up filters of different sizes, performing the samecomputation for each scale. Large filters would thendetect soft or blurred edges as well as overall illumi-nation changes; small filters would detect finer de-tails. Sharp edges would be detected at aU scales.

Recent theoretical results enhance the attractive-ness of this idea by showing that features similar tozero-crossings in a filtered image can be rich ininformation. First, Ben Logan of Bell Laboratorieshas proved that a one-dimensional signal filteredthrough a certain class of filters can be recon-structed from its zero-crossings alone. The Lapla-cian of a Gaussian does not satisfy Logan's condi-tions exactly. Still, his work suggests that theprimitive symbo1s provided by zero-crossings arepotent visual symbols. More recently Alan Yuille

VISION BY MAN AND MACHINE • 93

and I have made a theoretical analysis of center-surround filtering. We have been able to show thatzero-crossing maps obtained at different scales canrepresent the original image completely, that is,without any loss of information.

This is not to say that zero-crossings are the opti-mal coding scheme for a process such as stereopsis.Nor is it to insist that zero-crossings are the solebasis of biological vision. They are a candidate foran optimal coding scheme, and they (or somethinglike them) may be important among the items to bematched between the two retinal images. We have,therefore, a possible answer to the question of whatthe stereopsis module matches. In addition we havethe beginning of a computational theory that mayeventually give mathematical precision to the vagueconcept of "edges" and connect it to known proper-ties of biological vision, such as the prominence of"edge detector" cells discovered at the HarvardMedical School by David H. Hubel and Torsten N.Wiesel in the part of the cerebral cortex where vi-sual data arrive.

To summarize, a combination of computationalarguments and biological data suggests that an im-portant first step for stereopsis and other visual pro-cesses is the detection and marking of changes inintensity in an image at different spatial scales. Oneway to do it is to filter the image by the Laplacian ofa Gaussian; the zero-crossings in the filtered arraywill then correspond to intensity edges in the image.Similar information is implicit in the activity of ON-center and oFF-centerganglion cells in the retina. Toexplicitly represent the zero-crossings (if indeed thebrain does it at all) a class of edge-detector neuronsin the brain (no doubt in the cerebral cortex) wouldhave to perform specific operations on the output ofoN~center and OFF-centercells that are neighbors inthe retina. Here, however, one comes up against thelack of information about precisely what elementarycomputations nerve cells can readily do.

Weare now in a position to see how a represen-tation of intensity changes might be useful

for stereopsis. Consider first an algorithm devisedby Marr and me that implements the constraintsdiscussed above, namely uniqueness (a given pointon a physical surface has only one location, so thatonly one binocular match is correct) and continuity(variations in depth are generally smooth, so thatbinocular disparities tend to vary smoothly). It issuccessful at solving random-dot stereograms andat least some natural images. It is done by a com-

Page 14: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

94 • TOMASO POGGIO

puter; thus its actual execution amounts to a se-quence of calculations. It can be thought of, how-ever, as setting up a three-dimensional network ofnodes, where the nodes represent all possible inter-sections of lines of sight from the eyes in the three-dimensional world. The uniqueness constraint willthen be implemented by requiring that the nodesalong a given line of sight inhibit one another.Meanwhile the continuity constraint will be imple-mented by requiring that each node exciteits neigh-bors. In the case of random-dot stereograms theprocedure will be relatively simple. There thematches for pixels on each horizontal row in onestereogram need be sought only along the corre-sponding row of the other stereogram.

The algorithm starts by assigning a value of 1 toall nodes representing a binocular match betweentwo white pixels or two black pixels in the pair ofstereograms.The other nodes are given a value of O.The l's thus mark all matches, true and false (seeFIgure 5.9). Next the algorithm performs an alge-braic sum for each node. In it the neighboring nodeswith a value of 1 contribute positive weights; thenodes with a value of 1 along lines of sight contrib-ute negative weights. If the result exceeds somethreshold value, the node is given the value of 1;otherwise the node is set to O. That constitutes oneiteration of the procedure. After a few such itera-tions the network reaches stability. The stereopsisproblem is solved (see Figure 5-10).

The algorithm has some great virtues. It is a co-operative algorithm: it consists of local calcula-

tions that a large number of simple processors couldperfonn asynchronously and in parallel. One imag-ines that neurons could do them. In addition thealgorithm can fill in gaps in the data. That is, itinterpolates continuous surfaces. At the same time itallows for sharp discontinuities.On the other hand,the network it would require to process finely de-tailed natural images would have to be quite large,and most of the nodes in the network would be idleat anyone time. Furthermore, intensity values areunsatisfactory for images more natural than ran-dom-dot stereograms.

The algorithm's effectiveness can be extended toat least some natural images by first filtering theimages to obtain the sign of their convolution withthe Laplacian of a Gaussian. The resulting binarymaps then serve as inputs for the cooperative algo-rithm. The maps themselves are intriguing. In theones generated by large filters at correspondinglylow spatial resolution, zero-crossingsof a given sign

(for instance the crossings at which the sign of theconvolution changes from positive to negative) turnout to be quite rare and are never dose to eachother. Thus false targets (matches between noncor-responding zero-crossings in a pair of stereograms)are essentially absent over a large range of dis-parities.

This suggests a different class of stereopsis algo-rithms. One such algorithm..developed recently forrobots by Nishihara, matches positive or negativepatches in filtered image pairs. Another algorithm,developed earlier by Marr and me, matches zero-crossings of the same sign in image pairs made byfilters of three or more sizes. First the coarsely fil-tered images are matched and the binocular dis-parities are measured. The results are employedtoawromnate~~tertheimages.~onocu1arfeatures such as textures could also be used.) Asimilar matching process is then awlied to the me-dium-filtered images. Finally the process is appliedto the most finely filtered images. By that time thebinocular disparities in the stereo pair are known indetail, and so the problem of stereopsis has beenreduced to trigonometry.

A theoretical extension and computer implemen-tation of our algorithm by W. Eric L. Grimson

at the ArtificialIntelligence Laboratory works quitewell for a typical application of stereo systems: theanalysis of aerial photographs (see Figure 5.1). Inaddition it mimicsmany of the properties of humandepth perception. For example, it performs success-fully when one of the stereo images is out of focus.Yet there may also be subtle differences. Recentwork by John Mayhew and John P. Frisby at theUniversity of Sheffieldand by Julesz at BellLabora-tories should clarify the matter.

What can one say about biologicalstereopsis?Thealgorithms I have described are still far from solvingthe correspondence problem as effectively as ourown brain can. Yet they do suggest how the prob-lem is solved. Meanwhile investigations of the cere-bral cortex of the cat and of the cerebral cortex ofthe macaque monkey have shown that certain corti-cal neurons signal binocular disparities. And quiterecently Gian F. Poggio of the Johns Hopkins Uni-versity School of Medicine has found corticalneurons that signal the correct binocular disparity inrandom-dot stereograms in which there are manyfalse matches. His discovery, together with ourcomputational analysis of stereopsis, promises toyield insight into the brain mechanisms underlyingdepth perception.

Page 15: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

VISION BY MAN AND MACHINE • 95

LEFT

••••• ••••• •••• •~••• ••••••••• • •••• ••••••

RIGHT

•••••- .- •••~•••••••• • •••• ••••••

LEFT EYE

• • • • • • • • • • • • • • • 0

• • • • • • • • • 0

• • • • • • • • • • • • • • 0 •• • • • • • • • • • • • • 0 • •• • • • • • • • • • • • 0 • • •• • • • • • • • • • • 0 • • • •

• • • • • • • • 0 •• • • • • • • • • 0 • •

• • • • • • • • • • 0 • • • • •• • • • • • • • • 0 • • • • • •• • • • • • • • 0 • • • • • • •• • • • • • • 0 • • • • • • • •

• • • • • • • • 0 • • •• • • • • 0 • • • • • •• • • • 0 • • • • • • •• • • 0 • • • • • • • •

• • • • • • 0 • • • • • • • • •• • • • • 0 • • • • • • • • • •• • • • 0 • • • • • • • • • • •• • • 0 • • • • • • • • • • • •• • 0 • • • • • • • • • • • • •

0 • • • • • • • • •• 0 • • • • • • • • • • • • • •

w 0 • • • • • • • • • • • • • • •>w • • 0 • • • • • • • • • •" • • • • • • • • •I • 0 • •~

"Figure 5.9 STEREOPSIS ALGORITHM reconstructs thetiu'ft-dimensional visual world by seeking malches be-tween dots on corresponding rows of a pair of random-dotstereograms. Two such rows are shown (bl~ck lUId white,top). The rows below are placed along the u:es. Horizontallines represent lines of sight for the right eye; vertical

One message should emerge clearly: the extent towhich the computer and the brain can be broughttogether for the study of problems such as vision.On the one hand the computer provides a powerfultool for testing computational theories and algo-rithms. In the process it guides the design of neuro-physiological experiments: it suggests what oneshould look for in the brain. The impetus this willgive brain research in the coming decades is likely tobe great.

lines, lines of sight for the left. Color marks all intersec-tions at which the eyes both see a blaek dot or a white dol.A given dot in one stereogram could in principle matchany same-color dol in the other. Yet only some matches arecorrect (open colored circles), that is, only some reveal thai asquare of random-dot lexture hatI a binocular disparity.

The benefit is not entirely in that direction; com-puter science also stands to gain. Some computerscientists have maintained that the brain providesonly existence proofs, that is, a living demonstrationthat a given problem has a solution. They are mis-taken. The brain can do more: it can show how toseek solutions. The brain is an information proces-sor that has evolved over many millions of years toperform certain tasks superlatively well. If we re-gard it, with justified modesty, as an uncertain in-

Page 16: Vision by Man and Machine 1984.pdfble to either. We shall have a common language in which to discuss them: the language of information processing. Note that in this language descriptions

96 • TOMASO POGGJO

a b

c d

.'

, •

,..

Figure S.10 ITERATIONS OF THE ALGORITHM (de-picted schematically) solve the problem of stereopsis. Thealgorithm assigns a value of 1 to all intersections of lines ofsight marked by a match.. and of 0 to the othen. Next itcalculates a weighted sum for every intersection. Neigh-boring intersections with a value of 1 contribute positiveweighllil to the sum. The eye sees only one surface along a

strument, the reason is simply that we tend to bemost consciousof the things it does least well-therecent things in evolutionary history, such as logic,mathematics and philosophy-and that we tend tobe quite unconscious of its true powers, say in vi-

given line of sight; hence intersections with a value of 1along lines of sight c:ontribute negative weights. U theresult exceeds a threshold value, the intersection is met to1; otherwise it is reset to O. After a few iterations of theprocedure the cakulation is complete: the stereograDUI are

d""""'-

sion. It is in the latter domains that we have much tolearn from the brain, and it is in these domains thatwe should judge our achievements in computerscience and in robots. We may then begin to seewhat vast potential lies ahead.