proof copy [a-8672] 004307joatai/papers/potetz_josa.pdfproof copy [a-8672] 004307joa proof copy...

PROOF COPY [A-8672] 004307JOA

PROO

F COPY [A-8672] 004307JO

A Statistical correlations between two-dimensional

images andthree-dimensional structures in natural scenes

Brian Potetz and Tai Sing Lee

Department of Computer Science, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh,Pennsylvania 15213

Received October 3, 2002; revised manuscript received February 10, 2003; accepted February 10, 2003

In spite of the recent surge in the popularity of statistical approaches to vision, the joint statistics of coregis-tered range and light-intensity images have gone relatively unexplored. We investigate statistical correla-tions between images and the surface shapes that produced them. We determine which linear properties ofrange images can be best predicted from simple computations on intensity information, and we determinethose properties of intensity images that best predict range information. We find that significant (up to r5 0.45) and potentially exploitable correlations exist between linear properties of range and intensity images,and we explore the structure of these correlations. © 2003 Optical Society of America

OCIS codes: 150.5670, 330.3790.

1. INTRODUCTIONThe study of shape from shading has been a major branchof vision since the 1970s and has its roots even earlier, inthe photometric investigations of the lunar surface per-formed in the 1920s.1 Since this time, the standard ap-proach to the problem has involved modeling the behaviorof light as it travels through space and interacts with sur-faces and then attempting to invert the image formationprocesses. Unfortunately, inverting this process is highlyunderconstrained, and various assumptions about imageformation models and their parameters, such as Lamber-tian surface reflectance, uniform albedo, and shadow-free,single-point-source illumination, have to be made for thisapproach to work. However, these assumptions do notalways hold in natural images, and this often leads topoor generalization for these algorithms.

There have been some notable exceptions to this trend.Shape-from-shading algorithms that make use of a directassociation between luminance images and the three-dimensional (3D) models used to generate them were pre-sented by Freeman et al.2 and also by Lehky andSejnowski.3 In each of these cases the image statisticsused were derived from computer-generated images.These images make the same assumptions made by tra-ditional shape-from-shading algorithms. These ap-proaches suggest that a deeper understanding of the jointstatistics of natural images and their associated 3D struc-tures might be important for the successful developmentof a statistical approach for 3D inference.

There has been considerable recent interest in the sta-tistics of natural images, particularly in the context of im-age coding4–7 and scale invariance8,9 as well as in thejoint statistics of neighboring pixels or waveletresponses.10,11 However, there has been relatively littleinvestigation into joint statistics of range and luminanceimages. We know of only one study that explores these

joint statistics. It has been observed that human percep-tion of the length of a line segment on a blank backgrounddepends on the orientation of the line. In a recent study,Howe and Purves12 examine range images to find thatthis bias closely matches the 3D length of these line seg-ments when projected into range images. The authors goon to restrict their investigation to only those intervalsthat correspond to edges in the luminance image andshow that the finding still holds. This result shows howinvestigation into the statistics of range and coregisteredluminance images may help to explain how depth is com-puted in the brain.

There may be many factors that affect the statistics ofnatural images that cannot be inferred from simple physi-cal models. For example, the statistical relationship be-tween images and their surfaces will be affected by thenatural statistics of illumination direction, a factor that isknown to heavily influence human performance on shape-from-shading problems.13 Other factors that influencethis relationship may include the statistics of object sizeand shape in natural scenes as well as the natural statis-tics of the surface properties of those objects. This opensup the possibility that there may be simple, exploitablestatistical relationships between real images and surfaceshapes that have gone overlooked. Discovering these re-lationships might further the development of vision algo-rithms that utilize shape-from-shading information. Itmay also provide insight into how the human visual sys-tem is so adept at solving these problems.14

For this study we constructed a database of coregis-tered high-resolution two-dimensional (2D) color imagesand 3D range images using the most recent scanner tech-nology. We then conducted a first attempt to explore thestatistical correlations between images in these two do-mains. We searched for simple, local, linear correlationsusing linear regression, ridge regression, and canonical

Brian Potetz and Tai Sing Lee Vol. 20, No. 7 /July 2003 /J. Opt. Soc. Am. A 1

1084-7529/2003/070001-12$15.00 © 2003 Optical Society of America



PROO

F COPY [A-8672] 004307JO

A correlation. These techniques allowed us to extract somesimple but interesting statistical trends between inten-sity and range images. We hope this and future explora-tion of these data will provide a better understanding ofthe statistical distribution and correlations between 2Dimages and 3D structures that will provide a solid foun-dation for an image-based statistical approach for 3D in-ference.

2. METHODSTo investigate these statistical relationships, we first col-lected a database of coregistered intensity and range im-ages (where by coregistered we mean that correspondingpixels of the two images correspond to the same point inspace). We selected the Riegl LMS-Z360, the most so-phisticated available long-range scanner, to construct ourimage database. The Z360 collects coregistered rangeand color data by using an integrated color photosensorand a time-of-flight laser scanner with a rotating mirror.The scanner has a maximum range of 200 m and a depthaccuracy of 12 mm. However, for each scene in our data-base, multiple scans were averaged to obtain an accuracyof 6 mm. For the purposes of this study, red, green, andblue color values were combined into one gray-scale light-intensity value. The measurement of each pixel isknown to be independent. This means that no global im-age processing or automatic gain control is applied to theimages. All range measurements are reported in meters.All scanning is performed in spherical coordinates.

Using this scanner, we collected scans of a wide varietyof outdoor scenes. Example range and intensity imagepairs appear in Fig. 1. All scenes in our database areoutdoor shots, taken under sunny conditions during theweek of June 17th, 2002, in western Pennsylvania. Thecamera was kept level with the horizon and was posi-tioned either on the ground or on a tripod, roughly 1 m offthe ground. Over 100 scenes were taken at variousresolutions. However, for this paper we selected a 50-scene subset of our database with spatial resolution of22.5 6 2.5 pixels per degree. This removes from ourdataset images with very high or low spatial resolutions.Extremely dark images were also discarded. The con-tents of our images include scans of trees and wooded ar-eas, rocky areas, building exteriors, and sculptures.Twenty-one images were of urban scenes and 29 were ofrural scenes. Each image required minutes to scan,hence only stable and stationary scenes were taken. Theaverage size of our images was 1000 3 604 pixels, for atotal of 30,177,930 pixels.

The blue regions of the range image are areas where norange information is available. The surface could be outof range, such as sky, or too reflective to be measured bythe scanner, such as water. All image patches that con-tain invalid pixels were excluded from our calculations.

As is common in the analysis of the natural statistics ofimages,6,15 we work with the logarithm of the light-intensity values rather than intensity itself. One advan-tage of this is that image contrast, rather than being a

Fig. 1. Two sample image pairs from our database: one from the urban subset, one from the rural subset. The urban image is ofHammerschlag Hall, Carnegie Mellon University, Pittsburgh. The rural image is of trees and bushes in Schenley Park, Pittsburgh.Out-of-range regions are depicted using monochrome white noise.

2 J. Opt. Soc. Am. A/Vol. 20, No. 7 /July 2003 Brian Potetz and Tai Sing Lee



PROO

F COPY [A-8672] 004307JO

A

multiplicative factor, becomes an additive factor under logintensity. This means that linear filters respond to con-trast rather than raw differences in amplitude, and there-fore zero-sum linear filters are insensitive to the total con-trast of each patch.

We transform the range data in the same way, by ap-plying a logarithmic transform, as was done in previousstudies of pure range data. As described by Huanget al.,16 a large object and a small object of the sameshape appear identical to the eye when the large object ispositioned appropriately far away and the small object isclose. However, the raw range measurements of thelarge, distant object will differ from those of the small ob-ject by a constant multiplicative factor. In the log rangedata, the two objects will differ by an additive constant.Therefore, a zero-sum linear filter will respond identicallyto the two objects.

In this study, we work with 25 3 25 patches of coregis-tered light intensity and range data. There are15,577,472 patches in our database subset that contain noinvalid (out-of-range) pixels; 8,872,641 of these patcheswere from rural images, and 6,704,831 patches were fromthe urban images. We treat our data as 15.6 3 106 ob-servations of 1250 random variables: 625 luminancevariables, and 625 range variables. With a spatial reso-lution of 22.5 pixels per degree, each patch subtends a vi-sual angle of roughly 1.1 deg.

3. RESULTSA. Patch AveragesIn the study of image statistics, it is often assumed thatimages are translationally invariant. This assumption isbased on the assumption that the probability of seeingany image is equal to the probability of seeing the sameimage, only shifted by some distance and in some direc-tion. However, we do not make the second assumption,because some viewing angles might be more likely thanothers, and the range data might not be translationallyinvariant. Therefore we must compute the average val-ues of each point of our 25 3 25 image patches indepen-dently. Let N be the number of patches in our database,and let LP and RP be the Pth log light-intensity patch (lu-minance) and log range patch, respectively. Then

L̄~i, j ! 51

N (P51

N

LP~i, j !, (1)

R̄~i, j ! 51

N (P51

N

RP~i, j !, (2)

where i, j indicates the position within the image patchand L̄(i, j) and R̄(i, j) denote the patch averages.

The mean log intensity and log range image patchesare shown in Fig. 2. For explanatory purposes, we also

Fig. 2. Averages of all image patches.




PROO

F COPY [A-8672] 004307JO

A show the mean nonlog intensity and range patches (whichwere computed before the logarithm was applied). Ascan be seen in the figure, the average log intensity variesbetween 2.7863 and 2.7923 across the 25 3 25 patch, fora total range of 0.0060. This is a fairly narrow range forlog intensity values, which have a standard deviation of0.82. This suggests that the mean intensity of luminanceimages is roughly translationally invariant.

The average range patch, on the other hand, appears toexhibit slant relative to the vertical plane. The averagelog range values vary between 2.8552 and 2.9244, for a to-tal range of 0.0692. This is a much broader range for logrange values, which have a standard deviation of approxi-mately 0.91 log m. Therefore, the structure present inthe average log range patch is considerably more signifi-cant than the structure present in the average log inten-sity patch.

To better understand the structure observed in the av-erage range patch, it may be useful to temporarily returnto nonlogarithmic range values. Again, the averagerange patch is very close to planar, with the angle ofmaximum ascent close to the vertical. This slant arisesfrom the receding ground plane in most range images.Recalling that range values are in meters, this resultshows that, on average, within our database, the distancefrom the observer to the closest object increases at a rateof 1.3 m per visual degree in the vertical direction. Wefound similar results when we analyzed the rural and ur-ban subsets of the database separately, with little varia-tion across the average luminance patch, and an averageincline of 1.5 m/deg in the rural images and 1.0 m/deg inthe urban images.

Clearly, this result is highly sensitive to the choice ofsubject matter for each image and to bias in the orienta-tion of the camera. We do not claim that our results gen-

eralize to the set of all real images. An image databaseof aerial photography, for example, is unlikely to contain asimilar statistic. However, images in our database aretypical of rural and urban outdoor scenes. We expectsimilar results to be observed in similar image databases.

B. CovarianceTo investigate correlational relationships between lumi-nance and range data, we must understand how the indi-vidual pixels in the image patches correlate with one an-other. We now find the covariance between each pair ofpixels in our set of image patches. We begin by centeringour data by subtracting the average patch from eachpatch:

L̂P 5 LP 2 L̄, (3)

R̂P 5 RP 2 R̄. (4)

Note that translational invariance is not assumed forany data; each of the 1250 variables is centered indepen-dently. Let L and R be the N 3 252 matrices of all cen-tered log intensity and log range patches, respectively.Each row of L and R is a vectorized image patch, suchthat

L~25m1n !, p 5 L̂p~n, m !,

where LP(0, 0) is the upper left of the image patch, andLP(24, 0) is the upper right of the patch. The 6253 625 matrix L8L contains the covariance between eachpair of pixels in the intensity patch. Likewise, R8R isthe range covariance matrix, and L8R is the cross covari-ance matrix. All of the results reported in the remainderof this paper can be derived from these matrices.

In the top row of Fig. 3 we show the correlation be-

Fig. 3. (a) Correlation between intensity at pixel (13, 13) and all the pixels of the intensity patch. This represents one row of thecovariance matrix L8L, normalized by variance. (b) Correlation between intensity at pixel (13, 13) and the pixels of the range patch.This is one row of the matrix L8R, normalized by variance. (c) Correlation between range at pixel (13, 13) and the pixels of the intensitypatch. This is one column of the matrix L8R, normalized. (d) Correlation between range at pixel (13, 13) and the pixels of the rangepatch. This is one row of the matrix R8R, normalized. (e) Correlation between intensity and range at pixel (i, i). This represents thediagonal of the covariance matrix L8R, normalized by variance. (f) Same figure as (e), except measured over rural images only. (g)Same figure as (e), except measured over urban images only.




PROO

F COPY [A-8672] 004307JO

A tween a specific pixel and other pixels in the patch. Thecorrelation is obtained by means of the equation

r 5 cor@X,Y# 5cov@X,Y#

Avar@X#var@Y#.

We can make several interesting observations from thiscorrelation graph. First, we see that neighboring rangepixels are much more highly correlated than neighboringluminance pixels. This suggests that the low-frequencycomponents of range data contain much more power thanin luminance images and that the spatial Fourier spectrafor range images drops off more quickly than for lumi-nance images, which are known to have roughly 1/f spa-tial Fourier amplitude spectra.8 Because the measure-ment of each pixel is independent of the others, anyregular distortion of the power spectrum of range imagescaused by the scanner must be caused by the divergenceof the laser beam measuring the distance. For the RieglLMS-Z360 scanner, the beam divergence is 0.11 degrees,which is approximately 2 pixels in diameter on the im-ages in our database (0.04 to 0.05 deg/pixel). Only thehighest spatial frequencies could be influenced by this di-vergence.

This finding is reasonable because factors that causehigh-frequency variation in range images, such as textureor occlusion contours, tend also to cause variation in theluminance image. However, much of the high-frequencyvariation found in luminance images, such as shadow andsurface markings, are not observed in range images.

Second, we see that luminance and range values arenegatively correlated, with correlations in the neighbor-hood of 20.18. Physics-based approaches to shape fromshading consistently conclude that shape-from-shadinginference offers only relative depth information. Ourfindings suggest that, in natural images brighter pixelshave a tendency to be closer to the observer. It has beenobserved as far back as Leonardo da Vinci that humansperceive brighter objects as closer.17 Artists have madeuse of this fact to help create compelling illusions ofdepth.18,19 Later, psychologists confirmed this fact incontrolled experiments.20–22 The result presented here isthe first evidence that this relationship actually holds innature.

In psychology literature, this effect is known as relativebrightness.23 Possible explanations are offered as to whysuch a perceptual bias exists. One common explanationis that light coming from distant objects has a greatertendency to be absorbed by the atmosphere. Under cer-tain weather conditions, such as smog, it is possible forthe tendency of the atmosphere to absorb light to out-weigh its tendency to scatter light toward the eyes. How-ever, all of the images in our database were taken in clearand sunny conditions. Furthermore, the range of opera-tion for the LMS-Z360 scanner is roughly 200 m, andmost of our images were taken under much shorter range.Atmospheric effects are not considered effective until dis-tances considerably further than this.24

Another explanation that is given is that nearby objectsreflect more light to our eyes.23 For any given point onan object, the total amount of light that is reflected fromthat point and into our eye is inversely proportional to the

square of the distance to that point. However, the area ofthe retina occupied by that object also decreases propor-tionally with the square of the distance to that object.Therefore the amount of light per unit area on the retinaremains the same. Except under extreme conditions on adiscrete sensor array, this causes perceived brightness toremain the same.

Other explanations are physiological or psychologicalin nature and thus cannot explain the trend observed inour database. For example, it has been suggested thebias is a combination of the observations that bright ob-jects appear larger (an effect known as irradiance) andlarger objects appear closer.20 Another explanation hasbeen that bright objects are seen more clearly and clearobjects are seen as closer, owing to scattering of light inthe atmosphere (the aerial perspective).22 Other theoriesinvolve the presence of a third, background stimulus.Studies have shown that against gray or dark back-grounds, brighter stimuli appear closer. However,against a white background, darker stimuli appearcloser.25 Higher order statistical analysis would beneeded to search for a physical basis to trends involvingthe statistical relationship between three stimuli in ourdatabase.

As with the patch average results, the correlationfound in this database may be dependent on bias in thecamera position or orientation. Owing to limitations ofthe camera, no object in our images is closer than 2 maway, and all of the scenes in our images were brightly lit.The results we obtain might be different if the camerawere consistently placed in shaded, cluttered locations,looking out into brightly lit clearings. For example,scenes typical to predators may be different than thosetypical to prey and therefore may have different statis-tics.

On inspection of the images in our database, we sus-pected that the major cause of the correlation was shadoweffects within the environment. For example, leafy foli-age constitutes a significant portion of the database.Since the source of illumination comes from above, andoutside any tree, the outermost leaves of a tree or bushare typically the most illuminated. Deeper into the tree,the foliage is more likely to be shadowed by neighboringleaves. On the other hand, urban environments containmore flat surfaces and fewer concavities or places ofshadow. We expected the correlation to be much strongerfor the rural scenes.

To test this hypothesis, we repeated the analysis on therural and the urban image subsets independently. In thebottom row of Fig. 3 we plot the correlation betweenrange and intensity pixels at each location in the imagepatch. As can be seen from the figure, the correlationsfor the rural dataset has increased to 20.32, while thosefor the urban dataset are considerably weaker, in theneighborhood of 20.06.

If the correlation found in the original data set was dueto atmospheric effects, we would expect the correlation tobe equally good in both datasets. The average depth inthe urban database (32 m) was similar to that of the ruraldatabase (40 m), so atmospheric effects should be similarin both datasets. However, the correlation seems to comealmost entirely from the rural scenes.




PROO

F COPY [A-8672] 004307JO

A We feel that the true source of the correlations comes

from the statistics of surfaces in natural images and theeffects of shadows on these surfaces. In nature, surfacesoften have concavities and interiors. Because the lightsource is typically positioned outside of these concavities,the interiors of these concavities tend to be in shadow andmore dimly lit than the object’s exterior. At the sametime, these concavities will be farther away from theviewer than the object’s exterior. Thus the correlationfound in an image database should depend a great deal onthe statistics of surface shape. In rural images, the leafystructure of foliage and the rocky texture of stone providean abundance of concavities at any spatial scale. Thesmooth surfaces of building exteriors offer far fewer op-portunities for self-shadowing. It is still possible thatsignificant negative correlation may be observed in data-bases of indoor scenes. Crumpled fabrics and clutteredenvironments often contain shadowed concavities.

Langer and Zucker26 observed that for continuousLambertian surfaces of constant albedo, lit by a hemi-sphere of diffuse lighting and viewed from above, a ten-dency for brighter pixels to be closer to the observer canbe predicted from the equations for rendering the scene.Intuitively, the reason for this is that under diffuse light-ing conditions the brightest areas of a surface will bethose that are the most exposed to the sky. When viewedfrom above, the peaks of the surface will be closer to theobserver. Although these theoretical results have notbeen extended to more general environments, our resultsshow that in natural scenes, these tendencies remain,even when scenes are viewed from the side, under brightlight from one direction. In spite of these differences,both phenomena seem related to the observation that con-cave areas are more likely to be in shadow.

Finally, we observe that there is structure in the cova-riance between luminance and range pixels. Not only isthe luminance of a pixel correlated with distance at thatsame pixel, but the same luminance value is even morehighly correlated with distance at pixels lower in thepatch. This is the first indication of many from our da-tabase that bright surfaces tend to face upward. In theremainder of the paper, we present different ways of view-ing and understanding these correlations and their struc-tures.

C. Convex Range FiltersIn the next three subsections, we will search for thoseproperties in intensity images that correlate moststrongly with convexity in range images. We begin bydefining a linear filter that models convexity.

We model three types of convexity: vertical convexity,horizontal convexity, and isotropic (nondirectional) con-vexity. Vertical convexity is modeled as the second verti-cal derivative of a Gaussian:

G~x, y ! 51

sA2pexpS 2

x2 1 y2

s 2 D , (5)

FV * R 5]2

]y2 @G~x, y ! * R~x, y !#

5]2G

]y2 * R, (6)

where R is the range image and s is set to 6. The secondderivative of a function detects the rate of change in theslope of a surface. Recalling that higher range values aremore distant, we see that this linear filter will respondstrongly for a surface whose tangent plane faces upwardtoward the top of the patch and downward toward thebottom. This defines vertical convexity. Note that, ac-cording to this definition of convexity, corners and pointsnear occlusion boundaries on a foreground object are alsoconsidered convex.

Horizontal convexity is modeled similarly, as the sec-ond horizontal derivative of a Gaussian. Isotropic con-vexity is modeled as a Laplacian of Gaussian:

FH * R 5]2G

]x2 * R~x, y !, (7)

FI * R 5 ¹2@G~x, y ! * R~x, y !#, (8)

FI 5 FV 1 FH . (9)

All three filters were normalized to form zero-mean unitvectors:

(i, j

F~i, j ! 5 0, (10)

(i, j

F~i, j !2 5 1. (11)

These three filters are shown in Fig. 4(a).

D. Weighted AveragesConvolving the range filter with the range data provides ameasure of surface convexity at each point. We now ask,How do the pixels of a patch in the intensity image covarywith this measure of surface convexity? To answer thisquestion, we compute the covariance between the inten-sity image and the range filter’s response. Note that thiscovariance is equal to the average of all intensity patches,weighted by filter response:

cov@LP , F • RP# 5 L8RFR (12)

51

N (P51

N

~F • RP 2 E@F • RP# !~LP

2 E@LP# ! (13)

51

N (P51

N

~F • RP 2 F • R̄ !~LP 2 L̄ !

51

N (P51

N

~F • R̂P!L̂P

5 E@~F • R̂P!L̂P#, (14)

where F • RP is the dot product of the linear filter F withthe range patch RP .

These covariances (or weighted averages) are shown inFig. 4(b). In Fig. 4(c) we show the correlation betweenthe range filter response and each log intensity pixel.The vertical convexity response correlates positively withlight intensity toward the top of the patch, and it corre-lates negatively with light intensity toward the bottom of




PROO

F COPY [A-8672] 004307JO

A the patch. This suggests that items that are verticallyconvex, such as spheres or horizontal cylinders, are morelikely to be well lit toward the top of the convexity,whereas concave items are more likely to be bright to-ward the bottom and dark toward the top.

The weighted average for the horizontal convex filterFH shows positive correlation with light toward the uppercentral region of the patch, and this positive correlationextends downward and toward the sides of the patch.One possible explanation for high correlation with inten-sity toward the top of the patch is that areas of positiveGaussian curvature, such as a sphere, are more commonthan areas of negative Gaussian curvature, such as thehorn of a trumpet. Some studies have suggested that hu-mans have a perceptual bias toward positive Gaussiancurvature.27 It is possible that this bias is matched innature. If this were the case, horizontally convex regionswould tend to be vertically convex as well and wouldtherefore also tend to be brighter toward the top.

E. Linear CorrelationsIn the preceding subsection we observed that our linearmodels of convexity correlate with the intensity of light inthat region. We now ask three questions: Can we usethese correlations to predict the convexity response?What linear intensity filter yields a response that corre-lates most highly with convexity response? How well dothese filter responses correlate?

To answer these questions, we perform linear regres-sion on the range filter response. The regression coeffi-cient FL is the linear filter that minimizes the sumsquared error between the two filter responses:

FL 5 arg minFL

(P51

N

~FL • L̂P 2 FR • R̂P!2.

It can also be shown that the vector FL maximizescor@FL • LP , FR • RP#, just as the vector L8RFR (theweighted average) lies in the direction of the vector thatmaximizes cor@FL • LP , FR • RP#.

Let FR be the 252 3 1 column vector of range filter co-efficients for a single linear range filter, and let FL be the252 3 1 column vector of regression coefficients. Thenthe regression coefficients are computed as follows28:

FL 5 ~L8L!21L8~RFR!. (15)

Note that L8RFR is simply the weighted average,cov@LP , FR • RP#, in vector form, and that L8L is simplythe covariance matrix of the log intensity values.

The regression coefficients for our three convexity fil-ters are shown in Fig. 5(b). We see that the results arehighly noisy. To illustrate why, we perform singular-value decomposition on the symmetric matrix L8L5 U8DU, so that U is an orthonormal matrix whose col-umns are eigenvectors of L8L and D is a diagonal matrixof eigenvalues. We then have

FL 5 UD21U8L8~RFR!, (16)

U8FL 5 D21U8L8~RFR!. (17)

This shows that the coefficients of the principal compo-nents of FL are equal to the coefficients of the principalcomponents of L8(RFR) divided by the corresponding ei-genvalue. This has the effect of amplifying the values ofthose components with the least variance in the image da-tabase, which are therefore the most prone to noise. Inthe case of natural images, the eigenvectors of lowest ei-genvalue are the high-frequency components of the im-ages, and the nth eigenvalue of L8L is roughly propor-tional to 1/n.8,15 Thus the high-frequency components

Fig. 4. Convexity weighted averages. (a) Image and surface plots of the three convex range filters, (b) covariance between intensitypatch pixels and range filter response, (c) correlation between intensity and range filter response.




PROO

F COPY [A-8672] 004307JO

A

have considerably smaller eigenvalues, and so noise inthe high-frequency bands of the weighted averages will beamplified considerably in the regression coefficients. InFig. 6 we show the coefficients of the principal compo-nents of the weighted averages and the regression kernelsfor our three range filters.

We wish to compensate for the possibility that the largehigh-frequency coefficients that we observe in the regres-sion kernels are due to noise. We use a standard tech-nique known as ridge regression to simultaneously mini-mize both sum squared error and the total length of theintensity filter vector FL . Ridge regression reducesthose coefficients that are the least useful in predictingthe range filter response,

F̃L 5 arg minFL

F (P51

N

~FL • L̂P 2 FR • R̂P!2

1 l(i, j

FL~i, j !2G ,

where F̃L is the ridge-regression optical filter, and l is aparameter of the regression. It can be shown28 that solv-ing for this equation yields

F̃L 5 ~L8L 1 lI!21L8~RFR!. (18)

For each range filter, we used cross validation to findthe value of l used. For each of the 50 images in our da-tabase, we computed the covariance matrices L8L andL8(RFR) from the image subset consisting of each of the49 remaining images. We then computed F̃L for a rangeof values of l and tested the sum squared error yielded byeach F̃L on the one remaining image. l was chosen to bethe value that yielded the smallest mean error of the 50tests.

The resulting ridge-regression kernels are shown inFig. 5(c). We see that convexity in range images is bestpredicted by brighter luminance above the point of con-vexity, accompanied by darker luminance below. We also

Fig. 5. Linear regression results. Correlation values can be found in Table 1. (a) Convex range filters, (b) ordinary least-squaresregression kernels (intensity image filters), (c) ridge regression kernels.

Fig. 6. Principal components coefficients. (a) Convex range fil-ter, (b) principal components coefficients of weighted average, (c)principal components coefficients of ordinary least-squares re-gression kernels.




PROO

F COPY [A-8672] 004307JO

A see that there is a remarkable resemblance between theregression kernel for isotropic convexity with a Lamber-tian sphere lit from above.

In addition to showing which properties of intensity im-ages are most correlated with specific linear properties ofrange images, regression analysis also provides a mea-sure of how strong these relationships are. The values ofr for each filter are shown in Table 1. The correlationalvalues for the convexity filters are significant, but theyare not sufficient to accurately predict convexity from anintensity image in any practical application. However, itshould be noted that the original log range image could befully reconstructed, up to an additive constant, from theresponse to the isotropic convex filter alone, by using de-convolution. This is because the convex range filtersused in this experiment contain all spatial frequencies ex-cept very low and very high spatial frequencies. There-fore correlations near 1 would signify that nearly perfectreconstruction of the range image could be achieved fromlinear processing on the intensity image. Perfect recon-struction is a difficult goal, considering that the images inthis database are complex natural images and they con-tain surfaces that violate the assumptions described inSection 1. Even humans, who could not accurately esti-mate a complete range image from a natural image, mustrely on several monocular and binocular cues to estimatedepth information. Modern shape-from-shading tech-niques typically perform poorly on these images.

To estimate the total predictive power of these simplecorrelation-based techniques, we use regression to predicta single range pixel from the 25 3 25 intensity patch.The results are shown in Fig. 5 and Table 1. Predicting asingle range pixel from each intensity patch provides away to predict the original range image without introduc-ing noise due to deconvolution. The correlation for thisfilter is 0.21. It is can be shown from Eq. (15) that

cov@FL • LP , FR • RP# 5 var@FL • LP#, (19)

cor@FL • LP , FR • RP#2 5var@FL • LP#

var@FL • LP#.

(20)

This means that 4% of the total variation in range imagescan be explained by the intensity image by using simplesecond-order statistic techniques.

The above calculation estimates how well the intensityimage predicts the complete range image and all of itsproperties. However, we expect that some properties ofrange images are more easily predicted than others. In

the next subsection, we examine those properties of rangeimages that can be best predicted from the luminance im-age.

F. Canonical CorrelationWe have now seen that the responses to convex linearrange filters correlate with the responses of certain inten-sity filters. Our approach so far has been to choose spe-cific range filters and investigate their relationship withcoregistered light intensity. In this subsection we extractthe filters or features in both domains that are most cor-related with or predictive of one another. What we wouldnow like to know is, What linear properties of range im-ages are most easily predicted by the intensity image?Is FV the range filter whose regression will yield the high-est correlation between the filter responses, or are thereother range filters that are more correlated with the in-tensity data?

To answer this question, we use a technique known ascanonical correlation. Canonical correlation finds thepair of vectors, FL

1 and FR1 that maximizes the correlation

cor@FL1• LP , FR

1• RP#.

It then finds vectors FL2 and FR

2 that maximize correlationsubject to the constraint that cor@FL

2• LP , FL

1• LP# 5 0

and cor@FR2• RP , FR

1• RP# 5 0. The process repeats

until the vectors span the space of image and rangepatches. It can be shown that FL

n and FRn are propor-

tional to the nth eigenvectors of the matrices(L8L)21(L8R)(R8R)21(R8L) and (R8R)21(R8L)(L8L)21

3 (L8R), respectively.29 Because the correlation weseek to minimize is independent of the magnitude of thevectors FL

n and FRn , we can constrain each filter to have

unit variance (var@FLn• LP# 5 1) without affecting the

value of the correlations. Let FL and FR denote the 252

3 252 matrices of these eigenvectors. Then, not only doFL and FR whiten the space of intensity and rangepatches, but cor@FL

n• LP , FR

m• RP# Þ 0 implies that m

5 n.Once again, the resulting linear filters contain a great

deal of high-frequency noise. By combining ridge regres-sion with canonical correlation, we obtain a much cleanerresult. This technique is known as canonical ridge,30,31

and it is equivalent to finding eigenvectors of the matrices(L8L 1 IlL)21(L8R)(R8R 1 IlR)21(R8L) and (R8R1 IlR)21(R8L)(L8L 1 IlL)21(L8R). The results areshown in Fig. 7.

First, it is not surprising that the resulting filters areall highly localized in Fourier space. If the range filtersin Fig. 7 had power in all spatial frequencies, then decon-

Table 1. Linear Regression Results for the Three Linear Convexity Filtersa

Range Filter l r r̃ r̃R r̃U

FV 0.3 0.1358 0.1353 0.1760 0.1026FH 0.7 0.0447 0.0436 0.0744 0.0542FI 0.5 0.1053 0.1046 0.1409 0.0862FP 10.0 0.2147 0.2142 0.3784 0.0769

a r is the correlation between the intensity filter response and the ordinary least squares regression range filter response (cor@FL • LP , FR • RP#). r̃ isthe correlation after ridge regression. r̃R and r̃U are the correlations in the rural and urban datasets, respectively. The fourth filter, FP is the single-pixelresponse filter shown in row four of Fig. 5.




PROO

F COPY [A-8672] 004307JO

A

volution would allow us to reconstruct the complete rangeimage from prediction on the basis of a single intensity fil-ter. We already have seen that the ability to reconstructthe full range image from correlational statistics is lim-ited.

The next point to observe in these results is that verti-cal convexity is not the property of range images that ismost easily predicted from optical images. The pair ofvectors that exhibits the greatest correlation is the dccomponent of images, which responds strongly to brightimage patches, paired with a planar range filter that re-

sponds strongly to surfaces with gradients pointing up-ward and away from the observer. This tells us that thecorrelation between surfaces facing upward and brightimage patches is the highest correlation among all pairsof linear intensity and range filters. The correlation is0.35. This means that 0.282, or 8% of all variation in theoverall brightness of a patch of image can be explained bythe vertical component of its orientation, and visa versa.In other words, just knowing the overall intensity of animage patch reduces the uncertainty of the vertical slopeof that patch by 8%.

The next two canonical variates show vertical convex-ity. Both of them show that convex surfaces are corre-lated with bright intensity values that fall somewhereabove the area of the surface that is closest to the ob-server. This trend continues in basis-vector pairs oflower correlation. In general, it appears that many ofthe first several intensity basis vectors appear similar tohow the corresponding range filter would appear if it wererendered as a surface and lit from above.

Another important observation is that the basis vectorsof lower frequency appear earlier within the bases andthat vertical frequencies are favored. The canonical cor-relation bases are independent of the variance of eachcomponent, and so this result is not related to the powerspectrum of intensity or range images. It means thatlower-frequency components of the range image are moreeasily predicted, by use of linear correlations with inten-sity data.

In Figs. 7(b) and 7(c) we show the canonical filters forrural and urban scenes, respectively. Although the ruraland urban datasets are disjoint, the first several canoni-cal filters are qualitatively similar between the twodatasets. However, the correlations are much strongerin the rural dataset. In fact, throughout this paper, cor-relations in the rural images have been stronger thanthose in the urban images. However, it is worth notingthat the relative strengths of the correlations is not thesame for all properties of range images. The discrepancybetween correlations among individual points (bright pix-els tend to be closer) is quite high: 0.32 versus 0.06.The discrepancy between the correlations for the first ca-nonical filter is smaller: 0.46 versus 0.15, and the dis-crepancy between the correlations for the second canoni-cal filter is smaller still: 0.23 versus 0.12. There is asimilar discrepancy between the correlations for the con-vexity filters. This suggests that the trends measured bythese correlations (that surfaces tilted upward arebrighter and that convex objects are brightest directlyabove the convexity) are more robust in natural imagesand less dependent on locale.

Finally, it should be pointed out that the correlation co-efficient for the strongest canonical filter measures 0.46in rural scenes. This means that 21% of all variation inthe overall brightness of an image patch in rural imagescan be explained by the tilt of its underlying surface.This is the strongest statistical trend found in this paper.

4. DISCUSSIONIn this paper we have shown that the relationship be-tween the shape of objects and their images depends on

Fig. 7. Canonical correlation filters. (a) First eight canonicalridge filters for the entire (50-image) database. Below each fil-ter is the dc component strength, normalized on a scale from 21to 11. Below each filter pair is listed the correlation strength r.(b) First four canonical filters for the rural database. (c) Firstthree canonical filters for the urban database.




PROO

F COPY [A-8672] 004307JO

A statistical trends that cannot be inferred from physicalmodels of light. We have found that this relationship isintimately related to the statistics of lighting directions,the statistics of camera or head orientation, and the sta-tistics of surface shapes in natural scenes. These aresources of information that have not been seriously ex-ploited by traditional shape-from-shading techniques,which instead typically rely on handcrafted environmen-tal priors, such as smoothness and curvature constraints,that have not been verified in natural images. Some ofthe statistical trends shown here have been suspected bypsychologists and artists in the past. In this paper weshow the extent to which they hold true in the natural en-vironment, and we explore the structure and causes ofthese trends. This, and future investigations like this,may help us to understand how these statistical trendsmight be taken advantage of by computers and by the hu-man brain.

Psychophysical experiments show that humans areable to compute some shape information from shading ina parallel manner.32 This suggests that the computationis both local and fast. Neurophysiological experimentson awake monkeys also implicated the early visual cortexin the mediation of shape-from-shading pop-outperception.33 Simple, local statistical relationships be-tween shape and shading like those found in this papermay be exploited by the brain to achieve this level of per-formance. For example, the image in Fig. 8 gives us aninstantaneous perception of convexity and concavity foreach stimulus in the array. This may be related to therelatively high degree of correlation between shape con-vexity and the vertical intensity gradient.

Factoring spaces of high dimensionality is highly usefulfor performing computations in that space. In this studywe used regression and correlation techniques to discoverlinear bases in the range domain and the luminance do-main that are correlated with each other. This image ba-sis provides a factorization of image space that is basedon its utility in predicting 3D surface characteristics.Principal components analysis and independent compo-nents analysis are useful in image coding and compres-

sion. However, these techniques are limited in that theyrely only on the statistics of images themselves and do nottake into account the relevance of particular properties ofimages to the computational task at hand. For the solu-tion of any given visual problem, the most importantcharacteristics of an image are not always the ones thatvary the greatest, such as the first principal components.In fact, rare events are the often most salient and mean-ingful. The characteristics in images that are most use-ful for solving a particular problem depend on the natureof the problem. The codes that we have obtained mightbe more relevant to the task of inferring 3D structurefrom 2D information.

The results in this paper are only the first steps in aninvestigation into the statistical relationship betweenshape and shading. We have explored only the simplestrelationships between intensity and range images. Themethods used in this paper make two strong linearity as-sumptions. First, the only properties of images consid-ered were their response to linear filters. All previousliterature in shape from shading depends on highly non-linear properties of images. It is possible that strongerresults could be achieved by considering such properties.The second assumption of linearity is the measure of cor-relation between range and intensity image properties.Linear regression finds only properties of range, and op-tical images are related in a linear fashion. Other mea-sures of relation, such as mutual information, need to beexplored. Also, Fig. 3(b) suggests that the correlationsmeasured in this paper occur at spatial distances beyond25 pixels at 20 pixels/deg. The exact extent of this corre-lation remains unknown.

ACKNOWLEDGMENTSWe thank Mark Albert, Frankie Li, Kui Shen, and ScottMarmer, who participated in the data collection. Wethank Ann Lee and David Mumford for providing somerange data for our initial exploration on the range do-main. We also thank our reviewers for their insightfulcomments. Brian Potetz was supported by a NationalScience Foundation Integrative Graduate Education andResearch Traineeships training grant. Tai Sing Lee wassupported by National Science Foundation CAREERgrant 9984706.

Correspondence may be sent to Brian Potetz [email protected].

REFERENCES1. B. K. P. Horn, ‘‘Obtaining shape from shading information,’’

in The Psychology of Computer Vision, P. H. Winston, ed.(McGraw-Hill, New York, 1975), pp. 115–155.

2. W. T. Freeman, E. C. Pasztor, and O. T. Carmicheal, ‘‘Learn-ing low-level vision,’’ Int. J. Comput. Vision 40, 24–57(2000).

3. S. R. Lehky and T. J. Sejnowski, ‘‘Network model for shape-from-shading: neural function arises from both receptiveand projective fields,’’ Nature 333, 452–454 (1988).

4. B. A. Olshausen and D. J. Field, ‘‘Sparse coding with anovercomplete basis set: a strategy employed by V1?’’ Vi-sion Res. 37, 3311–3325 (1997).Fig. 8. Convex and concave stimuli defined by shading.




PROO

F COPY [A-8672] 004307JO

A 5. M. S. Lewicki and B. A. Olshausen, ‘‘Probabilistic frame-

work for the adaptation and comparison of image codes,’’ J.Opt. Soc. Am. A 16, 1587–1601 (1999).

6. J. H. van Hateren and A. van der Schaaf, ‘‘Independentcomponent filters of natural images compared with simplecells in primary visual cortex,’’ Proc. R. Soc. London Ser. B265, 359–366 (1998).

7. P. O. Hoyer and A. Hyvrinen, ‘‘Independent componentanalysis applied to feature extraction from colour and ste-reo images,’’ Network Comput. Neural Syst. 11, 191–210(2000).

8. D. L. Ruderman and W. Bialek, ‘‘Statistics of natural im-ages: scaling in the woods,’’ Phys. Rev. Lett. 73, 814–817(1994).

9. D. J. Field, ‘‘Scale-invariance and self-similar ‘wavelet’transforms: An analysis of natural scenes and mamma-lian visual systems,’’ in Wavelets, Fractals, and FourierTransforms, M. Farge, J. C. R. Hunt, and J. C. Vassilicos,eds. (Clarendon, Oxford, 1993), pp. 151–193.

10. J. Huang and D. Mumford, ‘‘Statistics of natural imagesand models,’’ in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (Computer Soci-ety Press, Santa Ana, Calif., 2000), pp. 541–547.

11. E. P. Simoncelli, ‘‘Modeling the joint statistics of images inthe wavelet domain,’’ in Wavelet Applications in Signal andImage Processing VII, M. A. Unser, A. Aldroubi, and A. F.Laine, eds., Proc. SPIE 3813, 188–195 (1999).

12. C. Q. Howe and D. Purves, ‘‘Range image statistics can ex-plain the anomalous perception of length,’’ Proc. Natl. Acad.Sci. U.S.A. 99, 13184–13188 (2002).

13. V. S. Ramachandran, ‘‘Perception of shape from shading,’’Nature 331, 163–166 (1988).

14. J. Sun and P. Perona, ‘‘Preattentive perception of elemen-tary three dimensional shapes,’’ Vision Res. 36, 2515–2529(1996).

15. D. J. Field, ‘‘What is the goal of sensory coding?’’ NeuralComput. 6, 559–601 (1994).

16. J. Huang, A. B. Lee, and D. Mumford, ‘‘Statistics of rangeimages,’’ in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (Computer SocietyPress, Santa Ana, Calif., 2000), pp. 324–331.

17. E. MacCurdy, ed., The Notebooks of Leonardo da Vinci, Vol-ume II (Reynal & Hitchcock, New York, 1938), p. 332.

18. C. Wallschlaeger and C. Busic-Snyder, Basic Visual Con-

cepts and Principles for Artists, Architects, and Designers(McGraw Hill, Boston, Mass., 1992).

19. T. M. Sheffield, D. Meyer, J. Lees, B. Payne, E. L. Harvey,M. J. Zeitlin, and G. Kahle, ‘‘Geovolume visualization inter-pretation: a lexicon of basic techniques,’’ Leading Edge 19,518–525 (2000).

20. J. Coules, ‘‘Effect of photometric brightness on judgments ofdistance,’’ J. Exp. Psychol. 50, 19–25 (1955).

21. H. Egusa, ‘‘Effects of brightness, hue, and saturation onperceived depth between adjacent regions in the visualfield,’’ Perception 12, 167–175 (1983).

22. I. L. Taylor and F. C. Sumner, ‘‘Actual brightness and dis-tance of individual colors when their apparent distance isheld constant,’’ J. Psychol. 19, 79–85 (1945).

23. D. G. Myers, Psychology (Worth, New York, 1995).24. J. E. Cutting and P. M. Vishton, ‘‘Perceiving layout and

knowing distances: the integration, relative potency, andcontextual use of different information about depth,’’ inHandbook of Perception and Cognition, Vol 5: Perception ofSpace and Motion, W. Epstein and S. Rogers, eds. (Aca-demic, San Diego, Calif., 1995), pp. 69–117.

25. M. Farne, ‘‘Brightness as an indicator to distance: relativebrightness per se or contrast with the background?’’ Percep-tion 6, 287–293 (1977).

26. M. S. Langer and S. W. Zucker, ‘‘Shape-from-shading on acloudy day,’’ J. Opt. Soc. Am. A 11, 467–478 (1994).

27. P. Mamassian and M. S. Landy, ‘‘Observer biases in the 3Dinterpretation of line drawings,’’ Vision Res. 38, 2817–2832(1998).

28. T. P. Ryan, Modern Regression Methods (Wiley-Interscience,New York, 1997).

29. A. Basilevsky, Statistical Factor Analysis and RelatedMethods (Wiley-Interscience, New York, 1994).

30. H. D. Vinod, ‘‘Canonical ridge and econometrics of joint pro-duction,’’ J. Econometrics 4, 147–166 (1976).

31. K. V. Mardia, J. T. Kent, and J. M. Bibby, MultivariateAnalysis (Academic, London, 1979).

32. J. Braun, ‘‘Shape-from-shading is independent of visual at-tention and may be a ‘texton’,’’ Spatial Vision 7, 311–322(1993).

33. T. S. Lee, C. Yang, R. D. Romero, and D. Mumford, ‘‘Neuralactivity in early visual cortex reflects behavioral experienceand higher order perceptual saliency,’’ Nat. Neurosci. 5,589–597 (2002).



proof copy [a-8672] 004307joatai/papers/potetz_josa.pdfproof copy [a-8672] 004307joa proof copy...

Documents