designing deep networks for surface normal estimation

9
Designing Deep Networks for Surface Normal Estimation Xiaolong Wang, David F. Fouhey, Abhinav Gupta Robotics Institute, Carnegie Mellon University Input Image Surface Normal (Output) Input Image Surface Normal (Output) Figure 1: Given a single image, our algorithm estimates the surface normal at each pixel. Notice how our algorithm not only estimates the coarse structure but also captures fine local details. For example, on the left, the normals of the couch arm and side table legs are estimated accurately (see zoomed version). On the right, the chair surface and legs and even the top of the shopping bags are captured correctly. Normal legend: blue ! X; green ! Y; red ! Z. Abstract In the past few years, convolutional neural nets (CNN) have shown incredible promise for learning visual represen- tations. In this paper, we use CNNs for the task of predict- ing surface normals from a single image. But what is the right architecture? We propose to build upon the decades of hard work in 3D scene understanding to design a new CNN architecture for the task of surface normal estimation. We show that incorporating several constraints (man-made, Manhattan world) and meaningful intermediate representa- tions (room layout, edge labels) in the architecture leads to state of the art performance on surface normal estimation. We also show that our network is quite robust and show state of the art results on other datasets as well without any fine-tuning. 1. Introduction The last two years in computer vision have generated a lot of excitement: deep convolutional neural networks (CNNs) have broken the barriers of performance on tasks ranging from scene classification to object detection and fine-grained categorization. For instance, on object detec- tion, performance on the standard dataset has gone up from a mAP of 33.7 to 58.5 in just two years. While CNNs have shown tremendous success on semantic tasks such as detec- tion and categorization, their performance on other vision tasks such as 3D scene understanding and establishing cor- respondence has been not as extensively studied. We want to explore the effectiveness of CNNs on the task of predicting surface orientation, or normals, from a single image. One could treat this as a per-pixel regression task and directly apply a CNN, for instance as was done for depth prediction in [8]. However, decades of research have shown that the output space of this task is governed by powerful physical constraints and researchers have ex- ploited these constraints from the very beginning of com- puter vision [27] through the line-labeling era [17, 2, 19] all the way to recent investigations [14, 24, 12, 30, 40, 10]. In this paper, we demonstrate how to incorporate insights about 3D representation and reasoning into a deep learning framework for surface normal prediction. While deep net- works have been particularly successful for learning image representations, we believe their design can benefit from past research in 3D scene understanding. We achieve this by developing CNNs that operate locally in a window as well as globally on the whole image; these predict not just surface normals, but also edges and cuboid room layout. A final CNN considers these predictions as well as evi- dence from vanishing points to yield a final prediction. Our method obtains state-of-the-art performance in surface nor- mal estimation and shows a substantial improvement over a standard feed-forward architecture. Additionally, we show that our physical constraints enable 4.6 percentage points of our performance in the most strict evaluation metric. More importantly, our networks provide a deeper understanding of the scene in terms of not just surface normals, but also room layout and edge labels (Figure 2). 2. Related Work The topic of 3D understanding goes back to the be- ginning of computer vision, starting from the first thesis, Roberts’ Blocks World [27]. At the heart of this problem 1

Upload: others

Post on 08-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Deep Networks for Surface Normal Estimation

Designing Deep Networks for Surface Normal Estimation

Xiaolong Wang, David F. Fouhey, Abhinav GuptaRobotics Institute, Carnegie Mellon University

Input Image Surface Normal (Output) Input Image Surface Normal (Output)

Figure 1: Given a single image, our algorithm estimates the surface normal at each pixel. Notice how our algorithm not onlyestimates the coarse structure but also captures fine local details. For example, on the left, the normals of the couch arm andside table legs are estimated accurately (see zoomed version). On the right, the chair surface and legs and even the top of theshopping bags are captured correctly. Normal legend: blue ! X; green ! Y; red ! Z.

Abstract

In the past few years, convolutional neural nets (CNN)

have shown incredible promise for learning visual represen-

tations. In this paper, we use CNNs for the task of predict-

ing surface normals from a single image. But what is the

right architecture? We propose to build upon the decades

of hard work in 3D scene understanding to design a new

CNN architecture for the task of surface normal estimation.

We show that incorporating several constraints (man-made,

Manhattan world) and meaningful intermediate representa-

tions (room layout, edge labels) in the architecture leads to

state of the art performance on surface normal estimation.

We also show that our network is quite robust and show

state of the art results on other datasets as well without any

fine-tuning.

1. IntroductionThe last two years in computer vision have generated

a lot of excitement: deep convolutional neural networks(CNNs) have broken the barriers of performance on tasksranging from scene classification to object detection andfine-grained categorization. For instance, on object detec-tion, performance on the standard dataset has gone up froma mAP of 33.7 to 58.5 in just two years. While CNNs haveshown tremendous success on semantic tasks such as detec-tion and categorization, their performance on other visiontasks such as 3D scene understanding and establishing cor-respondence has been not as extensively studied.

We want to explore the effectiveness of CNNs on thetask of predicting surface orientation, or normals, from asingle image. One could treat this as a per-pixel regression

task and directly apply a CNN, for instance as was donefor depth prediction in [8]. However, decades of researchhave shown that the output space of this task is governedby powerful physical constraints and researchers have ex-ploited these constraints from the very beginning of com-puter vision [27] through the line-labeling era [17, 2, 19] allthe way to recent investigations [14, 24, 12, 30, 40, 10].

In this paper, we demonstrate how to incorporate insightsabout 3D representation and reasoning into a deep learningframework for surface normal prediction. While deep net-works have been particularly successful for learning imagerepresentations, we believe their design can benefit frompast research in 3D scene understanding. We achieve thisby developing CNNs that operate locally in a window aswell as globally on the whole image; these predict not justsurface normals, but also edges and cuboid room layout.A final CNN considers these predictions as well as evi-dence from vanishing points to yield a final prediction. Ourmethod obtains state-of-the-art performance in surface nor-mal estimation and shows a substantial improvement over astandard feed-forward architecture. Additionally, we showthat our physical constraints enable 4.6 percentage points ofour performance in the most strict evaluation metric. Moreimportantly, our networks provide a deeper understandingof the scene in terms of not just surface normals, but alsoroom layout and edge labels (Figure 2).

2. Related Work

The topic of 3D understanding goes back to the be-ginning of computer vision, starting from the first thesis,Roberts’ Blocks World [27]. At the heart of this problem

1

Page 2: Designing Deep Networks for Surface Normal Estimation

Local

Global

Fusion

Input: Single Image Output: Surface Normals Global

Layout

Local

Edges

VP

Figure 2: An overview of our approach to predicting surface normals of a scene from a single image. We separately learnglobal and local processes and use a fusion network to fuse the contradictory beliefs into a final interpretation. Globalprocesses: our network predicts a coarse 20 ⇥ 20 structure and a vanishing-point-aligned box layout from a set of discreteclasses. Local processes: our network predicts a structured local patch from a part of the image and line-labeling classes:convex-blue, concave-green, and occlusion-red. Fusion process: our network fuses the outputs of the two input networks,the rectified coarse normals with vanishing points (VP) and images to produce substantially better results.

are two related questions: (1) What are the right primitivesfor understanding? and (2) Given the local evidence, howcan one obtain a global 3D scene understanding?

The problem of discovering primitives goes back to theearly days of computer vision. The first primitives proposedtook the form of lines [27, 36] and volumetric primitivessuch as geons [1], but these turned out to be too difficult todetect in natural images. Recent work has focused on usingedges [25], super pixels [28] or segments [15] as primitivesfor reasoning. Most recently, [9] instead argued that datashould determine the primitive instead of human intuition,and introduced a structured patch-based primitive; similarly[22] formulated the problem as per-pixel, using segmentsonly as the data dictated. Unfortunately, while all of theselocal primitives work well on things like blinds, cabinets,and tile floors, they tend to be stymied by local ambiguitiesat less-textured regions.

In order to resolve ambiguities, most work turns to someform of reasoning to do top-down prediction. Most re-cent work is based on higher-order volumetric representa-tions [14, 24, 30, 30] (e.g., the room should be an inside outbox) or reasoning over volumes [24, 29] (e.g., two volumesshould not intersect with each other) or edges [10] (e.g.,via detected convex edges or occlusion boundaries) Typi-cally, this representation is obtained via optimization overa domain-specific model and helps smooth predictions andresolve ambiguous areas, such as blank walls.

In this work, we address both threads. Instead of usingprimitives on manually designed features such as HoG [4],we use the data to derive a representation right from the pix-els: inspired by the recent success of CNNs [23, 21] on thetasks of object detection [11, 31], segmentation [38], depth

estimation [8], pose estimation [34], etc., we propose toadapt CNNs to learn representations and primitives for 3Dscene understanding. Similarly, instead of hand-designingan optimizable model to reason about ambiguities, we learna CNN to arbitrate between conflicting evidence. However,rather than abandon the insights learned in past work, weincorporate them into our design. In particular, we take intoaccount the importance of:Fusing global and local. We build local and global net-works that handle these two forms of evidence. We com-bine their predictions with a fusion network that greatlyoutperforms either alone. Our fusion network can beviewed as a form of learned reasoning that replaces pre-vious optimization-based attempts to reconcile evidence[12, 24, 29, 10] from conflicting sources.Human-centric constraints. Past work has shown that theman-made nature of indoor scenes provides powerful con-straints. For instance, it is common for there to be assumedthree orthogonal directions in the scene (the Manhattan-world assumption [3]), as used in [14, 25, 24, 30, 29, 10,39]. Similarly, it is common to assume that the scene isan inside-out box [14, 24, 30, 29]. Inspired by these ap-proaches, our global network predicts a box layout in ad-dition to coarse geometry, and we provide the vanishing-points to our fusion network. This enables the fusion net-work to softly apply the Manhattan or box constraint as thedata dictates (e.g., on walls and floors but not chairs), andour results show that this leads to improved predictions.Local structure. Another theme that has emerged both inthe past [33, 17, 2, 19] and recently [16, 20, 10] is the rea-soning between surface normals and the edges in the im-ages. Inspired by these local constraints, we incorporate

Page 3: Designing Deep Networks for Surface Normal Estimation

5x5 Conv

3x3 Pool

3x3 Conv

3x3 Pool

64

25

25

192

12

12

384

12

12

256

12

12 3x3 Conv

3x3 Conv

Fully Connected

Fully Connected

300

20

20

20

55x55x3 Image

Triangle Decoding

Decoding

Figure 3: The architecture of our global network. Given an 55⇥55 image as input, it is passed though 4 convolutional layers.On the top of the last convolutional layer, the neurons are fully connected to two separate outputs: (i) global scene surfacenormals and (ii) room layouts.

them in learning of the local network and as an input inthe fusion network. We demonstrate that the inclusion ofpredicted convex, concave and occlusion edges improve theperformance over the simple feed-forward network.

A preliminary version of this work appeared on Arxiv[37]. At the same time, [7] introduced a stacked CNNmodel for surface normal estimation. Our contributions arecomplementary to [7] and combining both should providefurther improvement.

3. OverviewThis paper aims to combine the knowledge gleaned over

the past decade in single-image 3D prediction with therepresentation-learning power of convolutional neural net-works. Our overall objective is to frame the single-image3D problem so that the structure we know is captured andconvolutional networks can do what they do best – learnstrong mappings from visual data to labels.

Following the lessons we described, we build a networkwith the following architecture (illustrated in Figure 2). Westart with two networks: a global network that takes thewhole image as input and predicts a coarse global interpre-tation (Section 4.2); and a local network that acts on localpatches in a sliding-window fashion and maps them to lo-cal orientation (Section 4.3). Because the global and lo-cal processes have complementary errors, we combine theiroutput with a fusion network that consolidates their predic-tions (Section 4.5). Each input network obtains strong per-formance by themselves, but by combining them, we obtainsubstantially better results, both quantitatively and qualita-tively.

In addition to performing global/local fusion, we injectglobal human-centric constraints (including room layout,vanishing point) and local surface/edge constraints into theframework by introducing additional tasks. Our global net-work predicts room layout as well, and our local networkpredicts an edge label. Integrating these extra tasks leads toa more robust final network. We evaluate our approach inSection 5 and analyze what aspect of our designs gives whattypes of performance increases.

4. MethodWe now describe each of the components of our method.

For each, we describe their inputs, outputs, the intermediatelayers, and the loss function they minimize.

4.1. Output: Regression as ClassificationThe outputs for the global and local networks are: sur-

face normal for each pixel, room layout and edge labels.The edge label (convex, concave, occluding, no-edge) is adiscrete output space and can be formulated as a classifica-tion problem. However, both surface normal and room lay-out are structured continuous output spaces (note a surfacenormal is on the unit sphere). Following past work such as[22], we reduce these problems to a classification problem.Surface Normal: We use the surface normal triangularcoding technique from Ladicky et al. [22] to turn normalregression into a classification problem. Specifically, wefirst learn a codebook with k-means and a Delaunay trian-gulation cover is constructed over the words. Given thiscodebook and triangulation, a normal can be re-written asa weighted combination of the codewords in whose trian-gle it lies. At training-time, we learn a softmax classifieron the codewords. At test-time, we predict a distributionover codewords; this is turned into a normal by finding thetriangle in the triangulation with maximum total probabil-ity, and using the relative probabilities within that triangleas weights for reconstructing the normal.Room Layout: Room layout is continuous structured out-put space. We reformulate the problem as classification bylearning a codebook over box layouts. The codewords arelearned with k-medoids clustering over 6000 room layouts;each codeword is a category for classification.

4.2. Global NetworkThe goal of this network is to capture the coarse struc-

ture, enabling the interpretation of ambiguous portions ofthe image which cannot be decoded by local evidence alone.Input: Whole image rescaled to 55⇥ 55⇥ 3.Output: Given the whole image as an input, we producetwo complementary global interpretations as outputs: (i) astructural estimation of surface normals for the image and

Page 4: Designing Deep Networks for Surface Normal Estimation

Conv Layers

4 Layers

4096

4096

40

13

13

4

Fully Connected

Fully Connected

Fully Connected

Fully Connected

(a) (b) 195x260x3 Image

Figure 4: Our local network. (a) Given a full image, we perform sliding window on it and extract 55 ⇥ 55 image contentas input. By forward propagation, the network produces the local surface normals and convex/concave/occlusion edge labelsfor the middle 13 ⇥ 13 image patch; (b) After sliding window, we obtain the surface normals and edge labels for the wholeimage. We plot the edge labels on the output of Structured Edges [6]. The colors blue, green and red represent the convex,concave and occlusion edge labels, respectively.

(ii) a cuboidal approximation of the image as introduced by[14] and used in [24, 30], among others. For surface normalestimation, the output layer is Mt⇥Mt⇥Kt where Mt⇥Mt

is the size of output image for surface normals and Kt isthe number of classes uses in codebook. For room layout,we use simple classification over 300 categories. We useMt = 20,Kt = 20.Architecture: The global global network includes fourconvolutional layers; these layers are shared by the twotasks (surface normal and room layout estimation). The out-put of the neurons in the fourth convolutional layers are thenfully connected to the two labels. To simplify the descrip-tion, we denote the convolutional layer as C(k, s), whichindicates the there are k kernels, each having the size ofs ⇥ s. During convolution, we set all the strides to 1. Wealso denote the local response normalization layer as LRN ,and the max-pooling layer as MP . The stride for poolingis 2 and we set the pooling operator size as 3 ⇥ 3. Thenthe network architecture for the convolutional layers can bedescribed as: C(64, 5) ! MP ! LRN ! C(192, 3) !MP ! LRN ! C(384, 3) ! C(256, 3). For surfacenormal estimation, neurons in the fourth convolutional lay-ers are fully connected to the output space of Mt⇥Mt⇥Kt,which is 20⇥ 20⇥ 20 = 8000. For the room layout estima-tion, we connect the same set of neurons to the Kl = 300

labels. The architecture of the network is shown in Figure 3.Loss function: We treat both tasks as classification prob-lems. For the room layout classification, we simply employthe softmax regression to define the loss.

For surface normal estimation, we denote Fi(I) as a Kt-class classification output for ith pixel on surface normaloutput map. We also apply softmax regression to optimizethe function Fi(I). Then the loss for the structural outputsof surface normals can be represented as,

L(I, Y ) = �M⇥MX

i=1

KX

k=1

( (yi = k) logFi,k(I)), (1)

where Fi,k(I) represents the probability that the ith pixelshould have the normal defined by the kth codeword,

(yi = k) is the indicator function, Y = {yi} are thegroundtruth labels for the surface normals, M = Mt andK = Kt.

During training, we learn the networks with these twolosses simultaneously. As we have structural outputs forsurface normals and only one prediction for the room lay-out, we need to balance the learning rate for both losses. If� denotes the learning rate for surface normal estimation,we set the learning rate for layout estimation to 50�.

4.3. Local NetworkThe goal of this network is to capture local evidence at

a higher resolution that might be missed by the global net-work. We take a sliding window approach where we extractfeatures in a window and predict the image properties inthe center of the window. This type of model has been ap-plied successfully for generating local image interpretationsin the form of normals [9, 22] and semantic edges [6].Input: Given an image with size 195 ⇥ 260, we performsliding window on it with a window of size 55 ⇥ 55 andstride of 13 (defined to match our output size).Output: The local network produces two types of outputs:(i) surface normals and (ii) an edge label. Each local slidingwindow predicts the surface normal for Mb ⇥Mb pixels atthe center of the window. We use Mb = 13. As Figure 4illustrates, our network takes a smaller part of the image asinput and predicts the surface normals in the middle of thepatch, thus predicting the local normals from local textureand its context. We use Kb = 40 codewords to define theoutput space. We use a larger number of codewords sincewe expect local network to capture finer details. For theedges, we use the classic categories of convex, concave, oc-clusion or not an edge. Note we just predict one edge labelfor 13 ⇥ 13 pixels. For visualization purposes, we projectthese edge labels onto the output of Structured Edges [6].Architecture: The architecture of the local network in-cludes 4 convolutional layers and 2 sets of fully connectedlayers. The convolutional layers are shared by the two tasks,and we use the same parameter settings mentioned in theglobal network. At the end of the convolutional layers, we

Page 5: Designing Deep Networks for Surface Normal Estimation

(a) (b) Figure 5: Top regions for the 4th convolutional layer units in global and local networks. The receptive field for the neuronsin the 4th layer is 31⇥ 31. We use red bounding boxes to represent the regions with top responses for different units. (a) Theneurons from the global network tend to capture the structure information in the global scene; (b) The neurons from the localnetwork respond to local texture and edges.

stack two separate fully connected layers with 4096 neu-rons, each of which corresponds to a task. For local surfacenormal estimation, the output size is Mb ⇥ Mb ⇥ Kb =

13⇥ 13⇥ 40 = 6760; for edge labels, there are 4 outputs.Loss Function: Both tasks are defined as classification. Foredge label estimation, we apply softmax regression to definethe loss. For the local surface normal estimation we applythe loss defined in Eq.1 by setting M = Mb and K = Kb.Similar to the coarse network, we optimize these two tasksjointly during training, and the learning rate for local sur-face normals and edge label estimation are � and 50�, re-spectively.

4.4. VisualizationWe now attempt to analyze what the global and local net-

works learn. Figure 5 shows the top 5 activations for theunits in the fourth convolutional layer of (a) global networkand (b) local network. Note that these two networks sharethe same structure in the convolutional layers, and the sizeof receptive fields of the unit is 31⇥31. We select represen-tative samples for illustration. For the global network, theunits capture high-level structures such as the side of beds,hallways and paintings on walls. For the local network, theunits respond to local texture and edges.

4.5. Fusion NetworkThe goal of this network is to fuse the results of the two

earlier networks and refine their results. Each approach hascomplementary failure modes and by fusing the two net-works, we show that better results can be obtained than ei-ther by themselves. Additionally, both coarse and local net-works treat every pixel independently; our fusion networkcan also be thought of applying a form of learned reasoningakin to [26, 35] on our outputs.Input: As input to this fusion network, we concatenate out-puts of the global and local networks with the input image.

The concatenation process is as follows:

• Global Coarse Output: The output of global networkis 20 ⇥ 20 with 20 classes. We decode the output toa 3-dimensional continuous surface normal map andupscale it to 195⇥ 260⇥ 3.

• Layout: We select the room layout corresponding tothe label with highest probability. The layout is a 3-channel feature map representing the surface normalsin the layout. We resize it to 195⇥ 260⇥ 3.

• Local Surface Normals: The output of local networkin the sliding window format is 195⇥ 260⇥ 3.

• Edge Labels: We obtain the 4 probabilities of edge la-bels per window. As the probabilities sum to 1, we donot pass the no-edge output to the fusion network. Weupsample this three dimension vector to size 13⇥13⇥3

for each window and obtain 195⇥ 260⇥ 3 inputs.• Vanishing Point-Aligned Coarse Output: We adjust

our global output’s interpretation to match vanishingpoints estimated by [14], yielding another feature rep-resentation with the same size.

In addition to 15 channels described above, we also con-catenate the original image and therefore our final input tothe deep network is 195⇥ 260⇥ 18.Output: The fusion network is also applied in a slidingwindow scheme on the 195 ⇥ 260 image. By taking theinputs with size of 55⇥55, we estimate the surface normalsof the Mb ⇥Mb center patch via the fusion network. Notethat we use the same output window size Mb = 13 as in thelocal network, and the output space is defined by Kb = 40

codewords.Architecture: The architecture of this network is 4 con-volutional layers and 2 fully connected layers. The con-volutional layers share the same parameter settings as theglobal and local networks. The last convolutional layer are

Page 6: Designing Deep Networks for Surface Normal Estimation

Input Output Ground Truth Input Output Ground Truth Figure 6: Qualitative results of surface normal estimation using our complete architecture. Input images are shown on theleft, ground truth surface normals from Kinect are shown in middle and the predicted surface normals are shown on right.Our network not only captures the coarse layout of the room but also preserves the fine details. Notice that fine details likethe top of couches and the legs of table are captured by our algorithm.

fully connected to 4096 neurons, which in turn lead to the13 ⇥ 13 ⇥ 40 outputs representing the surface normals. Attesting time, we apply the fusion network on the featuremaps with the stride of Mb.Loss Function: At training time, we fix the parameters ofthe global and local networks and obtain the feature mapsof the training data though them. The loss function is de-fined as Eq.1 by setting M = Mb and K = Kb. To trainthe network, we apply the stochastic gradient descent withlearning rate �.

5. ExperimentsWe now describe our experiments. We adopt the proto-

cols introduced in [9] and used by state-of-the-art methodson this task [9, 10, 22].Dataset and Settings: We evaluate our method on the NYUDepth v2 dataset [32]. However, to train our models we usethe corresponding raw video data for the training images.We process the video data using the provided developmentkit, but improve the normals with TV-denoising similar to[22]. We use the official split with 249 scenes for trainingand 215 scenes for testing. We extract 200K frames fromthe 249 scenes for training and test on the 654 images fromthe standard test set. We extract the room layout by fittingan inside-out box from [14] to the estimated surface nor-mals. The edge labels are estimated using the ground-truthdepth data in a procedure like [13].

During training, we fine-tune the network with stochasticgradient descent with learning rate � = 1.0 ⇥ 10

�6. Notethat during joint tuning with the layouts and edges we setthe learning rate as 50� for these losses. For training ourcoarse networks, we augment our data by flipping, colorchanges and random crops. For training the local and fusionnetwork, we rescale the training images to 195 ⇥ 260 andrandomly sample 400K patches with size 55⇥55 from them.

Evaluation Criteria: Following [9], we evaluate a per-pixel error over the whole dataset, ignoring values that areunknown due to missing depth data. We summarize thispopulation of per-pixel errors with statistics: the mean andmedian, as well as percent-good-pixel (PGP) metrics, orwhat fraction of the pixels are correct within a thresholdt (for t = 11.25, 22.5, 30). In the interest of easy compari-son, we report results on the ground-truth provided by [22];relative performance is similar on the ground-truth of [9]and our denoised normals.Baselines: Our primary baselines are the published state-of-the-art in surface normal prediction [9, 10, 22]. Each isstate-of-the-art in at least one metric that we evaluate on.Additionally, since there have been no published CNN re-sults, we adapt the coarse network of the Eigen et al. [8] tosurface normals by using the negative dot-product as loss.This coarse network nearly matches the full system’s per-formance on depth, and as a single feed-forward CNN withno intermediate representations or designed structures, it isa good baseline.

5.1. Experimental ResultsQualitative: First, we demonstrate our qualitative results.Figures 6 and 7 show the results of our complete architec-ture. Notice how our results capture the fine details of theinput image. Unlike many past approaches, our algorithmis able to estimate even the legs of the tables etc. Our al-gorithm is able to even estimate how the surface normalchanges across the couch (last column, figure 6).Quantitative: Table 1 compares the performance of our al-gorithm against several baselines. As the results indicate,our approach is significantly better than all the baselines inall metrics. For many cases, our results show as much as15% improvement over previously state of the art results.

For the sake of completeness, we also report results froma contemporary Arxiv paper [7] which uses stacked CNNs.

Page 7: Designing Deep Networks for Surface Normal Estimation

Input Image Input Image Input Image Output Output Output Figure 7: More qualitative results to show the performance of our algorithm. Again notice the details captured such as thetop of night-stands, the counters and even the legs of the chair are captured by our algorithm.

Input Ground Truth Coarse Network Local Network Fusion (Only Normals) Full Fusion Figure 8: Qualitative Ablative Analysis: The the global and local network estimation results have complementary failuremodes. By combining both normal outputs we obtain better results via fusion network. With more information feeding in,the full fusion network reasons among them and improve the performance.

The performance on all metrics are comparable, althoughwe do not use any Imagenet [5] data for pretraining. At am-biguous pixels, their approach tends to produce an averagedand smooth output whereas our approach picks one of theinterpretations. Therefore, their mean error is lower and ourmedian error is lower.Ablative Analysis: Next, we perform a comprehensive ab-lative analysis to explore which component of the networkhelps in improving performance. This ablative analysishelps evaluate the hypothesis that designing networks basedon meaningful intermediate representations and constraintscan help improve the performance.

First, we discuss some qualitative results shown in Fig-ure 8. As seen in the figure, the global network just cap-tures the coarse structure of the room. For example, in thetop figure, it misses the vertical surface on the inner side ofthe couch or it misses how the vertical orientations changedue to bookshelves between couches. On the other hand, a

Table 1: Results on NYU v2 for per-pixel surface normalestimation, evaluated over valid pixels.

(Lower Better) (Higher Better)Mean Median 11.25� 22.5� 30

Our Network 26.9 14.8 42.0 61.2 68.2Stacked CNN [7] 23.7 15.5 39.2 62.0 71.1UNFOLD [10] 35.2 17.9 40.5 54.1 58.9Discr. [22] 33.5 23.1 27.7 49.0 58.73DP (MW) [9] 36.3 19.2 39.2 52.9 57.83DP [9] 35.3 31.2 16.4 36.6 48.2

local network indeed captures those details. However, sinceit only observes local patches, it completely misclassifiesthe wall patches below the picture frame. Fusing the twonetworks preserves the finer details (inner side of the couchand changing vertical orientations of the wall), but still mis-

Page 8: Designing Deep Networks for Surface Normal Estimation

Input Output Input Output Input Output Figure 9: Results on B3DO dataset[18]. We obtain state of the art performance by applying our model trained on the NYUdataset without further fine-tuning on the B3DO dataset.

Table 2: Ablative Analysis

Mean Median 11.25� 22.5� 30

Full 26.9 14.8 42.0 61.2 68.2Full w/o Global 28.8 17.7 34.6 57.8 66.0Fusion (+VP) 27.3 15.6 40.2 60.1 67.5Fusion (+Edge) 27.8 16.4 37.5 59.4 67.4Fusion (+Layout) 27.7 16.0 38.8 59.9 67.4Fusion 27.9 16.6 37.4 59.2 67.1Local 34.0 25.1 25.6 46.4 56.2Global 30.9 20.8 31.4 52.3 60.5

Coarse CNN [8] 30.1 24.7 24.1 46.4 57.9

classifies a big patch on the wall near the picture frame.However, once the network uses the edge labels (e.g., theconvex edge of the shelf and the missing edge on the wall)to improve the boundaries.

Quantitatively, we compare all the components one byone in Table 2. The fusion network, which combinesthe raw images, surface normal predictions from the localand global networks provides a significant boost in per-formance. Furthermore, adding layout (+Layout), edges(+Edge) and vanishing points (+VP) independently improvethe performance of the network. By combining all of themtogether in the full fusion network, we obtain better resultsin all metrics, and a 4.6% gain in the most strict metric.

We note that adding components accounts only for themarginal gain of each component over the base system.While it is easy to improve bad systems, it is difficult toimprove on a strong system like the fusion network: by it-self, it would be state-of-the-art in most metrics. We there-fore report fusing our constraints with just the local network(Full w/o Global), which leads to a 9% gain across all PGPmetrics. This underscores the effectiveness and value of ourconstraints. Finally, we note that our performance is signifi-cantly better than our implementation of the coarse networkof Eigen et al. [8], a single feed-forward CNN.

Table 3: B3DO

Mean Median 11.25� 22.5� 30

Full 34.5 20.1 36.7 52.4 59.23DP(MW) [9] 38.0 24.5 33.6 48.5 54.5Hedau et al. [14] 43.5 30.0 32.8 45.0 50.0Lee et al. [25] 41.9 28.4 32.7 45.7 50.8

5.2. Berkeley B3DO DatasetTo show our model can generalize well, we apply it di-

rectly on the B3DO [18] dataset. There is significant mis-match in dataset bias between the two: NYU contains al-most exclusively full scenes while the B3DO contains manyclose-ups. Also, since B3DO contains many scenes withdown-facing views, unlike NYU, we rectify our results todetected vanishing points to compensate. We report our re-sults of our full fusion network in Table 3 and some somequalitative results in Figure 9. Our method outperforms thebaselines from [9] by a substantial margin in all metrics.

6. ConclusionWe have presented a novel CNN-based approach for sur-

face normal estimation. By injecting insights into 3D repre-sentation, our model achieves state of the art performance.Qualitatively, our model works well and not only capturesthe coarse scene structure but even captures fine details suchas table legs and curved surfaces of couches.

Acknowledgments: This work was partially supported by NSF IIS-1320083, ONR MURI N000141010934, Bosch Young Faculty Fellowshipto AG and NDSEG fellowship to DF. This material is also based on re-search partially sponsored by DARPA under agreement number FA8750-14-2-0244. The U.S. Government is authorized to reproduce and distributereprints for Governmental purposes notwithstanding any copyright nota-tion thereon. The views and conclusions contained herein are those of theauthors and should not be interpreted as necessarily representing the offi-cial policies or endorsements, either expressed or implied, of DARPA orthe U.S. Government. The authors thank NVIDIA for GPU donations.

Page 9: Designing Deep Networks for Surface Normal Estimation

References[1] I. Biederman. Recognition-by-components: A theory of hu-

man image understanding. Psychological Review, 94:115–147, 1987. 2

[2] M. Clowes. On seeing things. Artificial Intelligence, 2:79–116, 1971. 1, 2

[3] J. Coughlan and A. Yuille. The Manhattan world assump-tion: Regularities in scene statistics which enable bayesianinference. In NIPS, 2000. 2

[4] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In CVPR, 2005. 2

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009. 7

[6] P. Dollar and C. L. Zitnick. Structured forests for fast edgedetection. In ICCV, 2013. 4

[7] D. Eigen and R. Fergus. Predicting depth, surface normalsand semantic labels with a common multi-scale convolu-tional architecture. CoRR, abs/1411.4734, 2014. 3, 6, 7

[8] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. CoRR,abs/1406.2283, 2014. 1, 2, 6, 8

[9] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3Dprimitives for single image understanding. In ICCV, 2013.2, 4, 6, 7, 8

[10] D. F. Fouhey, A. Gupta, and M. Hebert. Unfolding an indoororigami world. In ECCV, 2014. 1, 2, 6, 7

[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In CVPR, 2014. 2

[12] A. Gupta, A. Efros, and M. Hebert. Blocks world revis-ited: Image understanding using qualitative geometry andmechanics. In ECCV, 2010. 1, 2

[13] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organizationand recognition of indoor scenes from RGB-D images. InCVPR, 2013. 6

[14] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatiallayout of cluttered rooms. In ICCV, 2009. 1, 2, 4, 5, 6, 8

[15] D. Hoiem, A. Efros, and M. Hebert. Geometric context froma single image. In ICCV, 2005. 2

[16] D. Hoiem, A. Stein, A. Efros, and M. Hebert. Recoveringocclusion boundaries from a single image. In ICCV, 2007. 2

[17] D. Huffman. Impossible objects as nonsense sentences. Ma-

chine Intelligence, 8:475–492, 1971. 1, 2[18] A. Janoch, S. Karayev, Y. Jia, J. Barron, M. Fritz, K. Saenko,

and T. Darrell. A category-level 3-d object dataset: Puttingthe kinect to work. In Workshop on Consumer Depth Cam-

eras in Computer Vision (with ICCV), 2011. 8[19] T. Kanade. A theory of origami world. Artificial Intelligence,

13(3), 1980. 1, 2[20] K. Karsch, Z. Liao, J. Rock, J. T. Barron, and D. Hoiem.

Boundary cues for 3D object shape recovery. In CVPR, 2013.2

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012. 2

[22] L. Ladicky, B. Zeisl, and M. Pollefeys. Discriminativelytrained dense surface normal estimation. In ECCV, 2014.2, 3, 4, 6, 7

[23] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard,W. Hubbard, and L. D. Jackel. Handwritten digit recognitionwith a back-propagation network. In NIPS, 1990. 2

[24] D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimat-ing spatial layout of rooms using volumetric reasoning aboutobjects and surfaces. In NIPS, 2010. 1, 2, 4

[25] D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoningfor single image structure recovery. In CVPR, 2009. 2, 8

[26] D. Munoz, J. A. Bagnell, and M. Hebert. Stacked hierarchi-cal labeling. In ECCV, 2010. 5

[27] L. Roberts. Machine perception of 3D solids. In PhD Thesis,1965. 1, 2

[28] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth fromsingle monocular images. In NIPS, 2005. 2

[29] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. BoxIn the Box: Joint 3D Layout and Object Reasoning from Sin-gle Images. In ICCV, 2013. 2

[30] A. G. Schwing and R. Urtasun. Efficient Exact Inference for3D Indoor Scene Understanding. In ECCV, 2012. 1, 2, 4

[31] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. In ICLR, 2014.2

[32] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoorsegmentation and support inference from RGBD images. InECCV, 2012. 6

[33] K. Sugihara. Machine Interpretation of Line Drawings. MITPress, 1986. 2

[34] A. Toshev and C. Szegedy. Deeppose: Human pose estima-tion via deep neural networks. In CVPR, 2014. 2

[35] Z. Tu and X. Bai. Auto-context and its application to high-level vision tasks and 3d brain image segmentation. TPAMI,32(10):1744–1757, 2010. 5

[36] D. Waltz. Understanding line drawings of scenes with shad-ows. In The Psychology of Computer Vision. McGraw-Hill,1975. 2

[37] X. Wang, D. F. Fouhey, and A. Gupta. Designing deep net-works for surface normal estimation. CoRR, abs/1411.4958,2014. 3

[38] X. Wang, L. Zhang, L. Lin, Z. Liang, and W. Zuo. Joint tasklearning via deep neural networks with application to genericobject extraction. In NIPS, 2014. 2

[39] Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: Awhole-room 3d context model for panoramic scene under-standing. In ECCV, 2014. 2

[40] Y. Zhao and S. Zhu. Scene parsing by integrating function,geometry and appearance models. In CVPR, 2013. 1