abstract arxiv:2110.07604v3 [cs.cv] 18 oct 2021

18
NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild Jason Y. Zhang Gengshan Yang Shubham Tulsiani * Deva Ramanan * Robotics Institute, Carnegie Mellon University Abstract Recent history has seen a tremendous growth of work exploring implicit repre- sentations of geometry and radiance, popularized through Neural Radiance Fields (NeRF). Such works are fundamentally based on a (implicit) volumetric repre- sentation of occupancy, allowing them to model diverse scene structure including translucent objects and atmospheric obscurants. But because the vast majority of real-world scenes are composed of well-defined surfaces, we introduce a surface analog of such implicit models called Neural Reflectance Surfaces (NeRS). NeRS learns a neural shape representation of a closed surface that is diffeomorphic to a sphere, guaranteeing water-tight reconstructions. Even more importantly, surface parameterizations allow NeRS to learn (neural) bidirectional surface reflectance functions (BRDFs) that factorize view-dependent appearance into environmental illumination, diffuse color (albedo), and specular “shininess.” Finally, rather than illustrating our results on synthetic scenes or controlled in-the-lab capture, we assemble a novel dataset of multi-view images from online marketplaces for selling goods. Such “in-the-wild” multi-view image sets pose a number of challenges, including a small number of views with unknown/rough camera estimates. We demonstrate that surface-based neural reconstructions enable learning from such data, outperforming volumetric neural rendering-based reconstructions. We hope that NeRS serves as a first step toward building scalable, high-quality libraries of real-world shape, materials, and illumination. The project page with code and video visualizations can be found at jasonyzhang.com/ners. Recovered Shape and Cameras Input Images Novel Views Output Texture Illumination Diverse 3D Reconstructions Initial Mesh Figure 1: 3D view synthesis in the wild. From several multi-view internet images of a truck and a coarse initial mesh (top left), we recover the camera poses, 3D shape, texture, and illumination (top right). We demonstrate the scalability of our approach on a wide variety of indoor and outdoor object categories (second row). [Video] * denotes equal coding. Corresponding email: [email protected]. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Upload: others

Post on 13-Apr-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

NeRS: Neural Reflectance Surfacesfor Sparse-view 3D Reconstruction in the Wild

Jason Y. Zhang Gengshan Yang Shubham Tulsiani∗ Deva Ramanan∗Robotics Institute, Carnegie Mellon University

Abstract

Recent history has seen a tremendous growth of work exploring implicit repre-sentations of geometry and radiance, popularized through Neural Radiance Fields(NeRF). Such works are fundamentally based on a (implicit) volumetric repre-sentation of occupancy, allowing them to model diverse scene structure includingtranslucent objects and atmospheric obscurants. But because the vast majority ofreal-world scenes are composed of well-defined surfaces, we introduce a surfaceanalog of such implicit models called Neural Reflectance Surfaces (NeRS). NeRSlearns a neural shape representation of a closed surface that is diffeomorphic to asphere, guaranteeing water-tight reconstructions. Even more importantly, surfaceparameterizations allow NeRS to learn (neural) bidirectional surface reflectancefunctions (BRDFs) that factorize view-dependent appearance into environmentalillumination, diffuse color (albedo), and specular “shininess.” Finally, rather thanillustrating our results on synthetic scenes or controlled in-the-lab capture, weassemble a novel dataset of multi-view images from online marketplaces for sellinggoods. Such “in-the-wild” multi-view image sets pose a number of challenges,including a small number of views with unknown/rough camera estimates. Wedemonstrate that surface-based neural reconstructions enable learning from suchdata, outperforming volumetric neural rendering-based reconstructions. We hopethat NeRS serves as a first step toward building scalable, high-quality librariesof real-world shape, materials, and illumination. The project page with code andvideo visualizations can be found at jasonyzhang.com/ners.

Recovered Shapeand Cameras

Input Images Novel ViewsOutput Texture Illumination

Diverse 3D Reconstructions

Initial Mesh

Figure 1: 3D view synthesis in the wild. From several multi-view internet images of a truck and a coarse initialmesh (top left), we recover the camera poses, 3D shape, texture, and illumination (top right). We demonstratethe scalability of our approach on a wide variety of indoor and outdoor object categories (second row). [Video]

∗ denotes equal coding. Corresponding email: [email protected].

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

arX

iv:2

110.

0760

4v3

[cs

.CV

] 1

8 O

ct 2

021

Page 2: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

1 Introduction

Although we observe the surrounding world only via 2D percepts, it is undeniably 3D. The goalof recovering this underlying 3D from 2D observations has been a longstanding one in the visioncommunity, and any computational approach aimed at this task must answer a central question aboutrepresentation—how should we model the geometry and appearance of the underlying 3D structure?

An increasingly popular answer to this question is to leverage neural volumetric representationsof density and radiance fields (Mildenhall et al., 2020). This allows modeling structures fromrigid objects to translucent fluids, while further enabling arbitrary view-dependent lighting effects.However, it is precisely this unconstrained expressivity that makes it less robust and unsuitable formodeling 3D objects from sparse views in the wild. While these neural volumetric representationshave been incredibly successful, they require hundreds of images, typically with precise camera poses,to model the full 3D structure and appearance of real-world objects. In contrast, when applied to‘in-the-wild’ settings e.g. a sparse set of images with imprecise camera estimates from off-the-shelfsystems (see Fig. 1), they are unable to infer a coherent 3D representation. We argue this is becausethese neural volumetric representations, by allowing arbitrary densities and lighting, are too flexible.

Is there a robust alternative that captures real-world 3D structure? The vast majority of real-worldobjects and scenes comprise of well-defined surfaces. This implies that the geometry, rather thanbeing an unconstrained volumetric function, can be modeled as a 2D manifold embedded in euclidean3D space—and thus encoded via a (neural) mapping from a 2D manifold to 3D. Indeed, such meshedsurface manifolds form the heart of virtually all rendering engines (Foley et al., 1996). Moreover,instead of allowing arbitrary view-dependent radiance, the appearance of such surfaces can bedescribed using (neural) bidirectional surface reflection functions (BRDFs), themselves developedby the computer graphics community over decades. We operationalize these insights into NeuralReflectance Surfaces (NeRS), a surface-based neural representation for geometry and appearance.

NeRS represents shape using a neural displacement field over a canonical sphere, thus constrainingthe geometry to be a watertight surface. This representation crucially associates a surface normal toeach point, which enables modeling view-dependent lighting effects in a physically grounded manner.Unlike volumetric representations which allow unconstrained radiance, NeRS factorizes surfaceappearance using a combination of diffuse color (albedo) and specularity. It does so by learning neuraltexture fields over the sphere to capture the albedo at each surface point, while additionally inferringan environment map and surface material properties. This combination of a surface constraint and afactored appearance allows NeRS to learn efficiently and robustly from a sparse set of images in thewild, while being able to capture varying geometry and complex view-dependent appearance.

Using only a coarse category-level template and approximate camera poses, NeRS can reconstructinstances from a diverse set of classes. Instead of evaluating in a synthetic setup, we introducea dataset sourced from marketplace settings where multiple images of a varied set of real-worldobjects under challenging illumination are easily available. We show NeRS significantly outperformsneural volumetric or classic mesh-based approaches in this challenging setup, and as illustrated inFig. 1, is able to accurately model the view-dependent appearance via its disentangled representation.Finally, as cameras recovered in the wild are only approximate, we propose a new evaluation protocolfor in-the-wild novel view synthesis in which cameras can be refined during both training andevaluation. We hope that our approach and results highlight the several advantages that neural surfacerepresentations offer, and that our work serves as a stepping stone for future investigations.

2 Related Work

Surface-based 3D Representations. As they enable efficient representation and rendering, polygo-nal meshes are widely used in vision and graphics. In particular, morphable models (Blanz and Vetter,1999) allow parametrizing shapes as deformations of a canonical template and can even be learnedfrom category-level image collections (Cashman and Fitzgibbon, 2012; Kar et al., 2015). With theadvances in differentiable rendering (Kato et al., 2018; Laine et al., 2020; Ravi et al., 2020), thesehave also been leveraged in learning based frameworks for shape prediction (Kanazawa et al., 2018;Gkioxari et al., 2019; Goel et al., 2020) and view synthesis (Riegler and Koltun, 2020). Whereasthese approaches use an explicit discrete mesh, some recent methods have proposed using continuous

2

Page 3: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

neural surface parametrization like ours to represent shape (Groueix et al., 2018) and texture (Tulsianiet al., 2020; Bhattad et al., 2021).

However, all of these works leverage such surface representations for (coarse) single-view 3Dprediction given a category-level training dataset. In contrast, our aim is to infer such a representationgiven multiple images of a single instance, and without prior training. Closer to this goal ofrepresenting a single instance in detail, contemporary approaches have shown the benefits of usingvideos (Li et al., 2020; Yang et al., 2021) to recover detailed shapes, but our work tackles a morechallenging setup where correspondence/flow across images is not easily available. In addition, whilethese prior approaches infer the surface texture, they do not enable the view-dependent appearanceeffects that our representation can model.

Volumetric 3D and Radiance Fields. Volumetric representations for 3D serve as a common, andarguably more flexible alternative to surface based representations, and have been very popular forclassical multi-view reconstruction approaches (Furukawa and Hernández, 2015). These have sincebeen incorporated in deep-learning frameworks for shape prediction (Girdhar et al., 2016; Choyet al., 2016) and differentiable rendering (Yan et al., 2016; Tulsiani et al., 2017). Although theseinitial approaches used discrete volumetric grids, their continuous neural function analogues havesince been proposed to allow finer shape (Mescheder et al., 2019; Park et al., 2019) and texturemodeling (Oechsle et al., 2019).

Whereas the above methods typically aimed for category-level shape representation, subsequentapproaches have shown particularly impressive results when using these representations to modela single instance from images (Sitzmann et al., 2019a; Thies et al., 2019; Sitzmann et al., 2019b) –which is the goal of our work. More recently, by leveraging an implicit representation in the form ofa Neural Radiance Field, Mildenhall et al. (2020) showed the ability to model complex geometriesand illumination from images. There has since been a flurry of impressive work to further push theboundaries of these representations and allow modeling deformation (Pumarola et al., 2020; Parket al., 2021), lighting variation (Martin-Brualla et al., 2021), and similar to ours, leveraging insightsfrom surface rendering to model radiance (Yariv et al., 2020; Boss et al., 2021; Oechsle et al., 2021;Srinivasan et al., 2021; Zhang et al., 2021b; Wang et al., 2021a; Yariv et al., 2021). However, unlikeour approach which can efficiently learn from a sparse set of images with coarse cameras, theseapproaches rely on a dense set of multi-view images with precise camera localization to recovera coherent 3D structure of the scene. While recent concurrent work (Lin et al., 2021) does relaxthe constraint of precise cameras, it does so by foregoing view-dependent appearance, while stillrequiring a dense set of images. Other approaches (Bi et al., 2020; Zhang et al., 2021a) that learnmaterial properties from sparse views require specialized illumination rigs.

Multi-view Datasets. Many datasets study the longstanding problem of multi-view reconstructionand view synthesis. However, they are often captured in controlled setups, small in scale, and notdiverse enough to capture the span of real world objects. Middlebury (Seitz et al., 2006) benchmarksmulti-view reconstruction, containing two objects with nearly Lambertian surfaces. DTU (Aanæset al., 2016) contains eighty objects with various materials but is still captured in a lab with controlledlighting. Freiburg cars (Sedaghat and Brox, 2015) captures 360 degree videos of fifty-two outdoorcars for multi-view reconstruction. ETH3D (Schöps et al., 2019) and Tanks and Temples (Knapitschet al., 2017) contain both indoor and outdoor scenes but are small in scale. Perhaps most relevantare large-scale datasets of real-world objects such as Redwood (Choi et al., 2016) and StanfordProducts (Oh Song et al., 2016), but the data is dominated by single-views or small baseline videos.In contrast, our Multi-view Marketplace Cars (MVMC) dataset contains thousands of multi-viewcaptures of in-the-wild objects under various illumination conditions, making it suitable for studyingand benchmarking algorithms for multi-view reconstruction, view synthesis, and inverse rendering.

3 Method

Given a sparse set of input images of an object under natural lighting conditions, we aim to model itsshape and appearance. While recent neural volumetric approaches share a similar goal, they require adense set of views with precise camera information. Instead, our approach relies only on approximatecamera pose estimates and a coarse category-level shape template. Our key insight is that insteadof allowing unconstrained densities popularly used for volumetric representations, we can enforcea surface-based 3D representation. Importantly, this allows view-dependent appearance variation

3

Page 4: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Shape Deformation Network

Texture Prediction Network

(a)

(b) (c)

(d) (e)

(f)

CanonicalSphere

Textured Mesh

Predicted Shape

PredictedTexture

+ =

Sphere Template Predicted Shape Predicted Texture Textured Mesh

Figure 2: Neural Surface Representation. We propose an implicit, continuous representation of shape andtexture. We model shape as a deformation of a unit sphere via a neural network fshape, and texture as a learnedper-uv color value via a neural network ftex. We can discretize fshape and ftex to produce the textured mesh above.

by leveraging constrained reflection models that decompose appearance into diffuse and specularcomponents. In this section, we first introduce our (neural) surface representation that captures theobject’s shape and texture, and then explain how illumination and specular effects can be modeled forrendering. Finally, we describe how our approach can learn using challenging in-the-wild images.

3.1 Neural Surface Representation

We represent object shape via a deformation of a unit sphere. Previous works (Kanazawa et al.,2018; Goel et al., 2020) have generally modeled such deformations explicitly: the unit sphere isdiscretized at some resolution as a 3D mesh with V vertices. Predicting the shape deformation thusamounts to predicting vertex offsets δ ∈ RV×3. Such explicit discrete representations have severaldrawbacks. First, they can be computationally expensive for dense meshes with fine details. Second,they lack useful spatial inductive biases as the vertex locations are predicted independently. Finally,the learned deformation model is fixed to a specific level of discretization, making it non-trivial, forinstance, to allow for more resolution as needed in regions with richer detail. These limitations alsoextend to texture parametrization commonly used for such discrete mesh representations—usingeither per-vertex or per-face texture samples (Kato et al., 2018), or fixed resolution texture map, limitsthe ability to capture finer details.

Camera

Surface

Figure 3: Notation and conven-tion for viewpoint and illumi-nation parameterization. Thecamera at c is looking at point xon the surface S. v denotes thedirection of the camera w.r.t x,and n is the normal of S at x. Ωdenotes the unit hemisphere cen-tered about n. We compute thelight arriving in the direction ofevery ω ∈ Ω, and r is the reflec-tion of w about n.

Inspired by Groueix et al. (2018); Tulsiani et al. (2020), we addressthese challenges by adopting a continuous surface representationvia a neural network. We illustrate this representation in Fig. 2.For any point u on the surface of a unit sphere S2, we represent its3D deformation x ∈ R3 using the mapping fshape(u) = x wherefshape is parameterized as a multi-layer perceptron. This networktherefore induces a deformation field over the surface of the unitsphere, and this deformed surface serves as our shape representation.We represent the surface texture in a similar manner–as a neuralvector field over the surface of the sphere: ftex(u) = t ∈ R3. Thissurface texture can be interpreted as an implicit UV texture map.

3.2 Modeling Illumination and Specular Rendering

Surface Rendering. The surface geometry and texture are not suf-ficient to infer appearance of the object e.g. a uniformly red car mayappear darker on one side, and lighter on the other depending on thedirection of incident light. In addition, depending on viewing direc-tion and material properties, one may observe different appearancefor the same 3D point e.g. shiny highlight from certain viewpoints.More formally, assuming that a surface does not emit light, the outgoing radiance Lo in direction vfrom a surface point x can be described by the rendering equation (Kajiya, 1986; Immel et al., 1986):

Lo(x, v) =

∫Ω

fr(x, v, ω)Li(x, ω)(ω · n)dω (1)

where Ω is the unit hemisphere centered at surface normal n, and ω denotes the negative direction ofincoming light. fr(x, v, ω) is the bidirectional reflectance function (BRDF) which captures material

4

Page 5: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Azimuth

Elev

atio

n

Azimuth

Elev

atio

n

Figure 4: Components of learned illumination model. Given a query camera viewpoint (illustrated via thereference image I), we recover the radiance output Lo, computed using Phong shading (Phong, 1975). Here, weshow the full decomposition of learned components. From the environment map fenv and normals n, we computediffuse (Idiffuse) and specular lighting (Ispecular). The texture and diffuse lighting form the view-independentcomponent (“View Indep.") and the specular lighting (weighted by the specular coefficient ks) forms theview-dependent component of the radiance. Altogether, the output radiance Lo = T Idiffuse + ksIspecularity(4). We also visualize the radiance using the mean texture, which is used to help learn plausible illumination.In the yellow box, we visualize the effects of the two specularity parameters. The shininess α controls themirror-ness/roughness of the surface. The specular coefficient ks controls the intensity of the specular highlights.

properties (e.g. color and shininess) of surface S at x, and Li(x, ω) is the radiance coming toward xfrom ω (Refer to Fig. 3). Intuitively, this integral computes the total effect of the reflection of everypossible light ray ω hitting x bouncing in the direction v.

We thus need to infer the environment lighting and surface material properties to allow realisticrenderings. However, learning arbitrary lighting Li or reflection models fr is infeasible given sparseviews, and we need to further constrain these to allow learning. Inspired by concurrent work (Wuet al., 2021) that demonstrated its efficacy when rendering rotationally symmetric objects, we leveragethe Phong reflection model (Phong, 1975) with the lighting represented as a neural environment map.

Neural Environment Map. An environment map intuitively corresponds to the assumption thatall the light sources are infinitely far away. This allows a simplified model of illumination, wherethe incoming radiance only depends on the direction ω and is independent of the position x i.e.Li(x, ω) ≡ Iω. We implement this as a neural spherical environment map fenv which learns topredict the incoming radiance for any query direction Li(x, ω) ≡ Iω = fenv(ω). Note that there is afundamental ambiguity between material properties and illumination, e.g. a car that appears red couldbe a white car under red illumination, or a red car under white illumination. To avoid this, we followWu et al. (2021), and further constrain the environment illumination to be grayscale, i.e. fenv(ω) ∈ R.

Appearance under Phong Reflection. Instead of allowing an arbitrary BRDF fr, the Phongreflection model decomposes the outgoing radiance from point x in direction v into the diffuse andspecular components. The view-independent portion of the illumination is modeled by the diffusecomponent:

Idiffuse(x) =∑ω∈Ω

(ω · n)Iω, (2)

while the view-dependent portion of the illumination is modeled by the specular component:

Ispecular(x, v) =∑ω∈Ω

(rω,n · v)αIω, (3)

where rω,n = 2(ω · n)n − ω is the reflection of ω about the normal n. The shininess coefficientα ∈ (0,∞) is a property of the surface material and controls the “mirror-ness" of the surface. If α ishigh, the specular highlight will only be visible if v aligns closely with rω . Altogether, we computethe radiance of x in direction v as:

Lo(x, v) = T (x) · Idiffuse(x) + ks · Ispecular(x, v) (4)

5

Page 6: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Images Init. Shape Reconstructions from Novel ViewsPred. Shape

Figure 5: Qualitative results on various household objects. We demonstrate the versatility of our approachon an espresso machine, a bottle of ketchup, a game controller, and a fire hydrant. Each instance has 7-10 inputviews. We find that a coarse, cuboid mesh is sufficient as an initialization to learn detailed shape and texture. Weinitialize the camera poses by hand, roughly binning in increments of 45 degrees azimuth. [Video]

where the specularity coefficient ks is another surface material property that controls the intensity ofthe specular highlight. T (x) is the texture value at x computed by ftex. For the sake of simplicity, αand ks are shared across the entire instance. See Fig. 4 for a full decomposition of these components.

3.3 Learning NeRS in the Wild

Given a sparse set of images in-the-wild, our approach aims to infer a NeRS representation, whichwhen rendered, matches the available input. Concretely, our method takes as input N (typically8) images of the same instance IiNi=1, noisy camera rotations RiNi=1, and a category-specificmesh initializationM. Using these, we aim to optimize full perspective cameras ΠNi=1 as wellas the neural surface shape fshape, surface texture ftext, and environment map fenv. In addition, wealso recover the material properties of the object, parametrized by a specularity coefficient ks andshininess coefficient α.

Initialization. Note that both the camera poses and mesh initialization are only required to becoarsely accurate. We use an off-the-shelf approach (Xiao et al., 2019) to predict camera rotations,and we find that a cuboid is sufficient as an initialization for several instances (See Fig. 5). We useoff-the-shelf approaches (Rother et al., 2004; Kirillov et al., 2020) to compute masks MiNi=1. Weassume that all images were taken with the same camera intrinsics. We initialize the shared globalfocal length f to correspond to a field of view of 60 degrees, and set the principal point at center ofeach image. We initialize the camera pose with the noisy initial rotations Ri and a translation ti suchthat the object is fully in view. We pre-train fshape to output the template meshM.

Rendering. To render an image, NeRS first discretizes the neural shape model fshape(u) over sphericalcoordinates u to construct an explicit triangulated surface mesh. This triangulated mesh and cameraΠi are fed into PyTorch3D’s differentiable renderer (Ravi et al., 2020) to obtain per-pixel (continuous)spherical coordinates and associated surface properties:

[UV,N, Mi] = Rasterize(πi, fshape) (5)

where UV [p], N [p], and M [p] are (spherical) uv-coordinates, normals, and binary foreground-background labels corresponding to each image pixel p. Together with the environment map fenv andspecular material parameters (α, ks), these quantities are sufficient to compute the outgoing radianceat each pixel p under camera viewpoint Πi using (4). In particular, denoting by v(Π, p) the viewing

6

Page 7: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

direction for pixel p under camera Π, and using u ≡ UV [p], n ≡ N [p] for notational brevity, theintensity at pixel p can be computed as:

I[p] = ftex(u) ·(∑ω∈Ω

(ω · n) fenv(ω))

+ ks

(∑ω∈Ω

(rω,n · v(Π, p))a fenv(ω))

(6)

Image loss. We compute a perceptual loss (Zhang et al., 2018) Lperceptual(Ii, Ii) that compares thedistance between the rendered and true image using off-the-shelf VGG deep features. Note thatbeing able to compute a perceptual loss is a significant benefit of surface-based representations overvolumetric approaches such as NeRF (Mildenhall et al., 2020), which operate on batches of raysrather than images, due to the computational cost of volumetric rendering. Similar to Wu et al. (2021),we find that an additional rendering loss using the mean texture (see Fig. 4 and Fig. 8 for exampleswith details in the supplement) helps learn visually plausible lighting.

Mask Loss. To measure disagreement between the rendered and measured silhouettes, we compute amask loss Lmask = 1

N

∑Ni=1‖Mi − Mi‖22, distance transform loss Ldt = 1

N

∑Ni=1Di Mi, and 2D

chamfer loss Lchamfer = 1N

∑Ni=1

∑p∈E(Mi)

minp∈Mi‖p− p‖22. Di refers to the Euclidean distance

transform of mask Mi, E(·) computes the 2D pixel coordinates of the edge of a mask, and p is everypixel coordinate in the predicted silhouette.

Regularization. Finally, to encourage smoother shape whenever possible, we incorporate a meshregularization loss Lregularize = Lnormals + Llaplacian consisting of normals consistency and Laplaciansmoothing losses (Nealen et al., 2006; Desbrun et al., 1999). Note that such geometry regularizationis another benefit of surface representations over volumetric ones. Altogether, we minimize:

L = λ1Lmask + λ2Ldt + λ3Lchamfer + λ4Lperceptual + λ5Lregularize (7)

w.r.t Πi = [Ri, ti, f ], α, ks, and the weights of fshape, ftext, and fenv_map.

Optimization. We optimize (7) in a coarse-to-fine fashion, starting with a few parameters and slowlyincreasing the number of free parameters. We initially optimize (7), w.r.t only the camera parametersΠi. After convergence, we sequentially optimize fshape, ftex, and fenv/α/ks. We find it helpful tosample a new set of spherical coordinates u each iteration when rasterizing. This helps propagategradients over a larger surface and prevent aliasing. With 4 Nvidia 1080TI GPUs, training NeRSrequires approximately 30 minutes. Please see Sec. 6 for hyperparameters and additional details.

4 Evaluation

In this section, we demonstrate the versatility of Neural Reflectance Surfaces to recover meaningfulshape, texture, and illumination from in-the-wild indoor and outdoor images.

Multi-view Marketplace Dataset. To address the shortage of in-the-wild multi-view datasets, weintroduce a new dataset, Multi-view Marketplace Cars (MVMC), collected from an online marketplacewith thousands of car listings. Each user-submitted listing contains seller images of the same carinstance. In total, we curate a subset of size 600 with at least 8 exterior views (averaging 10 exteriorimages per listing) along with 20 instances for an evaluation set (averaging 9.1 images per listing).We use Xiao et al. (2019) to compute rough camera poses. MVMC contains a large variety of carsunder various illumination conditions (e.g. indoors, overcast, sunny, snowy, etc). The filtered datasetwith anonymized personally identifiable information (e.g. license plates and phone numbers), masks,initial camera poses, and optimized NeRS cameras will be made available on the project page.

Novel View Synthesis. Traditionally, novel view synthesis requires accurate target cameras to useas queries. Existing approaches use COLMAP (Schönberger and Frahm, 2016) to recover groundtruth cameras, but this consistently fails on MVMC due to specularities and limited views. On theother hand, we can use learning-based methods (Xiao et al., 2019) to recover camera poses for bothtraining and test views. However, as these are inherently approximate, this complicates training andevaluation. To account for this, we explore two evaluation protocols. First, to mimic the traditionalevaluation setup, we obtain pseudo-ground truth cameras (with manual correction) and freeze themduring training and evaluation. While this evaluates the quality of the 3D reconstruction, it does notevaluate the method’s ability to jointly recover cameras. As a more realistic setup for evaluatingview synthesis in the wild, we evaluate each method with approximate (off-the-shelf) cameras, whileallowing them to be optimized.

7

Page 8: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Input Images Target View NeRS (Ours)NeRF* MetaNeRF MetaNeRF-ft IDR

Target View NeRF* IDR NeRS (Ours) Target View NeRF* IDR NeRS (Ours)

Figure 6: Qualitative comparison with fixed cameras. We evaluate all baselines on the task of novel viewsynthesis on Multi-view Marketplace Cars trained and tested with fixed, pseudo-ground truth cameras. Oneimage is held out during training. Since we do not have ground truth cameras, we treat the optimized camerasfrom optimizing over all images as the ground truth cameras. We train a modified version (See Sec. 4) ofNeRF (Mildenhall et al., 2020) that is more competitive with sparse views (NeRF∗). We also evaluate againsta meta-learned initialization of NeRF with and without finetuning until convergence (Tancik et al., 2021), butfound poor results perhaps due to the domain shift from Shapenet cars. Finally, IDR (Yariv et al., 2020) extractsa surface from an SDF representation, but struggles to produce a view-consistent output given limited inputviews. We find that NeRS synthesizes novel views that are qualitatively closer to the target. The red truck has 16total views while the blue SUV has 8 total views. [Video]

Input Images Target View NeRS (Ours)NeRF* MetaNeRF MetaNeRF-ft IDR

Target View NeRF* IDR NeRS (Ours) Target View NeRF* IDR NeRS (Ours)

Figure 7: Qualitative results for in-the-wild novel view synthesis. Since off-the-shelf camera poses are onlyapproximate for both training and test images, we allow cameras to be optimized during both training andevaluation (See Tab. 2 and Sec. 4). We find that NeRS generalizes better than the baselines in this unconstrainedbut more realistic setup. [Video]

Novel View Synthesis with Fixed Cameras. In the absence of ground truth cameras, we createpseudo-ground truth by manually correcting cameras recovered by jointly optimizing over all imagesfor each object instance. For each evaluation, we treat one image-camera pair as the target and theremaining pairs for training. We repeat this process for each image in the evaluation set (totaling 182).Unless otherwise noted, qualitative results use approximate cameras and not the pseudo-ground truth.

Novel View Synthesis in the Wild. While the above evaluates the quality of the 3D reconstructions,it is not representative of in-the-wild settings where the initial cameras are unknown/approximate andshould be optimized during training. Because even the test camera is approximate, each method issimilarly allowed to refine the test camera to better match the test image while keeping the modelfixed. Intuitively, this measures the ability of a model to synthesize a target view under some camera.See implementation details in Sec. 6.5.

Baselines. We evaluate our approach against Neural Radiance Fields (NeRF) (Mildenhall et al.,2020), which learns a radiance field conditioned on viewpoint and position and renders images usingraymarching. We find that the vanilla NeRF struggles in our in-the-wild low-data regime. As such, wemake a number of changes to make the NeRF baseline (denoted NeRF∗) as competitive as possible,including a mask loss and a canonical volume. Please see the appendix for full details. We alsoevaluate a simplified NeRF with a meta-learned initialization for cars from multi-view images (Tanciket al., 2021), denoted as MetaNeRF. MetaNeRF meta-learns an initialization such that with just afew gradient steps, it can learn a NeRF model. This allows the model to learn a data-driven priorover the shape of cars. Note that MetaNeRF is trained on ShapeNet (Chang et al., 2015) and thus hasseen more data than the other test-time-optimization approaches. We find that the default numberof gradient steps was insufficient for MetaNeRF to converge on images from MVMC, so we alsoevaluate MetaNeRF-ft, which is finetuned until convergence. Finally, we evaluate IDR (Yariv et al.,2020), which represents geometry by extracting a surface from a signed distance field. IDR learns aneural renderer conditioned on the camera direction, position, and normal of the surface.

8

Page 9: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Input Images Novel View 1 Novel View 2 Novel View 3Illum. of Mean Tex.Output Illum. of Mean Tex.Output Illum. of Mean Tex.Output

Figure 8: Qualitative results on our in-the-wild Multi-view Marketplace Cars dataset. Here we visualizethe NeRS outputs as well as the illumination of the mean texture on 3 of listings from the MVMC dataset. Wefind that NeRS recovers detailed textures and plausible illumination. Each instance has 8 input views. [Video]

Method MSE ↓ PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓NeRF∗ (Mildenhall et al., 2020) 0.0393 16.0 0.698 0.287 231.7MetaNeRF (Tancik et al., 2021) 0.0755 11.4 0.345 0.666 394.5MetaNeRF-ft (Tancik et al., 2021) 0.0791 11.3 0.500 0.542 326.8IDR (Yariv et al., 2020) 0.0698 13.8 0.658 0.328 190.1NeRS (Ours) 0.0254 16.5 0.720 0.172 60.9

Table 1: Quantitative evaluation of novel-view synthesis on MVMC using fixed pseudo-ground truth cam-eras. To evaluate novel view synthesis in a manner consistent with previous works that assume known cameras,we obtain pseudo-ground truth cameras by manually correcting off-the-shelf recovered cameras. We evaluateagainst a modified NeRF (NeRF∗), a meta-learned initialization to NeRF with and without finetuning (MetaN-eRF), and the volumetric surface-based IDR. NeRS significantly outperforms the baselines on all metrics on thetask of novel-view synthesis with fixed cameras. See Fig. 6 for qualitative results.

Method MSE ↓ PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓NeRF∗ (Mildenhall et al., 2020) 0.0464 14.7 0.660 0.335 277.9IDR (Yariv et al., 2020) 0.0454 14.4 0.685 0.297 242.3NeRS (Ours) 0.0338 15.4 0.675 0.221 92.5

Table 2: Quantitative evaluation of in-the-wild novel-view synthesis on MVMC. Off-the-shelf camerasestimated for in-the-wild data are inherently erroneous. This means that both training and test cameras areapproximate, complicating training and evaluation. To compensate for approximate test cameras, we allowmethods to refine the test camera given the test image with the model fixed (see Sec. 6 for details). Intuitivelythis measures the ability of a method to synthesize a test image under some camera. We evaluate against NeRFand IDR, and find that NeRS outperforms the baselines across all metrics. See Fig. 7 for qualitative results.

Metrics. We evaluate all approaches using the traditional image similarity metrics Mean-SquareError (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM).We also compute the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) whichcorrelates more strongly with human perceptual distance. Finally, we compute the Fréchet InceptionDistance (Heusel et al., 2017) between the novel view renderings and original images as a measure ofvisual realism. In Tab. 1 and Tab. 2, we find that NeRS significantly outperforms the baselines in allmetrics across both the fixed camera and in-the-wild novel-view synthesis evaluations. See Fig. 6 andFig. 7 for a visual comparison of the methods.

Qualitative Results. In Fig. 8, we show qualitative results on our Multi-view Marketplace Carsdataset. Each car instance has between 8 and 16 views. We visualize the outputs of our reconstructionfrom 3 novel views. We show the rendering for both the full radiance model and the mean texture.Both of these renderings are used to compute the perceptual loss (See Sec. 3.2). We find that NeRSrecovers detailed texture information and plausible illumination parameters. To demonstrate thescalability of our approach, we also evaluate on various household objects in Fig. 5. We find that a

9

Page 10: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

coarse, cuboid mesh is sufficient as an initialization to recover detailed shape, texture, and lightingconditions. Please refer to the project webpage for 360 degree visualizations.

5 Discussion

We present NeRS, an approach for learning neural surface models that capture geometry and sur-face reflectance. In contrast to volumetric neural rendering, NeRS enforces watertight and closedmanifolds. This allows NeRS to model surface-based appearance affects, including view-dependantspecularities and normal-dependant diffuse appearance. We demonstrate that such regularized re-constructions allow for learning from sparse in-the-wild multi-view data, enabling reconstruction ofobjects with diverse material properties across a variety of indoor/outdoor illumination conditions.Further, the recovery of accurate camera poses in the wild (where classic structure-from-motion fails)remains unsolved and serves as a significant bottleneck for all approaches, including ours. We tacklethis problem by using realistic but approximate off-the-shelf camera poses and by introducing a newevaluation protocol that accounts for this. We hope NeRS inspires future work that evaluates in thewild and enables the construction of high-quality libraries of real-world geometry, materials, andenvironments through better neural approximations of shape, reflectance, and illuminants.

Limitations. Though NeRS makes use of factorized models of illumination and material reflectance,there exists some fundamental ambiguities that are difficult from which to recover. For example, itis difficult to distinguish between an image of a gray car under bright illumination and an image ofa white car under dark illumination. We visualize such limitations in the supplement. In addition,because the neural shape representation of NeRS is diffeomorphic to a sphere, it cannot model objectsof non-genus-zero topologies.

Broader Impacts. NeRS reconstructions could be used to reveal identifiable or proprietary in-formation (e.g., license plates). We made our best effort to blur out all license plates and otherpersonally-identifiable information in the paper and data.

Acknowledgements. We would like to thank Angjoo Kanazawa and Leonid Keselman for usefuldiscussion and Sudeep Dasari and Helen Jiang for helpful feedback on drafts of the manuscript. Thiswork was supported in part by the NSF GFRP (Grant No. DGE1745016), Singapore DSTA, andCMU Argo AI Center for Autonomous Vehicle Research.

ReferencesHenrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl.

Large-Scale Data for Multiple-View Stereopsis. ICCV, 2016.

Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, and Bryan Catanzaro. View Generalizationfor Single Image Textured 3D Models. In CVPR, 2021.

Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, and Ravi Ramamoorthi. Deep 3D Capture:Geometry and Reflectance from Sparse Multi-view Images. In CVPR, 2020.

Volker Blanz and Thomas Vetter. A Morphable Model for the Synthesis of 3D Faces. In SIGGRAPH,1999.

Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Barron, Ce Liu, and Hendrik P.A. Lensch.NeRD: Neural Reflectance Decomposition from Image Collections. In ICCV, 2021.

Thomas J Cashman and Andrew W Fitzgibbon. What Shape are Dolphins? Building 3D MorphableModels from 2D Images. TPAMI, 35(1):232–244, 2012.

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An Information-rich 3DModel Repository. arXiv:1512.03012, 2015.

Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A Large Dataset of Object Scans.arXiv:1602.02481, 2016.

Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3D-R2N2: AUnified Approach for Single and Multi-view 3D Object Reconstruction. In ECCV, 2016.

10

Page 11: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Mathieu Desbrun, Mark Meyer, Peter Schröder, and Alan H Barr. Implicit Fairing of Irregular Meshesusing Diffusion and Curvature Flow. In SIGGRAPH, 1999.

James D Foley, Foley Dan Van, Andries Van Dam, Steven K Feiner, John F Hughes, Edward Angel,and J Hughes. Computer Graphics: Principles and Practice, volume 12110. Addison-WesleyProfessional, 1996.

Yasutaka Furukawa and Carlos Hernández. Multi-view Stereo: A Tutorial. Foundations and Trendsin Computer Graphics and Vision, 9(1-2):1–148, 2015.

Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a Predictable andGenerative Vector Representation for Objects. In ECCV, 2016.

Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh R-CNN. In ICCV, 2019.

Shubham Goel, Angjoo Kanazawa, , and Jitendra Malik. Shape and Viewpoints without Keypoints.In ECCV, 2020.

Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. APapier-mâché Approach to Learning 3D Surface Generation. In CVPR, 2018.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GansTrained by a Two Time-scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS,2017.

David S Immel, Michael F Cohen, and Donald P Greenberg. A Radiosity Method for Non-diffuseEnvironments. In SIGGRAPH, 1986.

James T Kajiya. The Rendering Equation. In SIGGRAPH, 1986.

Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning Category-Specific Mesh Reconstruction from Image Collections. In ECCV, 2018.

Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific ObjectReconstruction from a Single Image. In CVPR, 2015.

Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3D Mesh Renderer. In CVPR, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.

Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image Segmentation asRendering. In CVPR, 2020.

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and Temples: BenchmarkingLarge-scale Scene Reconstruction. ACM Transactions on Graphics, 36(4):1–13, 2017.

Kiriakos N Kutulakos and Steven M Seitz. A Theory of Shape by Space Carving. IJCV, 38(3):199–218, 2000.

Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. ModularPrimitives for High-Performance Differentiable Rendering. ACM Transactions on Graphics, 39(6),2020.

Xueting Li, Sifei Liu, Shalini De Mello, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and JanKautz. Online Adaptation for Consistent Mesh Reconstruction in the Wild. In NeurIPS, 2020.

Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. BARF: Bundle-AdjustingNeural Radiance Fields. In ICCV, 2021.

Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy,and Daniel Duckworth. NeRF in the Wild: Neural Radiance Fields for Unconstrained PhotoCollections. In CVPR, 2021.

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger.Occupancy Networks: Learning 3D Reconstruction in Function Space. In CVPR, 2019.

11

Page 12: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, andRen Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV,2020.

Andrew Nealen, Takeo Igarashi, Olga Sorkine, and Marc Alexa. Laplacian Mesh Optimization. InGRAPHITE, 2006.

Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. TextureFields: Learning Texture Representations in Function Space. In ICCV, 2019.

Michael Oechsle, Songyou Peng, and Andreas Geiger. UNISURF: Unifying Neural Implicit Surfacesand Radiance Fields for Multi-View Reconstruction. In ICCV, 2021.

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep Metric Learning via LiftedStructured Feature Embedding. In CVPR, 2016.

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF:Learning Continuous Signed Distance Functions for Shape Representation. In CVPR, 2019.

Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M.Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable Neural Radiance Fields. In ICCV, 2021.

Bui Tuong Phong. Illumination for Computer Generated Pictures. Communications of the ACM, 18(6), 1975.

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: NeuralRadiance Fields for Dynamic Scenes. In CVPR, 2020.

Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, andGeorgia Gkioxari. Accelerating 3D Deep Learning with PyTorch3D. arXiv:2007.08501, 2020.

Gernot Riegler and Vladlen Koltun. Free View Synthesis. In ECCV, 2020.

Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. "GrabCut" Interactive ForegroundExtraction using Iterated Graph Cuts. ACM Transactions on Graphics, 23(3):309–314, 2004.

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In CVPR,2016.

Thomas Schöps, Torsten Sattler, and Marc Pollefeys. BAD SLAM: Bundle Adjusted Direct RGB-DSLAM. In CVPR, 2019.

Nima Sedaghat and Thomas Brox. Unsupervised Generation of a Viewpoint Annotated Car Datasetfrom Videos. In ICCV, 2015.

Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A Comparisonand Evaluation of Multi-view Stereo Reconstruction Algorithms. In CVPR, 2006.

Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and MichaelZollhofer. DeepVoxels: Learning Persistent 3D Feature Embeddings. In CVPR, 2019a.

Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene Representation Networks:Continuous 3D-Structure-Aware Neural Scene Representations. In NeurIPS, 2019b.

Pratul P Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and Jonathan TBarron. NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis. InCVPR, 2021.

Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, UtkarshSinghal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier Features Let NetworksLearn High Frequency Functions in Low Dimensional Domains. In NeurIPS, 2020.

Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P. Srinivasan, Jonathan T. Bar-ron, and Ren Ng. Learned Initializations for Optimizing Coordinate-Based Neural Representations.In CVPR, 2021.

12

Page 13: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred Neural Rendering: Image Synthesisusing Neural Textures. ACM Transactions on Graphics, 38(4):1–12, 2019.

Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view Supervision forSingle-view Reconstruction via Differentiable Ray Consistency. In CVPR, 2017.

Shubham Tulsiani, Nilesh Kulkarni, and Abhinav Gupta. Implicit Mesh Reconstruction fromUnannotated Image Collections. arXiv:2007.08504, 2020.

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance Normalization: The MissingIngredient for Fast Stylization. arXiv:1607.08022, 2016.

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. NeuS:Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. InNeurIPS, 2021a.

Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. NeRF–: NeuralRadiance Fields Without Known Camera Parameters. arXiv:2102.07064, 2021b.

Shangzhe Wu, Ameesh Makadia, Jiajun Wu, Noah Snavely, Richard Tucker, and Angjoo Kanazawa.De-rendering the World’s Revolutionary Artefacts. In CVPR, 2021.

Yang Xiao, Xuchong Qiu, Pierre-Alain Langlois, Mathieu Aubry, and Renaud Marlet. Pose fromShape: Deep Pose Estimation for Arbitrary 3D Objects. In BMVC, 2019.

Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective Transformer Nets:Learning Single-view 3D Object Reconstruction without 3D Supervision. In NeurIPS, 2016.

Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Huiwen Chang, DevaRamanan, William T Freeman, and Ce Liu. LASR: Learning Articulated Shape Reconstructionfrom a Monocular Video. In CVPR, 2021.

Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and YaronLipman. Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance.In NeurIPS, 2020.

Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume Rendering of Neural Implicit Surfaces.arXiv:2106.12052, 2021.

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonableeffectiveness of deep features as a perceptual metric. In CVPR, 2018.

Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Rohit Pandey, SergioOrts-Escolano, Philip Davidson, Christoph Rhemann, Paul Debevec, et al. Neural Light Transportfor Relighting and View Synthesis. ACM Transactions on Graphics, 40(1):1–17, 2021a.

Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Freeman, andJonathan T Barron. NeRFactor: Neural Factorization of Shape and Reflectance Under an UnknownIllumination. In SIGGRAPH Asia, 2021b.

6 Appendix

We describe an additional ablation on view-dependence in Sec. 6.1, some limitations of our illumina-tion model in Sec. 6.2, implementation details for NeRS and the baselines in Sec. 6.3 and Sec. 6.4respectively, and details for in-the-wild novel view synthesis in Sec. 6.5. We also evaluate the effectof increasing the number of training images on novel view synthesis in Fig. 9, ablate removing anyview-dependence prediction altogether in Fig. 10, compare the learned shape models with volumecarving in Fig. 11, and describe the architectures of NeRS networks in Fig. 14.

13

Page 14: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

2 4 6 8 10 12 14Number of Training Images

0.14

0.16

0.18

0.20

0.22

0.24

0.26

LPIP

S (

)

More Training Images Improves Novel View SynthesisMeanStd. Error

Figure 9: Relationship between number of training images and reconstruction quality. We quantify thenumber of images needed for a meaningful reconstruction using NeRS on a specific instance from MVMC with16 total images. Given one of the images as a target image, we randomly select one of the remaining imagesas the initial training image. Then, we iteratively increase the number of training images, adding the imagecorresponding to the pseudo-ground truth pose furthest from the current set. This setup is intended to emulatehow a user would actually take pictures of objects in the wild (i.e. taking wide baseline multi-view imagesfrom far away viewpoints). Note that once the sets of training images are selected, we use the in-the-wild novelview synthesis training and evaluation protocol. In this plot, we visualize the mean and standard error over 16runs. We find that increasing the number of images improves the quality in terms of perceptual similarity, withperformance beginning to plateau after 8 images.

6.1 View-Dependence Ablation

To evaluate the importance of learning a BRDF for illumination estimation, we train a NeRS thatdirectly conditions the radiance prediction on the position and view direction, similar to NeRF. Morespecifically, we concatenate the uv with view direction ω as the input to ftex which still predicts anRGB color value, and do not use fenv. We include the quantitative evaluation on using our Multi-viewMarketplace Cars (MVMC) dataset for fixed camera novel-view synthesis in Tab. 3 and in-the-wildnovel-view synthesis in Tab. 4.

Video qualitative results are available on the figures page1 of the project webpage. We observe thatthe NeRS with NeRF-style view-direction conditioning has qualitatively similar visual artifacts to theNeRF baseline, particularly large changes in appearance despite small changes in viewing direction.This suggests that learning a BRDF is an effective way to regularize and improve generalization ofview-dependent effects.

6.2 Limitations of illumination model

In order to regularize lighting effects, we assume all lighting predicted by fenv to be grayscale. Forsome images with non-white lighting (see Fig. 12), this can cause the lighting color to be baked intothe predicted texture in an unrealistic manner. In addition, if objects are gray in color, there is afundamental ambiguity as to whether the luminance is due to the object texture or lighting e.g. darkcar with bright illumination or light car with dark illumination (see Fig. 13).

6.3 Implementation Details and Hyper-Parameters

To encourage plausible lighting, we found it helpful to also optimize a perceptual loss on theillumination of the mean texture, similar to Wu et al. (2021). During optimization, we render an

1https://jasonyzhang.com/ners/paper_figures

14

Page 15: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Reference Image No 𝑓𝑓env NeRS (Ours) Env. MapIllum. of Mean Tex.

Figure 10: Comparison with NeRS trained without view dependence. Here we compare the full NeRS(column 3) with a NeRS trained without view dependence by only rendering using ftex without fenv (column 2).We find that NeRS trained without view-dependence cannot capture lighting effects when they are inconsistentacross images. The difference between NeRS trained with and without view-dependence is best viewed invideo form. We also visualize the environment map and the illumination of the car with the mean texture. Theenvironment maps show that the light is coming primarily from one side for the first car, uniformly from alldirections for the second car, and strongly front left for the third car. [Video]

Method MSE ↓ PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓NeRF∗ (Mildenhall et al., 2020) 0.0393 16.0 0.698 0.287 231.7MetaNeRF (Tancik et al., 2021) 0.0755 11.4 0.345 0.666 394.5MetaNeRF-ft (Tancik et al., 2021) 0.0791 11.3 0.500 0.542 326.8IDR (Yariv et al., 2020) 0.0698 13.8 0.658 0.328 190.1NeRS (Ours) 0.0254 16.5 0.720 0.172 60.9NeRS + NeRF-style View-dep 0.0315 15.6 0.68 0.271 285.3

Table 3: Quantitative evaluation of novel-view synthesis on MVMC using fixed pseudo-ground truth cam-eras. Here, we evaluate novel view synthesis with fixed pseudo-ground truth cameras, constructed by manuallycorrecting off-the-shelf cameras that are jointly optimized by our method. In addition to the baselines from themain paper, we compare against an ablation of our approach that directly conditions the radiance on the uv andviewing direction (“NeRS + NeRF View-dep") in a manner similar to NeRF.

Target Camera Method MSE ↓ PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓

Approx.Camera

NeRF∗ (Mildenhall et al., 2020) 0.0657 12.7 0.604 0.386 290.7IDR (Yariv et al., 2020) 0.0836 11.4 0.591 0.413 251.6NeRS (Ours) 0.0527 13.3 0.620 0.284 100.9NeRS + NeRF-style View-dep. 0.0573 13.0 0.624 0.356 296.4

RefinedApprox.Camera

NeRF∗ (Mildenhall et al., 2020) 0.0464 14.7 0.660 0.335 277.9IDR (Yariv et al., 2020) 0.0454 14.4 0.685 0.297 242.3NeRS (Ours) 0.0338 15.4 0.675 0.221 92.5NeRS + NeRF-style View-dep. 0.0482 13.9 0.659 0.316 293.5

Table 4: Quantitative evaluation of in-the-wild novel-view synthesis on MVMC. To evaluate in-the-wildnovel view synthesis, each method is trained with (and can refine) approximate cameras estimated using Xiaoet al. (2019). To compensate for the approximate test camera, we allow each method to refine the test cameragiven the target image. In this table, we show the performance before (“Approx. Camera") and after (“RefinedApprox. Camera") this refinement. All methods improve from the refinement. In addition, we compare againstan ablation of our approach that directly conditions the radiance on the uv and viewing direction (“NeRS +NeRF View-dep") in a manner similar to NeRF.

15

Page 16: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Reference Image

Initial Cameras

PretrainedCameras

Optimized Cameras

Predicted Shape

Predicted Textured Mesh

Figure 11: Shape from Silhouettes using Volume Carving. We compare shapes carved from the silhouettes ofthe training views with the shape model learned by our approach. We construct a voxel grid of size 1283 and keeponly the voxels that are visible when projected to the masks using the off-the-shelf cameras (“Initial Cameras"),pre-trained cameras from Stage 1 (“Pretrained Cameras"), and the final cameras after Stage 4 (“OptimizedCameras"). We compare this with the shape model output by fshape. We show the nearest neighbor training viewand the final NeRS rendering for reference. While volume carving can appear reasonable given sufficientlyaccurate cameras, we find that the shape model learned by NeRS is qualitatively a better reconstruction. Inparticular, the model learns to correctly output the right side of the pickup truck and reconstructs the sideviewmirrors from the texture cues, suggesting that a joint optimization of the shape and appearance is useful. Also,we note that the more “accurate" optimized cameras are themselves outputs of NeRS. [Video]

Figure 12: Non-white lighting. Our illumination model assumes environment lighting to be grayscale. In thiscar, the indoor lighting is yellow, which causes the yellow lighting hue to be baked into the predicted texture(Predicted Texture) rather than the illumination (Illum. of Mean Texture).

image I with full texture using ftex as well as a rendering I using the mean output of ftex. We computea perceptual loss on both I and I . We weight these two perceptual losses differently, as shown inTab. 5. To compute illumination, we sample environment rays uniformly across the unit sphere, andcompute the normals corresponding to each vertex, ignoring rays pointing in the opposite direction ofthe normal.

6.4 Implementation Details of Baselines

NeRF2 (Mildenhall et al., 2020): We find that a vanilla NeRF struggles when given sparse views. Assuch, we make the following changes to make the NeRF baseline as competitive as possible: 1. weadd a mask loss that forces rays to either pass through or be absorbed entirely by the neural volume,analogous to space carving (Kutulakos and Seitz, 2000) (see Fig. 11); 2. a canonical volume that zerosout the density outside of a tight box where the car is likely to be. This helps avoid spurious “cloudy"artifacts from novel views; 3. a smaller architecture to reduce overfitting; and 4. a single renderingpass rather than dual coarse/fine rendering passes. In the main paper, we denote this modified NeRFas NeRF∗. For in-the-wild novel-view synthesis, we refine the training and test cameras directly,similar to Wang et al. (2021b).

MetaNeRF3 (Tancik et al., 2021): We fine-tuned the pre-trained 25-view ShapeNet model on ourdataset. We used the default hyper-parameters for the MetaNeRF baseline. For MetaNeRF-ft, weincreased the number of samples to 1024, trained with a learning rate of 0.5 for 20,000 iterations,then lowered the learning rate to 0.05 for 50,000 iterations.

2https://github.com/facebookresearch/pytorch3d/tree/main/projects/nerf3https://github.com/tancik/learnit

16

Page 17: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Figure 13: Gray texture-illumination ambiguity. For gray cars, there is a brightness ambiguity betweentexture and lighting. Although the car shown here is silver, another possible interpretation is that the car is darkgrey with very bright illumination and specularity.

PE (6)

(𝑢𝑢, 𝑣𝑣)

FCN (256)

IN (256)

Sigmoid

𝑅𝑅𝑅𝑅𝑅𝑅

LeakyReLU

FCN (3)

x7

PE (4)

𝜔𝜔

FCN (64)

IN (64)

ReLU

𝐿𝐿𝑖𝑖

LeakyReLU

FCN (1)

x7

PE (6)

(𝑢𝑢, 𝑣𝑣)

FCN (128)

IN (128)

LeakyReLU

FCN (3)

x2

+

𝑋𝑋𝑋𝑋𝑋𝑋

Figure 14: Network Architectures. Here, we show the architecture diagrams for ftex, fenv, and fshape. Weparameterize (u, v) coordinates as 3-D coordinates on the unit sphere. Following Mildenhall et al. (2020);Tancik et al. (2020), we use positional encoding to facilitate learning high frequency functions, with 6 sine andcosine bases (thus mapping 3-D to 36-D). We use blocks of fully connected layers, instance norm (Ulyanov et al.,2016), and Leaky ReLU. To ensure texture is between 0 and 1 and illumination is non-negative, we use a finalsigmoid and ReLU activation for ftex and fenv respectively. We pre-train fshape to output the category-specificmesh template. Given a category-specific mesh, we use an off-the-shelf approach to recover the mesh-to-spheremapping.4

IDR5 (Yariv et al., 2020): Because each instance in our dataset has fewer images than DTU, weincreased the number of epochs by 5 times and adjusted the learning rate scheduler accordingly.IDR supports both fixed cameras (which we use for fixed camera novel-view synthesis) and trainedcameras (which we use for in-the-wild novel-view synthesis).

6.5 In-the-wild Novel-view Synthesis Details

For in-the-wild novel-view synthesis without ground truth cameras, we aim to evaluate the capabilityof each approach to recover a meaningful 3D representation while only using approximate off-the-shelf cameras. Given training images with associated approximate cameras, each method is requiredto output a 3D model. We then evaluate whether this 3D model can generate a held-out target imageunder some camera pose.

In practice, we recover approximate cameras using Xiao et al. (2019). However, as these onlycomprise of a rotation, we further refine them using the initial template and foreground mask. The

4https://github.com/icemiliang/spherical_harmonic_maps5https://github.com/lioryariv/idr

17

Page 18: Abstract arXiv:2110.07604v3 [cs.CV] 18 Oct 2021

Stage Optimize Num.Iters

Loss Weights

Mask DT Chamfer Perc. Perc. mean Reg.

1 ΠiNi=1 500 2 20 0.02 0 0 02 Above + fshape 500 2 20 0.02 0 0 0.13 Above + ftex 1000 2 20 0.02 0.5 0 0.14 Above + fenv, α, ks 500 2 20 0.02 0.5 0.15 0.1

Table 5: Multi-stage optimization loss weights and parameters. We employ a 4 stage training process:first optimizing just the cameras before sequentially optimizing the shape parameters, texture parameters, andillumination parameters. In this table, we list the parameters optimized during each stage as well as the numberof training iterations and loss weights. All stages use the Adam optimizer (Kingma and Ba, 2015). “Perc. refers"to the perceptual loss on the rendered image with full texture while “Perc. mean" refers to the perceptual loss onthe rendered image with mean texture.

initial approximate cameras we use across all methods are in fact equivalent to those obtained by ourmethod after Stage 1 of training, and depend upon: i) prediction from Xiao et al. (2019), and ii) inputmask and template shape. During training, each method learns a 3D model while jointly optimizingthe camera parameters using gradient descent. Similarly, at test time, we refine approximate testcamera with respect to the training loss function for each method (i.e. (7) for NeRS, MSE for NeRF,and RGB+Mask loss for IDR) using the Adam optimizer for 400 steps. We show all metrics beforeand after this camera refinement in Tab. 4.

18