render for cnn: viewpoint estimation in images using cnns … · 2015-09-25 · render for cnn:...

Render for CNN: Viewpoint Estimation in ImagesUsing CNNs Trained with Rendered 3D Model Views

Supplementary Material

Hao Su⇤, Charles R. Qi⇤, Yangyan Li, Leonidas J. GuibasStanford University

A. Organization of the DocumentThis document provides additional quantitative results,

technical details, and test examples to the main paper.Here we describe the document organization. We

start from providing more quantitative results, a furtherevaluation of azimuth estimation (Sec B). We then providemore technical details for the synthesis pipeline (Sec C) andnetwork architecture (Sec D). In Sec E, we provide moredetails of our 3D model dataset. Lastly, we show more testexample visualizations (Sec F).

0 5 10 15 20 25 30 35 40 450

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Azimuth angle error

mV

P

Ours−Joint−Top2Ours−JointOurs−RenderOurs−RealReal−cls-depVDPMChance

0

5

10

15

20

25

30

Me

dia

n a

zim

uth

an

gle

err

or

VDPMReal−cls-depOurs−RealOurs−RenderOurs−JointOurs−Joint−Top2

Figure 1. Viewpoint estimation performance using positiveVDPM detection windows on PASCAL VOC 2012 val. Left:mean viewpoint estimation accuracy as a function of azimuthangle error �✓ . Viewpoint is correct if distance in degree betweenprediction and groundtruth is less than �✓ . Right: medians ofazimuth estimation errors (in degree), the lower the better. Referto Table 1 for settings of methods in comparison.

B. Comparison over VDPM for ViewpointEstimation by VDPM Bounding Box (Sec5.2)

In the main paper, we compare viewpoint estimationperformance of our methods vs VDPM, using detectionwindows from R-CNN. Here, we compare these methodsagain, using detection windows from VDPM, i.e., toestimate the viewpoint of objects in detection windows fromVDPM. We also included Average Precision (AP) of bothDPM and R-CNN (with bounding box regression) detectorsin Table 2. To make this material more self-contained, wesummarize the settings of all methods in Table 1.

Results are shown in Figure 1. The trend is unchanged,except that in Figure 1 VDPM (16V) is slightly better

than Real-cls-dep (model trained with real images, withoutusing our new loss function). Note that R-CNN detectsmany more difficult cases than VDPM in terms of occlusionand truncation. In other words, Figure 1 focuses on thecomparison of our methods and VDPM over simple cases.

C. Synthesis Pipeline Details (Sec 4.1)The synthesis pipeline is illustrated in Figure 2.

3D Model Normalization Following the convention ofPASCAL 3D+ annotation, all 3D models are normalizedto have bounding cube centered at the origin with diagonallength 1.

Rendering The parameters that affect the rendering in-clude lighting condition, camera extrinsics and intrincis. Torender each image, we sample a set of these parametersfrom a distribution as follows:

• Lighting condition. We add N point lights and enablethe environmental light with energy E ⇠ N (0, 1).N is uniformly sampled from 0 to 6. All lightingparameters are sampled i.i.d. The position plightis uniformly sampled on a sphere of radius that isuniformly sampled from 8 to 20, between longitude0� to 360� and latitude �90� and 90�; the energyE ⇠ N (2, 2); the color is fixed to be white.

• Camera extrinsics. We first describe the cameraposition parameters (i.e., translation parameters). Inthe polar coordinate system, let ⇢ be the distance of theoptical center to the origin, ✓ and � be the longitudeand latitude, respectively. We use kernel densityestimation (KDE) to estimate the non-parametric dis-tributions of ⇢, ✓,� for each category respectively,from the VOC 12 train set of PASCAL3D+ dataset. Wethen use the estimated distributions to generate i.i.d.samples for rendering. See Figure 4 for estimationresults of chairs.

We then describe the camera pose parameters (i.e.,rotation parameters). Similar to the position param-

name features learnedfrom data

trained withreal data

trained withsynthetic data

geometric structureaware loss function

VDPM no (HoG) yes no no (16 DPMs)Real-cls-dep yes yes no noOurs-Real yes yes no yesOurs-Render yes no yes yesOurs-Joint yes yes yes yesOurs-Joint-Top2 yes yes yes yes

Table 1. Summary of settings for methods in the main paper and Figure 1 of the appendix. Note that the Ours-Joint-Top2 takes the top-2viewpoint proposals with the highest confidences after non-maximum suppression.

Sample lighting parametersSample camera parameters

Hyper-parameter estimation

Rendering

Sample background image

Add background Crop

Sample cropping parametersAlpha-blending composition3D Model

from real images

Figure 2. Scalable and overfit-resistant synthesis pipeline.

eters, we use KDE to estimate distribution of in-planerotation for each category and generate i.i.d. samplesfor rendering. We set the camera to look at the origin,the image plane perpendicular to the ray from theoptical center to the origin.

• Camera intrinsics. The focal length and aspect ratioare fixed to be 35 and 1.0, respectively.

We use Blender, an open-source 3D graphics andanimation software, to efficiently render each 3D model.

Background synthesis All details are described in themain paper.

Cropping As seen in Figure 3, we use annotationsprovided by PASCAL3D+ and PASCAL VOC to gettruncation patterns of objects in real images (VOC12 trainset). For each real training image, we project 3D modelcorresponding to the object back to the image and thenget a bounding box for the projected model (full objectbounding box). Then, by comparing it with the providedgroundtruth bounding box of the (visible) object, we knowhow the object is truncated. Specifically, we know therelative shift of four edges from groundtruth box to full box.We use kernel density estimation to learn non-parametricdistribution of these relative positions for each category and

Figure 3. Cropping parameter estimation. We estimate thecropping parameters by comparing the groundtruth bounding boxand full object bounding box. The groundtruth bounding box(green) is from PASCAL 3D+. The full object bounding box(red) is estimated by us as follows: because the real trainingimage dataset (PASCAL 3D+) has provided landmark registrationsbetween each object instance and a similar 3D model, we canproject the 3D model to the image space and estimate the fullbounding box.

generate samples for croppings of rendered images. SeeFigure 5 for an example of estimated truncation patterns ofchairs.

AP aero bike boat bottle bus car chair table mbike sofa train tv meanDPM 42.2 49.6 6.0 20.0 54.1 38.3 15.0 9.0 33.1 18.9 36.4 33.2 29.6R-CNN 74.0 66.7 32.9 31.5 68.0 58.4 26.9 39.3 71.5 44.2 63.1 63.7 54.5

Table 2. Average Precision (AP) on VOC 12 val from DPM and R-CNN (with bounding box regression).

0 200 4000

0.01

0.02

0.03

0.04

0.05chair azimuth

-100 0 1000

0.05

0.1

0.15

0.2chair elevation

-100 0 1000

0.1

0.2

0.3

0.4chair tilt

0 20 400

0.05

0.1

0.15chair distance

Figure 4. Histogram of viewpoint parameter samples fromestimated distribution (chairs as example). The horizontal axisis azimuth, elevation and tilt (in-plane rotation) degree and cameradistance for each sub-figure. The vertical axis is probability of thebin. In top-left figure, we see azimuth viewpoints of chairs arebiased toward chair fronts facing to the camera.

D. Network Details (Sec 4.2)We adapt network architecture from AlexNet to object

viewpoint estimation. For notation, conv is convolutionallayer including pooling and ReLU. fc is fully connectedlayer. The number following conv or fc, starting from1 at bottom, means order of layer. We keep structuresof convolutional layers and fc6 and fc7 fully connectedlayers consistent with R-CNN network. As we fine tunelayers above conv3, we can use shared lower layers forboth detection and viewpoint estimation, which reducescomputation cost. The last fully connected layer isobject category specific. All categories share layers belowand including fc7 (trained by viewpoint annotations, fc7features now preserve geometric information about theimage). On top of fc7, each category has a fully connectedlayer with output size 360 ⇥ 3 = 1080 (3 is for azimuth,elevation and in-plane rotation). We use the geometricstructure aware loss mentioned in the main paper as the losslayer. During back propagation, only viewpoint losses fromthe object category of the instance will be counted.

E. More Details on 3D Model Set (Sec 5.1)Table 3 lists the statistics of the models used in the main

paper. All models are downloaded from ShapeNet, which

-2 0 20

0.05

0.1

0.15

0.2

0.25chair left shift ratio

-2 0 20

0.1

0.2

0.3chair right shift ratio

-1 0 10

0.1

0.2

0.3chair top shift ratio

-2 0 20

0.05

0.1

0.15

0.2chair bottom shift ratio

Figure 5. Histogram of cropping parameter samples fromestimated distribution (chairs as example). The horizontal axisis ratio of groundtruth-full box difference over width or height offull box. The vertical axis is probability of the bin. From bottom-right figure, we can see bottom parts of chairs tend to be truncated.

Name synset offset numaeroplane n02691156 4045bicycle n02834778 59boat n04530566 1939bottle n02876657 498bus n02924116 939car n02958343 7497chair n03001627 6928dining table n04379243 3650motorbike n03790512 337sofa n04256520 3173train n04468005 389tv monitor n03211117 1095

Table 3. Statistics of models used in the paper.

is organized by the taxonomy of WordNet. Models are alsopre-aligned to have consistent orientation by ShapeNet. InWordNet (and ShapeNet), each category is indexed by aunique id named “synset offset”.

F. More ExamplesSee next pages for examples of positive and negative

results. The negative results are grouped by error patterns.

3RVLWLYH�([DPSOHV

� ��

� ��

� *UD\�EDU�LQGLFDWHV�WKH�YLHZSRLQW�HVWLPDWLRQ�FRQILGHQFH��7KH�GDUNHU��WKH�KLJKHU�� 5HG�YHUWLFDO�OLQH�LQ�WKH�JUD\�EDU�LQGLFDWHV�WKH�JURXQGWUXWK YLHZSRLQW

3RVLWLYH�([DPSOHV��

� ��


2FFOXVLRQVRID RFFOXGHG E\ SHRSOH FDU RFFOXGHG E\ PRWRUELNH

WDEOH RFFOXGHG E\ FKDLU WYPRQLWRU RFFOXGHG E\ DUP


� ��

$PELJXLW\


0XOWLSOH�2EMHFWV

� ��



7UXQFDWLRQ

render for cnn: viewpoint estimation in images using cnns … · 2015-09-25 · render for cnn:...

Documents