a uniﬁed framework for multi-view multi-class object pose …cli53/papers/supp_chi_eccv18.pdf ·...

A Unified Framework for Multi-View Multi-ClassObject Pose Estimation

— Supplementary Material —

Chi Li, Jin Bai, Gregory D. Hager

Department of Computer Science, Johns Hopkins University

1 Introduction

In this supplementary document, we first detail our pose rectification algorithm that isbriefly discussed in the beginning of Sec. 3. Next, we provide additional numerical resultsfor YCB-Video [1] and JHUScene-50 [2], including mPCK accuracy on groundtruthbounding box, PCK curves, instance segmentation accuracy and mPCK accuracy ofMVn-MVN with the number of views larger than 5. Finally, we visualize additionalresults of pose estimation by MCN and MV5-MCN on YCB-Video and JHUScene-50.

2 Pose Rectification via Viewpoint Centralization

A 6-DoF pose can be represented as (R, T ) where R and T stand for rotation and trans-lation components, respectively. In the common practice of pose annotation process [3,4, 2, 1], we label R and T with respect to the current camera viewpoint by fitting theobject model onto the observed 3D scene with the help of AR markers. Although thecombined SE(3) representation (R, T ) is consistent with the projected image appearance,the rotation component R can be ambiguous where the same rotation R correspondsto different object observation in image domain. We show a simple example in the leftof Fig. 1. Here, a power drill with a fixed rotation R is moving from left to right alongthe direction of X axis of image plane. If we capture its snapshot on the trajectory withdifferent T , the corresponding image observations can be distinct as the drill undergoesout-of-plane rotation with respect to the camera viewpoint. Such inconsistency becomesa problem when we try to learn the generalizable mapping from image space to therotation space (i.e. SO(3)). This issue has been revealed in the case of 1-D yaw angleestimation in [5]. Unfortunately, most existing learning-based 6-DoF pose estimationapproaches [1, 6] ignore this problem where the cropped image or feature map directlyregresses to R.

Our solution is to rectify the pose as the object is observed from the centerlineof the camera. Consider the bounding box location of an observed object image Ioas (x1, y1, x2, y2) where (x1, y1) and (x2, y2) are image coordinates for top-left andbottom-right corners. Let (cx, cy) be the 2D camera center on image plane and fx, fy bethe focal lengths for X and Y axes, respectively. We can compute the 3D orientation vtowards the center of observed image Io:

v = [(x1 + x2

2− cx)/fx, (

y1 + y22

− cy)/fy, 1] (1)

2 Chi Li, Jin Bai, Gregory D. Hager

Fig. 1: Illustration of the ambiguity in standard pose representation (left) and our solutionof viewpoint centralization (right). The blue and red colors on the right figure indicatethe camera and image planes before and after centralization, respectively.

Subsequently, we compute rectified XYZ axes of camera coordinate system [Xv, Yv, Zv]by aligning the current Z axis [0, 0, 1] to v.

Xv = [0, 1, 0]× Zv, Yv = Zv ×Xv, Zv =v

‖v‖2(2)

Note that symbol × is the cross product. Finally, the original pose label (R, T ) can beprojected onto this rectified XYZ axes to get the rectified pose (R, T ):

R = Rv ·R, T = Rv · T, (3)

where Rv = [Xv;Yv;Zv] stacks the X, Y, Z coordinates in column. The right of Fig. 1illustrates the process of viewpoint centralization. If the depth image and camera intrin-sics are available, we also rectify the XYZ value of each image pixel by transformingeach XYZ by Rv . Subsequently, we construct a normalized XYZ map by centering thepoint cloud to its median.

In Fig. 1, we can see that RGB image plane is also rotated after the centralization.Therefore, the original object image is supposed to be warped to the new image planein principle. However, this warping operation only changes the image scale but notthe content (there is no out-of-plane rotation), which affects little on the CNN trainingbecause CNN always requires some types of scale normalization1. As a consequence,we leave the original image static while rectifying its pose label.

3 Quantitative Analysis

In this section, we first demonstrate additional numerical results of MCN on YCB-Videobenchmark, including mPCK accuracy on groundtruth bounding box, segmentation

1 For example, isotropic and non-isotropic warping.

Title Suppressed Due to Excessive Length 3

accuracy and PCK curves on individual object classes. Subsequently, we show thesegmentation accuracy of MCN on JHUScene-50 dataset. Last, we show the curves ofmPCK accuracies of multi-view MCN (MVn-MCN) versus the number of views on bothYCB-Video and JHUScene-50.

Evaluation Metric. For 6-DoF pose estimation, we follow the recently proposedmetric “ADD-S” by [1]. The pose distance D(h, h∗) between an estimate h = (R, T )and the groundtruth h∗ = (R∗, T ∗) is defined as:

D(h, h∗) =1

m

∑x1∈M

minx2∈M

‖(Rx1 + T )− (R∗x2 + T ∗)‖2 (4)

The traditional metric [3] considers a correct pose estimate h if D(h, h∗) is belowa threshold. [1] improves this threshold-based metric by computing the area underthe curve of accuracy-threshold while varying different thresholds within a range (i.e.[0, 0.1]). We denote this new metric as “mPCK” because the accuracy-threshold curveis essentially the PCK curve [7] and the area under PCK curve is the mean of PCKaccuracies on different thresholds. The segmentation accuracy is the ratio of the numberof pixels with correctly predicted mask label versus all.

3.1 YCB-Video

Table 1 reports the mPCK accuracies of all object classes on groundtruth (GT) of objectbounding boxes. We can see that mPCK on GT is better than mPCK on detectionbounding boxes (shown in the main paper). Typically, the gap of mPCK between GT anddetection is larger on RGB than RGB-D. This is because we rely on actual image scaleof bounding box to recover 3D translation for RGB input. Consequently, the translationestimate is sensitive to the jittering of object bounding boxes. Additionally, Table 1 alsoreports the segmentation accuracies of MCN on both RGB and RGB-D. It is a little bitsurprising that the segmentation performance on RGB only is quite competitive to theone on RGB-D. The mPCK on RGB is even higher than the one on RGB-D on someclasses such as “008 pudding box”. This implies that RGB images offer the criticaldetails for instance segmentation. Last, we demonstrate PCK curves of MCN in Fig. 3.In Fig. 3 (a) and Fig. 3 (b), the blue and red curves corresponds to PCK curves of MCNon RGB and RGB-D, respectively.

3.2 JHUScene-50

Table 2 shows the segmentation accuracies of MCN on RGB and RGB-D data. Weobserve the similar phenomenon that the segmentation accuracy on RGB is only slightlyworse than the result on RGB-D. Additionally, the overall segmentation accuracy onJHUScene-50 is roughly 10% lower than the one on YCB-Video, which is consistentwith the fact that JHUScene-50 is inferior to YCB-Video in terms of mPCK accuracyachieved by MCN. This is mainly because MCN is trained on synthetic images onlyon JHUScene-50 while a mixture of synthetic and real training images being used inYCB-Video. Further, the real test data of JHUScene-50 contains heavier occlusion andmore diverse cluttered background.


ObjectmPCK on GT Segmentation Accuracy

RGB RGB-D RGB RGB-D002 master chef can 91.2 94.4 92.5 94.5

003 cracker box 78.5 86.3 88.8 89.2004 sugar box 85.1 93.8 91.8 94.5

005 tomato can 93.3 94.2 90.0 94.6006 mustard bottle 91.9 96.8 97.3 97.5007 tuna fish can 95.2 95.8 90.2 93.5008 pudding box 84.9 90.9 69.3 61.3009 gelatin can 92.1 95.6 90.7 92.9

010 potted meat can 90.8 87.0 87.7 88.8011 banana 70.0 94.9 95.7 96.3

019 pitcher base 91.1 94.6 94.2 96.0021 bleach cleanser 86.8 94.4 94.8 96.8

024 bowl 85.0 83.1 93.5 85.8025 mug 91.9 95.5 89.5 87.9

035 power drill 87.2 91.3 85.4 89.9036 wood block 87.2 83.7 83.5 89.1

037 scissors 80.2 75.0 92.8 92.5040 large marker 66.4 89.2 89.0 93.4051 large clamp 86.5 92.7 88.0 90.4052 larger clamp 79.5 87.5 92.1 92.9061 foam brick 79.2 93.9 90.6 91.5

All 86.9 91.0 89.9 90.9Table 1: mPCK and instance segmentation accuracies of MCN on YCB-Video Dataset.

3.3 Multi-View Performance

We plot mPCK accuracies of multi-view MCN (MVn-MCN) versus the number of viewsin Fig. 2. For each test data, we randomly sample 5, 10, 20, 30 additional views in thesame sequence and run MVn-MCN to estimate the final pose result. We simply useall frames if the total number of views in one sequence is smaller than the numberof sampled views. We can see that the mPCK accuracy consistently increases withthe growing number of views, which indicates our proposed multi-view framework iseffective in selecting a better estimate than the default top-1 result among the hypothesesfrom the top-k hypothesis pool.

4 Qualitative Analysis

We visualize more pose estimation results on YCB-Video in Fig. 4 and JHUScene-50 in Fig. 5. We observe that pose estimates from RGB image only is in general notvery accurate while results on RGB-D being more stable and precise. The multi-viewalgorithm is capable of boosting the single-view result on RGB-D data, especially forobjects with symmetrical geometry such as cup, bottle and bowl.


Object RGB RGB-Ddrill 1 73.7 74.7drill 2 69.6 73.3drill 3 75.0 79.4drill 4 69.2 76.0

hammer 1 88.2 89.5hammer 2 87.1 88.9hammer 3 81.7 83.3hammer 4 87.7 88.5hammer 5 88.4 88.6

sander 79.4 79.1

All 80.0 82.1Table 2: Instance segmentation accuracies of MCN on JHUScene-50 dataset [2].

1 5 10 20 3070

75

80

85

90

95

100

(a) YCB-Video1 5 10 20 30

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

(b) JHUScene-50

Fig. 2: Plots of mPCK accuracies of multi-view MCN (MVn-MCN) versus the numberof views on both YCB-Video (left) and JHUScene-50 (right). In each plot, the blue curveis for RGB image and the red curve is for RGB-D.

References

1. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: A convolutional neural network for6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)

2. Li, C., Boheren, J., Carlson, E., Hager, G.D.: Hierarchical semantic parsing for object poseestimation in densely cluttered scenes. In: ICRA. (2016)

3. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Modelbased training, detection and pose estimation of texture-less 3d objects in heavily clutteredscenes. In: Computer Vision–ACCV 2012. Springer (2013)

4. Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d objectpose estimation using 3d object coordinates. In: ECCV. Springer (2014)

5. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3d bounding box estimation using deeplearning and geometry. In: CVPR, IEEE (2017)

6. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: Making rgb-based 3d detectionand 6d pose estimation great again. In: CVPR. (2017)

7. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR.(2011)


0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a) 002 master chef can

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b) 003 cracker box

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(c) 004 sugar box

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(d) 005 tomato can

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(e) 006 mustard bottle

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(f) 007 tuna fish can

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(g) 008 pudding box

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(h) 009 gelatin can

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(i) 010 potted meat can

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(j) 011 banana

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(k) 019 pitcher base

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(l) 021 bleach cleanser

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(m) 024 bowl

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(n) 025 mug

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(o) 035 power drill

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(p) 036 wood block

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(q) 037 scissors

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(r) 040 large marker

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(s) 051 large clamp

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(t) 052 larger clamp

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(u) 061 foam brick

Fig. 3: PCK curves of MCN on RGB (blue) and RGB-D (red) from YCB-Video [1]. Ineach sub-figure, the X axis is the error threshold and Y axis is the accuracy.


Fig. 4: Illustration of pose estimation results by MCN on YCB-Video. The projectedobject mesh points that are transformed by pose estimates are highlighted by orange.From left to right of each data, we show original image, MCN estimates on RGB, MCNestimates on RGB-D and MV5-MCN estimates on RGB-D.

Fig. 5: Illustration of pose estimation results by MCN on JHUScene-50. The projectedobject mesh points that are transformed by pose estimates are highlighted by pink. Fromleft to right of each data, we show original image, MCN estimates on RGB, MCNestimates on RGB-D and MV5-MCN estimates on RGB-D.

a uniﬁed framework for multi-view multi-class object pose …cli53/papers/supp_chi_eccv18.pdf ·...

Documents