recovering 6d object pose from rgb indoor image based on...

Communicated by Dr. Nianyin Zeng

Accepted Manuscript

Recovering 6D Object Pose from RGB Indoor Image based onTwo-stage Detection Network with Multi-task Loss

Fuchang Liu, Pengfei Fang, Zhengwei Yao, Huansong Yang

PII: S0925-2312(18)31523-6DOI: https://doi.org/10.1016/j.neucom.2018.12.061Reference: NEUCOM 20283

To appear in: Neurocomputing

Received date: 24 July 2018Revised date: 16 December 2018Accepted date: 26 December 2018

Please cite this article as: Fuchang Liu, Pengfei Fang, Zhengwei Yao, Huansong Yang, Recovering6D Object Pose from RGB Indoor Image based on Two-stage Detection Network with Multi-task Loss,Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.12.061

This is a PDF file of an unedited manuscript that has been accepted for publication. As a serviceto our customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form. Pleasenote that during the production process errors may be discovered which could affect the content, andall legal disclaimers that apply to the journal pertain.

https://doi.org/10.1016/j.neucom.2018.12.061

https://doi.org/10.1016/j.neucom.2018.12.061

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Recovering 6D Object Pose from RGB Indoor Image based onTwo-stage Detection Network with Multi-task Loss

Fuchang Liu, Pengfei Fang, Zhengwei Yao, Huansong Yang∗

Hangzhou Normal University, China

Abstract

Object pose estimation from an RGB image has recently become a common problem owing toits widespread application. The advent of convolutional neural networks has impelled significantprogress in object detection. However, most available methods do not involve a category-levelpose estimation approach. This study presents an end-to-end 6D category-level pose estimationbased on a two-stage bounding-box recognition backbone architecture. Our network directlyoutputs the 6D pose without requiring multiple stages or additional post-processing such as aPerspective-n-Point (PnP). The two-stage CNN architecture and our loss function render multi-task joint training effective and efficient. We improve the pose estimation accuracy by replac-ing fully connected layers with fully convolutional layers. Fully convolutional networks requirefewer parameters and are less susceptible to overfitting. Moreover, we transform the pose estima-tion problem into classification and regression tasks using our network; these are called Pose-clsand Pose-reg, respectively. We also present qualitative and quantitative results on real data fromthe SUN RGB-D dataset. The experiments demonstrate the effectiveness of our algorithms com-pared to other state-of-the-art methods.

Keywords: Pose estimation, Two-stage detection, Convolutional neural networks, Multi-taskloss

1. Introduction

Six-dimensional (6D) object pose (3D rotation and 3D translation) estimation is an impor-tant task in computer vision and has wide-ranging applications in different areas includingrobotics, autonomous driving, medical imaging and virtual/augmented reality. Whereas recently-developed methods relying on depth images are largely robust, they are short-ranged and exhibitscanning noises and illumination conditions. Object pose estimation from RGB images is moreeffective particularly for mobile cameras and internet image gallery.

Traditionally, 6D object pose can be recovered using a PnP method [1] based on the matchingof local features between 3D models and images. However, these local feature-matching basedapproaches are unsuitable for inadequately textured objects. Template-based matching [2, 3]or dense feature learning approaches [4, 5, 6] have been proposed to address texture-less ob-jects. However, the template-based methods are sensitive to illuminations and occlusions. Fea-ture learning approaches are time consuming for dense feature extraction and pose refinement.

∗Corresponding author: [email protected]

Preprint submitted to Neurocomputing December 31, 2018

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Inspired by the success of deep learning, tasks including object classification [7], object detec-tion [8, 9] and recently, object instance segmentation [10] have achieved remarkable improve-ments. Numerous studies [11, 12, 13] have considered estimating an object pose using deep neu-ral network architectures, particularly convolutional neural networks (CNNs). However, mostavailable studies have applied instance-level pose estimation rather than category-level pose esti-mation. For category-level pose estimation, the method to address the intra-category appearancevariation of instances within a category must be learned. Only the approaches in [14, 15, 16] aredesigned for category-level pose estimation. However, these approaches are either for one ob-ject per image or a discretised pose space. An earlier method of category-level pose estimationwas to extend the deformable part model (DPM) in order to explicitly model partial appear-ances and their poses. This is referred to as DPM-VOC+VP [17]. CNN-based methods enablesuch relations to be captured implicitly through hierarchical convolutional structures. Recently,CNN-based methods have been demonstrated to outperform DPM-based methods for pose esti-mation [16]. Moreover, DPMs can also be considered as a specific instantiation of CNNs. Ourapproach operates on multiple objects per image and numerous instances per category.

Our study builds on a number of recent advances in object detection based on CNNs. How-ever, our objective is to simultaneously detect and recover 6D poses of object instances fromone indoor image. In this paper, we present a CNN framework based on a two-stage detectionnetwork composed of a residual network (ResNet) [18], region proposal network (RPN) [8] andfully convolutional network (FCN) [19]. Because there are a number of methods to design ajoint detection and pose estimation framework (such as training continuous pose estimations andformulating the task as a regression problem, or sampling the pose space into discrete bins andformulating the task as a classification problem), we provide our solution by transforming ournetwork into different forms such as classification and regression. We perform the pose esti-mation as a classification problem by discretizing the rotation and translation into bins, calledPose-cls. We then compare our framework with Scenemap [20], which provides the 2D positionof objects present in a scene based on a semantic-based segmentation architecture. Furthermore,we also change our framework into a regression form and solve the pose estimation as a re-gression problem, called Pose-reg. With the regression framework, we can regress the rotationand translation directly and simultaneously. Finally, we present an end-to-end pipeline that canjointly optimise the object detection and 6D object pose estimation using either a classificationor regression form. In recent years, a pipeline of simultaneous tasks has become more popu-lar [21]. CNN architectures permit the joint training of over two branches simultaneously. Thebenefit of this pipeline is that multiple training branches can share the same feature maps fromthe convolutional backbone.

Because our CNN pipeline addresses three tasks (namely, the classification, bounding boxestimation and pose estimations of objects) concurrently, three associated loss terms have to bebalanced by tuning the hyper-parameters during training. This requires an effective method torepresent a 6D pose as well as an effective design of the loss function. In this study, we representa pose using a 3D vector for Pose-cls, in which the first element represents the in-plane rotationangle θ of the x-y plane because indoor objects are generally aligned towards the direction ofgravitiy; the last two elements represent the x and y components of the translation vector of thepose. For Pose-reg, we represent a pose using a 2D vector; here, the first element representsθ, which is similar to that for Pose-cls, and the last element represents the z component of thetranslation vector of the pose. Using the z component of the translation, we can recover thecomplete translation using a 3D-2D projection formulation with the specified intrinsic cameracalibration matrix. With simplified rotational and translational components, we tuned the hyper-

2

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

parameters of all the loss terms effectively and balanced them efficiently.To demonstrate the effectiveness of our approach, we compared our method with other state-

of-the-art methods on a real dataset. Our evaluation demonstrates that the proposed methodoutperforms its counterparts significantly and performs effectively on multiple objects per imageand for numerous instances per category. In addition, we provide the following contributions:

• We present and discuss two alternative methods to achieve a category-level pose estima-tion by changing a multi-task loss into classification and regression forms. Both of themsimultaneously detect an object from an RGB image and predict its 6D pose without requir-ing multiple stages or refinement. We assess these two methods of pose estimation on anindoor-image dataset.

• We extend the use of bounding-box recognition networks to multiple tasks including de-tection and pose estimation by adding ResNet and FCN into our end-to-end pipeline. Theapplication of ResNet with an FCN can reduce the sizes of the parameters and achieveincreased accuracy, simultaneously.

• We validated our methods on the SUN RGB-D dataset, which contains categories withnumerous instances and images with multiple objects. Our methods were demonstrated tooutperform previous state-of-the-art approaches.

The remainder of this paper is organised as follows: We review related studies in Section 2. Wethen describe our end-to-end pipeline for pose estimation in Section 3. In Section 4, we validateour method on public datasets. Finally, we provide a few concluding remarks and discuss areasof future study in Section 5.

2. Related Works

In this section, we briefly review 6D object pose estimation methods, which can be largelyclassified into keypoint-based, template-based, feature learning and CNN-based methods.

Keypoint-based methods. Sparse keypoints are extracted from interesting regions in an im-age and matched to the corresponding points on the 3D models, enabling the recovery of 6Dposes. Similar methods are generally rapid and robust to occlusions and scene clutter [22, 23].A limitation of keypoint-based approaches lies in their inherent sparse representation of the tar-get object, i.e., the quality of the keypoints directly affects the accuracy of the pose estimation.Therefore, these methods require rich textures on the objects to compute the sparse keypoints.

Template-based methods. In recent years, researchers have displayed significant interest ininadequately textured or texture-less objects. The most traditional approaches for inadequatelytextured objects utilise object templates [2, 3, 24]. Although template-based methods effectivelydetect texture-less objects, they exhibit certain drawbacks in terms of pose variations. To handlepose variations, a large number of templates are required, which is time-consuming. In addi-tion, template-based methods cannot handle illuminations or occlusions effectively because theinferior lighting and occlusions decrease the similarity score of the templates.

Feature learning methods. A number of studies have advanced beyond the use of sparse key-points by learning, to the regression of dense object coordinates for each pixel for establishingof 2D-3D correspondences and thereby recover the pose of a target object from dense correspon-dences [4, 5, 6, 25]. Furthermore, similar studies generally follow a time-consuming multi-stagepipeline. After recovering the 6D pose from the 2D-3D correspondence, a further refinement

3

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

is generally required to obtain the final pose. However, object coordinates regression does notoperate effectively with symmetric objects.

CNN-based methods. Recently, CNN has been applied to the 6D pose problem. Viewpointsand Keypoints [14] and Render for CNN [15] transformed 3D pose estimation into classifica-tion tasks, specifically by discretizing the pose space. Inspired by this philosophy, SSD-6D [11]extends the SSD detection framework [26] to a 3D rotation estimation. It decomposes a 3D rota-tion space into discrete viewpoints and in-plane rotations. Furthermore, these approaches do notdirectly output the translation. PoseCNN [13] jointly segments objects and estimates the rotationand distance of the segmented objects from the camera. However, to localise objects, PoseCNNrequires an FCN [19] for semantic segmentation. In addition, PoseCNN cannot address inputimages that contain multiple instances of an object. Another category of CNN-based methodsdescribes the pose of a target object by specifying the locations of a fixed set of keypoints froma more local perspective. BB8 [12] proposed a time-consuming multi-stage pipeline for objectpose estimation tasks. BB8 first uses a CNN to localise objects and then another CNN to pre-dict the 2D projections of the keypoints (i.e., the corners of the 3D bounding boxes) around theobjects. The 6D pose is recovered using an additional PnP algorithm. BB8 is not end-to-endbecause it combines multiple separated CNNs. A recent study [27] adopted a similar approach.The authors extended a YOLO object detection network [28] to predict 2D projections of thecorners of the 3D bounding boxes around different objects. In another study, the researchers pre-dicted the pose based on an SSD network [16] using additional outputs for the poses. However,their approach handles only a 3D rotation component with a discretised view space.

In parallel to these methods, Deep-6DPose [29] jointly detects, segments and recovers 6Dposes of object instances from one RGB image. Deep-6DPose is an end-to-end deep learn-ing pipeline; it decouples pose parameters into translation and rotation such that the rotationcan be regressed through a Lie algebra representation. This method predicts the z componentand requires post processing to recover the full translation. However, the CNNs architecture ofthis method consists of six fully connected layers, which require excessive memory footprint.Moreover, the multi-task loss function has four terms, which have to be balanced cautiously.Scenemap [20] predicts a coarse scene from a top-down view with an image as input. Scenemapuses the FCN backbone for estimating the localisation and class of objects. However, Scenemapdoes not handle rotation estimation and has difficulty addressing input images that contain mul-tiple instances of an object. DeepContext [30] provides a 3D extension for Scenemap; however,it focuses on a holistic scene understanding within the scene template and requires a 3D pointcloud computed from the depth images.

We summarise the differences between our method and the available CNN-based methods inTable 1. Our approach performs category-level detection and pose estimation for multiple objectsand outputs poses with six degrees of freedom. Our CNN architecture is two-stage rather thansingle-shot. Two-stage CNNs are more accurate than single-shot CNNs, particularly on smallobjects and multiple objects. This is because the determination of the size of cells and the numberof objects that lie in the same cell is challenging in single-shot object detectors. Moreover,multiple objects cause occlusion; this impacts the accuracy of these single-shot methods [27],which are based on correspondences between an object’s 3D bounding box corner and its 2Dprojection. PnP could provide worse results when correspondences are degenerated because ofocclusion. In addition, our approach predicts the object’s 6D pose without requiring multiplestages [12, 13] or multiple hypotheses examination [4, 6]. Certain single-shot 6D pose methodsmust then be refined [11] or must use PnP to recover the object’s pose [27].

4

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Table 1: Comparison of our method with other available CNN-based methods.

6D DOF Multiple Objects Category-level

Viewpoints and Keypoints [14] 5 5 3

Render for CNN [15] 5 5 3

SSD-6D [11] 3 3 5

PoseCNN [13] 3 3 5

BB8 [12] 3 3 5

Single Shot 6D [27] 3 3 5

Deep-6D Pose [29] 3 3 5

Fast Single Shot [16] 5 3 3

Ours 3 3 3

3. Method

3.1. Overview

An object pose estimation task is more challenging because to estimate an object pose, theobject must be accurately detected. Therefore, our contribution is an architecture that yields arapid and accurate one-shot pose prediction without requiring post-processing such as an edge-or contour-based refinement [11]. To complete our task, we extended the two-stage detectionbackbone into a joint detection and pose estimation network.

3.2. Learning in a pose estimation pipeline

(1) End-to-End Architecture: Our approach, outlined in Fig. 1, uses several CNN structures tosolve our three sub-tasks, namely, the class-, pose- and bounding box estimations of theobjects, concurrently. Given an RGB image, our pipeline first extracts the proposed regionsusing an RPN; then, it regresses the z component of the translation and in-plane rotationθ on the x-y plane. To improve the accuracy and reduce the size of the parameters, wereplace VGG-16 [31] with Resnet-101 [18] and adopt an FCN rather than fully connectedlayers. Resnet-101 can increase the accuracy owing to the considerably increased depthand achieves easier convergence. In theory, fully connected layers can also be consideredas convolutions with kernels that cover their entire input regions. A fully convolutionalrepresentation requires fewer parameters and is less susceptible to overfitting as employedand validated using Mask RCNN [10].

(2) Classification vs. Regression: Our pipeline can be tailored to classification and regressiontasks by changing the loss function. We compared our Pose-cls against available methodssuch as Scenemap [20], as illustrated in Table 3. We also tested our Pose-reg and conductedan ablation study on ResNet and an FCN. Moreover, we compared Pose-cls and Pose-reg;further details of this are provided in the subsequent section. A key observation is thatregression-based pose estimation is more accurate than its classification-based counterpart.

(3) Multi-task Loss: To jointly train a multi-task network, we define our training objective,which is composed of a weighted sum of three losses: a class loss Lcls, bounding box loss

5

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Lbox and pose loss Lpose. Similar to previous methods [8, 9], during training, we define ourmulti-task loss on each sampled region of interest (RoI) as follows:

L(p, u, bu, bv, θ, x, y, z) = Lcls(p, u) + [u ≥ 1]Lbox(bu, bv) + α1[u ≥ 1]Lpose(θ, x, y, z), (1)

where Lcls(p, u) = −log(pu) is the log loss over two classes (object of class u vs. back-ground). For each RoI, we use Lcls to output a discrete probability distribution, p =

(p0, ..., pC), over C + 1 categories. For bounding-box regression, we use the loss

Lbox(bu, bv)) =∑

i∈{x,y,w,h}smoothL1 (bu

i − bvi ), (2)

in which

smoothL1 (x) =

0.5x2 if |x| < 1

|x| − 0.5 otherwise,

is a robust L1 loss, which is less sensitive to outliers than the L2 loss. Because each trainingRoI is labelled with a ground-truth class u and a ground-truth bounding-box v. Lbox is definedover a true bounding-box for class bv = (bv

x, bvy, b

vw, b

vh) and a predicted bounding-box bu =

(bux, b

uy , b

uw, b

uh). The indicator function [u ≥ 1] outputs 1 when u ≥ 1 and 0 otherwise. The

background class is labelled u = 0. The third loss Lpose has two options: LPose−reg (smoothL1 regression loss) or LPose−cls (softmax loss). The hyper-parameter α1 regulates the balancebetween the loss of pose and others. We normalise the ground-truth bounding-box targets tohave zero mean and unit variance. Thus, the classification loss Lcls and bounding-box lossLbox are identical (as in [8]). Note that Lbox and Lpose are activated only for positive RoIs andare disabled for background RoIs. For Pose-reg, the pose branch outputs two numbers foreach RoI, which represents the z component of the translation and the rotation angle θ alongthe z-axis (as mentioned above, indoor objects are generally aligned toward the direction ofgravity, or the z-axis). In addition, the pose regression loss is defined as follows:

Lpose−reg(θ, z) = smoothL1 (θ − θ) + α2smoothL1 (z − z), (3)

In addition, z and z are two scalars representing the regressed z component and ground truthz component, respectively, of translation; θ and θ are two scalars representing the regressedrotation angle and the ground truth rotation angle, respectively, of the x-y plane and α2 is ascale factor used to regulate the rotation and translation regression errors.

For Pose-cls, the pose branch outputs three numbers for each RoI, which represent the rota-tion angle θ on the x-y plane and the x and y components of the translation. In addition, thepose classification loss is computed as the translation and rotation losses as follows:

Lpose−cls(θ, x, y) = −log(pθ) − α3(log(px) + log(py)), (4)

where pθ is the softmax output for the true rotation class θ and px and py are the softmaxoutput for true translation classes on the x and y components, respectively.

6

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

ResNet101

Input image feature map + Rois

RPN

RoIAlign

7×7 feature map

FCNAverage pooling

Fully connected layer

cls

box

translation

rotation

Figure 1: Overview of our pose estimation pipeline composed of ResNet and an FCN. Given an input image, our pipelinefirst predicts the bounding boxes of RoIs using an RPN and quantizes the RoIs into feature bins (e.g., 7 × 7) usingRoIAlign; finally, the FCN outputs convolutional features to the last FC layer for pose classification or regression.

3.3. Training and inference

(1) Training: We implemented our end-to-end PyTorch-based architecture using a GPU. Theinput for our network is an RGB image. The RPN outputs RoIs at different sizes and shapes.We use three scales and three aspect ratios, resulting in nine anchors in the RPN. This designpermits the network to detect small objects.

Here, α1, α2 and α3 in Eqs. (1)∼(4) are empirically set to 1, 2 and 1. We observed that valuesof less than 1 or greater than 2 lead to a poor performance, as illustrated in Tables 8∼11. Wetrained our network using a stochastic gradient descent with a momentum of 0.9 and a weightdecay of 0.0001. The network was trained on a GTX 1080Ti GPU for 150 k iterations. Eachmini batch has 256 images. The learning rate was set to 0.001 for the first 50 k iterations;then, it was decreased by 10 for the remaining iterations. The top 2,000 RoIs from the RPN(with a positive to negative ratio of 1:3) were subsequently used for computing the multi-taskloss. An RoI is considered positive if it has an intersection over union (IoU) with a groundtruth box of at least 0.5; otherwise, it is negative. The loss Lpose is defined for only positiveRoIs.

During the test phase, we ran a forward pass on the input image. We adopted non-maximumsuppression (NMS) [9] on the RoIs produced using the RPN based on their cls score (i.e.,0.7) and the IoU threshold (i.e., 0.5).

(2) From a 2D regressed pose to a complete 6D pose: Given the predicted θ, we can recoverthe corresponding rotation matrix because the rotation axis is aligned toward the z-axis. Torecover the complete translation, we rely on the predicted z component (i.e. ∆z) and recoverthe x and y components (i.e. ∆x and ∆y). We can recover ∆x and ∆y according to thefollowing projection equation assuming a pinhole camera:

∆x =(u0 − cx)∆z

fx, (5)

∆y =(v0 − cy)∆z

fy, (6)

where fx and fy denote the focal lengths of the camera, (cx, cy)T is the principal point and u0

and v0 are the bounding box centres of the ROIs.7

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

4. Experiments and Analysis

4.1. Metrics

For rotation, we evaluated our results using the average viewpoint precision (AVP) metricproposed by Xiang et al. [32]. For translation, we display the median of all the translationalerrors (MedErr) between the estimated translation and the ground truth.

4.2. Datasets

To prevent a domain shift from synthetic to real images, we directly trained our network onthe SUN RGB-D [33] dataset consisting of 10,335 RGB-D images with 3D object bounding boxannotations. There were 14.2 objects in each image on average. Totally, there were 47 scene cat-egories and approximately 800 object categories, as shown in Fig. 2. These RGB-D images weremostly captured from household environments with a strong context. We manually selected im-ages from sleeping, office and lounging areas because they represent commonly observed indoorenvironments with relatively larger numbers of images provided in SUN RGB-D. We focusedon five common object categories, namely, chairs, tables, sofas, beds and shelves, and selected8,363 RGB-D images containing such objects from SUN RGB-D. We divided these images intoa training set (90%) and a testing set (10%); moreover, we applied mean preprocessing and dataaugmentation with horizontal flips on the training set. In the following experiments, we eval-uated our end-to-end architecture for pose estimation on a testing set. Note that we used onlyRGB channels as input, and the depth channel was used only for a 3D annotation.

Living room Classroom Kitchen

Restaurant Office Bedroom

Figure 2: SUN RGB-D dataset. Six example images are shown for 47 scene categories from the SUN RGB-D.

4.3. Results

In this section, we provide the experimental results for two main tasks: 2D object detectionand 3D pose estimation. We validated that our network performs effectively on multiple objectsand for numerous instances per category.

Performance of object detection. We observed that our pipeline achieves the best results onSUN RGB-D with an [email protected] of 70.4%, as illustrated in Table 2. Our approach is based on

8

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

a two-stage detection pipeline and outperforms one-stage methods such as YOLO and SSD. Weran a number of ablations to analyse the effects of the core factors. We observed that ResNetimproves the detection accuracy and that an FCN requires fewer parameters than the FC layers.We visualise the object detection results of our pipeline on the SUN RGB-D in Fig 3.

Figure 3: Object detection results on sample images. Each object category is displayed with a different colour.

Table 2: Evaluation of 2D object detection on the SUNRGB-D dataset (bounding box AP).

Method chair table sofa bed shelf mAP parameters

RCNN [33] 41.2 16.6 42.2 76.0 35.0 42.2 N/A

YOLO v3 416 [34] 68.5 46.0 73.4 83.2 57.5 65.7 N/A

SSD 300 [26] 67.0 51.7 73.4 84.5 48.2 65.0 N/A

Ours, VGG16 & FC 71.7 51.1 74.0 89.9 56.3 68.6 549.4MB

Ours, ResNet101 & FC 74.9 55.1 78.6 86.6 56.9 70.4 1.1GB

Ours, ResNet101 & FCN 73.4 52.9 80.0 86.5 58.5 70.3 190.9MB

Performance of Pose-cls pipeline. We evaluated our Pose-cls end-to-end pose estimation ar-chitecture on the SUN RGB-D dataset. We quantitatively compared our method against a base-line method based on single frame depth estimation and Scenemap. Our Pose-cls pipeline outputsthe 2D location and orientation of objects in a similar manner as Scenemap, which describes thelocations of objects. To evaluate our Pose-cls pipeline, we discretised the 2D translation spaceinto a 2D grid with a resolution of 0.5m2 and dimensions [−2.5m, 2.5m] × [−2.5m, 2.5m] andformulated this task as a 100-way classification problem. In addition, we divided the 360◦ rangeof rotation into 36 bins and transformed this task into a classification problem (Fig. 4).

We employed a category-agnostic fashion for Pose-cls, which shares outputs for poses acrossall the object categories. Using Nc categories of objects, our pipeline outputs Nθ poses. Category-agnostic fashion provides speed improvements and is less susceptible to overfitting than acategory-specific approach. According to Table 3, our Pose-cls network provides more accu-rate results for translation. Further details on the baseline and Scenemap are available in [20].

9

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

To demonstrate the rotational accuracy, we provide our results in Table 4. Because the baselinemethod and Scenemap in [20] are incapable of handling the rotation of objects, we did not com-pare our results with these approaches. Rather, we compared our approach with pose estimationbased on an SSD [16], which is a single-shot pose estimation. The comparison demonstratedthat our model performs higher. In Fig 5, we provide example results of our method on SUNRGB-D. Apart from the true positive rates, we also report the performance when ’one-off’ errorsare counted as correct (i.e., an object detection a cell away from the ground truth is still counted).Note that, although our method does not always determine the perfect location, it is still effectivefor inferring the overall scene structure.

We conducted an ablation study on the impact of ResNet and an FCN, the results are presentedin Tables 3 and 4. We observed that ResNet and an FCN improve the level of accuracy. Wehypothesise this to be because ResNet is deep and resolves the vanishing gradient problem andbecause an FCN has fewer parameters and requires less data for training.

0

99

1

98

shelfsofa

tablechair

2

97

. . .

. . .

10

.

.

.

0°

0 12

. . .

. . .35

Figure 4: Translation and rotation estimation. We discretised the translation on the x-y plane into 100 bins and thein-plane rotation into 36 bins. Both the rotation and translation were estimated as classification problems.

Table 3: Evaluation of translation estimation (the fraction of instances whose predicted translation error within 0.5m).

Pose-cls chair table sofa bed shelf Avg.

Baseline [20] N/A N/A N/A N/A N/A 0.03

Scenemap [20] N/A N/A N/A N/A N/A 0.15

Ours, VGG16 & FC 0.29 0.21 0.25 0.25 0.11 0.22

Ours, ResNet101 & FC 0.34 0.21 0.33 0.37 0.17 0.28

Ours, ResNet101 & FCN 0.37 0.26 0.32 0.39 0.21 0.31

Performance of Pose-reg pipeline. In Pose-reg, we use a separate output layer for each cate-gory, i.e., Nc × Nθ outputs, because a category-specific approach is more effective for regression;this is also validated based on Mask R-CNN. The MedErr of the recovered translation is pre-sented in Table 5. Our Pose-reg network outperforms DeepContext [30] by a large margin of10% for a translation estimation. We also conducted a comparison with DeepContext for rota-tion, although such a comparison is biased. Note that DeepContext employs a 3D point cloudand predefined 3D scene templates for rotation estimation, whereas our method uses only 2DRGB information. Table 6 illustrates that our network performs higher than the state-of-the-artmethods.

10

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

input image ground- truth estimation input image ground- truth estimation

Figure 5: A few qualitative results. Cell colour: the white grid cells indicate the ground truth, the green grid cells indicatetrue positives, and the yellow grid cells indicate true positives with ’one-off’ errors.

Table 4: Evaluation of rotation estimation (36 view AVP).

Pose-cls chair table sofa bed shelf Avg.

SSD Pose [16] 0.12 0.09 0.15 0.32 0.18 0.17

Ours, VGG16 & FC 0.25 0.22 0.32 0.33 0.16 0.26

Ours, ResNet101 & FC 0.31 0.22 0.32 0.46 0.17 0.29

Ours, ResNet101 & FCN 0.31 0.25 0.33 0.46 0.20 0.31

We studied the effects of the core components in Pose-reg individually. It is noteworthy thatpose estimation using VGG performs higher than that using ResNet, as illustrated in Table 6. Wehypothesise this to be because ResNet with FC has more parameters and requires more data fortraining. From Tables 5 and 6, it is evident that an FCN is important for obtaining the highestlevel of accuracy. We consider this to be because an FCN reduces the sizes of the parameters andis less susceptible to overfitting than fully connected layers.

Comparison. As illustrated in Table 7, the pose estimation based on a regression pipeline(i.e., Pose-reg) significantly outperforms that based on a classification pipeline (i.e., Pose-cls)with a margin of approximately 6%. Table 7 illustrates that both Pose-cls and Pose-reg exhibitlower accuracies for ’tables’ than for the other categories; this is mainly because tables are likelyto have ambiguous poses owing to the symmetries involved. Thus, it is challenging for poseestimation methods to handle objects with rotational symmetry.

Evaluation of Hyper-parameters. In Table 8∼11, we compare different values of α1, α2 and11

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Table 5: Evaluation of translation estimation using MedErr metric (in meters).

Pose-reg chair table sofa bed shelf Avg.

DeepContext [30] 0.35 0.35 0.34 0.28 0.25 0.30

Ours, VGG16 & FC 0.16 0.23 0.18 0.24 0.36 0.23

Ours, ResNet101 & FC 0.14 0.22 0.16 0.22 0.34 0.22

Ours, ResNet101 & FCN 0.14 0.19 0.15 0.22 0.28 0.19

Table 6: Evaluation of rotation estimation (36 view AVP).

Pose-reg chair table sofa bed shelf Avg.

DeepContext [30] 0.44 0.44 N/A N/A N/A N/A

Ours, VGG16 & FC 0.35 0.17 0.33 0.33 0.27 0.29

Ours, ResNet101 & FC 0.37 0.18 0.31 0.22 0.31 0.28

Ours, ResNet101 & FCN 0.47 0.25 0.41 0.37 0.25 0.35

Table 7: Comparison between Pose-cls and Pose-reg (18 view AVP).

Method chair table sofa bed shelf Avg.

Pose-cls

SSD Pose [16] 0.23 0.17 0.25 0.61 0.29 0.31

Ours, VGG16 & FC 0.60 0.43 0.57 0.63 0.38 0.52

Ours, ResNet101 & FC 0.61 0.41 0.55 0.69 0.47 0.55

Ours, ResNet101 & FCN 0.63 0.43 0.60 0.64 0.48 0.56

Pose-reg

Ours, VGG16 & FC 0.64 0.33 0.53 0.50 0.73 0.55

Ours, ResNet101 & FC 0.64 0.25 0.54 0.35 0.67 0.49

Ours, ResNet101 & FCN 0.71 0.43 0.66 0.60 0.70 0.62

α3 in Eqs. (1), (3) and (4) on our five object categories. Table 8∼11 illustrate that our resultwas impacted when α1, α2 and α3 are less than 1 or greater than 2. When the value of thehyper-parameters attained 10, we observed that the gradient of Eq. (1) exploded.

5. Conclusion and Discussion

In this study, we proposed an end-to-end trainable approach for jointly detecting, and most im-portantly, recovering 6D poses of multiple object instances from one RGB image. Our backboneis a two-stage detection pipeline replacing VGG with ResNet and an FC with an FCN. ResNetincreases the accuracy by providing an implicit ensemble of an exponential number of networks.Moreover, an FCN reduces the sizes of the parameters and is robust to overfitting. Our end-to-

12

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Table 8: Evaluation of hyper-parameters on Pose-cls (mean AVP of 36 view).HHHHHα1

α3 0.1 1.0 2.0

0.1 0.10 0.28 0.27

1.0 0.10 0.31 0.30

2.0 0.08 0.25 0.30

Table 9: Evaluation of hyper-parameters on Pose-cls (mean fraction of instances whose translation error within 0.5m).HHHHHα1

α3 0.1 1.0 2.0

0.1 0.15 0.16 0.13

1.0 0.27 0.31 0.29

2.0 0.28 0.30 0.28

Table 10: Evaluation of hyper-parameters on Pose-reg (mean fraction of instances whose viewpoints error within π/6).HHHHHα1

α2 0.1 1.0 2.0

0.1 0.40 0.73 0.77

1.0 0.25 0.74 0.77

2.0 0.15 0.56 0.71

Table 11: Evaluation of hyper-parameters on Pose-reg (mean translation error using MedErr metric (in meters)).HHHHHα1

α2 0.1 1.0 2.0

0.1 0.21 0.25 0.25

1.0 0.19 0.20 0.19

2.0 0.18 0.22 0.18

13

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

end pipeline (i.e., Pose-cls or Pose-reg) provides category-level pose estimation and comparesfavourably with other state-of-the-art methods. We also compared Pose-cls with Pose-reg andobserved that a regression pipeline provides more accurate results than a classification pipeline.This is likely to be because effective classification results require an appropriate sampling ona discretised pose space. In addition, a regression network (i.e., Pose-reg) regresses the scalarnumbers for a rotation and translation individually without any sampling. A possible future studywould be to improve our pipeline, for handling symmetrical and highly occluded objects. Theproblem of handling objects with an axis of symmetry is still open and have been addressedpartially by [12]. A few rotation-invariant object detection methods based on CNNs exhibit po-tential for addressing this problem [35, 36]. Moreover, highly occluded object detection can bepotentially solved by using autoencoders [37] and adversarial learning [38].

6. References

References

[1] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000.[2] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. R. Bradski, K. Konolige, N. Navab, Model based training,

detection and pose estimation of texture-less 3d objects in heavily cluttered scenes, in: ACCV, 2012.[3] R. Rios-Cabrera, T. Tuytelaars, Discriminatively trained templates for 3d object detection: A real time scalable

approach, in: ICCV, 2013.[4] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, C. Rother, Learning 6d object pose estimation using

3d object coordinates, in: ECCV, 2014.[5] A. Krull, E. Brachmann, F. Michel, M. Y. Yang, S. Gumhold, C. Rother, Learning analysis-by-synthesis for 6d pose

estimation in rgb-d images, in: ICCV, 2015.[6] F. Michel, A. Kirillov, E. Brachmann, A. Krull, S. Gumhold, B. Savchynskyy, C. Rother, Global hypothesis gener-

ation for 6d object pose estimation, in: CVPR, 2017.[7] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in:

NIPS, 2012.[8] R. B. Girshick, Fast r-cnn, in: ICCV, 2015.[9] S. Ren, K. He, R. B. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal net-

works, TPAMI 39 (6) (2017) 1137–1149.[10] K. He, G. Gkioxari, P. D. r, R. B. Girshick, Mask r-cnn, in: ICCV, 2017.[11] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, N. Navab, Ssd-6d: making rgb-based 3d detection and 6d pose estimation

great again, in: ICCV, 2017.[12] M. Rad, V. Lepetit, Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of

challenging objects without using depth, in: ICCV, 2017.[13] Y. Xiang, T. Schmidt, V. Narayanan, D. Fox, Posecnn: A convolutional neural network for 6d object pose estimation

in cluttered scenes, in: RSS, 2018.[14] S. Tulsiani, J. Malik, Viewpoints and keypoints, in: CVPR, 2015.[15] H. Su, C. R. Qi, Y. Li, L. J. Guibas, Render for cnn: View-point estimation in images using cnns trained with

rendered 3d model views, in: ICCV, 2015.[16] P. Poirson, P. Ammirato, C. Fu, W. Liu, J. Kosecka, A. C. Berg, Fast single shot detection and pose estimation, in:

Fourth International Conference on 3D Vision (3DV), 2016, pp. 676–684.[17] B. Schiele, M. Stark, P. Gehler, B. Pepik, Teaching 3d geometry to deformable part models, in: IEEE Conference

on Computer Vision and Pattern Recognition(CVPR), 2012.[18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.[19] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, TPAMI 39 (4) (2017)

640–651.[20] M. Hueting, V. Patraucean, M. Ovsjanikov, N. J. Mitra, Scene structure inference through scene map estimation,

VMV.[21] B. Hariharan, P. A. A. ez, R. B. Girshick, J. Malik, Simultaneous detection and segmentation, in: ECCV, 2014.[22] D. G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV (2004) 91–110.[23] I. Gordon, D. G. Lowe, What and where: 3d object recognition with accurate pose, in: Toward Category-Level

Object Recognition, 2006.

14

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

[24] A. Tejani, D. Tang, R. Kouskouridas, T.-K. Kim, Latent-class hough forests for 3d object detection and poseestimation, in: ECCV, 2014.

[25] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, C. Rother, Uncertainty-driven 6d pose estimation ofobjects and scenes from a single rgb image., in: CVPR, 2016.

[26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, A. C. Berg, Ssd: single shot multibox detector,in: ECCV, 2016.

[27] B. Tekin, S. N. Sinha, P. Fua, Real-time seamless single shot 6d object pose prediction, in: CVPR, 2018.[28] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, in: CVPR, 2017.[29] T. Do, M. Cai, T. Pham, I. D. Reid, Deep-6dpose: Recovering 6d object pose from a single RGB image, in: arXiv,

arXiv:1802.10367, 2018.[30] Y. Zhang, M. Bai, P. Kohli, S. Izadi, J. Xiao, Deepcontext: Context-encoding neural pathways for 3d holistic scene

understanding, in: ICCV, 2017.[31] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: arXiv,

arXiv:1409.1556, 2014.[32] Y. Xiang, R. Mottaghi, S. Savarese, Beyond pascal: A benchmark for 3d object detection in the wild, in: IEEE

Winter Conference on Applications of Computer Vision (WACV), 2014.[33] S.Song, S.Lichtenberg, J.Xiao, Sun rgb-d: A rgb-d scene understanding benchmark suite, in: CVPR, 2015.[34] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, in: arXiv, arXiv:1804.02767, 2018.[35] G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object detection in vhr

optical remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 54 (12) (2016) 7405–7415.[36] K. Li, G. Cheng, S. Bu, X. You, Rotation-insensitive and context augmented object detection in remote sensing

images, IEEE Transactions on Geoscience and Remote Sensing 56 (4) (2018) 2337–2348.[37] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, A. M. Dobaie, Facial expression recognition via learning deep sparse

autoencoders, Neurocomputing 273 (17) (2018) 643–649.[38] X. Wang, A. Shrivastava, A. Gupta, A-fast-rcnn: Hard positive generation via adversary for object detection, in:

CVPR, 2017.

15

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Biography

Fuchang Liu Fuchang Liu received the BS and PhD degrees in computer science from theNanjing University of Science and Technology in 2004 and 2009, respectively. He was a post-doctoral research fellow of computer science and engineering at Ewha Womans University andNTU. He is an assistant professor at Hangzhou Normal University. His research interests in-clude scene understanding, computer graphics, GPUs rendering, and collision detection. He haspublished 5 papers in Siggraph Asia, IEEE TVCG, CAD and BMVC.

Pengfei Fang Pengfei Fang currently is a graduate student at Hangzhou Normal University.His research interests include objects detection and pose estimation.

Zhengwei Yao Zhengwei Yao received the M.S. degree in computer science and educationfrom Hangzhou Normal University, Hangzhou, China, in 2006 and the Ph.D. degree in computerapplication technology from Shanghai University, Shanghai, China, in 2010 respectively. Hecame to Indiana University- Purdue University Fort Wayne for visiting scholar in 2012. Heis currently an associate professor at the Digital Media and HCI Research Center, HangzhouNormal University. His research interests include human computer interaction, virtual reality,augmented reality, interactive video games, and scientific visualization.

16

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

THuansong Yang Huansong Yang is a professor at Hangzhou Normal University. His researchinterest include education on big data, multimedia. He has published more than 80 papers injournals in these fields. He got the first level in Zhejiang province ”151 talents project” andHangzhou ”131 talents project”.

17

recovering 6d object pose from rgb indoor image based on...

Documents