start from minimum labeling: learning of 3d object models...

8
Start from Minimum Labeling: Learning of 3D Object Models and Point Labeling from a Large and Complex Environment Quanshi Zhang , Xuan Song , Xiaowei Shao , Huijing Zhao and Ryosuke Shibasaki Abstract— A large category model base can provide object- level knowledge for various perception tasks of the intelligent vehicle system. The automatic and efficient construction of such a model base is highly desirable but challenging. This paper presents a novel semi-supervised approach to discover possible prototype models of 3D object structures from the point cloud of a large and complex environment, given a limited number of seeds in an object category. Our method incrementally trains the models while simultaneously collecting object samples. Considering the bias problem of model learning caused by bias accumulation in a sample collection, we propose to gradually differentiate the standard category model into several sub- category models to represent different intra-category structural styles. Thus, new sub-categories are discovered and modeled, old models are improved, and redundant models for similar structures are deleted iteratively during the learning process. This multiple-model strategy provides several interactive op- tions for the category boundary to deal with the bias problem. Experimental results demonstrate the effectiveness and high efficiency of our approach to model mining from “big point cloud data”. I. I NTRODUCTION An all-embracing category model base can directly pro- vide high-level knowledge for AI tasks, such as object detection, segmentation, and tracking. Mining such bases from “big data” with a minimum human labeling has become an important AI area, which provides a continuous challenge to state-of-the-art algorithms. [38], [39] have recently been proposed to use a single labeled object to mine category model bases from cluttered RGB or RGB-D images. Therefore, in this paper, we propose a semi-supervised method to learn category models from unlabeled “big point cloud data”. The algorithm only requires to label a small number of object seeds in each object category to start the model learning, as shown in Fig. 1. Such design saves both the manual labeling and computation cost to satisfy the model-mining efficiency requirement. We propose an iterative framework for collecting object samples and learning models, as shown in Fig. 2. Con- sidering a large intra-category variation, we differentiate an entire object category into different structural styles as sub-categories, and use multiple models, each representing the structural distribution of the object samples within a sub-category. These sub-category models are incrementally learned along with sample collection. During the model Quanshi Zhang, Xuan Song, Xiaowei Shao, and Ryosuke Shibasaki are with Center for Spatial Information Science, University of Tokyo. {zqs1022,songxuan,shaoxw,shiba} at csis.u-tokyo.ac.jp Huijing Zhao is with Key Laboratory of Machine Perception (MoE), Peking University. zhaohj at cis.pku.edu.cn Fig. 1. Given an unlabeled point cloud of a large urban environment and a small number of object seeds in a category, we aim to mine a set of models to provide object-level knowledge for point labeling. The point cloud of a large environment can be considered as a kind of “big data”, and thus we propose to limit both the manual labeling and the computation cost to ensure a high efficiency for model mining. learning, new sub-categories are discovered, similar sub- categories are merged, and models for the old sub-categories are gradually improved. Unlike semi-supervised model learning based on image search engine results [16], the bias 1 problem is still in- tractable without the sifting of the search engine in large and complex environments. Owing to large intra-category shape variations and the prevalence of occlusions, object collection bias during the initial learning steps will affect further model learning, and errors will be accumulated into a significant model bias. The multiple-model strategy is a plausible way of dealing with the bias problem. Newly collected samples only affect models of their own categories, and the growth of the sample collection bias is therefore limited within certain biased sub-categories. We manually eliminate the biased models after the learning process is complete, to provide the correct category boundary. The main contributions of this research can be summarized as follows. 1) To the best of our knowledge, this is the first proposal for an efficient mining of category models from “big point cloud data”. With limited computation and human labeling, the method is oriented toward an efficient construction of a category model base. 2) A multiple-model strategy is proposed as a solution to the bias problem, and provides several discrete and selective category boundaries. II. RELATED WORK Point cloud processing has developed rapidly in recent years. In this section, we discuss a wide range of related 1 Without sufficient manual labeling, the model may either over fit to some special sub-categories within the target category, or shift to other categories in the incremental learning process.

Upload: others

Post on 15-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

Start from Minimum Labeling: Learning of 3D Object Models andPoint Labeling from a Large and Complex Environment

Quanshi Zhang†, Xuan Song†, Xiaowei Shao†, Huijing Zhao‡ and Ryosuke Shibasaki†

Abstract— A large category model base can provide object-level knowledge for various perception tasks of the intelligentvehicle system. The automatic and efficient construction of sucha model base is highly desirable but challenging. This paperpresents a novel semi-supervised approach to discover possibleprototype models of 3D object structures from the point cloudof a large and complex environment, given a limited number ofseeds in an object category. Our method incrementally trainsthe models while simultaneously collecting object samples.Considering the bias problem of model learning caused by biasaccumulation in a sample collection, we propose to graduallydifferentiate the standard category model into several sub-category models to represent different intra-category structuralstyles. Thus, new sub-categories are discovered and modeled,old models are improved, and redundant models for similarstructures are deleted iteratively during the learning process.This multiple-model strategy provides several interactive op-tions for the category boundary to deal with the bias problem.Experimental results demonstrate the effectiveness and highefficiency of our approach to model mining from “big pointcloud data”.

I. INTRODUCTION

An all-embracing category model base can directly pro-vide high-level knowledge for AI tasks, such as objectdetection, segmentation, and tracking. Mining such basesfrom “big data” with a minimum human labeling has becomean important AI area, which provides a continuous challengeto state-of-the-art algorithms. [38], [39] have recently beenproposed to use a single labeled object to mine categorymodel bases from cluttered RGB or RGB-D images.

Therefore, in this paper, we propose a semi-supervisedmethod to learn category models from unlabeled “big pointcloud data”. The algorithm only requires to label a smallnumber of object seeds in each object category to startthe model learning, as shown in Fig. 1. Such design savesboth the manual labeling and computation cost to satisfy themodel-mining efficiency requirement.

We propose an iterative framework for collecting objectsamples and learning models, as shown in Fig. 2. Con-sidering a large intra-category variation, we differentiatean entire object category into different structural styles assub-categories, and use multiple models, each representingthe structural distribution of the object samples within asub-category. These sub-category models are incrementallylearned along with sample collection. During the model

† Quanshi Zhang, Xuan Song, Xiaowei Shao, and RyosukeShibasaki are with Center for Spatial Information Science,University of Tokyo. {zqs1022,songxuan,shaoxw,shiba} atcsis.u-tokyo.ac.jp

‡ Huijing Zhao is with Key Laboratory of Machine Perception (MoE),Peking University. zhaohj at cis.pku.edu.cn

Fig. 1. Given an unlabeled point cloud of a large urban environment and asmall number of object seeds in a category, we aim to mine a set of modelsto provide object-level knowledge for point labeling. The point cloud ofa large environment can be considered as a kind of “big data”, and thuswe propose to limit both the manual labeling and the computation cost toensure a high efficiency for model mining.

learning, new sub-categories are discovered, similar sub-categories are merged, and models for the old sub-categoriesare gradually improved.

Unlike semi-supervised model learning based on imagesearch engine results [16], the bias1 problem is still in-tractable without the sifting of the search engine in large andcomplex environments. Owing to large intra-category shapevariations and the prevalence of occlusions, object collectionbias during the initial learning steps will affect further modellearning, and errors will be accumulated into a significantmodel bias. The multiple-model strategy is a plausible wayof dealing with the bias problem. Newly collected samplesonly affect models of their own categories, and the growth ofthe sample collection bias is therefore limited within certainbiased sub-categories. We manually eliminate the biasedmodels after the learning process is complete, to providethe correct category boundary.

The main contributions of this research can be summarizedas follows. 1) To the best of our knowledge, this is thefirst proposal for an efficient mining of category modelsfrom “big point cloud data”. With limited computation andhuman labeling, the method is oriented toward an efficientconstruction of a category model base. 2) A multiple-modelstrategy is proposed as a solution to the bias problem, andprovides several discrete and selective category boundaries.

II. RELATED WORK

Point cloud processing has developed rapidly in recentyears. In this section, we discuss a wide range of related

1Without sufficient manual labeling, the model may either over fit to somespecial sub-categories within the target category, or shift to other categoriesin the incremental learning process.

Page 2: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

Fig. 2. What is a feasible and efficient way to mine category models? Given a limited number of 3D point clouds of the object seeds in an object categoryand a real 3D environment, our algorithm is able to automatically collect object samples with a large intra-category variation and learn the category model.In this iterative framework, blue and red rectangles indicate the inputs and outputs, respectively.

work to provide a better understanding of our method ofcategory model mining.

Knowledge mining: The segmentation and classification(point labeling) of 3D point clouds are two kinds of 3Denvironment understanding [25], [26], [27], [28], [29], [30],[31], [32], [33], [36], [37]. [34] focus on unsupervisedsegmentation of 3D point clouds based on local features.Other studies have contributed to the learning of commonstructures of different categories from well-segmented 3Dobjects [13]. Munoz et al. [25], [33], Triebel et al. [26],and Anguelov et al. [32] made a breakthrough when theyemployed associative Markov networks (AMNs) with a max-margin strategy for supervised point cloud classification andsegmentation.

However, knowledge mining is mainly required to be ap-plied to unlabeled data in an unsupervised or semi-supervisedmanner. In addition, knowledge mining should focus on thelearning of a general model for objects within an entirecategory, rather than just collecting a number of individualobject samples based on local segmentation criteria.

Even so, we used the trained models to achieve point label-ing (although the models can also be applied to other tasks,such as object retrieval and recognition), and performedexperiments to compare its performance with the classicalsupervised AMN-based point classification.

Object-level global structure: A number of pioneeringstudies have contributed to the extraction of high and middlelevel structural knowledge. Hebert et al. [10] used some high-level shape assumptions to discover various structures in theenvironment, while Ruhnke et al. [19] learned a compactrepresentation of a 3D environment based on Bayesianinformation criteria. Endres et al. [17] used latent Dirichletallocation to discover 3D objects. These methods focus onpatterns at the part level, whereas we expected to extractglobal structures with the correct object-level semantemes.

Category modeling: Closer to our field of categorymodel mining, some approaches for the collection of 3Dobject samples have been proposed. Herbst et al. [11], [12]detected which objects had been moved across multiple depthimages of the same scene, and Somanath et al. [22] detectedthe same objects appearing in different 3D scenes. Detry etal. [18] learned a general hierarchical object model fromstereo data with clear edges.

Category model mining is not limited to the segmentationof objects with a specific shape or to recurrent objects, butalso includes category discovery with large intra-category

shape variations. From this viewpoint, the most closelyrelated work involves unsupervised repetitive shape extrac-tion [20], [21] and unsupervised 3D category discovery [1],where object samples are extracted automatically and clas-sified into different categories.

By contrast, we aim to learn category models for pointlabeling, rather than simply collecting object samples or labelpoints in a point cloud. The proposed model encodes thestructural knowledge for a whole category with considerableintra-category variations. Meanwhile, our approach satisfiesthe efficiency requirement, i.e. a relatively small amount ofcomputation and manual labeling, which is important formodel mining from “big point cloud data”. Nevertheless, wecompare our approach with [20], [21] and [1] in experimentsfrom the perspective of point labeling.

III. ALGORITHM

Our algorithm operates in a bootstrapping framework. Weuse current models to collect new samples and estimatetheir reliability. Strangely shaped or largely occluded samplesare considered unreliable. Thus, samples are weighted bytheir reliability, and this weighting is used as feedback torefine the current sub-category models and discover new sub-categories.

A. Preliminaries: category modeling

We extend the cell-based object representation proposed inour previous work [1] to represent the structure distributionof samples within a sub-category. We divide the object intocells, and extract the local shape features from each cell. Themodel encodes the point-occupying probability of each celland its local feature distributions among the object samples.

We use a cylinder template to describe the cell division,as shown in Fig. 3. Objects are placed into the 3D spaceof a vertical cylinder. The cylinder size is determined by theactual size of the object seeds. The cylinder is divided into Ffloors and L layers (F = 16,L = 8). It is then further dividedinto N = 50 parts using radial planes. We therefore obtain atotal of FLN = 6400 cells.

Sample: The sample is placed at the center of thecylinder. We calculate the local feature for each cell torepresent the shape of the corresponding local point cloud.Inspired by the spectral analysis of point clouds [23], weuse a cuboid to fit the point cloud in each cell. The edgelength of the cuboid is the square root of the eigenvalues ofthe point covariance matrix. We estimate the cuboid volume

Page 3: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

Algorithm 1 Learn object models and collect object samplessimultaneously

Input: The point cloud of a large and complex environ-ment and k pre-labeled object seeds in a category.Output: A set of models and collected object samples.Initialization: Generate an initial model from each objectseed to form the initial model set. Initialize the sample setconsisting of the object seeds and their top-ranked matchedsamples in the environment.repeat

1. Use the current models to estimate the reliability ofall samples in the sample set (see Section III-B).2. Produce new candidate models from the pure breed-ing of new samples and the hybridization between newsamples and current models (see Section III-C.1).3. MDL-based model selection using reliability-weighted samples (see Section III-C.2).4. Search the top-ranked samples corresponding to thecurrent models in the environment, and add them to thesample set (see Section III-B).

until no new samples can be well matched to the models.

using its edge length, and take the volume as the localfeature. Obviously, line-shape cells and surface-shape cellshave small feature values, whereas cloud-shape cells havelarge values. The local features of cells that do not containany 3D points are defined as none. Thus, an object samplecan be represented as a vector of local features S = <si>,where i is the cell index.

Model: The model encodes the point-occupying proba-bility and local feature distributions in each cell, as m = {P,µ , Var}, where P= <pi>, µ= <µi>, Var= <vari>. Foreach cell i, pi indicates its probability of containing points,and µi and Vari denote the mean value and variance of itslocal feature when it contains points, respectively.

Matching: We use object matching to compute similari-ties between models and samples. As objects are brute-forcesearched in the environment (details follow in Section III-B), they can be simply sampled at the center of the cylinder.Thus, we only consider horizontal rotations in matching.

For cell i in the sample, θ(i) denotes the correspondingcell in the model with horizontal rotation θ . The matchingvalue between model m = {P, µ , Var} and sample S iscalculated as follows:

match(S,m) = maxθ

∑i:si 6=none,pθ(i) 6=0 Celli√∑i δ (si)∑i pθ(i)

(1)

match(S,m) is normalized based on both the sample andmodel sizes. δ (si) is 1 if si 6= none, and 0 otherwise. Celli isthe local matching value of cell i in the sample with rotationangle θ , and it is assumed to follow a Gaussian distribution:

Celli = pθ(i)G (si|µθ(i),varθ(i)) (2)

where G (·) indicates a Gaussian distribution. Celli is weight-ed using the point-occupying probability. Similar to [1], the

F floors (F=3)

N ra

dial

div

isio

ns(N

=12)

L layers(L=2)

The top view 00

An example model based on the actual cell division

The actualcell division:

F = 16,L = 8,

N = 50.

The cell division is sparser than the actual setting for clarity.

Fig. 3. Illustration of cell division in the cylinder. The central cells aredenser, as we assume the object is located in the center, and thus the centralcells are more reliable for object representation. Please see texts and Fig. 10for detailed explanations of the model.

maximization problem in (1) can be solved via gradientdescent methods with the initial pose estimation, or simplythrough an exhaustive search.

Model-based point labeling: Given the object samplescollected by the trained models, we can further apply model-based object segmentation to the collected samples, therebyachieving point labeling.

Let M denote the model set. We collect object samplesusing these models via a brute-force search in the envi-ronment. Given a threshold τ , each sample S satisfyingmaxm∈M match(S,m)> τ is collected. For the segmentation,let Ci be a cell in sample S. If the local matching value ofCi—Celli—is less than 1/3, Ci is removed from S.

B. Incremental sample collection and evaluation

According to Algorithm 1, the initial models are generatedusing the object seeds. We then iteratively use the currentmodels to collect new samples, and use the new samples toincrementally train the models.

Sample collection: New samples are brute-forcesearched and collected from the 3D environment using thecurrent models m ∈ M. Each model inserts three of itsbest matched samples into the sample set, and a thresholdmaxm∈M match(S,m) < ε is set as the stopping conditionfor the sample collection of model m. Actually, we can usethe initial models to prune the search range to not-so-badlymatched samples during preprocessing, thus greatly reducingcomputation of sample collection in further iterations.

Sample evaluation: Not all of the collected sampleshave equally reliable shapes. Unreliable samples either donot contain the target object in their center, or have strange,largely occluded, or mixed shapes. We use the current modelsto estimate the reliability of each sample. As different modelsindicate different sub-categories, each sample only needs tobe well matched to its target sub-category. The reliability ofsample S can therefore be formulated as

R(S) = maxm∈M

match(S,m) (3)

.

C. Model learning

Sample collection and model learning are operated alter-natively, as shown in Algorithm 1. Given the new samplescollected by the current models, model learning has twosteps, i.e. candidate model generation and model selection.In the first step, new candidate models are incrementallygenerated by combining the current models and new samples.

Page 4: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

20 40 60 80 100 120

1

2

3

(a) wall

(b) street

(c) tree

50 100 150 200 2502

2.53

3.5

10 20 30 40 50 60

22.53

3.5

50

2.5

3

3.5

50 1

2.5

3

3.5

50 100 150

2.5

3

3.5

50 100 150 200

2.5

3

3.5

50 100 150 200 250

2.5

3

3.5

Iteration 1 Iteration 10Iteration 4 Iteration 13Iteration 750

3

3.5

Initial models

(d)

Fig. 4. Model selection results in the final iteration (a–c) and evolution of the wall models (d). The horizontal axis shows different samples, and thevertical axis shows the matching uncertainty as defined in (8a). Different curves show the matching uncertainty produced by different candidate models.The colored curves are for the selected models. The thick lines indicate the best-matched models of the samples after model selection. The samples aresorted to ensure a monotonic increase in the thick lines for clarity. The selected models an approximate minimization of matching uncertainty.

This incremental model generation ensures that the newmodel number is independent of the growth of the samplenumber. In the second step, the minimum description length(MDL) principle is used to select a subset of these candidatemodels that best describe the sub-categories of the samples.In this way, models of new sub-categories may be discovered,and current sub-categories may be assigned better models.The model selection results in the final learning iteration areillustrated in Fig. 4(a–c), and the evolution of the wall modelsafter different iterations is shown in (d).

1) Candidate model generation: The candidate modelset consists of 1) current models, 2) models generatedthrough the pure breeding of new samples, and 3) modelsgenerated by the hybridization between new samples andcurrent models.

Pure breeding of a new sample: Pure breading is used toproduce a new model directly from a sample S. For cell i inthis model, let its (spatially) nearest not-none cell in sampleS be cell j of feature s j. The point-occupying probability isassumed to follow the Gaussian distribution w.r.t. its distancedisti j, and the local feature distribution is initialized with aconstant variation as

µi = s j, vari = ν , pi=G (disti j|µ′= 0,δ

′= ηR j) (4)

The local feature variance ν can be either pre-defined orestimated from a seed. R j denotes the distance between thesample center and cell j.

Hybridization between a new sample and model: Thisoperation first generates a pure-breeding model ms fromsample S, and then merges ms and the current model m ∈Minto a new model mnew. Let the current model m be generatedon the basis of Γ samples. Thus, mnew merged by m and ms

is a (Γ+ 1)-sample model. Let m and S be matched withhorizontal rotation angle θ , and let cell i in ms correspond tocell θ(i) in m. µnew

i , varnewi , and pnew

i in mnew are computedas

pnewi =

psi + pθ(i)Γ

Γ+1(5a)

µnewi =

µθ(i)pθ(i)Γ+µs

i psi

Γpθ(i)+psi

pθ(i) 6= 0, psi 6= 0

µθ(i) pθ(i) 6= 0, psi = 0

µsi pθ(i) = 0, ps

i 6= 0none pθ(i) = 0, ps

i = 0

(5b)

varnewi =

max{ ((µθ(i)−µnewi )2+varθ(i))Γpθ(i)Γpθ(i)+ps

i

+(µs

i −µnewi )2 ps

iΓpθ(i)+ps

i,0.1ν}

pθ(i) 6= 0, psi 6= 0

varθ(i) pθ(i) 6= 0, psi = 0

varsi pθ(i) = 0, ps

i 6= 0none pθ(i) = 0, ps

i = 0

(5c)

where the minimum varnewi is set to 0.1ν to avoid an over-

fitting between the model and samples.2) MDL-based model selection: We use the MDL prin-

ciple [35] to select a model subset M from the candidatemodel set M in order to describe the different sub-categories.

argminM⊆M

L (S,M), L (S,M) = L (S |M)+L (M) (6)

L (M)=−∑m∈M

pmlogpm, L (S |M)=−∑m∈M

pmUm (7)

where the total description length L (S,M) consists of theinter-model description length L (M) and the intra-modelsample variation given the models L (S |M). pm is the prob-ability of a sample being best-matched to model m amongall models in M. Um is the average matching uncertaintybetween model m and its best-matched samples. Let u(S,m)denote the matching uncertainty between sample S and modelm. The calculation of Um is weighted by the sample reliabilityR(S):

u(S,m) =− log match(S,m) (8a)

Um = ∑Φ(S)=m

R(S)u(S,m)/ ∑Φ(S)=m

R(S) (8b)

where, Φ(S) = argmaxm∈M

match(S,m)

We use the greedy strategy to find an approximate solutionto the minimization problem, as illustrated in Fig. 4. In eachstep, we remove the model from the current model set M,which minimizes the total description length, as follows:

argminm∈M

L (S ,M�{m}) (9)

Encouragement of sub-category diversity: In the earlyiterations, the models are seed-like, and thus prone to collect-ing seed-like samples. Therefore, seed-like samples make upa larger proportion than they should in early iterations. If we

Page 5: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

Fig. 5. Object seeds

estimate pm on the basis of the collected samples, the largepm of seed-like models may prevent the selection of modelsof other shape styles. As a result, the learning process will bebiased toward seed-like sub-categories. Hence, we set pm toa constant c for all models to ensure sub-category diversity.Thus, based on (6) and (7), we get L (M) = −∑m∈M clogcand

argminM⊆M

L (S ,M) = argminM⊆M

{−λ‖M‖−∑m∈M

Um}, (10)

where λ = logc is a constant.

IV. EXPERIMENTS

A. Intelligent vehicle and 3D point cloud

We develop an intelligent vehicle system to collect 3Ddata. The vehicle is equipped with five single-row laserscanners to profile its surroundings in different directions,as shown in Fig. 1. A global positioning system (GPS) andan inertial measurement unit (IMU) are mounted onto thevehicle. A localization module is developed by fusing theGPS/IMU navigation unit with a horizontal laser scanner.In this module, the localization problem is formulated as asimultaneous localization and mapping system with movingobject detection and tracking [24], thereby ensuring both theglobal and local accuracy of the vehicle’s pose estimation.A 3D representation of the environment can be obtained bygeo-referencing the local range data from four slant laserscanners in a global coordinate system, given the vehicle’spose and the geometric parameters of the slanted laserscanners.

Our intelligent vehicle collects 3D point cloud data in anurban environment. Unlike RGB-D image data, large-scale3D point cloud data do not contain color information. Theproposed category model mining requires that each categorycontains enough objects to form a shape pattern, and thusthe environment must be quite large and contain variousobjects. The size of the entire point cloud in our experimentis 300m× 400m. We choose the wall, street, and tree asthe three target categories in our experiments. Within theenvironment, a large number of cars park on the street sides.The smooth street has a small variation in shape, whereas thewall has a larger shape variation. The wall has a high noiselevel owing to its long distance from the intelligent vehicle.The wall also has various shapes, as it may be occluded bytree branches. Trees have the largest shape variation, due totheir varied structures.

B. Model learning

Object samples for the wall, street, and tree are randomlyselected from the environment as seeds (as shown in Fig. 5).We set the minimum matching values for sample collectionas ε = 0.1, and the model probability for model selection

as λ = 0.03 for the street and tree categories; we set lowervalues for parameters ε = 0.001 and λ = 0.01 for the wallcategory, as the wall is far from the vehicle. Occlusion andmeasurement noise thus cause large structure variations ofthe wall category.

C. Results and evaluation

Fig. 10 shows the trained models and collected samples.All collected samples and the division of their sub-categoriesare shown in Fig. 4. The three wall models represent threesub-categories: a normal shape (Wall 1), noise shape (Wall2), and incomplete shape (Wall 3). The two tree modelsrepresent isolated trees (Tree 1) and ∆-shaped trees (Tree 2).The street with cars to the side is represented by Street 1,and the flat street is represented by Street 3. The model forStreet 2 appears somewhat ambiguous, being between twoother street sub-categories. Only three samples are describedas Street 2 in the sample set (see the thick red line in Fig. 4),and there appears to be a strange shape pattern, i.e. a fencein the middle of the street, in this environment (Fig. 10).Therefore, we consider Street 2 as a biased model, which istypically an incorrect model in semi-supervised learning.

We use point labeling to evaluate the learned models. Wecompare the proposed method with pure seed-based point la-beling, conventional unsupervised methods, and the widely-used supervised method of point cloud classification, whichis based on AMNs. Each competing method is typical in 3Dpoint labeling in a large urban environment, although noneare as close to the task of efficient model base constructionas our method. We therefore only compare them from theperspective of point labeling. Besides, as point cloud datado not have color information, many RGBD-image-basedmethods [2], [4], [5], [7], [8], [9] are not comparable.

To construct the ground truth, we manually label 3D pointsas wall, flat street, tree and car. Points labeled car andflat street are considered as positives in the street detection.Both the tree seeds and tree models contain a small areaof flat streets, so the flat street points are not considered aseither positives or negatives in the tree detection. Differentsegmentation thresholds (τ) are used to draw the performancecurves for point labeling.

1) Comparison with seed-based point labeling: Seed-based point labeling directly generates the initial modelsfrom seeds, and uses them to retrieve objects in the environ-ment. Fig. 6 compares model- and seed-based point labeling.The results demonstrate the accuracy of the learned models.Point labeling uses the global structure of objects, so smallobject fragments from large occlusions are not prone to beingdetected, although they are labeled as different categories inthe ground truth. As a result, the curve does not continuouslyincrease toward a recall of 100%.

2) Comparison with conventional unsupervised meth-ods: The first unsupervised method for comparison is un-supervised 3D category discovery and the point labelingproposed in [1]. Considering the requirements for the inputsand outputs in Section II, this method is the most comparableto our own. Along with [1], we also compare our method

Page 6: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

TABLE ICOMPARISON OF THE TIME COST

Tim

e(s

) Unsupervised Unsupervised repetitive Ours approach:3D category shape extraction Semi-superviseddiscovery [1] [20], [21], [1] Wall Street Tree

6280 3366 200 47 52

with unsupervised repetitive shape extraction. This methodwas originally proposed by [20], [21], and was applied toindoor environment. Zhang et al. extended the core ideaof this method to a large urban environment, as one ofits competing methods in the experiments of [1]. Thus, wechoose this extended version of repetitive shape extractionfor comparison. Note that repetitive shape extraction ismainly based on the hierarchical clustering of object samples(see [1] for details), and as is widely known, the clusternumber (or cluster size) greatly affects the cluster purity.Fortunately, one of the final cluster-merging steps producestwo clusters in the wall, street, and tree categories, as wellas one chaotic cluster (shown in Fig. 7), which is similar tothe final number of sub-categories in our method. Thus, weuse these clusters for comparison.

Result: The results of the two unsupervised competingmethods are shown in Fig. 6. The unsupervised 3D categorydiscovery [1] shows relatively low error rates, while unsu-pervised repetitive shape extraction [20], [21], [1] exhibitsrelatively high detection rates.

Without shape guidance from object seeds or any mecha-nism to limit the bias problem, unsupervised repetitive shapeextraction provides biased results, i.e. object samples that arenot correctly localized (see Fig. 7). One category consists ofwall fragments, and another category takes a combination oftwo trees as a single object. Actually, the chaotic cluster isa wall category, but is greatly biased. Its proportion of wall,flat street, tree, and car points is 1 : 0.4 : 0.6 : 0.2.

Time cost: Table. I shows the time costs of the twounsupervised methods and our approach. Compared to theunsupervised methods, the biggest advantage of our semi-supervised approach is its low computational cost, as itonly deals with some semantically meaningful structurepatterns belonging to certain target categories, rather than allpossible (probably meaningless) repetitive patterns in a largeenvironment. The high efficiency is important for knowledgemining from “big point cloud data”.

3) Comparison with AMN-based classification: AMN-s [14] demonstrated a superior performance in the multiple-class classification of point clouds in recent years. Althoughnot designed for category model mining, and despite theirrequirement for the manual labeling of a large amount oftraining data, we compare AMNs with our system from theperspective of point labeling. AMNs are trained to classifythe wall, flat street, tree, car, and unlabeled categories.The unlabeled category mainly consists of small fragmentsresulting from data sparsity and other objects, such as buses.They are unclear at the object level, and thus are not usedin the previous evaluation. The evaluation follows the samecriteria above: car and flat street are considered as positivesin the street detection, and flat street points are not used in

treewall street

Fig. 7. Central samples of the clusters learned by unsupervised repetitiveshape extraction [20], [21], [1]. The object samples describe, from left toright, the shape patterns of two wall categories, two street categories, twotree categories, and one chaotic category. The red rectangles indicate biasedcategories.

0 0.02 0.04 0.060

0.2

0.4

0.6

0.8

initial models3 iterations6 iterations9 iterations

0 0.02 0.04 0.060

0.2

0.4

0.6

0.8

1 seed2 seeds3 seeds4 seeds5 seeds

Fig. 8. Comparison of tree models (a) after different numbers of iterations,and (b) with different seed numbers. The curves with more than four seedsconverge, showing that a limited number of seeds are sufficient for modellearning, as seed bias is overcome through sample collection.

the tree detection.The max-margin strategy allows the AMN to operate as a

powerful multiple-class classifier, but the algorithm does notuse the global shapes of objects efficiently. In some cases,local features are discriminative enough for classification,thus leading to relatively low recalls and high error rates.Fig. 6 shows the classification results with different numbersof training samples.

4) Other evaluations: We evaluate the models trainedafter a different number of iterations, or with different seednumbers. The tree category is selected for the tests. We donot manually remove biased models from the results to avoidbringing subjective judgments into the evaluation; in fact,there are no extremely biased models in the tree results.

Performance after a different number of iterations:Fig. 8(a) shows a performance comparison. The initial

models are directly generated from seeds. In early iterations,the system tends to collect seed-like samples, which enlargesthe bias problem in the initial models. Thus, both the initialmodels and those after three iterations are highly accuratefor seed-like objects (the recall is relatively high whereas theerror rate is low), but they have a low accuracy for non-seed-like objects (their recalls converge at a low value). However,our MDL-based model selection ensures model diversity(Section. III-C.2) and enables the evolution from seed-liketo well developed models. As a result, the curves after sixiterations converge, showing a similarly good performance.

Performance with different seed numbers: Our algorithmdoes not require a large number of seeds for high accuracy, asthe seeds are just the starting points of the learning process.The models are mainly trained using automatically collectedsamples. As shown in Fig. 8(b), except the 1-seed curve, theother four curves converge, showing that a limited numberof seeds are sufficient for model learning.

V. CONCLUSION AND DISCUSSION

This paper proposed a semi-supervised approach for train-ing object models in a large and complex 3D environment,

Page 7: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.002 0.004 0.006 0.008 0.010

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

seeds

300900

2700

Rec

all

Error rate0 1 2 3 4

x 10-3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rec

all

Error rate

seeds

300

9002700

0 0.02 0.04 0.060

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Rec

all

Error rate

seeds

300 9002700

Wall Street Tree

Detailed view

Detailed view

Detailed view

Our methodSeed-based

Unsupervised 3D category discovery

Unsupervised repetitive shape extraction

AMN-based classification

With biased models

Our methodSeed-based

Unsupervised 3D category discovery

Unsupervised repetitive shape extraction

AMN-based classification

Our methodSeed-based

Unsupervised repetitive shape extraction

AMN-based classification

street

cars

Error rateError rate Error rate

Recall Recall Recall

0.8

0.6

0.4

0.2

0 0.02 0.04 0.060

0.9

0.6

0.3

0 0.0020 0.004

0.8

0.6

0.4

0.2

0 0.004 0.0080

Fig. 6. Comparison of competing methods, including seed-based point labeling, AMN-based point cloud classification [14], unsupervised 3D categorydiscovery [1] and unsupervised repetitive shape extraction [20], [21], [1]. Performance curves based on both our learned models and seeds are shown. Thebiased model—Street 2—is removed and thus not used in our street detection; in addition, the black curve illustrates the performance without subtractingthe biased models. Recalls and error rates of the unsupervised 3D category discovery [1] are indicated by the blue dots. As [1] regards the “flat street”and “the cars on the street sides” as two individual categories, we use two blue dots to show their performance for the street category. Besides, theevaluation criterion for the tree category in [1] is different from ours, so its tree category performance is not shown. The purple and green dots indicatethe unsupervised repetitive shape extraction [20], [21], [1] and AMN-based multiple-class classification [14], respectively. In the learning of the AMNs,We label the seeds, and then further randomly label 300, 900, and 2700 point cloud cliques as different settings of the training samples (see the text nextto the green dots).

Fig. 9. Model-based point labeling results. Different colors indicate different categories, i.e. wall (green), tree (red), and street (blue).

and described various experiments that demonstrated itseffectiveness and high efficiency.

The proposed approach is a plausible way for modelmining from “big point cloud data”, as it requires much lesshuman labeling than supervised methods and exhibits lesscomputational cost than unsupervised methods.

Color information is not used, proving that the miningof category models can be performed using 3D-point-clouddata alone. In our experiment, the biased street model didnot greatly reduce the point labeling accuracy of the entirestreet category (Fig. 6), as it only detected a limited numberof strange shape patterns. We can consider the biased modelas an abnormal detection from the environment.

ACKNOWLEDGMENTThis work was supported by Microsoft Research, a Grant-

in-Aid for Young Scientists (23700192) of Japans Ministry ofEducation, Culture, Sports, Science, and Technology (MEX-T), and Grant of Japans Ministry of Land, Infrastructure,Transport and Tourism (MLIT).

REFERENCES

[1] Q. Zhang, X. Song, X. Shao, H. Zhao, and R. Shibasaki, “Unsuper-vised 3D Category Discovery and Point Labeling from a Large UrbanEnvironment”, In ICRA, 2013.

[2] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena “Contextuallyguided semantic labeling and search for three-dimensional pointclouds”, In IJRR vol. 32, no. 1, 19–34, 2013.

[3] A. Aldoma, F. Tombari, L. D. Stefano, and M. Vincze, “A globalhypotheses verification method for 3D object recognition”, In ECCV,2012.

[4] K. Lai, L. Bo, X. Ren, and D. Fox, “Detection-based object labelingin 3D scenes”, In ICRA, 2012.

[5] K. Lai, L. Bo, X. Ren, and D. Fox, “Sparse distance learning forobject recognition combining rgb and depth information”, In ICRA,2011.

[6] W. Susanto, M. Rohrbach, and B. Schiele, “3D object detection withmultiple kinects”, In ECCV, 2012.

[7] X. Ren, L. Bo, and D. Fox, “Rgb-(d) scene labeling: Features andalgorithms”, In CVPR, 2012.

[8] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta-tion and support inference from rgbd images”, In ECCV, 2012.

[9] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena, “Semanticlabeling of 3D point clouds for indoor scenes”, In NIPS, 2011.

[10] A. Collet, S. S. Srinivasay, and M. Hebert, “Structure discovery inmulti-modal data: a region-based approach”, In ICRA, 2011.

[11] E. Herbst, P. Henry, X. Ren, and D. Fox, “Toward object discoveryand modeling via 3D scene comparison”, In ICRA, 2011.

[12] E. Herbst, X. Ren, and D. Fox, “Rgb-d object discovery via multi-scene analysis”, In IROS, 2011.

[13] K. Lai, L. Bo, X. Ren, and D. Fox A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. In ICRA, 2011.

[14] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert, “Contextualclassification with functional max-margin Markov networks”, InCVPR, 2009.

Page 8: Start from Minimum Labeling: Learning of 3D Object Models ...qszhang.com/publications/icra2014.pdf · cells, and extract the local shape features from each cell. The model encodes

Fig. 10. Models and their corresponding samples. Each circle in the model represents a not-none cell, showing its point-occupying probability (intensity),mean value ([low] blue→green→red [high]) and variance (the reciprocal of the circle area) of the local feature.

[15] D. Anguelov, B. Taskary, V. Chatalbashev, D. Koller, D. Gupta,G. Heitz, and A. Ng, “Discriminative learning of Markov randomfields for segmentation of 3D scan data”, In CVPR, 2005.

[16] L.-J. Li, G. Wang, and Li F.-F.. OPTIMOL: automatic Online PicturecollecTion via Incremental MOdel Learning. In IJCV, vol. 88, no. 2,pp. 147–154, 2010.

[17] F. Endres, C. Plagemann, C. Stachniss, and W. Burgard, “Sceneanalysis using latent dirichlet allocation”, In RSS, 2009.

[18] R. Detry, N. Pugeault, and J. H. Piater, “A probabilistic frameworkfor 3D visual object representation”, In PAMI, vol. 31, no. 10, pp.1790–1803, 2009.

[19] M. Ruhnke, B. Steder, G. Grisetti, and W. Burgard, “Unsupervisedlearning of compact 3D models based on the detection of recurrentstructures”, In IROS, 2010.

[20] J. Shin, R. Triebel, and R. Siegwart, “Unsupervised discovery ofrepetitive objects”, In ICRA, 2010.

[21] M. Ruhnke, B. Steder, G. Grisetti, and W. Burgard, “Unsupervisedlearning of 3D object models from partial views”, In ICRA, 2009.

[22] G. Somanath, R. MV, D. Metaxas, and C. Kambhamettu, “D-clutter:Building object model library from unsupervised segmentation ofcluttered scenes”, In CVPR, 2009.

[23] J.-F. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert, “Naturalterrain classification using three-dimensional ladar data for groundrobot mobility”, In JFR, vol. 23, no. 1, pp. 839–861, 2006.

[24] H. Zhao, M. Chiba, R. Shibasaki, X. Shao, J. Cui, and H. Zha, “Slamin a dynamic large outdoor environment using a laser scanner”, InICRA, 2008.

[25] D. Munoz, N. Vandapel, and M. Hebert. Onboard Contextual Classifi-cation of 3-D Point Clouds with Learned High-order Markov RandomFields. In ICRA, 2009.

[26] R. Triebel, K. Kersting, and W. Burgard. Robust 3D Scan PointClassification using Associative Markov Networks. In ICRA, 2006.

[27] H. Zhao, Y. Liu, X. Zhu, Y. Zhao, and H. Zha. Scene Understandingin a Large Dynamic Environment through a Laser-based Sensing. InICRA, 2010.

[28] A. Golovinskiy, V. G. Kim, and T. Funkhouser. Shape-based Recog-nition of 3D Point Clouds in Urban Environments. In ICCV, 2009.

[29] C. Wojek, S. Roth, K. Schindler, and B. Schiele. Monocular 3D SceneModeling and Inference: Understanding Multi-Object Traffic Scenes.In ECCV, 2010.

[30] I. Posner, M. Cummins, and P. Newman. A Generative Frameworkfor Fast Urban Labeling Using Spatial And Temporal Context. InAutonomous Robots, vol. 26, no. 2, pp. 153–170, 2009.

[31] K. Klasing, D. Wollherr, and M. Buss. Realtime segmentation of rangedata using continuous nearest neighbors. In ICRA, 2009.

[32] D. Anguelov, B. Taskary, V. Chatalbashev, D. Koller, D. Gupta, G.Heitz, and A. Ng. Discriminative learning of Markov random fieldsfor segmentation of 3D scan data. In CVPR, 2005.

[33] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. ContextualClassification with Functional Max-Margin Markov Networks. InCVPR, 2009.

[34] K. Klasing, D. Wollherr, and M. Buss. A Clustering Method forEfficient Segmentation of 3D Laser Data. In ICRA, 2008.

[35] M. H. Hansen and B. Yu, “Model selection and the principle ofminimum description length”, In JASA, vol. 96, no. 454, pp. 746–774, 2001.

[36] R. Triebel, R. Paul, D. Rus, and P. M. Newman, “Parsing outdoorscenes from streamed 3D laser data using online clustering andincremental belief updates”, In AAAI, 2012.

[37] R. Paul, R. Triebel, D. Rus, and P. M. Newman, “Semantic catego-rization of outdoor scenes with uncertainty estimates using multi-classGaussian process classification”, In IROS, 2012.

[38] Q. Zhang, X. Song, X. Shao, H. Zhao, and R. Shibasaki, “CategoryModeling from just a Single Labeling: Use Depth Information to Guidethe Learning of 2D Models”, In CVPR, 2013

[39] Q. Zhang, X. Song, X. Shao, H. Zhao, and R. Shibasaki, “LearningGraph Matching for Category Modeling from Cluttered Scenes”, InICCV, 2013