segmentation and modelling of visually symmetricobjects by robot actions

8/3/2019 Segmentation and Modelling of Visually SymmetricObjects by Robot Actions

http://slidepdf.com/reader/full/segmentation-and-modelling-of-visually-symmetricobjects-by-robot-actions 1/19

1

Segmentation and Modelling of Visually Symmetric

Objects by Robot ActionsWai Ho Li and Lindsay Kleeman

Intelligent Robotics Research Centre

Department of Electrical and Computer Systems Engineering

Monash University, Clayton, Victoria 3800, Australia

{ Wai.Ho.Li, Lindsay.Kleeman } @eng.monash.edu.au

Abstract—Robots usually carry out object segmentation andmodelling passively. Sensors such as cameras are actuated by arobot without disturbing objects in the scene. In this paper, wepresent an intelligent robotic system that physically moves objectsin an active manner to perform segmentation and modelling usingvision. By visually detecting bilateral symmetry, our robot isable to segment and model objects through controlled physical

interactions. Extensive experiments show that our robot is able toaccurately segment new objects autonomously. We also show thatour robot is able leverage segmentation results to autonomouslylearn visual models of new objects by physically grasping androtating them. Object recognition experiments confirm that therobot-learned models allow robust recognition. Videos of roboticexperiments are available from Multimedia Extensions 1, 2 and3.

Index Terms—fast symmetry, real time, computer vision,autonomous, segmentation, robotics, object recognition, SIFT,interactive learning, object manipulation, grasping

I. INTRODUCTION

The ability to perform object segmentation and modelling

used to be the exclusive domain of higher primates. With

passing time, computer vision research has produced ever im-

proving systems that can segment and model objects. Modern

techniques such as Interactive Graph Cuts [Boykov and Jolly,

2001] and Geodesic Active Contours [Markus et al., 2008] can

produce accurate segmentations given some human guidance.

Similarly, visual features such as SIFT [Lowe, 2004], Gabor

Filter banks [Mutch and Lowe, 2006] and Haar wavelets [Viola

and Jones, 2001] enable reliable object detection and recogni-

tion, especially when combined with machine learning meth-

ods such as Boosting using AdaBoost [Freund and Schapire,

1997]. However, these computer vision techniques rely heavily

on a priori knowledge of objects and their surroundings, such

as initial guesses of foreground-background pixels, which is

difficult to obtain autonomously in real world situations.

This paper presents a robotic system that applies physical

actions to segment and model new objects using vision. The

system is composed of a robot arm that moves objects within

its workspace inside the field of view of a stereo camera

pair. The arm-camera geometry is configured to mimic a

humanoid platform operating on objects supported by a flat

table. A photo of our robotic system is shown in Figure 1.

The checkerboard pattern is used to perform a once-off arm-

camera calibration prior to robotic experiments.

Fig. 1. Robot System Components

Physical actions can reduce the need for prior knowledge

by providing foreground-background segmentation. However,

a robot will require significant training and background in-

formation to perform object manipulations autonomously. By

limiting our scope to objects that exhibit bilateral symmetry

in a perpendicular manner to a known plane, such as cups

and bottles resting on a table, we propose a partial but robust

solution to this problem. Given that many objects in domestic

and office environments exhibit sufficient bilateral symmetry

for our autonomous system, our symmetry-based approach can

be employed in a wide variety of situations. Experiments showthat our robot is able to autonomously segment and model

new symmetric objects through the use of controlled physical

actions. Object recognition experiments confirm that the robot-

collected models allow robust recognition of learned objects.

A. Object Segmentation

We define object segmentation as the task of finding all

pixels in an image that belong to an object in the physical

world. An object is defined as something that can be manip-

ulated by our robot, such as a cup or bottle. Whereas image

segmentation methods generally rely on consistency in adja-

cent pixels [Pal and Pal, 1993], [Skarbek and Koschan, 1994],



2

object segmentation requires external knowledge so that the re-

sulting segments are physically meaningful. For example, prior

knowledge of background pixel statistics is used to perform

object segmentation via background subtraction [Elgammal

et al., 2000]. Similarly, interactive segmentation approaches

rely on a priori information provided by a human user to

produce useful segmentations. Such prior information allows

object segmentation despite variations in pixel value within anobject and similarities in pixel value between an object and

the background.

Our robotic system uses physical interaction to provide prior

information for object segmentation. Not to be confused with

human-robot interaction or human-computer interaction, our

robot physically interacts with objects in order to perform

segmentation. Instead of actuating a camera in an eye-in-hand

configuration, our robot actuates the object while keeping

its cameras fixed. By having a robotic agent interactively

move objects in a scene, our approach breaks the traditional

computer vision paradigm of passive observation.

The notion of using robotic action to aid object segmen-

tation was first proposed in [Tsikos and Bajcsy, 1988]. More

recently, [Fitzpatrick, 2003], [Fitzpatrick and Metta, 2003] and

[Kenney et al., 2009] showed that object segmentation can be

performed using a robotic agent that employs a blind poking

action to move objects. By limiting our scope to objects with

visual symmetry, our approach improves on these recent works

by decoupling object detection and object segmentation.

Recent interactive object segmentation approaches essen-

tially sweeps the end effector across a scene to try and

poke objects. Object detection and segmentation are performed

simultaneously when the robot detects a jump in visual motion

beyond that of its moving end effector. In essence, the chance

collision of the end effector with a new object provides a newsegmentation. It is important to note that the robot does not

hold any expectations as to if and when the effector-object

collision occurs as the robot is blind to the objects in the

scene.

In contrast, our robot detects a new symmetric object and

then moves the object using a planned manipulation. This

plan-then-act approach allows segmentation to be delayed

until after object motion has ceased. This differs from recent

approaches that performs detection and segmentation at the

time of effector-object contact, which can produce poor seg-

ments as the initial object motion can occur between video

frames. Such poor segmentations are highlighted in Figure 11

of [Fitzpatrick and Metta, 2003], which shows that the robot’s

end effector can be included in the segmentation results. Near-

empty segmentations were also returned by their approach.

Similarly, Figures 6 and 9 of [Kenney et al., 2009] show small

chunks missing from the object segmentations.

By incorporating real time object tracking into our segmen-

tation approach, we are able to perform motion segmentation

using video frames taken before and after object motion. This

prevents poor segmentations due to insufficient or unexpected

object motion. In addition, object symmetry is used to im-

prove object segmentations. Object segmentations produced

autonomously by our robot, including results for transparent

objects, are shown in Section VIII-A.

Fitzpatrick et al and Kenney et al applies a relatively

high-speed poke to actuate objects for motion segmentation.

This is because segmentation is performed using motion or

motion templates that arise during effector-object contact.

As such, the end effector must move quickly to generate

enough motion during object contact to initiate segmentation.

In contrast, our plan-then-act approach allows the use of a

gentle and purposeful robotic nudge to actuate objects sincesegmentation is performed using the video frames before and

after object motion. The limiting factor for the robotic nudge

is merely overcoming static friction between an object and the

supporting surface. In experiments, we show that our method

does not tip over tall objects such as empty bottles and does

not damage fragile objects such as ceramic mugs. Fitzpatrick

and Kenney did not test their segmentation approaches on any

fragile or top-heavy objects.

B. Object Modelling

Robust object recognition requires large quantities of train-

ing data, such as hand-labelled images of a target object underdifferent illumination conditions. The fact that each new object

in the recognition database requires new training data further

compounds the problem in real world environments. While

one-shot learning [Fei-Fei et al., 2006] provides some hope

for the future, robust object recognition still depends heavily

on large quantities of training data. Our robot is able to break

this traditional dependence on hand-labelled data by collecting

its own training images when it encounters a new object.

The object modelling process begins with the robot moving

a new object over a short distance using a robotic nudge.

However, moving an object on a table only provides a single

view of the moved object. The object’s height is estimatedusing the left and right camera segmentations by assuming a

convex hull with a flat upper surface. This allows our robot

to physically pick up the nudged object. The robot’s ability

to move autonomously from a simple nudge to the more

advanced grasp is novel and useful in situations where the

robot has to deal with new objects.

After segmentation, the robot picks up the nudged object

and rotates it to collect training images over the entire 360

degrees of the grasped object. Object models are constructed

using these robot-collected training images. The proposed

approach differs from the traditional approach of offline image

collection and feature detection using a turntable-camera rig

as surveyed in [Moreels and Perona, 2005]. Our approach

also differs from semi-autonomous systems, such as [Kim

et al., 2006], that require a human user to provide the robot

with different views of a test object. Instead, our robot

autonomously learns new objects by modelling them online.

Object recognition experiments suggests that the robot is able

to learn useful visual models of textured objects.

The robot’s gripper has two wide foam fingers, which can

be seen in Figure 2. The foam-padded gripper ensures a

stable grasp but does not allow an accurate pose estimate of

the grasped object. This means that foreground-background

segmentation is not available during training image collection.

As such, multi-view object recognition methods such as [Chen



3

and Chen, 2004] are unsuitable because they rely on well seg-

mented training images. Instead, our approach extracts SIFT

descriptors [Lowe, 2004] from the robot-collected training

images to model objects. As the inclusion of background SIFT

descriptors in the object model can produce false positives

during recognition, we have developed an automatic descriptor

pruning method. The pruning method compares descriptors be-

tween the images within a robot-collected training set to rejectbackground descriptors. The pruning method is applicable to

any SIFT detection scenario where an object is rotated in front

of a static background.

II. SYSTEM OVERVIEW

A. Robot Hardware

The components of our robot are shown in Figure 1. Two

Videre Design IEEE1394 colour CMOS cameras form a stereo

pair. The cameras are verged together at an angle of 15 degrees

from parallel to provide a greater overlap in the stereo view

of the table. A PUMA260 six degree-of-freedom robot arm is

used to perform simple object manipulations. The calibrationgrid is used to obtain the geometric relationship between the

stereo cameras and the robot arm, as well as to determine

the geometry of the table plane. Details of the arm-camera

calibration are provided in Section II-B.

Two views of the robot’s Ottobock gripper are available in

Figure 2. Pieces of soft and hard foam have been added to

the gripper to enable robust and safe object manipulations.

An L-shaped soft foam protrusion is used to direct a damped

pushing force to the bottom portion of objects during the

robotic nudge. This foam nudger is shown at the top right

corner of Figure 2(a). Wide rigid foam pads have been affixed

to both fingers of the gripper in order to improve stability and

reliability when the robot picks up and rotate nudged objects.

These blue foam pads can be seen in Figures 2(a) and 2(b).

(a) Front view of gripper (b) Side view of gripper

Fig. 2. Photos of foam-padded robot gripper. The L-shaped protrusion isused to perform the robotic nudge (inside red rectangle in left photo)

B. System Calibration

Our robot assumes two pieces of prior knowledge. Firstly,

the geometric relationship between the stereo cameras and the

robot arm is required. This piece of a priori information allows

the translation of object locations obtained via visual sensing

to the robot arm’s coordinate frame. Secondly, our robotic

system requires the geometry of the table plane. This prevents

the robot arm from colliding with the table and also allows the

end effector to move in parallel with the table plane during

the robotic nudge. Both pieces of prior knowledge are obtained

during system calibration, which only needs to be repeated if

the pose of the stereo cameras changes relative to the table.

System calibration begins with stereo camera calibration

using the MATLAB camera calibration toolbox [Bouguet,

2006]. The intrinsic parameters of each camera is obtained,followed by a stereo camera calibration process to find the

extrinsic parameters. The stereo camera calibration allows the

triangulation of locations in 3D relative to the cameras. The

corners of the checkerboard calibration pattern are triangulated

in three dimensions. Subsequently, the geometry of the table

top is found by fitting a plane to the corners of the checker-

board pattern.

The arm-camera calibration is also performed with the help

of the calibration pattern. Note that this calibration is not

required for robotic platforms such as humanoid robots, as

the coordinate transformation between the arm and camera

frames are fixed. However, as our stereo camera system is

movable to allow for greater flexibility during testing andexperiments, explicit arm-camera calibration is needed. A

possible alternative to prior system calibration is to use an

online approach such as visual servoing with end effector

markers [Taylor and Kleeman, 2002].

Arm-camera calibration begins by having the robot arm

draw a grid of black dots on the table using a custom-

made pen attachment. The grid of dots are placed at known

locations relative to the arm’s coordinate frame. Next, these

dots are triangulated using the stereo camera to produce a

corresponding set of locations relative to the camera’s coordi-

nate frame. The coordinate frame transformation is found by

solving the Absolute Orientation problem, which returns thetransformation that maps the dots between the arm and camera

frames. We used a PCA approach [Arun et al., 1987] to solve

the absolute orientation problem.

C. Object Segmentation and Modelling Process

The autonomous object segmentation and modelling process

is summarized by the flowchart in Figure 3. Rectangular boxes

in the flowchart represent the steps performed by the robot in

order to segment and model an object. The rounded boxes

represent the results generated from the adjoining step.

The process begins by finding an interesting location that

can be further explored via physical interaction. This is done

by surveying the table using the stereo cameras and exploiting

the known geometry between the robot arm, cameras and

table plane. The clustering algorithm used to obtain interesting

locations is detailed in Section III.

Upon finding an interesting location to explore, a physical

action that we call the robotic nudge is performed in order

to generate predictable object motion. Note that the robot

will only attempt a robotic nudge if the interesting location

being explored is physically reachable with the robot arm’s

end effector. If the robotic nudge yields object motion, real

time symmetry tracking is performed in stereo to keep track

of the moving object. Section IV details the robotic nudge and

our stereo tracking approach.



4

Find Interesting

Location

Robotic Nudge andStereo Tracking

Compressed

Frame Difference

START

Object Segmentations

(Left & Right Camera)

Pick Up and Rotate

Nudged Object

Object ModellingSIFT

Model

IF nudgable

IF nudge successful

Determine object height

Training

Images

Object Tracking Videos(Left & Right Camera)

ObjectDatabase

Return object to table

Fig. 3. Autonomous Object Segmentation and Modelling Flowchart

A robotic nudge is deemed successful if stereo object track-

ing converges after the nudge, as this implies that a symmetric

object has been actuated a short distance on the table. After

a successful nudge, object segmentation is performed usinga compressed frame difference. This novel symmetry-based

motion segmentation method is described in Section V.

Object segmentation is performed individually for the left

and right cameras. The segmentations are used to estimate the

height of the nudged object. Leveraging this knowledge, the

robot then picks up and rotates the object, collecting a series of

training images at regular orientation increments during object

rotation. The training images are added to the object database

as part of the object model. This robot-driven training data

collection approach is covered by Section VI.

Finally, the robot places the object back onto table to allow

future segmentation and modelling attempts. The modelling

process concludes with the robot performing offline object

modelling using the training images. SIFT [Lowe, 2004] is

used to extract robust affine invariant features from the training

images. These features are also added to the object database

as part of the object model. The SIFT detection and pruning

methods are detailed in Section VII.

Each step presented in the flowchart is given its own

section in the remainder of this paper, followed by extensive

experimental results and conclusions.

III. FINDING INTERESTING LOCATIONS

The autonomous segmentation and modelling process be-

gins with a search for locations that the robot should explore

using simple manipulations. These interesting locations are

defined as 2D locations on the table plane that are likely to

contain a symmetric object. Interesting locations are found by

collecting and clustering symmetry intersects, both of which

are further detailed below. Symmetry detection is performed

using our fast symmetry detector [Li et al., 2008].

Symmetry lines are detected in the left and right video

frames to provide data to a clustering algorithm. All possiblepairings of symmetry lines between the left and right images

are triangulated to form 3D axes of symmetry using the

method described in our previous paper [Li and Kleeman,

2006a]. In our experiments, three symmetry lines are detected

for each image, resulting in a maximum of nine triangulated

axes of symmetry. Symmetry axes that lie outside the robot

manipulator’s workspace are ignored. Axes that have an angle

of more than 10 degrees with respect to the table plane normal

are also ignored.

The intersections between the remaining symmetry axes

and the table plane are collected over 25 pairs of video

frames and recorded as 2D locations on the table plane.

This collection of locations are grouped into clusters using a

modified QT algorithm [Heyer et al., 1999]. The QT clustering

algorithm does not require any prior knowledge of the number

of actual clusters. This is important as we are not making

any assumptions concerning the number of objects on the

table. The QT algorithm also provides a way to limit the

diameter of clusters, reducing the likelihood of clusters that

include symmetry lines from multiple objects. The original QT

algorithm was modified with the addition of a cluster quality

threshold. The quality threshold is used to ignore clusters

formed by symmetry axes that occur in less than half of

all collected frames. The geometric centroid of the cluster

nearest to the robot’s cameras is chosen as the next interestinglocation to explore using the robotic nudge as detailed in the

next section. This helps prevent poor segmentations due to

occluding objects.

IV. THE ROBOTIC NUDGE

Foreground-background segmentation is initiated by an ac-

tion we call the robotic nudge. Unlike the poking action used

by Fitzpatrick [Fitzpatrick, 2003], the robotic nudge is a gentle

and low speed action. The nudge is designed to move a new

object a short distance across the camera’s field of view in

a manner that limits scale change. By careful motion control

and real time tracking, a variety of symmetric objects can beactuated using the robotic nudge. The experimental results in

Section VIII-A show the successful segmentation of top heavy

and fragile objects using the robotic nudge.

A. Motion Control

An L-shaped foam protrusion, highlighted by a red rectangle

in Figure 2, is used to actuate objects. The primary purpose of

the protrusion is to allow the application of pushing force near

the bottom of nudged objects, which prevents objects from

being tipped over. The foam protrusion also provides some

damping during effector-object contact, which is essential

when actuating brittle objects.



5

Figure 4 shows the side view of the robotic nudge motion.

The nudge is performed by moving the end effector in an L-

shaped trajectory. The height of P 0 is well above the height

of the tallest expected object. Dmax is set to allow enough

clearance so that the L-shaped foam protrusion does not collide

with objects as it is lowered. In our robotic experiments, Dmax

was set to provide a safety margin of 20mm. Dmin is chosen

so that the smallest expected object will be sufficiently actu-ated by the robotic nudge for subsequent motion segmentation.

Note that the L-shaped nudge motion is essential. In early

experiments, the gripper retreated from P 2 directly to P 0,

which knocked over objects that are wider on top such as the

plastic cup in Figure 14.

Fig. 4. The robotic nudge: Side view

An overhead view of the robotic nudge is provided in

Figure 5. Note that the nudge is performed along a vector

that is perpendicular to the line formed between the right

camera’s focal point and the target object’s symmetry axis.

This reduces the scale change incurred by the object as it is

actuated across the camera’s field of view and also lowers the

probability of object rotation due to indirect contact. As such,

the robotic nudge is designed to improve the quality of motion

segmentation.

Fig. 5. The robotic nudge: Overhead view

After finding a location of interest, the robot calculates

the positions P 0, P 1 and P 2 based on the right camera’s

location. Linearly interpolated encoder values are generated

using inverse kinematics at run time to move the end effector

smoothly between these three points. A robotic nudge captured

by the right camera can be seen in Figure 6.

Fig. 6. Right camera video frames captured during a nudge. The frames aretaken from the P 1-P 2-P 1 part of the robotic nudge

B. Stereo Tracking using Symmetry

During the robotic nudge, the right camera image is moni-

tored for object motion. Motion detection is performed on the

part of the image coloured green in Figure 4. By ignoring

motion between the object’s symmetry line and the end

effector, the robot’s ego motion will not be misinterpreted

as object motion. Motion detection is performed using the

block motion algorithm in our fast symmetry tracker [Li andKleeman, 2006b].

After detecting object motion, tracking is performed on

the object’s symmetry line using a Kalman filter. Tracking

is performed independently on both cameras in the stereo

pair and continues until the nudged object stops moving.

Motion segmentation will only take place if both trackers

converge to symmetry lines that triangulate to a symmetry axis

roughly perpendicular to the table plane. This prevents poor

segmentation caused by insufficient object motion. Note that

if no object motion is detected, tracking is not performed and

the object learning process begins anew at the next location

of interest.

V. OBJECT SEGMENTATION BY COMPRESSED FRAME

DIFFERENCE

Our fast symmetry detector was previously used to perform

static object segmentation [Li et al., 2008]. This dynamic

programming approach generates object contours by linking

edges using a symmetry constraint. It is unable to recover

asymmetric parts of symmetric objects such as cup handles.

It is also prone to edge noise and background symmetry,

which may result in segmentations that are inaccurate when

compared against the actual physical object. This section

presents a new motion segmentation approach that remedies

these problems by making use of the quasi-predictable object

motion generated by the robotic nudge.

The motion segmentation process is illustrated by Figure 7.

The right camera image before and after the robotic nudge

are shown in Figures 7(a) and 7(b). Motion segmentation

begins by computing the absolute frame difference between

the before and after images, which results in the image in

Figure 7(c). The object’s symmetry lines before and after the

robotic nudge are overlaid onto the frame difference image as

green lines. Note that thresholding the frame difference image

at this stage will produce a segmentation mask that includes

too many background pixels. Also, if the moving object is not

highly textured, a large gap will be present at the centre of the



6

motion mask. Both problems can be seen in Figure 7(c). We

can overcome these problems by using the object’s symmetry

to our advantage.

(a) Before Nudge (b) After Nudge (c) Frame Difference

(d) Compressed Differ-ence (e) Symmetry Filled (f) Segmentation Result

Fig. 7. Segmentation by Compressed Frame Difference. The CompressedDifference and Symmetry Filled images are rotated so that the object’ssymmetry line is vertical

The compressed frame difference is shown in Figure 7(d).

This image is generated by removing the pixels between the

symmetry lines in the frame difference image, compressing

the two symmetry lines into one. This process also removes

changes in the object’s orientation caused by the robotic

nudge. Notice that the compressed frame difference no longer

includes many background pixels. The motion gap present

in the raw frame difference image is also smaller in thecompressed frame difference.

A small motion gap may remain in the compressed frame

difference. This can be seen in Figure 7(d) as a dark V-

shape bisected by the symmetry line. To remedy this, we

again exploit object symmetry to our advantage. The result

in Figure 7(e) is obtained by following the symmetry filling

process illustrated in Figure 8.

Fig. 8. Symmetry filling process used to generated the result in Figure 7(e)

Recall that the compression step merges the symmetry lines

of the object in the before and after frames. Using this newly

merged symmetry line as a mirror, we search for motion on

either side of it. A pixel is considered moving if its frame

difference value is above a threshold. These pixels are coloured

gray in Figure 8. The filling process marks all pixels from

the symmetry line to the outer most symmetric pixel pair as

moving. This allows the process to fill motion gaps in the

interior of an object while retaining asymmetric parts of a

symmetric object such as the handle of a mug. The object

segmentation result in Figure 7(f) is obtained by using the

symmetry filled image as a mask.

V I. PICKING UP AND ROTATING NUDGED OBJECTS

The object modelling process begins after the robotic nudge.The robot uses the object segmentation results from both

cameras to estimate the height of the object. The top of the

nudged object in the image is determined by following the

object’s symmetry line upwards. The top of the object is

where its symmetry line intersects with the object-background

boundary of its segmentation. Figure 9 visualizes an object’s

symmetry line and the top of the object as detected by our

robotic system.

(a) Left camera image (b) Right camera image

Fig. 9. The top of a nudged object’s symmetry line as detected by the robot

Figure 10 illustrates how an object’s height is estimated

using its symmetry axis. The symmetry axis is produced by

the same stereo triangulation process employed in Section III.

The blue line joins the camera’s focal point and the top of the

object as detected in the camera view. The estimated heightis marked as a black dot. Note that the estimated height has a

systematic bias that makes it greater than the actual height

of the physical object. Height estimates from the left and

right camera views are cross-checked for consistency before

attempting to grasp the object.

Fig. 10. Object height estimation using symmetry axis showing systematicbias in height estimate.

In Figure 10, r represents the object radius and d is the

systematic bias of the estimated height. In cases where the

object deviates from a surface of revolution, r represents the

horizontal distance between the object’s symmetry axis and

the point on the top of the object that is furthest from the



7

camera. The angle between the camera’s viewing direction

and the table plane is labelled as θ. Using similar triangles,

the height error d is described by the following equation. Note

that the equation assumes an object with a convex hull that

has a flat upper surface and ignores the effects of an object

appearing off centre in the camera image.

d = r tanθ (1)

For our experimental rig, which simulates the arm-camera

geometry of a humanoid platform, θ is roughly 30 degrees. As

we are only interested in robot-graspable objects, we assume

radii ranging from 30mm to 90mm. This produces a d error

value between 18mm and 54mm. To compensate for this error,

the gripper is vertically offset downwards by 36mm during

object grasping. As the vertical tolerance of the robot’s two-

fingered end effector is well over ±18mm, object grasping

is reliable as demonstrated by the experiments detailed in

Sections VIII-B and IX.

A. Training Image Collection

After estimating the nudged object’s height, grasping is

performed by lowering the opened gripper vertically along the

object’s symmetry axis. When the gripper arrives at the top of

the object, offset downwards by the height triangulation error

d, a power grasp is performed by closing the gripper. The

object is raised until most of the gripper is outside the field of

view of the stereo cameras. This helps prevent the inclusion

of end effector features in the object’s model.

Training images are collected by rotating the grasped object

about a vertical axis. Right camera images are taken at 30-

degree intervals over 360 degrees to produce 12 trainingimages per object. The 30-degree angle increment is chosen

according to the ±15 degrees view point tolerance reported

for SIFT descriptors [Lowe, 2004]. The first two images of a

training set collected by the robot is shown in Figure 11. Each

training image is 640× 480 pixels in size.

Fig. 11. Two of twelve images in the green bottle training set. The right imagewas captured after the robot has rotated the grasped object by 30 degrees

VII. OFFLINE OBJECT MODELLING USING SIFT

The scale invariant feature transform (SIFT) [Lowe, 2004]

is a multi-scale feature detection method that extracts unique

descriptors from affine regions in an image. It is attractive

for robotic applications because SIFT descriptors are robust

against translation, rotation, illumination changes and small

changes in viewing angle.

A. SIFT Detection

Recall that the robot rotates a grasped object to collect

12 training images at 30-degree increments. After object

manipulation, SIFT detection is performed on each image in

a training set using David Lowe’s binary implementation. Our

own C/C++ code is used to match and visualize descriptors.

The locations of SIFT descriptors detected in a training image

are shown as blue dots in Figure 12(a). Note the densecoverage of descriptors over the grasped object.

B. Pruning Background Descriptors

Figure 12(a) highlights the need to prune non-object de-

scriptors before building object models. The inclusion of

non-object descriptors may lead to false positives in future

object recognition attempts. This problem will be especially

prominent when the robot is operating on objects set against

similar backgrounds.

An automatic pruning method is used to remove non-object

descriptors as well as repetitive object descriptors. The pruned

result is shown in Figure 12(b). Notice that the majorityof background descriptors, including the descriptor extracted

from the object’s shadow, have been successfully removed. Ex-

periments suggests that the remaining non-object descriptors

have negligible effect on object recognition performance.

(a) All detected descriptors

(b) Background descriptors pruned

Fig. 12. Pruning background SIFT descriptors

Pruning is performed as follows. Firstly, a loose bounding

box is placed around the grasped object to remove background

descriptors. The bounding box is large enough to accom-

modate the object tilt and displacement that occurs during



8

rotation. This step removes a large portion of background

descriptors.

The second pruning step searches within an entire training

set to remove background descriptors. As the grasped object is

rotated in front of a static background, background descriptors

will occur much more frequently within a training image set.

We assume that an object descriptor in the current training

image may also be detected in the images collected at theprevious object rotation as well as the next object rotation. This

means a descriptor belonging to the grasped object should only

match with a maximum of two descriptors from other images

in the same training set. The second pruning step makes use

of this observation by removing descriptors that have three or

more matches with descriptors from other images in the same

training set.

Our automatic pruning method should generalize to any

SIFT feature set collected from images of an object rotated in

front of a static background. The descriptor threshold in the

second pruning step can be adjusted based on the object rota-

tion increment between training images. Apart from increasing

object recognition robustness by reducing the probability of

false positives, the reduction in the number of descriptors also

reduces the computational cost of recognition. In the example

shown in Figure 12, the number of descriptors is reduced from

268 to 163.

C. Object Recognition

The robot’s object recognition system is described in Fig-

ure 13. An object database is created from autonomously

collected training images. Each object is modelled in the

database by twelve images, the SIFT descriptors detected in

these images and an object label. Note that SIFT detectionis performed on grayscale images so no colour information

is retained in the database. The object label is a text string

specified by the user.

0o 30o 60o 330o

OBJECT LABEL: White Bottle

SIFT Descriptor Sets fr om 12 Training Images

OBJECT LABEL: Yellow Bottle

OBJECT LABEL: Transparent Bottle

OBJECT DATABASE

Input Image

DetectSIFT

Compute Matches withall Descriptor Sets

0o 30o 60o 330o


0o 30o 60o 330o


Object RecognitionResults

Object Label ofBest SIFT Set

Best SIFTDescriptor Set

Training Imageof Best SIFT Set

Set with MostMatches is BEST

InputSIFT

Descriptors

Fig. 13. Object recognition using SIFT

Object recognition is done by comparing an input image

against the object database. SIFT detection is performed on

the input image to obtain a set of input descriptors. The

input descriptors are matched against all descriptor sets in

the database, which are drawn as green squares with angles

in Figure 13. Two descriptors are considered to match if the

Euclidean distance between them is smaller than 0.6 times

the distance between the second best match. The descriptor

set with the most matches is considered the best . The ob-

ject recognition system also returns the training image and

object label associated with the best descriptor set. As pose

estimation requires three correct matches [Lowe, 2004], therecognition system will only return a result when three or

more matches are found.

VIII. ROBOTIC EXPERIMENTS

Three sets of experiments were carried out using our robotic

system. Firstly, experimental results where the robot only ap-

plies the nudge action in order to segment objects is presented.

Limited results were presented in [Li and Kleeman, 2008].

Secondly, the results of the robot using both the nudge and

grasp actions to model objects are presented. Some preliminary

results were previously presented in [Li and Kleeman, 2009].

These experiments are presented as individual subsectionswithin the current section.

Thirdly, we present an extensive set of experiments that tests

the limits of our autonomous system to examine its robustness

and identify failure modes. Due to the length of the results,

these new experiments are presented separately in the next

section.

Results from the three sets of experiments are highlighted

in Multimedia Extensions 1, 2 and 3 respectively.

A. Object Segmentation

Twelve segmentation experiments were carried out on ten

objects of different size, shape, texture and colour. Trans-parent, multi-coloured and partially symmetric objects are

also included. Objects are set against different backgrounds,

ranging from plain to cluttered. As shown in Extension 1, all

segmentation results are obtained autonomously by our robot

without any human aid. For safety reasons, a warning beacon

flashes during robot motion, periodically casting red light on

the table. The flashing beacon can be seen in Extension 1

starting at 00:26 during the object tracking video sequence.

The test objects were chosen to provide a variety of visual

and physical challenges to the autonomous segmentation ap-

proach. The control is the textureless blue cup in Figure 14.

The video frame taken after the nudge is shown in the left

column. The autonomously obtained object segmentation is

shown on the right. Note the accurate segmentations obtained

regardless of whether the background is plain or cluttered. Due

to the monocular nature of our segmentation approach, small

parts of the object’s shadow is also included in the results.

The white cup in Figure 15 poses a challenge to our

segmentation process not because of its imperfect symmetry,

but because of its shape. Due to its narrow stem-like bottom

half, the nudge produces very small shifts in the object’s

location, creating a narrow and weak contour of pixels in the

frame difference. As seen from the resulting segmentation,

our algorithm is able to handle this kind of object. Figure 16

illustrates the robustness and accuracy of our segmentation



9

process. The robot was able to autonomously obtain a very

clean segmentation of a transparent cup against background

clutter.

The mugs in Figures 17 and 18 test the robustness of

segmentation for symmetric objects with asymmetric parts.

The handles of both mugs are successfully included in the seg-

mentation results. The multi-colour mug provides additional

challenges as it has a similar grayscale intensity to the tablecloth and is physically brittle. Our system was able to deal

with both challenges.

The less accurate segmentation of the multi-colour mug is

due to a mechanical error in the PUMA 260 wrist joint, which

caused the end effector to dip towards the table during the

nudge. This resulted in movement of the table cloth and a

less accurate segmentation result. Considering the difficulty of

segmenting a multi-colour object without a priori assumptions

such as colour and shape models, coupled with unexpected

movement in the background, our symmetry-based approach

performed surprisingly well.

The remaining test objects are bottles of various sizes,

appearance and mass. The bottle in Figure 19 is completely

filled with water, which allows us to test the strength and

accuracy of the robotic nudge. Due to its small size and weight,

the nudge must be accurate and firm to produce enough object

motion for segmentation. The segmentation result suggests that

the nudge can actuate small and dense objects.

The test objects also include empty plastic drink bottles,

which are lightweight and easy to tip over. During the robotic

nudge, their symmetry lines tend to wobble, which provides

noisy measurements to the symmetry trackers. As such, these

objects test the robustness of stereo tracking and the robotic

nudge. Figure 20 shows a successful segmentation of two

textured bottles against plain backgrounds. Figure 21 is asimilar experiment repeated against background clutter.

Figure 22 contains two segmentation results for a transpar-

ent bottle. Note the accurate segmentation obtained for the

transparent bottle, which has a very weak motion signature

when nudged. The result is especially impressive considering

that it is fairly difficult to obtain an accurate segmentation by

hand.

Fig. 14. Blue Cup

Fig. 15. Partially Symmetric White Cup

Fig. 16. Transparent Cup in Clutter



10

Fig. 17. White Mug

Fig. 18. Multi-coloured Mug

Fig. 19. Small Water-Filled Bottle

Fig. 20. Textured Bottles

Fig. 21. Textured Bottle in Clutter

Fig. 22. Transparent Bottle



11

B. Object Modelling

Autonomous modelling experiments were performed by the

robot on the seven test objects displayed in Figure 23. The

test objects are beverage bottles, including two transparent

bottles and a glass bottle. Apart from the cola bottle, all

the other bottles are empty. This raises the object’s center

of gravity and makes object interactions more difficult. As

beverage bottles are surfaces of revolution, they are visuallysymmetric from multiple views and therefore easily segmented

using the robotic nudge. The object modelling process along

with selected results are presented in Extension 2.

(a) White (b) Yellow (c) Green (d) Brown

(e) Glass (f) Cola (g) Transparent

Fig. 23. Bottles used in object modelling and recognition experiments.

The robot was successful in collecting training images

autonomously for all seven test objects. The object database

and recognition system is tested using 28 images, four for

each test object. Each quartet of images shows the test object

at different orientations and set against varying amounts of

object clutter.

The recognition system returned the correct object label for

all test images. Statistics of the descriptor matches are shown

in Table I. Good descriptor matches are tabulated under the

columns labelled with a√

. Bad matches are shown under

the × columns. We define a good match as one where the

descriptor in the input image appears at a similar location on

the object in the matching training image. Bad matches are

those with descriptors that belong to different parts of the ob-

ject. Good and bad matches are judged by manual observation.

Note that all results meet the minimum requirement of three

correct matches needed for pose estimation.

A selected set of recognition results are shown at the end

of the paper. The two-digit number in the caption corresponds

to the image numbers in Table I. Each recognition result is

a vertical concatenation of the input image and the matching

training image. The object label returned by the recognition

system is printed as green text at the bottom of the recognition

TABLE IOBJECT RECOGNITION RESULTS – SIFT DESCRIPTOR MATCHES

Bottle

Image number

00 01 02 03√ × √ × √ × √ ×White 16 0 6 0 17 0 7 0

Yellow 14 0 11 0 24 0 4 0

Green 23 1 21 1 11 0 9 1

Brown 15 0 16 0 16 0 8 0Glass 5 0 6 1 4 1 4 1

Cola 7 0 4 0 9 0 11 0

Transparent 6 0 7 1 11 0 6 0

results. Descriptor matches are shown as red lines.

A stereotypical recognition result is shown in Figure 36(a).

The large number of SIFT matches centered around textured

areas is a general trend as seen in Figure 36(c). Our system

can also recognize objects that have undergone pose changes,

which is apparent in Figure 36(b). Notice the drop in number

of SIFT matches when the white bottle is turned upside down.

This may be caused by surface curvature, perspective effectsand changes in illuminations. The results in Figures 36(d) and

37(a) show that our recognition system is able to deal with

partial occlusion and background visual clutter. Note that the

brown bottle is also used to test the limits of recognition in

Section IX-E.

Figure 37(b) shows a small number of SIFT descriptor

matches for the glass bottle. This can be attributed to the

bottle’s reflective label and its low texture surface. Note that

a bad descriptor match can be seen between the bottle cap in

the input image and the label in the training image returned

by the system. Figures 37(c) and 37(d) show that the system

can recognize semi-transparent objects but also highlights the

fact that the object’s label is being recognized, not the object

itself. This can be seen in Figure 37(c) where an empty cola

bottle matches with the half-filled bottle in the database. The

reliance on object texture is an inherent limitation of SIFT

features and will be further investigated in Section IX-E

IX . EXPERIMENTS TO TES T LIMITS OF SYSTEM

The experiments presented in Section VIII were repeated in

another laboratory to test the system’s robustness to illumina-

tion changes and other factors. The robot arm was mounted

on a different table and the stereo cameras were placed at

a new location relative to the arm. Despite these changes,

following the calibration steps detailed in Section II-B allowed

the reconfiguration of a robust system. The newly configured

robotic system is shown in Figure 24.

The experiments presented tests some of the limits of our

autonomous system to examine its robustness and identify

failure modes. In each experimental subset, we aimed for

simple scenarios in order to control the number of parameters

that can vary. Experiments performed includes:

• Autonomous nudge and grasp of a mug with a handle to

test system on symmetric object with an asymmetry part

• Investigating the effects of background edge pixel noise

• Partial occlusion of the target object

• Object collisions during the robotic nudge



12

Fig. 24. Reconfigured robotic system. Note the different relative locations of the robot arm and cameras when compared against the old system in Figure 1

• Investigation of object recognition failure modes

Videos of the robot in action for the experiments above are

available from Extension 3. The first 4 sets of experimentsas listed above are presented in chronological order within the

video.Note that some object tracking videos have been slowed

down from 25FPS to 10FPS for ease of viewing. As object

recognition experiments are performed using passive vision no

videos are provided for them.

A. Symmetric objects with asymmetric parts

The robotic system was asked to learn a set of bottles using

a nudge then grasp approach in Section VIII-B. To see whether

the grasping approach generalized to symmetric objects with

asymmetric parts, a white mug with a handle was used totest the system. The robot was successful in nudging and

subsequently grasping the white mug. Figure 25 shows the

segmentation returned by the robot.

(a) Right camera image (b) Segmentation results

Fig. 25. Segmentation from autonomous nudge and grasp of white mug withhandle. Non-object pixels are coloured green in the segmentation result

B. Background edge pixel noise

Our robot’s reliance on bilateral symmetry is also its

Achilles heel, as multiple stages of visual processing make

use of our fast symmetry detector. The experiments presented

here attempt to disrupt the symmetry detection results by

introducing noisy edge pixels using a textured table cloth and

a newspaper. Recall from Figure 3 that our robotic system is

also designed to err on the side of caution and abort learning

attempts if anything goes wrong. As such, the experiments

also implicitly test the robustness of the system’s design.

We begin by saturating the camera image with edges using

a highly textured table cloth as seen in Figure 26. Notice the

large number of non-object edge pixels, which drowns out

the object’s symmetric edges. This results in no interesting

locations being found by the robot as the triangulated symme-

try axes do not intersect the table in a perpendicular manner.The robot correctly chose not to attempt a nudge. It may be

possible to find the object by raising the number of symmetry

lines detected. However, as all possible pairings of symmetry

lines from the left and right cameras must be triangulated to

find interesting locations, we chose to limit the number of

symmetry lines detected in each camera image to three to

avoid a combinatorial explosion in computational complexity.

(a) Right camera image (b) Fast symmetry results

Fig. 26. Fast symmetry detection failure due to large quantities of non-objectedge pixels. The top three symmetry lines are shown as green lines with edgepixels overlaid in magenta

Next, the number of noisy edge pixels is reduced by turning

the table cloth over. The symmetry detection results are shown

in Figure 27.

(a) Fast symmetry results (b) Segmentation result

Fig. 27. Successful fast symmetry detection and segmentation via roboticnudge despite the presence of background edge pixel noise

Note that the robot is able to detect the object’s symmetry

line. This resulted in successful stereo triangulation followed

by successful nudge and grasp actions. The same experiment

was also successful on the white mug from Figure 25.

Finally, in order to have finer control over the location of

background edge pixels, a folded newspaper was used as a

noise source. By moving the location of the newspaper, we

were able to produce an experiment where the robot was able

to nudge the object but correctly aborts the learning attempt

before segmentation due to failed fast symmetry tracking.

Symmetry detection results before the nudge are shown in

Figure 28 below. Note that the position of the newspaper

had to be manually fine tuned via guess-and-check in order

to generate this failure mode. In the vast majority of cases,

the newspaper had no effect on the system. A video of the



13

robot successfully performing the nudge and grasp actions

autonomously is also provided as reference.

(a) Left camera (b) Right camera

Fig. 28. Successful fast symmetry detection before the robotic nudge. Notethat symmetry tracking fails during the robotic nudge despite successful objecttriangulation. The top three fast symmetry lines are shown in green. Edgepixels are shown in red over the grayscale input image

C. Partial occlusion of target object

These experiments focused on the effects of partial oc-

clusion on symmetry tracking, the success of which is aprerequisite to proceed to segmentation and subsequent object

learning steps. Four experiments were performed using the

same white mug and pink cup from previous tests for the

sake of consistency. An asymmetric object was used to provide

the occlusion as it is invisible to our symmetry-based vision

system. A symmetric occluding object will be nudged by

the robot first as the system always attempts to actuate the

object nearest the camera. The pre-nudge and post-nudge right

camera image for all four experiments are shown in Figure 29.

By fine tuning the location of the occluding object in

experiment 2, we achieved a failure mode where the target was

detected but the robotic nudge increased the level of occlusion

too much, causing symmetry tracking to diverge. All other

experiments produced segmentations via the robotic nudge

and the robot was able to grasp the objects autonomously.

Segmentation results are shown in Figure 30. As expected,

the occlusions introduced several artefacts in the segmentation

results. However, despite the degradation in segmentation

quality, the robot was able to autonomously grasp the object

in occlusion experiments 1, 3 and 4 in Figure 29.

In occlusion experiment 1, there are two artefacts present

in the object segmentation. Firstly, a collision between the L-

shaped foam nudger and the mug’s handle during the gripper’s

descent when nudging the object caused a large rotation in

the object pose. This resulted in the O-shaped segmentationartefact on the right of the mug. Note that symmetry tracking

(a) Occlusion 1 (b) Occlusion 3 (c) Occlusion 4

Fig. 30. Segmentation results for occlusion experiments 1, 3 and 4from Figure 29. Background pixels are coloured green. Note that Occlusionexperiment 2 did not produce a segmentation as tracking failed during the

robotic nudge

converged despite the unintended collision. In addition, scene

illumination changes caused by reflections and shadows from

the robot arm also resulted in parts of the occluding object

being included in the segmentation results. In experiments

3 and 4, the segmentation results also included parts of the

occluding object and background due to lighting changes.

However, these artefacts did not affect the subsequent grasping

step.Increasing the amount of occlusion before the nudge results

in no object being detected and no robotic nudge. Overall,

we found the fast symmetry detector to be robust to partial

occlusions especially when the occluding object is shorter in

height than the tracking target. The robot was also able to

abort learning attempts prematurely if symmetry tracking fails.

This means that the robot’s object knowledge, in the form of

segmentations and SIFT features, will not be corrupted by

failed symmetry tracking in occlusion experiment 2.

D. Object collisions during nudge

In the experiments presented previously in Section VIII-A,

the robotic nudge was successful in segmenting the test

objects. However, what happens when something goes wrong

during the nudge? Here we present three experiments where

the robotic nudge causes various kinds of unexpected events.

In the first experiment, a nudged cup collides with a tennis

ball which rolls for a short period of time after the nudge. In

the second experiment, the cup collides with another cup. In

the third experiment, an upside-down bottle is tipped over by

the nudge. The robot-eye-view of each experiment before and

after the nudge is presented in Figure 31.

Collision 1 Collision 2 Collision 3

Success Success Tracking fails

Fig. 31. Collision experiments designed to cause unexpected events duringthe robotic nudge (from right). The before and after nudge images (rightcamera) are shown in the top and bottom rows respectively.

The segmentation results for experiments 1 and 2 are shown

in Figure 32. Note that as expected, the movement of the object

being hit by the pink cup resulted in segmentation artefacts.

However, as the height of the object is determined along it’s

symmetry line, autonomous grasping was not adversely af-

fected. Tracking fails to converge for experiment 3 so the robot

correctly aborts the learning attempt before the segmentation

step.



14

Occlusion 1 Occlusion 2 Occlusion 3 Occlusion 4

Success Tracking fails Success Success

Fig. 29. Partial occlusion experiments. The right camera image before and after the robotic nudge are shown in the top and bottom row respectively. Eachof the four experiments are given it’s own column. Note that the robot was successful at performing the entire autonomous learning process, from nudge tograsp, apart from experiment 2. In experiment 2, symmetry tracking failed during the robot nudge, thereby successfully aborting the learning attempt beforesegmentation is performed

(a) Collision 1 (b) Collision 2

Fig. 32. Segmentation results from collision experiments 1 and 2. Note thatdespite the noisy segmentation results autonomous grasping was performedsuccessfully following the nudge

E. Object recognition failure modes

In Section VIII-B, the robot autonomously learned SIFT

models for each of the seven bottles in Figure 23 in order

to build an object database. Here, we investigate the failure

modes of SIFT recognition using the same object recognition

database. Note that the recognition results presented here are

obtained using passive vision without any robotic action.

Firstly, to see if the new lighting conditions affected recog-

nition on the learned objects, we retested the recognition sys-

tem on several objects in Figure 23. The robot was successful

in recognizing the learned objects as expected given SIFT’s

inherent robustness to illumination changes.

Secondly, we showed images of unmodelled objects to

the recognition system to see if any false positives would

be returned. The system was tested against the unmodelled

objects in Figure 33. Object recognition did not return any

false positive object matches.

Thirdly, we attempted to cause a false positive by presenting

a new brown bottle that is nearly identical to the one already

modelled. Both bottles can be seen in Figure 34. Notice the

similarity in features as the new bottle is actually the same

Fig. 33. Previously unmodelled objects used to test recognition system. Nofalse positives were returned by our system

drink with updated branding. This scenario is one that can be

encountered by a robot operating in domestic environments.

Fig. 34. New brown bottle (left) versus brown bottle already modelled inrobot’s object recognition database (right)

The new imposter brown bottle is able to cause false posi-

tives when placed at certain orientations, especially when the

Chinese text and the English lettering is visible. Surprisingly,

the number of SIFT matches is significant smaller with the

new bottle as can be seen in Figure 35. This suggests that a

higher threshold on the minimum required number of SIFT

matches can reject this false positive but will raise the risk

of missed recognition for objects with few distinctive SIFT

features.



15

(a) Old modelled bottle (b) New bottle

Fig. 35. Example of false positive caused by the new imposter brown bottle. Note the reduced number of SIFT matches as compared to the old bottle alreadymodelled in the recognition database

F. Discussion of limitations

The experiments in this section highlight the strengths and

weaknesses of our system. As the experiments were conducted

in a different laboratory, they suggest that the proposed design

is robust to illumination changes as well as changes in arm-

camera geometry and camera-table viewpoint. Table II lists

the chance of different failure modes as experienced during

the robotic experiments. A horizontal dash indicates that

tracking is not performed as no symmetric objects are detected.

Note that the object recognition experimental results are not

included in the table as they do not make use of the whole

system.

TABLE IICHANCE OF SYSTEM FAILURE ACCORDING TO EXPERIMENTAL RESULTS

ExperimentFailure mode

No object detected Tracking diverges

Asymmetric parts Rare Rare

High background texture Common -

Some background texture Rare Rare

Textured background distractor Rare Sometimes

Minor occlusion Rare Rare

Major occlusion pre-nudge Common -

Major occlusion post-nudge Rare Common

Collision Rare Rare

Object tipping over Rare Common

Here we define failure mode as the manner in which ourrobotic system aborts object segmentation and modelling,

which does not imply complete failure of the system. As can

be seen in Table II, experiments revealed that our system has

two main modes of failure in the learning process described in

Figure 3. Firstly, object detection can fail due to overwhelming

background edge noise or the lack of object edge pixels caused

by occlusion. This results in the system stopping the object

learning process before any robotic action. Secondly, given

that an object is detected by the robot, fast symmetry tracking

can diverge during the robotic nudge. Tracking failure can be

caused by occlusion of the target object, the nudged object

being tipped over or the presence of background symmetrylines along the moving object’s trajectory. Again, the robot

will err on the side of caution by stopping the learning process

and abandoning the motion segmentation attempt. Overall,

our action-based learning approach appears to be robust to

unexpected events.

The first two collision experiments did not interrupt the

learning process but introduced artefacts in the segmentation

results. These artefacts did not affect subsequent grasping but

one can imagine scenarios with complicated object clutter

that may result in a failed grasp or further collisions between

the gripper and non-target objects. Unintentionally, the robot

gripper also collided with the white mug in the occlusion



16

experiment 1. This resulted in a large rotation of the object

and a segmentation artefact. The system was able to track the

nudged object and complete the learning process despite this

problem. However, due to the white mug’s lack of texture, the

SIFT features extracted during the learning process may not

be of great use for object recognition.

SIFT object recognition was robust and some effort was

needed to generate the failure modes. A near-identical bottlewas needed to cause false positives. The use of robotic action,

such as grasping and rotating an object to access more views,

may help disambiguate highly similar objects and prevent

false positives. Missed detection required significant removal

of features. Further investigation of failure modes as well as

the synergistic use of other features such as colour and edge

contours for object modelling are interesting directions for

future work.

X. CONCLUSIONS AND FUTURE WOR K

Our interactive segmentation approach performs robustly

and accurately on near-symmetric objects in visually clutteredenvironments. By using the robotic nudge, the entire segmen-

tation process is carried out autonomously. Multi-coloured and

transparent objects, as well as objects with asymmetric parts,

are handled in a robust manner. We have shown that our

approach can segment objects of various visual appearances,

which will help shift the burden of training image collection

from the user to the robot.

End effector obstacle avoidance and path planning, espe-

cially in situations where non-symmetric objects are present in

the robotic nudge path, are left to future work. Object detection

is a prerequisite for path planning and path planning is needed

to actuate objects in cluttered scenes. Careful application of therobotic nudge may help resolve this chicken-and-egg problem.

Improvements can also be made to the motion segmentation

approach. Section IX showed that the reliance on edges for

symmetry detection is a limitation of our system. The use

of orthogonal visual modalities such as colour and intensity

gradients maybe synergistic to our segmentation approach.

Stereo optical flow and graph cuts may improve the quality of

segmentation but sacrifices the system’s ability to operate on

transparent and reflective objects as these objects lack reliable

surface information. As the geometry of our table plane is

known, dense stereo can provide further improvements by

removing the nudged object’s shadow from the segmentation

results.

After carrying out object segmentation autonomously, our

robot continues to learn about new objects through physical

interaction. Our robot is able leverage a simple nudge action

to pick up and rotate new objects. Experiments show that our

robot is able pick up beverage bottles autonomously, including

a transparent plastic bottle and a fragile glass bottle. This

raises the question of whether other objects can be modelled

using our nudge-then-grasp approach. Household objects such

as vases, mugs, tin cans, jugs, pots, baskets, buckets are

sufficiently symmetric for our vision algorithms. Whether the

power-grasp employed for bottles will generalize to these

objects poses an interesting prospect for future work. For

example, cutlery can be sensed using visual symmetry [Yl-Jski

and Ade, 1996], but requires different manipulation strategies

to actuate. The use of other visual features that represent object

structure, such as radial symmetry or rectilinear box models,

may also lend themselves to a nudge-then-grasp approach.

The concept of moving from simple to more advanced

object manipulations allows a robot to autonomously escalate

object interactions. Our robot was able to construct a smalldatabase of objects that is sufficient for robust recognition

without any human intervention. Experimental results suggests

that robot-collected training data is of sufficient quality to

build useful object models. The inclusion of other features,

such as colour histograms and object contours may increase

the discriminatory power of the learned object models as

well as compensate for SIFT’s reliance on surface texture.

Online estimation of the fundamental matrix between views

of the grasped object and structure from motion may enable

the construction of 3D object models. Robot-learned object

models may also enable more intelligent regrasping of learned

objects.

Our robot is an addition to a sparse field of systems [Fitz-

patrick, 2003], [Ude et al., 2008], [Kenney et al., 2009] that

actuates objects in front of a static camera instead of actuating

a camera around static objects. The proposed approach takes

a small but important step towards greater robot autonomy by

shifting the labour intensive tasks of training data collection

and object learning from the human user to the tireless robot.

Given the ever increasing ratio of workers versus retirees in

many industrial nations [Christensen, 2008] and positive public

opinion towards domestic robots [Ray et al., 2008], the case for

autonomous object learning via robotic interaction has never

been stronger.

APPENDIX: INDEX TO MULTIMEDIA EXTENSIONS

TABLE IIIINDEX TO MULTIMEDIA EXTENSIONS

Extension Media Type Description

1 Vid eo Object segmentatio n by robotic nudge

2 Video Object modelling using SIFT

3 Video Experiments to test system limits

ACKNOWLEDGEMENTS

Thanks go to Steve Armstrong for his help with repair-

ing the PUMA 260 manipulator. We gratefully acknowledge

Monash IRRC for their financial support. The authors also

thank the anonymous reviewers for their insightful comments

and the suggestion to further experiment with the limits of our

robotic system. The work presented in this paper was funded

by the ARC Centre for Perceptive and Intelligent Machines in

Complex Environments (CE0561489).

REFERENCES

[Arun et al., 1987] Arun, K. S., Huang, T. S., and Blostein, S. D. (1987).Least-squares fitting of two 3-d point sets. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 9:698–700.



17

[Bouguet, 2006] Bouguet, J.-Y. (2006). Camera calibration toolbox formatlab. Online.URL: http://www.vision.caltech.edu/bouguetj/calibdoc/.

[Boykov and Jolly, 2001] Boykov, Y. Y. and Jolly, M.-P. (2001). Interactivegraph cuts for optimal boundary & region segmentation of objects inn-d images. In International Conference on Computer Vision (ICCV),volume 1, pages 105–112, Vancouver, Canada.

[Chen and Chen, 2004] Chen, J. and Chen, C. (2004). Object recognitionbased on image sequences by using inter-feature-line consistencies. Pattern

Recognition, 37:1913–1923.[Christensen, 2008] Christensen, H. I. (2008). Robotics as an enabler for

aging in place. In Robot Services in Aging Society IROS 2008 Workshop,Nice, France.

[Elgammal et al., 2000] Elgammal, A., Harwood, D., and Davis, L. (2000).Non-parametric model for background subtraction. In European Confer-

ence on Computer Vision, Dublin, Ireland.[Fei-Fei et al., 2006] Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot

learning of object categories. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 28(4):594–611.[Fitzpatrick, 2003] Fitzpatrick, P. (2003). First contact: an active vision

approach to segmentation. In Proceedings of Intelligent Robots and

Systems (IROS), volume 3, pages 2161–2166, Las Vegas, Nevada. IEEE.[Fitzpatrick and Metta, 2003] Fitzpatrick, P. and Metta, G. (2003). Ground-

ing vision through experimental manipulation. In Philosophical Trans-

actions of the Royal Society: Mathematical, Physical, and Engineering

Sciences, pages 2165–2185.

[Freund and Schapire, 1997] Freund, Y. and Schapire, R. E. (1997). Adecision-theoretic generalization of on-line learning and an application toboosting. Journal of Computer and System Sciences, 55(1):119–139.

[Heyer et al., 1999] Heyer, L. J., Kruglyak, S., and Yooseph, S. (1999).Exploring expression data: Identification and analysis of coexpressedgenes. Genome Research, 9:1106–1115.

[Kenney et al., 2009] Kenney, J., Buckley, T., and Brock, O. (2009). In-teractive segmentation for manipulation in unstructured environments. In

In Proceedings of the IEEE International Conference on Robotics and

Automation, Kobe, Japan.[Kim et al., 2006] Kim, H., Murphy-Chutorian, E., and Triesch, J. (2006).

Semi-autonomous learning of objects. In Conference on Computer Vision

and Pattern Recognition Workshop, 2006. CVPRW ’06., pages 145–145.[Li and Kleeman, 2006a] Li, W. H. and Kleeman, L. (2006a). Fast stereo

triangulation using symmetry. In Australasian Conference on Robotics

and Automation, Auckland, New Zealand. Online.URL: http://www.araa.asn.au/acra/acra2006/.

[Li and Kleeman, 2006b] Li, W. H. and Kleeman, L. (2006b). Real timeobject tracking using reflectional symmetry and motion. In IEEE/RSJ

Conference on Intelligent Robots and Systems, pages 2798–2803, Beijing,China.

[Li and Kleeman, 2008] Li, W. H. and Kleeman, L. (2008). Autonomoussegmentation of near-symmetric objects through vision and robotic nudg-ing. In International Conference on Intelligent Robots and Systems , pages3604–3609, Nice, France.

[Li and Kleeman, 2009] Li, W. H. and Kleeman, L. (2009). Interactivelearning of visually symmetric objects. In International Conference on

Intelligent Robots and Systems, St Louis, Missouri, USA.[Li et al., 2008] Li, W. H., Zhang, A. M., and Kleeman, L. (2008). Bilateral

symmetry detection for real-time robotics applications. International

Journal of Robotics Research, 27(7):785–814.[Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Computer Vision, 60(2):91–

110.[Markus et al., 2008] Markus, U., Thomas, P., Werner, T., Cremers, D.,and Horst, B. (2008). Tvseg - interactive total variation based imagesegmentation. In British Machine Vision Conference (BMVC), Leeds.

[Moreels and Perona, 2005] Moreels, P. and Perona, P. (2005). Evaluationof features detectors and descriptors based on 3d objects. In ICCV ’05:

Proceedings of the Tenth IEEE International Conference on Computer

Vision (ICCV’05) Volume 1, pages 800–807, Washington, DC, USA. IEEEComputer Society.

[Mutch and Lowe, 2006] Mutch, J. and Lowe, D. G. (2006). Multiclassobject recognition with sparse, localized features. In Proceedings of

IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, volume 1, pages 11–18. IEEE.[Pal and Pal, 1993] Pal, N. R. and Pal, S. K. (1993). A review on image

segmentation techniques. Pattern Recognition, 26(9):1277–1294.[Ray et al., 2008] Ray, C., Mondada, F., and Siegwart, R. (2008). What do

people expect from robots? In IEEE/RSJ International Conference on

Intelligent Robots and Systems, pages 3816–3821, Nice, France.

[Skarbek and Koschan, 1994] Skarbek, W. and Koschan, A. (1994). Colourimage segmentation — a survey. Technical report, Institute for TechnicalInformatics, Technical University of Berlin.

[Taylor and Kleeman, 2002] Taylor, G. and Kleeman, L. (2002). Graspingunknown objects with a humanoid robot. In Proceedings of Australasian

Conference on Robotics and Automation, Auckland.[Tsikos and Bajcsy, 1988] Tsikos, C. J. and Bajcsy, R. K. (1988). Segmen-

tation via manipulation. Technical Report MS-CIS-88-42, Department of Computer & Information Science, University of Pennsylvania.

[Ude et al., 2008] Ude, A., Omrcen, D., and Cheng, G. (2008). Makingobject learning and recognition and active process. International Journal

of Humanoid Robotics, 5:267–286. Special Issue: Towards CognitiveHumanoid Robots.

[Viola and Jones, 2001] Viola, P. and Jones, M. J. (2001). Rapid objectdetection using a boosted cascade of simple features. In Proceedings

of IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, Kauai Marriott, Hawaii, USA.[Yl-Jski and Ade, 1996] Yl-Jski, A. and Ade, F. (1996). Grouping symmet-

rical structures for object segmentation and description. Computer Vision

and Image Understanding, 63(3):399–417.



18

(a) White 00 (b) White 01

(c) Yellow 02 (d) Green 02

Fig. 36. Object recognition results



19

(a) Brown 03 (b) Glass 03

(c) Cola 03 (d) Semi-Transparent 03

Fig. 37. Object recognition results

segmentation and modelling of visually symmetricobjects by robot actions

Documents