the state of service robots: current bottlenecks in object

18
1 The State of Service Robots: Current Bottlenecks in Object Perception and Manipulation S. Hamidreza Kasaei, Jorik Melsen, Floris van Beers, Christiaan Steenkist, and Klemen Voncina Department of Artificial Intelligence, University of Groningen, Groningen, The Netherlands [email protected], {j.l.a.melsen, f.van.beers, C.N.Steenkist, k.voncina}@student.rug.nl Abstract—Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of time but also to adapt to new environments through time and interact with non-expert human users safely. Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object. While there are many successes, the robot should be painstakingly coded in advance to perform a set of predefined tasks. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by human experts. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In this paper, we review advances in service robots from object perception to complex object manipulation and shed a light on the current challenges and bottlenecks. I. I NTRODUCTION The development of service robots, used for any domestic or service task, is ongoing and interest is growing. On the one hand, there is an increase in supply, with ever more efficient and widely applicable robots. On the other hand, there is an increase in demand, due to interest from service industries to automate as well as an increasing elderly population [1] that is a significant challenge for many countries. According to a recent study [1], the number of Europeans aged 80-plus is set to rise from 4.9% in 2016 to 13% in 2070. The old- age dependency ratio of the European population (i.e., people aged 65 or above relative to those aged 15-64) was 29.6% in 2016 and is projected to reach 51.2% by 2070, meaning for every person in retirement age there are less than two people of working age. This significant demographic change poses several challenges since the population of caregivers is shrinking, while the number of people needing care is growing. This imbalance between demand and supply, leads to a big gap in the workforce and, therefore, calls for developing service robots to help us overcome this ever-increasing problem. There is a wide diversity of tasks that a service robot can be used for, such as setting a table for a meal, clearing a table after eating a meal, serving a drink, helping people carry groceries [2], etc. Most of these household tasks can be decomposed into detecting an object, driving the robot’s arm to a desired pose, and manipulating an object. In other words, a robot needs to Service Robots Object Perception Perceptual Learning Interaction Grasping Manipulation Memory System Object Perception and Perceptual Learning Action: Object Grasping and Manipulation Memory Management: Salience and Forgetting Fig. 1. Categorization of sub-tasks of a service robots into three core components, including: (i) object perception and perceptual learning, (ii) object grasping and manipulation, (iii) memory management. know which kinds of objects exist in a scene, where they are, and how to grasp and manipulate objects in different situations to operate in human-centric domains. These tasks are of a high complexity and consist of several sub-tasks that need to be performed sequentially or simultaneously to accomplish a certain goal. As shown in Fig. 1, we have categorized these sub-tasks into three core components, including: Object perception and perceptual learning: A service robot may sense the world through different modalities. The perception system provides important information that the robot has to use for interacting with users and environments. For instance, to interact with users and environment, a robot needs to know which kinds of objects exist in a scene and where they are. Besides, learning mechanisms allow incremental and open-ended learning. A service robot must also update its models over time with limited computational resources. Object grasping and manipulation: A service robot must be able to grasp and manipulate objects in different situations to interact with the environment as well as human users. It is worth mentioning, object manipulation mainly happens in a completely reactive manner, with the agent selecting one or more primitive actions on each decision cycle, executing them, and repeating the process on the next cycle. This approach is associated with closed-loop strategies for execution, since the agent arXiv:2003.08151v1 [cs.RO] 18 Mar 2020

Upload: others

Post on 29-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The State of Service Robots: Current Bottlenecks in Object

1

The State of Service Robots: Current Bottlenecksin Object Perception and Manipulation

S. Hamidreza Kasaei, Jorik Melsen, Floris van Beers, Christiaan Steenkist, and Klemen VoncinaDepartment of Artificial Intelligence, University of Groningen, Groningen, The Netherlands

[email protected], {j.l.a.melsen, f.van.beers, C.N.Steenkist, k.voncina}@student.rug.nl

Abstract—Service robots are appearing more and more inour daily life. The development of service robots combinesmultiple fields of research, from object perception to objectmanipulation. The state-of-the-art continues to improve to makea proper coupling between object perception and manipulation.This coupling is necessary for service robots not only to performvarious tasks in a reasonable amount of time but also to adaptto new environments through time and interact with non-experthuman users safely. Nowadays, robots are able to recognizevarious objects, and quickly plan a collision-free trajectory tograsp a target object. While there are many successes, the robotshould be painstakingly coded in advance to perform a set ofpredefined tasks. Besides, in most of the cases, there is a relianceon large amounts of training data. Therefore, the knowledge ofsuch robots is fixed after the training phase, and any changesin the environment require complicated, time-consuming, andexpensive robot re-programming by human experts. Therefore,these approaches are still too rigid for real-life applicationsin unstructured environments, where a significant portion ofthe environment is unknown and cannot be directly sensed orcontrolled. In this paper, we review advances in service robotsfrom object perception to complex object manipulation and sheda light on the current challenges and bottlenecks.

I. INTRODUCTION

The development of service robots, used for any domesticor service task, is ongoing and interest is growing. On the onehand, there is an increase in supply, with ever more efficientand widely applicable robots. On the other hand, there is anincrease in demand, due to interest from service industriesto automate as well as an increasing elderly population [1]that is a significant challenge for many countries. Accordingto a recent study [1], the number of Europeans aged 80-plusis set to rise from 4.9% in 2016 to 13% in 2070. The old-age dependency ratio of the European population (i.e., peopleaged 65 or above relative to those aged 15-64) was 29.6%in 2016 and is projected to reach 51.2% by 2070, meaningfor every person in retirement age there are less than twopeople of working age. This significant demographic changeposes several challenges since the population of caregivers isshrinking, while the number of people needing care is growing.This imbalance between demand and supply, leads to a big gapin the workforce and, therefore, calls for developing servicerobots to help us overcome this ever-increasing problem.

There is a wide diversity of tasks that a service robot can beused for, such as setting a table for a meal, clearing a table aftereating a meal, serving a drink, helping people carry groceries[2], etc. Most of these household tasks can be decomposed intodetecting an object, driving the robot’s arm to a desired pose,and manipulating an object. In other words, a robot needs to

Service

Robots

ObjectPerception

Perceptual

Learning

Interaction

Grasping

Manipulation

MemorySystem

Object Perception and Perceptual Learning

Action: Object Grasping and Manipulation

Memory Management: Salience and Forgetting

Fig. 1. Categorization of sub-tasks of a service robots into three corecomponents, including: (i) object perception and perceptual learning, (ii)object grasping and manipulation, (iii) memory management.

know which kinds of objects exist in a scene, where they are,and how to grasp and manipulate objects in different situationsto operate in human-centric domains. These tasks are of ahigh complexity and consist of several sub-tasks that need tobe performed sequentially or simultaneously to accomplish acertain goal. As shown in Fig. 1, we have categorized thesesub-tasks into three core components, including:

• Object perception and perceptual learning: A servicerobot may sense the world through different modalities.The perception system provides important informationthat the robot has to use for interacting with users andenvironments. For instance, to interact with users andenvironment, a robot needs to know which kinds ofobjects exist in a scene and where they are. Besides,learning mechanisms allow incremental and open-endedlearning. A service robot must also update its models overtime with limited computational resources.

• Object grasping and manipulation: A service robotmust be able to grasp and manipulate objects in differentsituations to interact with the environment as well ashuman users. It is worth mentioning, object manipulationmainly happens in a completely reactive manner, withthe agent selecting one or more primitive actions oneach decision cycle, executing them, and repeating theprocess on the next cycle. This approach is associatedwith closed-loop strategies for execution, since the agent

arX

iv:2

003.

0815

1v1

[cs

.RO

] 1

8 M

ar 2

020

Page 2: The State of Service Robots: Current Bottlenecks in Object

can also sense the environment on each time step.• Memory management: A service robot, working in an

open-ended domain, should involve experience manage-ment mechanisms such as salience and forgetting toprevent the accumulation of examples in the memory.Otherwise, the memory consumption and the requiredtime to both update the models and recognize new objectswould increase exponentially.

Memory

Learning

Reasoning

Action Perception

Planning

Action Perception

Interpretation

Semantic Memory Perceptual Memory

Semantic Learning Perceptual Learning

Fig. 2. Abstract architectures for hy-brid reactive-deliberative robots with asingle memory system [3].

These core components aretightly coupled together us-ing a software architecture.This coupling is necessaryfor service robots, not onlyto perform object perceptionand manipulation tasks in areasonable amount of time,but also to robustly adapt tonew environments by han-dling new objects. A servicerobot should process verydifferent types of informa-tion in varying time scales.Two different modes of pro-cessing, generally labelledas System 1 and System 2,are commonly accepted the-ories in cognitive psychol-ogy [4], and are mainly used as a software architecture ofa service robot. The operations of System 1 (i.e. perceptionand action) are typically fast, automatic, reactive and intu-itive. The operations of System 2 (i.e. semantic) are slow,deliberative and analytic. In this review paper, we mainlyfocus on object perception and manipulation tasks in servicerobots with the distinctive characteristics of System 1. Theabstract system architecture is depicted in Fig. 2. In thisarchitecture, a Perception component processes all momen-tary information coming from sensors, including sensors thatcapture the actions and utterances of the user. A Reasoningcomponent updates the world model and determines plansto achieve goals. An Action component reactively dispatchesand monitors the execution of actions, taking into accountthe current plans and goals. Finally, a Learning component,which typically runs in the background, analyzes the trace offoreground activities recorded in a Memory component andextracts and conceptualizes possibly interesting experiences.The resulting conceptualizations are stored back in memory.It is worthwhile to mention that each component in suchan abstract architecture is usually decomposed into a set ofsoftware modules.

This review of current state-of-art service robots will takea closer look at each piece in the pipeline of a service robotperforming higher order tasks. Each core component will bereviewed based on recent works, describing the current state-of-the-art and identifying possible areas where improvementscan still be made. The remainder of this paper is organizedas follows. In Section II, we review a set of state-of-the-artassistive and service robots. Afterward, in sections III, IV, V

and VI the state-of-the-art research on each core componentis described with both positive developments and unsolvedissues. Finally, in section VII, conclusions are presented andfuture works are discussed.

II. ASSISTIVE AND SERVICE ROBOTS

While a lot of developments have been made recently ineach of the mentioned core components and in the field ofservice robotics as a whole, robotic servants do not yet liveamong us, helping us in our daily tasks. We believe thatthe underlying reason is that robots are usually painstakinglycoded and trained extensively in advance to perform objectperception and manipulation tasks in the right way. Therefore,the knowledge of such robots is fixed after the training phase,and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by expertusers. Although an exhaustive survey of assistive robotics isbeyond the scope of this paper, representative works will bereviewed in this section.

Over the past decade, several projects have been conductedto develop robots to assist people in daily tasks. Most state-of-the-art service robots use classical object category learning andrecognition approaches (i.e. offline training and online testingare two separate phases), where open-ended object categorylearning is generally ignored. Therefore, they work well forspecific tasks, where there are limited and predictable sets ofobjects, and fail at any other assignment. In other words, theperceptual knowledge of these robots is static and they areunable to adapt to dynamic environments. Examples of suchservice robots that have demonstrated perception and actioncoupling include ARMEN [5], El-E [6], Busboy [7], TUMRosie [8], TORO [9], Walk-Man [10], ARMAR [11], andDLRs Rollin’Justin [12].

In the ARMEN project, Leroux et al. [5] proposed a mobileassistive robotics approach providing advanced functions tohelp care for elderly or disabled people at home. This projectmainly involves object manipulation, knowledge representa-tion and object recognition. The authors also developed aninterface to facilitate the communication between the userand the robot. Jain et al. [6] presented an assistive mobilemanipulator, named EL-E, that can autonomously pick objectsfrom a flat surface and deliver them to the user. The usershould provide the location of the object to be grasped by therobot by pointing at the object with a laser pointer.

In another work, a busboy assistive robot has been devel-oped by Srinivasa et al. [7]. In particular, they propose amulti-robot assistive system, consisting of a Segway mobilerobot with a tray and a stationary Barrett WAM robotic arm.The Segway robot navigates through the environment andcollects empty mugs from people. Then, it delivers the mugsto a predefined position near the Barrett arm. Afterwards,the arm detects and manipulates the mugs from the tray andloads them into a dishwasher rack. The vision system ofbusboy is designed for detecting a single object type (mugs).Furthermore, because there is a single object type (i. e. mug),they computed the set of grasp points off-line.

The work on two robots cooperating to solve a pancakemaking task [8] shows the feasibility of having one or multiple

2

Page 3: The State of Service Robots: Current Bottlenecks in Object

Fig. 3. An illustrative example of a “make a salad” task with a PR2 robot: In this scenario, the robot faces a new object while making a salad. A userthen teaches the “Plate” category to the robot and demonstrates a feasible grasp for the object. This way, the robot conceptualizes new concepts and adaptsits perceptual motor skills over time to different tasks.

robots perform a task they weren’t specifically programmedfor. In this experiment, it is shown that by giving a robotinformation on objects and low-level tasks, used in conjunctionwith a task planning module, these robots can make their ownorder of operations from a set of instructions downloadedfrom the internet. The instructions are vague and directedtowards humans with some prior knowledge of the objects andoperations involved. When an instruction mentions preheatingthe pancake maker, for example, this could mean turning iton, or it could also require plugging in the power cord. Arobot making its own order of operations will need to takeall these differences into account. This opens up routes forfurther dynamic task resolution, giving a robot the agency tosolve a wide variety of problems, in contrast with most roboticprojects, which work to solve a small range of tasks which arespecifically defined for that purpose.

Another interesting robot which has been in developmentfor a while is TORO [9], a continuation of the DLR’s Rollin’Justin [12]. This robot was developed to explore the possi-bilities of torque-controlled joints in humanoid robots. Whileseveral forms of torque control are specified, the electricaldrive units with torque measurements are argued to be mosteffective for service robots in a household setting. The torquefeedback from limbs while moving gives the robot an addi-

tional layer of safety when interacting with humans as anymovement can be adjusted not just by input from cameras,but also from the limbs themselves.

Combining both actuators and torque to balance passiveand active adaptation to the environment WALK-MAN [10]was developed. This robot was designed in response to acall for disaster relief robots and, to perform well in thesehectic environments, focuses on being able to move effectivelyin very rough terrain. In this work, it is argued that, whilequadrupedal or wheeled robots are often more stable, theyare incapable of properly traversing significantly imbalancedterrain. To further diminish the effect of uneven terrain, thecombination of active and passive adaptation was developed.This resulted in a very robust bipedal robot.

While multiple of these robots have been developed forwork in a human environment, ARMAR [11] is specificallydeveloped to collaboration with humans. The focus of thisrobot is on applicability in a human work place and workingside by side with humans. To achieve this, the reach andcarrying capacity have been developed significantly. ARMARhas a reach of 1.3m and can carry 10kg in one arm, even atmaximum reach. Most importantly, ARMAR has been devel-oped with the ability to detect human behaviour in an attemptto recognize when a human is in need of help. This is done

3

Page 4: The State of Service Robots: Current Bottlenecks in Object

through pose-tracking of the human, which can be combinedwith speech commands, giving the robot information on whenand how to help. It can also estimate the task a human is doingand determine whether it is a task normally done by multiplepeople. If so, the robot can step in as the second person.

In order for robots to be useful day to day, they needto understand and make sense of the spaces where we liveand work, and adapt to new environments by gaining moreand more experiences. In such dynamic domains, robots areexpected to increasingly interact and collaborate closely withhumans. This requires new forms of machine intelligence.Towards this goal, open-ended learning must be properlyunderstood and addressed to better integrate robots into oursociety. In human cognition, learning is closely related tomemory. Wood et al. [13] presented a thorough review anddiscussion on memory systems in animals as well as artificialagents, having in mind further developments in artificial intel-ligence and cognitive science. Cognitive science also revealedthat humans learn to recognize object categories and graspaffordances ceaselessly over time [14], [15]. This ability allowsadapting to new environments by enhancing knowledge fromthe accumulation and conceptualization of new experiences.Inspired by this theory, service robots should approach 3Dobject recognition and manipulation from a long-term perspec-tive and with emphasis on domain open-endedness. Moreover,we have to incorporate intermittent robot-teacher interaction,which is an outstanding feature in human learning [16]. Thisway, non-expert users will be able to correct unexpectedactions of the robot and quickly guide the robot toward targetbehaviors. For example, consider the robotic “make a salad”task as depicted in Fig. 3. In this example, the robot facesa new object while making a salad. The robot then asks auser to teach the category of the object and demonstrate howto grasp it. Such situations provide opportunities to collecttraining instances from online experiences, and the robot canincrementally update its knowledge rather than retraining fromscratch when a new task is introduced or a new category isadded.

To achieve this, several cognitive robotics groups havestarted to explore how to learn incrementally from past experi-ences and human interactions to achieve adaptability. Besides,since service robots receive a continuous stream of data,several methods have been introduced that use open-endedlearning for object perception. In [17], a system with similargoals is described. Faulhammer et al., [18], presented a percep-tion system that allows a mobile robot to autonomously detect,model, and re-recognize objects in everyday environments.They only considered isolated object scenarios. Of course,actual human living environments can be very cluttered, that’swhy Srinivasa et al. [19] introduced HERB. HERB is anautonomous mobile manipulator that can navigate through adynamic and highly cluttered environment. Furthermore, itis able to search for and recognize objects in this clutterand is able to manipulate a large range of objects, includingconstrained objects like doors. The main drawback of HERBis that its error recovery is completely hand-coded, which doesnot interact sufficiently with other modules (e.g. the planningmodules). As a result HERB can become stuck in cycles

in which after recovery of an error, it keeps performing thebehaviour that lead to the same error over and over again.

In the RACE project (Robustness by Autonomous Com-petence Enhancement) [20], [21], a PR2 robot demonstratedeffective capabilities in a restaurant scenario including theability to serve a coffee, set a table for a meal and clear atable. The aim of RACE was to develop a cognitive system,embodied by a service robot, which enabled the robot tobuild a high-level understanding of the world by storing andexploiting appropriate memories of its experiences. The Xcompany also launched a similar project recently, called theeveryday robot project1. The main goal of this project isto develop a general-purpose learning robot that can safelyoperate in human environments, where things change everyday, people show up unexpectedly, and obstacles appear outof nowhere.

III. OBJECT PERCEPTION AND PERCEPTUAL LEARNING

Object perception is one of the most researched subjectsin the field of robotics and computer vision. It comprises awide range of different tasks which, over the last decades,have been more and more focused on (deep) neural networkapproaches. This has largely been facilitated by the vastincrease in available computational power that has occurredconcurrently. The main goal of object perception is to detectany objects the robot needs to interact with in order to helphuman users. Furthermore, the output of object perception isrequired for motion planning not only to specify the poseof a target object but also to localize all objects that canbe considered as obstacles for the current task. An obstaclecan be (i) “fixed objects” (e.g., wall, table, etc.), (ii) objectsthat are in a fixed position most of the time (e.g., a vase,a table-sign, etc.), or (iii) a dynamic object, which usuallycorresponds to human’s or the robot’s body. Additionally, itis important that a robot knows about its work-space andaccessible regions within the work-space. In the followingsubsections, we discuss the recent advances in object detection,open-ended object category learning and recognition, and thealternate forms of perceptions for service robots.

A. Object detection

Object detection and pose estimation are crucial for roboticsapplications and recently attracted attention of the researchcommunity [22]. Many researchers participated in publicchallenges such as the Amazon picking challenge2 to solvemultiple objects detection and pose estimation in a realisticscenario. There are two different mechanisms that are oftenemployed for real-time object detection purposes.

In the first group of approaches, an object detection ap-proach is paired with an approach for object classification.These approaches initially detect objects and place a boundingbox around them. Each detected object is then classified [23],[24]. Four examples of isolated table-top objects are shownin Fig. 4. In general, separating object detection and object

1https://x.company/projects/everyday-robots2https://www.amazonrobotics.com

4

Page 5: The State of Service Robots: Current Bottlenecks in Object

Fig. 4. Examples of object segmentation in isolated objects scenarios: detectedobject candidates are shown by different bounding boxes and colors.

recognition is a suitable strategy for service robots since itallows robots to detect never-seen-before objects [24], [25],[26], [27], [28], [29]. A drawback, however, is that onlydetected objects will be classified. In a household environment,a robot may frequently encounter a pile of objects such as aclutter of toys in the living room, tidying up a messy dinningtable, or multiple unused objects stacked in a box in thegarage. This complicates the object detection process, becausesome objects can be occluded by and overlapping with otherobjects. There are two sets of approaches that can handlepile segmentation. The first set of approaches mainly useclustering algorithms to segment objects using the curvature ofthe surface normals and/or colour information [30], [31], [32](see Fig. 5). The other set is based on active segmentation. Inother words, such approaches initially segment the scene tospecify a set of object candidates and then, manipulate/pushobject candidates one by one to be isolated from the pile ofobjects [33]. For example, Van Hoof et al. [34] presented apart-based probabilistic approach for interactive object seg-mentation. They tried to minimize human intervention in thesense that the robot learns from the effects of its actions, ratherthan human-given labels (see Fig. 6). In another work, Guptaand Sukhatme [35] explored manipulation-aided perceptionand grasping in the context of sorting small objects on atabletop. They presented a pipeline that combines perceptionand manipulation to accurately sort the bricks by color andsize.

The second group of approaches for real-time object de-

Fig. 5. Four examples of object segmentation in pile scenarios: detectedobjects are shown by different colors.

tection are based on semantic segmentation, which classifieseach pixel/point of a scene to a particular class [36] (i.e.,end-to-end object detection). This group does not have thementioned problem of the first group [37], [38], [39], [36],[40], [41]. Each segmented region corresponds to a salientpart of an object, an entire object or a group of objectsbelonging to the same category. However, this group is notentirely satisfactory for service robots. That is why severalattempts have been made to not only segment by category,but also by different instances in a category [42]. The benefitof such segmentation is that all points/pixels are classified,which could provide valuable information for the motion/taskplanning purposes. An example of a semantic segmentationapproach on point clouds is given by SpiderCNN [43]. As thename suggests, it uses CNNs, however, with one importantadaption. Normal CNNs cannot straightforwardly be applied topoint clouds data, since it is not stored in a regular grid, whichis required for such CNNs. SpiderCNN instead uses a set ofirregular parametrized convolutions, which can be applied topoint clouds directly. A similar approach is used by PointCNN[44], which instead uses x-transforms to be able to extendconvolutions to point clouds. Alternative approaches that donot use CNNs have also been explored. Take for instancePointNet [45], it initially processes all points independentlyand identically with a set of input and feature transforms.Eventually it combines point features by applying one layerof max pooling. While performance is on par with otherstate-of-the-art approaches, its drawback is that due to itslayers of independent processing, it does not capture localfeatures of the spatial distribution of points well. To thisend PointNet++ [46] was introduced. It borrows ideas fromCNNs and applies the architecture of PointNet recursivelyon incrementally growing local regions of the input space.Unlike the original pointNet, this now allows it to capturelocal features on a variety of scales.

While end-to-end deep learning based object detection canachieve high accuracy, it generally requires a very large num-ber of training examples and training with limited data usuallyleads to poor performance [24], [25]. Catastrophic forgetting isanother important limitation of deep learning approaches [47],[48]. Furthermore, these approaches are unable to learn newcategories in an online fashion. It is worth mentioning thatfor path planning purposes, the mentioned limitations are notthat important, since the robot mainly needs to avoid objects,

Fig. 6. Examples of active object segmentation in pile of objects scenarios:(left) The robot observes the scene, obtaining a point cloud from oneperspective. (right) The robot decides to push the object form the bottom tosegment them. The robot repeats these step to singulate all objects (adaptedfrom [34]).

5

Page 6: The State of Service Robots: Current Bottlenecks in Object

Fig. 7. Examples of object (scene) segmentation: (left) a 3D input pointcloud;(left) a network prediction; The colors of the points represent the object labels(adapted from [37]).

furniture or humans, which can be divided in relatively fewcategories. However, to be of help, a service robot must beable to manipulate all sort of small household items accurately,which come in a seemingly endless number of categories.

Another problem with many current 3D object detectionapproaches is the inability to detect transparent and highlyreflective objects. Since glass and shiny metal objects arecommonplace in every household, this is an issue that surelyneeds to be addressed. S. Sajjan et al., [49] took a steptowards addressing such limitations for object grasping andmanipulation. In particular, they proposed a deep learningapproach, named ClearGrasp [49], for estimating accurate 3Dgeometry of transparent objects from a single RGB-D imagefor robotic manipulation.

B. Object category learning and recognition

A typical household environment contains objects belongingto a large number of categories. In such domains, it isnot feasible to assume one can anticipate and preprogrameverything for robots. Therefore, autonomous robots must havethe ability to execute learning and recognition concurrently.Several methods have been introduced that allow for open-ended learning of new categories. In such approaches, theintroduction of new categories can significantly interfere withthe existing category descriptions. To cope with this issue,memory management mechanisms, including salience andforgetting, can be considered [47]. The main approaches canbe divided into two different categories, according to whattype of object representation they use. On the one hand,there are approaches that do still use deep learning basedtechniques [50], [51], [28]. They use networks that have beenpretrained on a large dataset, and, therefore, are already ableto extract a lot of useful features from images. Additionaltraining and recognition on new objects and categories isapproached in a number of different ways, some use few-shot learning [50], [52], [53], others use one-class supportvector machines [54] and random forests [55], and simpleinstance based learning is also combined with nearest neighborclassification [50], [51], [28]. Other learning approaches havealso been considered, such as the use of autoencoder-basedrepresentation learning [56], [57], Bag of Words [58], [26],[59], and topic modeling [60], [61], [62].

On the other hand, there are approaches that instead usehand-crafted features for object recognition [27], [63], [64],[65], [66]. They use either global or local features of theobjects that can be extracted in a variety of ways. Therepresentation of an object is usually given by a histogram

of features. Concerning category formation, an instance-based learning (IBL) approach is usually adopted, in whicha category is represented by a set of views of instancesof the category. When the robot is presented with a newobject, it compares the representation of the new object tothe representation of all instances of all categories by, forexample, a simple nearest neighbor classifier. Therefore, eachinstance-based object learning and recognition approach canbe seen as a combination of a particular object represen-tation, similarity measure [67] and classification rule [68].One advantage that instance-based learning has over othermethods of machine learning is its ability to adapt the model topreviously unseen data. The disadvantage of this approach isthat the computational complexity can grow with the numberof training instances. The computational complexity of clas-sifying a single new instance is O(n), where n is number ofinstances stored in memory. Therefore, these systems mustresort to experience management methodologies to discardsome instances to prevent the accumulation of an impracticallylarge set of experiences. Salience and forgetting mechanismscan be used to bound the memory usage. These mechanismsare also useful for reducing the risk of overfitting to noisein the training set. Another advantage of the instance-basedapproach is that it facilitates incremental learning in an open-ended fashion.

There are some works that adapt model-based learning(MBL) that are often contrasted with IBL approaches. Inparticular, IBL considers category learning as a process oflearning about the instances of the category while MBL isa process of learning a parametric/non-parametric (Bayesian)model from a set of instances of a category. In other words,each category is represented by a single model. The IBLapproaches propose that a new object is compared to all previ-ously seen instances, while the MBL approaches propose that atarget object is compared to the model of categories. Therefore,in the case of recognition response, MBL approaches arefaster than IBL approaches. In contrast, IBL approaches canrecognize objects using small number of experiences, whileMBL approaches need more experiences to achieve a goodclassification result. Therefore, training is very fast in IBLapproaches, but they require more time in the recognitionphase. Another disadvantage of IBL approaches is that theyneed a large amount of memory to store the instances. In MBL,new experiences are used to update category models and thenthe experiences can be forgotten immediately. The categorymodel encodes the information collected so far. Therefore, thisapproach consumes a much smaller amount of memory whencompared to any IBL approach.

C. Fine-grained object recognition

In human-centric environments, fine-grained (very similar)object categorization is as important as basic-level categoriza-tion. A problem of the above approaches is that categoriesthat are very similar might be hard to distinguish. Suchcategories could for example be different items of cutlery,different dog, cat or other pet breeds, a variety of box shapedobjects (like food container boxes, tissues, a stack of paper,

6

Page 7: The State of Service Robots: Current Bottlenecks in Object

Fig. 8. A set of very similar (fine-grained) cutlery objects: (left) an objectview of a spoon, (center) an instance of a fork, and (right) an object view ofa knife object.

etc) or different writing utensils. Attempts have been made totackle this issue by introducing fine grained object recognition[69], [70], [71]. Fine-grained object recognition takes intoconsideration small visual details of the categories that areimportant to distinguish them from similar categories. In thecase of food boxes, think for example about the print on theboxes. Fine-grained and basic-level recognition are importantin different domains and, when combined, a trade-off is madebetween the computational benefits of basic-level recognitionand the accuracy of fine-grained recognition. Most existingapproaches for fine-grained categorization heavily rely onaccurate object parts/features annotations during the trainingphase. Such requirements prevent the wide usage of thesemethods. However, some works only use class labels and donot need exhaustive annotations. Geo et al. [70] proposeda Bag of Words (BoW) approach for fine-grained imagecategorization by encoding objects using generic and specificrepresentations. This approach is impractical for an open-ended domain (i.e., large number of categories) since the sizeof object representation is linearly dependent on the numberof known categories. Zhang et al. [71] proposed a novelfine-grained image categorization system. They only usedclass labels during the training phase. This work completelydiscarded co-occurrence (structural) information of objects,which may lead to non-discriminative object representationsand, as a consequence, to poor object recognition performance.Several types of research have been performed to assessthe added-value of structural information. Kasaei et al. [60]proposed an open-ended object category learning approach justby learning specific topics per category. In another work [69],an approach is proposed to learn a set of general topics forbasic-level categorization, and a category-specific dictionaryfor fine-grained categorization.

Object detection and recognition performance can be im-proved by considering the context in which objects appear[72], [73], [74], [75]. In a house, certain categories of relateditems are often placed together. This information can be usedto improve the processing of related items. For example, chairsand tables often appear together, so the presence of one couldbe used as a cue to expect the presence of the other as well.Additionally, information can be used to distinguish objectsfrom similar categories that appear in different environments.A pen and a screwdriver, for example, have a fairly similarshape, however, a pen is likely to appear on or near a desk in ahome office or a table in the living room, while a screwdriveris expected to be around other tools such as hammers andwrenches that are more likely to be found in a garage or neara fuse box.

In active perception scenarios, whenever the robot fails to

recognize an object from the current view point, the robot willestimate the next view position and capture a new scene fromthat position to improve the knowledge of the environment.This will reduce the object detection and pose estimationuncertainty. Towards this end, Mauro et al. [76] proposed aunified framework for content-aware next best view selectionbased on several quality features such as density, uncertainty,and 2D and 3D saliency. Using these features, they computeda view importance factor for a given scene. Kasaei et al. [77]proposed a novel Next-Best-View prediction algorithm toimprove object detection and manipulation performance. Firsta given scene is segmented into object hypotheses and thenthe next best view is predicted based on the properties ofthose object hypotheses. In another work, Biasotti et al. [78]approached the problem of defining the representative viewsfor a single 3D object based on visual complexity. They pro-posed a new method for measuring the viewpoint complexitybased on entropy. Their approach revealed that it is possibleto retrieve and to cluster similar viewpoints. Doumanoglouet al. [79] used class entropy of samples stored in the leafnodes of a Hough forest to estimate the Next-Best-View. Someresearchers have recently adopted deep learning algorithmsfor next best view prediction in active object perception. Forinstance, Wu et al. [80] proposed a deep network namely 3DShapeNets to represent a geometric shape as a probabilisticdistribution of binary variables on a 3D voxel grid.

D. Alternate forms of perceptions

Cognitive scientists showed that humans’ vision is not anindependent process and it is closely coupled to other formsof perception [81], [82], [83]. It could therefore be desirableto also explore different forms of perception in robotics.Furthermore, it could be interesting to investigate a way tocouple these forms of perception, similar to what we see inhumans. This could be especially useful in the case wherefragmented information in the different modes of perceptioncould be combined to form a clearer picture.

Take, for example, the case of robotic perception via touch.A robot can not only use vision, but also touch to identifyobjects [84]. This could help with identifying transparent orreflective objects, which was previously identified as a problemarea and is something many forms of visual object detectioncurrently cannot. It could also be very useful for a robot to beable to detect typical household sounds, which has receivedsurprisingly little attention so far [85]. Additionally, auditoryand visual perception could be improved by interchange ofinformation. Research in this respect has largely focused onsound localization [86], [87], [88], [89], [90], [91]. Thisentails that, given a video input with a corresponding audio, asystem tries to identify which sounds in the audio channelcorrespond to which objects in the video scene. Take forexample the MONO2BINAURAL [86] DNN. Given a single-channel audio signal together with an accompanying sourcevideo, it uses spatial information in the video to convert theaudio signal into two channels, each of which represents thesignal received in one ear. This improved double audio channelthen also provides spatial information of the sounds, which

7

Page 8: The State of Service Robots: Current Bottlenecks in Object

is clearly related to sound localization. It is therefore notsurprising that the representation learned by this network wassuccessfully extended to the task of sound localization. TheCO-SEPERATION [87] architecture is very similar, However,it separates the audio into different sources instead of twospatial channels. By doing this it also learns to match thesound of the same object type appearing in different videos.

While sound localization can already be useful in itself forrobots to, for example, look at the person who is talking to it,to the best knowledge of the authors no attempts have beenmade to leverage this separated and localised audio to aid inthe recognition of objects or events. For instance, imagine thecase of a whistling kettle, or something more serious like thethump of a human falling and hitting the ground, possiblyinjuring themselves. It is important for the robot to be ableto identify where the sound comes from. Furthermore, it isjust as important that the robot should be able to recognisethe sounds, and the objects that produce them, and take actionif necessary. Ideally these alternate forms of perceptions, aswell as the different forms of vision we discussed, should becoupled in a unified framework, so that one can benefit frominformation of the others.

IV. AFFORDANCE DETECTION AND OBJECT GRASPING

Object grasping and manipulation stand at the core ofa successful service robot, without these abilities the robotcannot provide many of the necessary services. There area number of problems inherent to this task. For example,the robot needs to accurately detect the pose of objects,recognise the class of objects, and find the best location tograsp them in real-time. Besides, any planned movements needto avoid self collision and collision with the environment.Furthermore, the more complex a robotic arm and gripper are,the more complicated these problems become, taking morecomputation time to find adequate solutions. Both accuracyand computation time are the two major factors to balancewhen writing algorithms for these purposes. A great deal ofwork has been done to improve the quality of object graspingand reduce the execution of motions in object manipulation.There are, however, still areas and specific subjects that havenot yet received a lot of attention. In the following subsections,we will point out some of those gaps in the state-of-the-art ofobject grasping and manipulation.

A. Grasp point detection

Detecting stable grasp points for previously unseen objectsin real-time is one of the main challenges for object grasping.Towards finding proper grasping points of objects, many differ-ent techniques for both teaching and learning processes can beused. Trial and error is not often utilized due to the large riskof damaging the robot, but it is able to be employed to learnmultiple good grasps [92], [93]. Some approaches use objectsegmentation, but they differ significantly from the approachesmentioned in the previous section, which often have thedisadvantage of being relatively inflexible and not generalizewell to objects outside of the set it has been trained on.Matching known grasps in a manner of instance-based learning

Fig. 9. Results of grasp point detection on three different objects: (left)Without considering object affordance, the robot detects a graspable pointon the blade region of the knife; (center and right) By considering objects’affordances, the robot detects a suitable grasp point for both objects.

is possible as well [94]. Many recent techniques share the samepipeline to find a set of stable grasp points for the given object.First the grasping candidates are sampled from the point cloud(or image) of the object, then the candidates are ranked throughfor example a Convolutional Neural Network (CNN)[95]. Thebest candidate then gets grasped in either an open-loop or aclosed-loop fashion. Grasp point detection approaches can bebroadly classified into two categories, analytical approachesand empirical approaches. While analytical approaches rely onkinematic and dynamic formulations to choose a proper end-effector configurations (i.e., position and orientation of handand fingers), empirical approaches use data-driven learningalgorithms to transfer grasps from 3D model databases toa target object. Empirical methods can be further classifiedbased on whether the grasp configuration is being computedfor known, familiar or unknown objects. The underlying reasonfor this classification is that prior knowledge about objectsdetermines how grasp candidates are generated and ranked.For more details on data-driven grasp synthesis, we refer thereader to the surveys of Bohg et al. [96] and Sahbani [97].

Deep learning methods have recently achieved the largestadvancements in grasping for unknown objects. Although anexhaustive survey is beyond the scope of deep learning meth-ods for grasp point detection, we review some representativeworks here. Most of these approaches however use an adaptionof the CNN architectures designed for object recognition [98].Additionally, they often sample and rank potential graspsindividually [99]. Together, this can cause exceedingly longcomputation times which makes them unsuitable for real-time closed-loop grasps. Mahler et al. [100] proposed the firstDexterity Network (Dex-Net). Since then they have proposedthree version of Dex-Net architectures. Together with theDexNet architectures they also proposed the growing Dex-Net dataset. The Dex-Net 1.0 [100] uses a CNN with aAlex-Net architecture [101] for predicting the labels of thegiven object first and then, the probability of forced closureunder uncertainty in object pose, gripper pose and frictioncoefficient to obtain the quality of a grasp. The Dex-Net2.0 [95] uses a Grasp Quality Convolutional Neural Network(GQ-CNN) to evaluate grasp candidates for a certain object. Inthis approach, first, a discrete set of antipodal candidate graspsis sampled from the image space. Then, these grasp candidatesare forwarded to the GQ-CNN to find out the best grasp point.The architecture of Dex-Net 3.0 [102] does not differ from thearchitecture from the Dex-Net 2.0, and the main difference liesin the use of a suction cup in the Dex-Net 3.0, whereas theDex-Net 2.0 uses parallel-jaw gripper. The Dex-Net 4.0 [103]

8

Page 9: The State of Service Robots: Current Bottlenecks in Object

was created for a robot with a suction cup and a parallel-jawgripper. The advantages of suction cups are that they can reachinto narrow spaces and pick up an object with a single pointof contact. In difference to earlier versions, the Dex-Net 4.0not only trains on grasping objects in separation, but it is alsotrained to grasp objects in pile scenarios.

Generative Grasping Convolutional Neural Network (GG-CNN) [104], [105] seeks to improve on the mentioned draw-backs. GG-CNN is an object-independent grasp synthesismethod which can be used for closed-loop grasping. GG-CNNpredicts the quality and pose of grasps at every pixel through aone-to-one mapping from the depth image. This is in contrastwith most deep-learning grasping methods which use discretesampling of grasping candidates and have long computationtimes. Furthermore, such online analysis of objects allowsfor more precise and potentially faster grasps. These methodsvastly reduce the amount of possible grasps, by highlightingthe locations where the grasps are of the highest quality.In another work [106], an online grasp generation has beenaddressed. Both [106] and GG-CNN approaches use neuralnetworks to generate maps of grasp quality. For the former,grasp points for the two end-effectors at various angles aregenerated. For the latter, maps for grasp quality, angles andwidth are generated.

A robot could benefit from affordance detection of parts of asingle object to reduce the complexity in finding the locationsof high quality grasps [107], [108], [107]. Certain objects havea clear place to be grabbed, for instance the handle of a knife,the ear of a mug or the handle of a pan. These can be identifiedin this manner. On the contrary, there are also objects thatshould not be grasped in a certain area. Considering again thecase of a knife, its blade could damage the robot’s end-effectorwhen grasped. Softer skin-like end-effectors [109], which areimportant for delicate manipulation of objects, are especiallyat risk (see Fig. 9). Care should also be taken with plates, cups,bowls and similar objects. They may contain food or liquids.Manipulating filled containers is already a difficult task [110],a bad grasp can make the manipulation more difficult than ithas to be. Naturally, touching food or liquids with the end-effector should be avoided as well.

An affordance for a certain grasp is generally synonymouswith a high grasping quality, be it for a specific end-effector,angle or other purpose. In [111], objects’ point clouds aresemantically segmented by a rule based system. Based onwhat the robot is tasked with, an specific segment of theobject may be more suitable to be grasped than others. As abonus, the sizes and shapes of segments are features, allowingthe system to classify objects. This combination of graspaffordance and knowledge of semantics, creates a system thatis robust in handling objects for a variety of tasks. Besides,affordance prediction can provide some additional informationabout other objects [112], [113]. For example, it can tellthe robot on which furniture other objects can be placed,or determine if an object can be picked up or not, etc. Itis worthwhile to mention that affordances do not make anyclaims about the category of objects in a scene, but rather tryto predict the functionality of objects [113]. Instance-basedlearning with affordances [106] and applying affordances to

segments of objects [111] are other possible options.

B. Object grasping

Object grasping remains one of the challenging tasks and re-quires knowledge from different fields. This topic has receiveda lot of attention lately. Through the use of inverse kinematics(IK), there are many different grasps possible on objects. Notall of them are feasible as not all joint configurations are freeof collisions and singularities. A good grasp has a high quality,a measure of how stable a grasp is, and has to be somewhereon the visible part of the object. A large body of recent effortshas focused on solving 4-DoF (x, y, z, yaw) grasping, wherethe gripper is forced to approach objects from above [106],[105], [95], [114].

A major drawback of these approaches is the inevitablyrestricted ways to interact with objects. For instance, they arenot able to grasp a horizontally placed plate. Worse still, theknowledge of robots is fixed after the training stage, in thesense that the representation of the known grasp templatesdoes not change anymore. A household situation is prone tochange, cabinets may be stocked differently and implementsmay be lost and gained. To avoid impossible tasks, the degreesof freedom of an end-effector cannot be so restricted. Thesedrawbacks call for more advanced and seamless approachesfor object grasping.

Recently some research groups have taken token a steptowards addressing these issues. Qin et al., [116] studiedthis problem in a challenging setting, assuming that a setof household objects from unknown categories are casuallyscattered on a table. They proposed a learning framework todirectly regress 6-DoF grasps from the entire scene point cloudin one pass. In particular, they compute a per-point scoring andpose regression method for 6-DoF grasp. In another paper,a single view was used for grasp generation by Kopicki etal., [117]. This works similarly as the direct regression ofpoint clouds. The main obstacle with such a method is that

Fig. 10. An example of multi-functional gripper that can be used for graspingobjects in different situations: (left) In this situation, the object is graspablevertically using the two-finger parallel-jaw gripper since there is enough spacebetween walls and the object. This type of grasping is mainly used to pickup objects with small, irregular surfaces such as baskets, table-top objectsand tools; (center) Some objects can be robustly grasped using a suction cupgripper as shown in this figure. This type of grasping is robust for objectswith large and flat surfaces, e.g., books and boxes. (right) In some situations,it is necessary to grasp an object from an specific direction mainly due toenvironmental constraints or application requirement. For instances, this typeof grasp is used to grasp objects resting against walls, which may not havesuction-able areas from the top [115].

9

Page 10: The State of Service Robots: Current Bottlenecks in Object

the back side of objects is generally unknown and representmissing information. Challenging situations, those where theback side needs to be grasped, are thus harder to correctlyresolve. By combining generative models and using new waysto evaluate contact points, a higher rate of success is achievedin these challenging situations. In another work, Murali etal., [118] proposed an approach to plan 6-DoF grasps for anydesired object in a cluttered scene from partial point cloudobservations. They mainly used an instance segmentationmethod to detect the target object. To generate a set of grasppoints for the object, they follow a cascaded approach by firstreasoning about grasps at an object level and then checkingthe cluttered environment for collisions.

These methods are partial solutions to the general problemof moving from 4 to 6 degrees of freedom. A larger searchspace for grasps means that there are more grasps that arein conflict with the environment. A larger search space alsomeans more computation time is required to find good graspsand eliminate bad ones. By focusing on regression of pointclouds and single views, Qin et al. and Kopicki et al. avoida large computational pitfall. Likewise, the method of Muraliet al. is able to avoid scene collisions by clever segmentationand analysis.

For the methods discussed, the end-effectors are still rathersimple. While fast grasping with a simple gripper is relativelyeasy, doing so with more complicated and also more dextrousgrippers is again harder and more computationally intensive.Another option is a vacuum end-effector which functions likea suction cup. Like grippers, these have similar requirementsfor a successful grasp. Both require a high quality grasp, butwhile grippers require a stable pinching area, suction cupsrequire a stable suction area. This difference in suitable grasplocation for vacuum end-effectors allows them to manipulateobjects in ways that grippers cannot.

As we saw before, Mahler et al., [102] created Dex-Net 3.0,a large dataset for 3D object grasping specifically for vacuum-based end-effectors. This dataset is mainly used to train agrasp quality classifier. Therefore, simple objects and typicaleveryday objects are grasped with a high rate of success whilethe most complicated of objects are able to be successfullygrasped over half of the time. While a vacuum end-effectoris highly suitable for work in a factory, a delicate gripper isstill preferred for the manipulation of the softest of objects, aswell as those objects that lack any proper suction points. Theopposite is also true, objects that are more easily picked upwith a suction cup or that are not able to be picked up with agripper may warrant the inclusion of a vacuum end-effector. Toensure the proper completion of any task, a service robot mayhave to be equipped with both. Instead of using two roboticarms with two different types of gripper, Zeng et al., [115]developed an interesting multi-functional gripper, consistingof a suction cup and a parallel-jaw gripper, to allow robots torobustly grasp objects from a cluttered scene without relyingon their object identities or poses. Such mechanisms enable arobot to quickly switch between suction cup and parallel-jawgripper to grasp different type of objects in various positions(see Fig. 10).

There are also varies anthropomorphic robotic hands. The

shadow dexterous hand [119] is an almost fully actuatedrobotic hand that is modeled very closely after human hands.There even exists an upgraded version which boasts touch andvibration sensors on the fingertips of the device. For a moreintegrated approach, iCub [120], [121] is a fully anthropo-morphic robot that was designed for human-like interactionswith its environment. Compared to the shadow dexterous hand,the iCub robot is softer and less actuated. It is covered ina soft skin-like material to aid in the delicate handling ofobjects. On the extreme end of the spectrum, there is the RBOHand-II [122]. While still anthropomorphic, it’s individualfingers are more alike to tentacles than human fingers. It is anextremely soft and relatively simple hand, lacking any joints.Nevertheless it is still able to grasp a variety of objects.

The simplicity of its gripper does not necessarily inhibit itfrom being successful at various task. This is also shown in areview of various robotic hands [123], where there has beena rise in simpler but effective grippers. While complex end-effectors can be used to do complex tasks, a simpler grippermay be able to perform it just as effectively. This leads back tothe vacuum end-effector which is far more effective in certaincircumstances than any anthropomorphic robotic hand.

While soft or under-actuated robotic hands are not requiredfor delicate operation, it generally depends on the task at hand.Industrial robots are often fully actuated and rigid because,while the actions themselves are often not complicated and in ahighly structured environment, high precision is still required.For service robotics, the tasks are more varied while requiringsimilar precision. For such an environment, the end-effectormay specify suitable tasks instead of the other way around.Having multiple end-effectors at its disposal then increases theamount of tasks that the robot is suitable for. To this end, acombination of different grippers such as in Zeng et al., [115]would be a proper approach to create a capable service robot.

C. Open-ended grasp learning

While progress has been made towards open-ended learning,it remains a big problem that needs to be addressed further.Current approaches are only able to successfully learn a lim-ited percentage of all objects that can be found in a household.For service robots which need to be able to help those in need,significant chance of failure to recognise and grasp objects,for important objects like medicine containers, could be veryproblematic. Towards addressing this issue, some researchersused kinesthetic teaching to teach a new grasp configuration,including the position and orientation of the arm relative to theobject and the finger positions [124], [125], [126]. As shownin Fig. 11, an instructor teaches an appropriate end-effectorposition and orientation using the robot’s compliant model.After performing the kinesthetic teaching, visual features ofthe taught grasp region (e.g., heightmaps [125]) are extractedand stored as a grasp template in the grasp memory. Inanother work, Kasaei et al. [124], formulated grasp posedetection as a supervised learning problem to grasp familiarobjects. Their primary assumption was that new objects thatare geometrically similar to known ones could be grasped in asimilar way. They used kinesthetic teaching to teach feasible

10

Page 11: The State of Service Robots: Current Bottlenecks in Object

Fig. 11. An example of kinesthetic teaching: (left) a user demonstrates afeasible grasp to the PR2 robot; (right) Extracted template heightmap andgripper-pose are used to train the proper grasp position for the given object(adapted from [125]).

grasp configurations for a set of objects. The target grasppoint is described by (i) a spin-image [127], which is a localshape descriptor, and (ii) a global feature, which represents thedistance between the grasp point and the centroid of the object.To detect a grasp configuration for a new object, they initiallyestimate the similarity of the target object with all knownobjects, and then, try to find the best grasp configuration forthe target object based on the grasp templates of the mostsimilar object.

Most of these examples have dealt with only grasping. Tobe truly considered dexterous, robotic systems should havethe capacity to do complicated tasks. A small step in the rightdirection is the ability to prevent objects from slipping andfalling [128]. The adjustment of already held objects, such asby using the individual fingers [129], is a required steppingstone to advanced object handling. Moving from one roboticarm to two, synchronized or not [130], truly opens up the wayto manipulate equipment made for human use. A fair part ofthe equipment that service robots have to work with, is madefor humans. This means that advanced object manipulation, aswell as the other advances, are required for the developmentof capable service robots.

V. OBJECT MANIPULATION

It is easier to explain the details of object manipulationusing an example; consider serve a coke task as shown inFig. 12. To accomplish this task, the robot needs to detectand recognize all objects first. Afterwards, it has to identify aproper grasp pose for the coke object and plan a trajectory toreach the target pose. The object is then grasped by the robot.The object manipulation module computes an obstacle-freetrajectory to navigate the robots end-effector to a desired pose,which is on top of the cup object. The object manipulationmodule should check whether any part of the manipulator isat risk of colliding with itself or with any obstacles. Finally,the manipulation module sends out the action to the executionmanager module. We will mainly discuss collision avoidanceand dexterity in the following subsections.

A. Self-collision avoidanceThe standard way of dealing with self-collision avoidance

(SCA) is either by planning feasible paths under constraints, or

to react when the robot almost collides with itself. A plannedroute is susceptible to interruptions and thus only works instatic environments [131], [132], [133]. The environments thatservice robots have to perform in, may not always be static.Whenever the environment changes, the plan will have to berecalculated, including the SCA. Reactive systems, such as[134], [130], are more adaptable and can deal with changes tothe environment. As a trade-off, they tend to get stuck moreeasily, may have trouble with tight spaces and tend to oscillatearound local maxima in their joint space.

Salehian et al., [130] proposed a new way of handling SCA.The collision boundary and its gradient in the combined jointspace of the two arms is encoded into a model before theoperation of the machine even starts. This stored gradientleads the robotic arms to avoid situations where they mightcollide, by being repulsed by the boundary denoting imminentcollision. This distance to collision is encoded in a kernelSVM and can be used as a constraint for inverse kinematics(IK) solving instead of the usual joint-to-joint distances. Thismodel is effectively data-driven, requiring examples of validand invalid joint configurations for the SVM to learn. Thetrade-off to this is that solving the IK with SCA is veryfast (<10ms). The collision boundary and gradient have tobe determined only once for a given robot blueprint and canbe copied onto each production model.

While [130] has a good method to avoid collisions be-tween its two robot arms, it does not take into accountthe singularities that can occur during motion. The modelfrom [134] manages to balance SCA as well as avoidingsingularities, moving smoothly and reaching the desired goalposition and orientation. The SCA is done in a similar manneras before, but a neural network is trained instead of a SVM.The imminence of collisions, together with other objective

Fig. 12. An example of object manipulation in serve a coke scenario; (top-right) Initially, the robot detects the table as shown by the green polygon.Then, all table-top objects are detected and recognized. The pose and categorylabel of each object is highlighted by a bounding box and a label in red ontop of each object. (top-left) The robot then finds out the CokeCan objectand goes to its pre-grasp area and picks it up first from the table. (bottom-left) The robot retrieves the position of Cup first, and then calculates anobstacle-free trajectory and moves the CokeCan on top of the Cup to serves thedrink. (bottom-right) Finally, the robot computes an obstacle-free trajectoryto navigate the robots end-effector to the initial pose.

11

Page 12: The State of Service Robots: Current Bottlenecks in Object

functions, determines the constraints that the IK solver isunder.

These methods are able to quickly avoid self-collision, butrely on data to show the collision boundaries. The safety ofjoint configurations can be easily simulated and so it is possi-ble to generate this data beforehand. It is somewhat analogousto how humans learn the limits of their bodies, but withoutthe need for some form of stress or touch sensors embeddedinto the robot arm structure. Nevertheless, looking into morehuman-like robotic arms, complete with additional sensors,can be useful for more advanced service robotics. However, forsimpler service robots, the current implementations of collisionproximity is more than enough.

B. Avoiding collisions with the environment

In contrast to self-collision avoidance, checking for col-lisions with the environment is a computationally intensiveprocess (see Fig. 13). Reactive systems have to continuouslycheck if they are at risk of colliding while planners have tocheck every configuration or position that they may attempt touse. In an online environment, speed is key, an algorithm thattakes more than a second may already be too slow. One wayto remedy this in a planning-based approach is to sample moreaggressively towards the goal position and orientation [132].This reduces the amount of nodes that have to be visited duringplanning, but also the amount of collision checks that have tobe made.

Looking at the consequences of the actions is important,especially when it comes to hard hitting robots and delicatesurroundings. One interesting way of preventing damage is bypredicting the consequences of disturbing scenes [131]. Whenan object is picked up by the robot, it may cause other objectsto move, tumble or fall. Objects may get damaged in this way,which is of course not desired. To remedy this, robots usuallyact as little as possible to get the job done. This new methodinstead focuses on learning the order in which to move objectsto cause the least amount of damage. In this case, the pathsthat disturbed objects take, by rolling or falling, act as a cost orpenalty. Based on available knowledge of the scene, the modelis able to generalize and choose the best order of operationsto cause the least damage.

Avoiding damage and reducing planning load are bothimportant steps to get closer to fast and safe robotics. Nev-ertheless, not much work seems to be put into reducingthe complexity of collision checking itself. For SCA, neuralnetworks were trained to learn the boundaries of self colli-sion. This resulted in a fast method of avoiding bad jointconfigurations. Such system could be set up to do a similarthing for visible objects with, for example, a stream pointclouds as input. Similar to affordances for grasping, humansseem to be able to be able to quickly ascertain affordancesfor arm and manipulator movement and orientation basedon what they see and know. For service robots, this meansthey should be able to operate on a similar level, or at leastfast enough to perform the various tasks we require of them.Towards this goal, Qureshi et al., proposed MPnet algorithmwhich is a learning-based neural planner for solving motion

Fig. 13. An example of defining a set of environmental constraints to preventthe robot from collisions with the environment in the serve a coke scenario.

planning problems [135], [136]. In particular, they presenteda recursive algorithm that receives environment informationas point-clouds, as well as a robot’s initial and desired goalconfigurations as input arguments and generates connectablepath as an output. Some other researches also demonstratedsegmenting motion tasks could reduce the complexity thatthe lower modules have to deal with [137]. Whether it is bydistinguishing between individual tasks based on relations oftouch [138] or embedding scores or attractors directly into arobotic joint feature space [139]. Simplifying the problem ofmotion with several degrees of freedom, allows for a greaterfocus on the other parts.

C. Coordination and dexterity

For humans, doing something with “one hand behind theirback” is seen as a challenge. Similarly for robots, doing atask with only one gripper is often possible, but may be moredifficult. Furthermore, some tasks may even be impossible toperform. Coordinated and uncoordinated motion of two roboticarms is discussed in [130], but in a factory setting instead ofa household one. As household situations usually deal withobjects of a smaller size, the self collision distance thresholdsare set a bit higher than would be required. This highlightsone of the potential problems with multi-arm coordinationin household settings. Robotic hands may have to work veryclosely together, to the point of touching. As of yet, very littleresearch seems to include this as a point of interest. While itis possible to manipulate objects one at a time in a clutteredenvironment [106] and many actions are possible with onearm, there are cases where extra dexterity is required.

In the same vein, there is also room for improvement when itcomes to manipulating objects that are already being grasped.Preventing an object from slipping from a gripper [128] andminor manipulation with fingers while grasping [129] havealready been somewhat explored. These all lay the foundationfor advanced manipulation, which remains mostly out of focusin favor of the act of grasping itself. Nevertheless, advancedmanipulation is a requirement for the most delicate of tasks.As an example, correctly cracking open an egg without anytools is difficult even for humans, but a service robot may at

12

Page 13: The State of Service Robots: Current Bottlenecks in Object

one point be asked to do this or similar difficult tasks. For thisreason, advanced manipulation still requires more research.

VI. MANIPULATION IN HUMAN-ROBOT SHAREDENVIRONMENTS

A lot of research has gone into enabling robots to dealwith a wide variety of environments, both static and dynamic.While static environments have been mastered quite well, thenavigation of dynamic environments, especially those that dealwith having other human agents in the same space as therobot, are currently a hot topic in the field. Path and trajectoryplanning are essential components of object manipulation,especially if the robot is required to affect the environmentpast its immediate reach. This means that it must be ableto relocate itself in, at worst, a highly dynamic environmentwith one or more other agents that may act unpredictably.We can further break down this problem into local and globalnavigation. Global navigation deals with planning towards agoal or objective that is not currently in the range of the robot’sperception while local navigation concerns navigation throughthe immediately perceivable space around the robot [140].Global navigation and path planning has been largely tackledto a satisfactory degree and should generally be able toconverge to an optimal global solution [141].

One of the focuses in current developments is on local pathplanning and obstacle avoidance in dynamic environments.Within this task, we can further specify into varying levelsof application. Lower level concepts, such as combining colordata with depth sensing to get additional information about theenvironment [142], [143], can be used for collision avoidance.A slightly higher level task is creating a collision risk map inorder to more effectively navigate the environment. These riskmaps provide some insight into possible future states from thecurrent one [140], [144]. Instead of explicitly modelling thecollision risk using an artificial neural network, one can insteadbe trained to abstract the risk to higher or lower confidencepath solutions [145].

When considering a robot’s movement in human-robotshared environments, many of the lower level functions andconsiderations have been puzzled out already, so the cur-rent state-of-the-art instead focuses on higher level controlphilosophies. For a truly robust trajectory planning in such adynamic environment, the socially aware control model shouldbe able to account for unexpected events, such as groupsor individuals moving away from or avoiding unmodeledobstacles. Expanding on [146] and the concept of prioritizinghuman agents in an environment, we can see a shift towardsincreasingly social based models, incorporating social forcemodels (SFM) [147] to develop a system of socially reactivecontrol (SRC) [148]. Such a system proposes to take intoaccount not only single humans, but groups of them and theircollective motion for mobile manipulators, which can be usedin stationary manipulation scenarios as well. This means groupdynamics, such as group motion, centres and size of groups.Additionally it is proposed to gain some basic understandingof what the human agent is doing at a given time, what theyare interacting with, and using that information to further

understand what is likely to happen in the environment aroundthe robot.

An example given in the paper [148] shows a humaninteracting with an object of interest and a second humanfacing the first as if to approach them. This would causeeither the robot to cross the path of the human or the humanto cross the robot’s path if both continue on a straight line.Additionally, in the path to the goal are two humans whoclearly form another distinct group. The socially reactivemodel of control will attempt to avoid the group as a wholeinstead of attempting to pass through them, even if thereis enough space to perform such an action. While groupdynamics can be useful to model in order to avoid collision orinteractions with groups of humans, the model of navigation,considering humans as single independent agents rather thantrying to infer groups, is still very relevant. In this case, thestate-of-the-art proposes a change from a flat pre-calculatedconfidence for the trajectory of a human to a Bayesian one[149] instead, which is constantly updated. This would allowfor unobtrusive navigation around agents that act entirelyunpredictable. This means either intrinsically unpredictable orunpredictable in a sense that an agent is reacting to featuresof the environment that are not modelled by the robot, whichwould make a subsequent action to avoid an obstacle orobject unpredictable. Such a motion plan is successful atavoiding human agents because it is far more conservativein the planning stage, considering a much wider area around ahuman to be inaccessible. In essence, it is similar to the earlierintroduced concept of a virtual “force-field” around a humanagent [146], however it is not modelled explicitly as such.This provides a slightly more adaptable framework where anyother agent in motion, human or not, is able to be avoided,by not making too strong of an assumption as to its intendeddirection of motion or goal.

Another interesting thing to consider in human-robot sharedenvironments is that all paths to a certain goal may be blocked,even by very light or easily movable objects. Should a robot,in its navigation, consider if an object can be pushed asidewithout damage to itself or the environment? For example, asheet of cardboard slips from a shelf and blocks the robot’spath, can the robot simply push through it as it would do nodamage to either the robot or the rest of the environment?Navigation has been considered as a tool to aid in a robot’svision and perception, especially in crowded scenarios, andgenerally serves the purpose of moving the robot to a locationwhere it can manipulate the environment using its gripperor other manipulation tool. However should the navigationitself be considered a valid method for manipulating theenvironment, using the frame of the robot itself?

Another concept to consider, is something more akin tomapping. Whereas so far, we have discussed path and tra-jectory planning (i.e., navigation) as a means to bring amanipulator to a desired location in a dynamic environment,we can instead consider another intermediary task. The cameraof a robot rarely has more than just the three rotational degreesof freedom, so any planar motion that is desired needs tobe provided by the robot itself through navigation. Yervilla-Herrera et al., [150] proposed a use-case for navigation which

13

Page 14: The State of Service Robots: Current Bottlenecks in Object

combines some basic static obstacle avoidance with the prin-ciples of object reconstruction using methods like shape frommotion. In this case, the goal of the navigation is, in fact, toprovide better or more complete sensory information to therobot, rather than navigating to a specific point in the space.This may be useful in the case that an object in a pile ismore easily detected or manipulated from a different angle ofapproach [77], [22], or simply to gain a better understanding ofthe overall shape of an object. As mentioned in the section III,the better the robot’s ability to perceive an object, the easierit is to manipulate.

VII. CONCLUSION AND FUTURE WORK

In recent years, many great developments have been madein the field of service robotics. It can be seen that a lot ofrecent developments are due to the use of more complexmachine learning techniques, such as (deep) neural networks,and are based on large amounts of data. While this leads tocontinuous improvements on the tasks themselves, large issuesremain with real-world applicability due to time complexityissues. Service robots also struggle severely in unknownenvironments, lacking open-ended learning about object cate-gories and scenes, a map for global navigation, reliable objectrecognition for local navigation, and running into collisionissues while manipulating objects. After these robotic tasksare solved in experimental setups, the focus will need to shiftfrom solutions, to applicability, by reducing complexity andimplementing more open-ended learning techniques.

As reviewed in this paper, several major issues and hurdlesare solved almost entirely. It is shown that path planning isclose to completion in a reliable, known environment, as isgrasp planning. Similarly, object recognition, which feeds intoboth these types of planning, is also approaching near perfectscores when applied to objects in a predetermined setting.

Current issues in object manipulation can be boiled down toavoiding collision. This means mostly dealing with dynamicand shared human-robot environments by developing bettermethods for local planning. Some solutions that have beendeveloped recently concern themselves with humans, or evengroups of humans. Using group dynamics, a group of humansor other agents can temporarily be seen as a single unitfor the purposes of trajectory prediction. Whether trackinga single human or a group, trajectory prediction has alsoundergone development Bayesian trajectory calculations. Thisallows the robot a range of possibilities where an entity mightmove next and use this probability to plan its own path.Besides, self-collision avoidance is also addressed in recentworks. New approaches have been suggested which are moreadaptable and can adjust a grasp movement during execution.These adaptive approaches have a higher tendency to getstuck, however. Through the use of support vector machine(SVM) [130], resulting in adaptable grasps that sometimesran into singularities, and later using a neural net for motionplanning [136], these issues were resolved.

A. Trade-offsWhile some areas see considerable linear improvements,

other areas suffer from one of two situations: either an

improvement in one criteria, such as accuracy, proves to be asetback in another criteria, such as speed, or two approachesare developed side-by-side and their development continuouslysurpasses each other, without a clear best approach to the task.

The first category is mainly concerning planning modules.When a planning module becomes more sophisticated, usually,the complexity of the constraints increases. This causes thecalculations to be much more difficult, almost inevitablyaffecting the time it takes to find a proper solution. As such,in both navigation and grasp planning, continuous issues arisewhen trying to apply new methods in real environments. Whilesome of these computational burdens eventually even out dueto increases in hardware capacity, other times specific researchneeds to be done to reduce time complexity.

The second category is seen across many different areasof service robotics. In object perception, it applies to objectrepresentation, differentiating between object descriptors thatare either hand-crafted or trained by a neural network. Itcan also be seen in how to view the environment, whereapproaches based on bounding boxes compete with approachesusing image segmentation. Finally, in more recent work onobject perception, the problem of cluttered areas is addressed,where multiple objects are in a pile. Here the distinction ismade between using a grasping module to relocate the itemsbefore identifying them, or to use segmentation to label thepartially occluded objects [151].

In grasping, certain trade-offs from this second categorycan also be observed. With the problem of damage reductiondue to collision, some approaches try to limit the movementsmade by a manipulator such that the chance of an accidentis limited. Other, more dynamic solutions will try to predictpossibly damaging results and use this damage as a parameteror constraint in grasp planning. Finally, when constructing thegrasp movements, there is no consensus on whether graspsshould be based on point clouds, segmented by a rule-basedsystem, or by using neural nets to generate grasps. From thislist of trade-offs, it can be seen that the field of roboticsdelivers far from a unified solution to most issues. Newapproaches are continuously developed, old approaches arereinvigorated and improved and even opposing, but similarlyeffective approaches are found.

B. Future Work

While new developments are made frequently, some issuesare either not solved or largely lacking research. For objectmanipulation, a suggested direction is to focus on differentcontrol philosophies for local planning. While some, such asBayesian trajectory prediction, are already being developed,this is the area where navigation has most to gain. Some sug-gest that navigation can also be used to have a robot capableof learning online, basically using navigation to explore. Thiswould mean a robot needs to determine the most probablelocations for a given object in a household environment or partof a map, and use navigation planning to go to that locationsto observe and manipulate the object.

A different issue remaining for the grasping task is that ofincreasing complexity and dependency on data. Many models

14

Page 15: The State of Service Robots: Current Bottlenecks in Object

are currently trained on large sets of data from previous grasps,or on data concerning specific objects. This means that open-ended learning, as with the object perception task, is stilllacking. To the best of the authors’ knowledge, object detectionfor manipulation has largely been used for small tabletopitems. However, the robot should also be able to autonomouslymanipulate larger objects like (wheel) chairs, as well as partlyfixed objects like cabinet doors or windows. To this end, it isimportant to appropriately extend 3D object recognition andaffordance prediction to furniture, doors and windows.

For object perception, the list of unresolved issues is longer.It contains, among other things, dealing with objects withreflective surfaces and dealing with large objects. In orderto solve the more difficult corner cases, several suggestionshave been made. One such suggestion is to incorporate otherforms of perception, such as tactile, into object recognition. Inorder to be less data dependent, the area of object perceptionwill need to put a bigger emphasis on open-ended learning,allowing a robot to learn new objects while performing tasks.The final addition to object perception overlaps with graspingas it has to do with affordance predictions. Affordances havebeen used in both perception and grasping for the identificationof objects. As an additional use, grasping affordances canbe extracted from images to quickly find suitable graspinglocations and orientations. It may be possible to extract othertypes as affordances such as ease of movement for terrain,available space to move in 3D or joint spaces. As affordancesare already used for grasping with single grippers, they maybe applied to the case of double-armed robots as well. Aslittle work has been done on the use of multiple gripperssimultaneously in the household, this seems like a worthwhiledirection.

REFERENCES

[1] S. Spasova, R. Baeten, S. Coster, D. Ghailani, R. Pena-Casas, andB. Vanhercke, “Challenges in long-term care in europe,” A study ofnational policies, European Social Policy Network (ESPN), Brussels:European Commission, 2018.

[2] R. Memmesheimer, V. Seib, and D. Paulus, “homer@ unikoblenz:winning team of the robocup@ home open platform league 2017,”in Robot World Cup. Springer, 2017, pp. 509–520.

[3] M. Oliveira, G. H. Lim, L. Seabra Lopes, S. H. Kasaei, A. M. Tome,and A. Chauhan, “A perceptual memory system for grounding semanticrepresentations in intelligent service robots,” in 2014 IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems. IEEE, 2014,pp. 2216–2223.

[4] J. S. B. Evans, “Dual-processing accounts of reasoning, judgment, andsocial cognition,” Annu. Rev. Psychol., vol. 59, pp. 255–278, 2008.

[5] C. Leroux, O. Lebec, M. B. Ghezala, Y. Mezouar, L. Devillers,C. Chastagnol, J.-C. Martin, V. Leynaert, and C. Fattal, “Armen:Assistive robotics to maintain elderly people in natural environment,”IRBM, vol. 34, no. 2, pp. 101–107, 2013.

[6] A. Jain and C. C. Kemp, “EL-E: an assistive mobile manipulator thatautonomously fetches objects from flat surfaces,” Autonomous Robots,vol. 28, no. 1, p. 45, 2010.

[7] S. Srinivasa, D. Ferguson, J. M. Vandeweghe, R. Diankov, D. Berenson,C. Helfrich, and K. Strasdat, “The robotic busboy: Steps towards devel-oping a mobile robotic home assistant,” in Proceedings of InternationalConference on Intelligent Autonomous Systems, July 2008.

[8] M. Beetz, U. Klank, I. Kresse, A. Maldonado, L. Mosenlechner,D. Pangercic, T. Ruhr, and M. Tenorth, “Robotic roommates mak-ing pancakes,” in 2011 11th IEEE-RAS International Conference onHumanoid Robots. IEEE, 2011, pp. 529–536.

[9] J. Englsberger, A. Werner, C. Ott, B. Henze, M. A. Roa, G. Garofalo,R. Burger, A. Beyer, O. Eiberger, K. Schmid et al., “Overview ofthe torque-controlled humanoid robot TORO,” in 2014 IEEE-RASInternational Conference on Humanoid Robots. IEEE, 2014, pp. 916–923.

[10] N. G. Tsagarakis, D. G. Caldwell, F. Negrello, W. Choi, L. Baccelliere,V.-G. Loc, J. Noorden, L. Muratore, A. Margan, A. Cardellino et al.,“Walk-Man: A high-performance humanoid platform for realistic en-vironments,” Journal of Field Robotics, vol. 34, no. 7, pp. 1225–1259,2017.

[11] T. Asfour, M. Wachter, L. Kaul, S. Rader, P. Weiner, S. Ottenhaus,R. Grimm, Y. Zhou, M. Grotz, and F. Paus, “Armar-6: A high-performance humanoid for human-robot collaboration in real worldscenarios,” IEEE Robotics & Automation Magazine, vol. 26, no. 4, pp.108–121, 2019.

[12] M. Fuchs, C. Borst, P. R. Giordano, A. Baumann, E. Kraemer, J. Lang-wald, R. Gruber, N. Seitz, G. Plank, K. Kunze et al., “Rollin’justin-design considerations and realization of a mobile platform for ahumanoid upper body,” in 2009 IEEE International Conference onRobotics and Automation. IEEE, 2009, pp. 4131–4137.

[13] R. Wood, P. Baxter, and T. Belpaeme, “A review of long-term memoryin natural and synthetic systems,” Adaptive Behavior, vol. 20, no. 2,pp. 81–103, 2012.

[14] S. Harnad, “To cognize is to categorize: Cognition is categorization,”in Handbook of categorization in cognitive science. Elsevier, 2017,pp. 21–54.

[15] M. A. Lebedev and S. P. Wise, “Insights into seeing and grasping: dis-tinguishing the neural correlates of perception and action,” Behavioraland cognitive neuroscience reviews, vol. 1, no. 2, pp. 108–129, 2002.

[16] K. Illeris, “A comprehensive understanding of human learning,” inContemporary theories of learning. Routledge, 2018, pp. 1–14.

[17] D. Skocaj, A. Vrecko, M. Mahnic, M. Janıcek, G.-J. M. Kruijff,M. Hanheide, N. Hawes, J. L. Wyatt, T. Keller, K. Zhou et al., “Anintegrated system for interactive continuous learning of categoricalknowledge,” Journal of Experimental & Theoretical Artificial Intel-ligence, vol. 28, no. 5, pp. 823–848, 2016.

[18] T. Faulhammer, R. Ambrus, C. Burbridge, M. Zillich, J. Folkesson,N. Hawes, P. Jensfelt, and M. Vincze, “Autonomous learning of objectmodels on a mobile robot,” IEEE Robotics and Automation Letters,vol. 2, no. 1, pp. 26–33, 2016.

[19] S. S. Srinivasa, D. Ferguson, C. J. Helfrich, D. Berenson, A. Collet,R. Diankov, G. Gallagher, G. Hollinger, J. Kuffner, and M. V. Weghe,“HERB: a home exploring robotic butler,” Autonomous Robots, vol. 28,no. 1, p. 5, 2010.

[20] J. Hertzberg, J. Zhang, L. Zhang, S. Rockel, B. Neumann, J. Lehmann,K. S. Dubba, A. G. Cohn, A. Saffiotti, F. Pecora et al., “The raceproject,” KI-Kunstliche Intelligenz, vol. 28, no. 4, pp. 297–304, 2014.

[21] M. Oliveira, L. Seabra Lopes, G. H. Lim, S. H. Kasaei, A. M. Tome,and A. Chauhan, “3D object perception and perceptual learning in therace project,” Robotics and Autonomous Systems, vol. 75, pp. 614–626,2016.

[22] J. Sock, S. Hamidreza Kasaei, L. Seabra Lopes, and T.-K. Kim, “Multi-view 6D object pose estimation and camera motion planning usingrgbd images,” in Proceedings of the IEEE International Conference onComputer Vision Workshops, 2017, pp. 2228–2235.

[23] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for objectdetection,” in Advances in neural information processing systems, 2013,pp. 2553–2561.

[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp. 779–788.

[25] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection withdeep learning: A review,” IEEE transactions on neural networks andlearning systems, 2019.

[26] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M.Tome, “Towards lifelong assistive robotics: A tight coupling betweenobject perception and manipulation,” Neurocomputing, vol. 291, pp.151–166, 2018.

[27] S. H. Kasaei, A. M. Tome, L. Seabra Lopes, and M. Oliveira, “GOOD:A global orthographic object descriptor for 3D object recognition andmanipulation,” Pattern Recognition Letters, vol. 83, pp. 312–320, 2016.

[28] H. Kasaei, “OrthographicNet: A deep learning approach for 3D objectrecognition in open-ended domains,” arXiv preprint arXiv:1902.03057,2019.

[29] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M.Tome, “Interactive open-ended learning for 3D object recognition: An

15

Page 16: The State of Service Robots: Current Bottlenecks in Object

approach and experiments,” Journal of Intelligent & Robotic Systems,vol. 80, no. 3-4, pp. 537–553, 2015.

[30] Q. Zhan, Y. Liang, and Y. Xiao, “Color-based segmentation of pointclouds,” Laser scanning, vol. 38, no. 3, pp. 155–161, 2009.

[31] V. Jumb, M. Sohani, and A. Shrivas, “Color image segmentation usingk-means clustering and otsus adaptive thresholding,” InternationalJournal of Innovative Technology and Exploring Engineering (IJITEE),vol. 3, no. 9, pp. 72–76, 2014.

[32] O. P. Verma, M. Hanmandlu, S. Susan, M. Kulkarni, and P. K.Jain, “A simple single seeded region growing algorithm for colorimage segmentation using adaptive thresholding,” in 2011 InternationalConference on Communication Systems and Network Technologies.IEEE, 2011, pp. 500–503.

[33] L. Chang, J. R. Smith, and D. Fox, “Interactive singulation of objectsfrom a pile,” in 2012 IEEE International Conference on Robotics andAutomation. IEEE, 2012, pp. 3875–3882.

[34] H. Van Hoof, O. Kroemer, and J. Peters, “Probabilistic segmentationand targeted exploration of objects in cluttered environments,” IEEETransactions on Robotics, vol. 30, no. 5, pp. 1198–1209, 2014.

[35] M. Gupta and G. S. Sukhatme, “Using manipulation primitives forbrick sorting in clutter,” in 2012 IEEE International Conference onRobotics and Automation. IEEE, 2012, pp. 3883–3889.

[36] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2015, pp. 3431–3440.

[37] C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal convnets:Minkowski convolutional neural networks,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019, pp.3075–3084.

[38] B. W. Kim, Y. Park, and I. H. Suh, “Integration of top-downand bottom-up visual processing using a recurrent convolutional–deconvolutional neural network for semantic segmentation,” IntelligentService Robotics, pp. 1–11, 2019.

[39] G. L. Oliveira, C. Bollen, W. Burgard, and T. Brox, “Efficient androbust deep networks for semantic segmentation,” The InternationalJournal of Robotics Research, vol. 37, no. 4-5, pp. 472–491, 2018.

[40] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“DeepLab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected CRFs,” IEEE transactions onpattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,2017.

[41] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas,and H. Su, “PartNet: A large-scale benchmark for fine-grained andhierarchical part-level 3D object understanding,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2019,pp. 909–918.

[42] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, and S. Yan, “Proposal-freenetwork for instance-level object segmentation,” IEEE transactions onpattern analysis and machine intelligence, vol. 40, no. 12, pp. 2978–2991, 2017.

[43] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao, “SpiderCNN: Deeplearning on point sets with parameterized convolutional filters,” inProceedings of the European Conference on Computer Vision (ECCV),2018, pp. 87–102.

[44] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Con-volution on x-transformed points,” in Advances in neural informationprocessing systems, 2018, pp. 820–830.

[45] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: deep learningon point sets for 3D classification and segmentation,” in Proceedingsof the IEEE conference on computer vision and pattern recognition,2017, pp. 652–660.

[46] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchicalfeature learning on point sets in a metric space,” in Advances in neuralinformation processing systems, 2017, pp. 5099–5108.

[47] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinskaet al., “Overcoming catastrophic forgetting in neural networks,” Pro-ceedings of the national academy of sciences, vol. 114, no. 13, pp.3521–3526, 2017.

[48] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan,“Measuring catastrophic forgetting in neural networks,” in Thirty-second AAAI conference on artificial intelligence, 2018.

[49] S. S. Sajjan, M. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, andS. Song, “Cleargrasp:3D shape estimation of transparent objects formanipulation,” in 2020 IEEE International Conference on Robotics andAutomation (ICRA), 2020.

[50] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrink-ing and hallucinating features,” in Proceedings of the IEEE Interna-tional Conference on Computer Vision, 2017, pp. 3018–3027.

[51] M. Ullrich, H. Ali, M. Durner, Z.-C. Marton, and R. Triebel, “SelectingCNN features for online learning of 3D objects,” in 2017 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS).IEEE, 2017, pp. 5086–5091.

[52] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning with-out forgetting,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2018, pp. 4367–4375.

[53] B. Oreshkin, P. R. Lopez, and A. Lacoste, “TADAM: Task dependentadaptive metric for improved few-shot learning,” in Advances in NeuralInformation Processing Systems, 2018, pp. 721–731.

[54] B. Krawczyk and M. Wozniak, “One-class classifiers with incrementallearning and forgetting for data streams with concept drift,” SoftComputing, vol. 19, no. 12, pp. 3387–3400, 2015.

[55] M. Ristin, M. Guillaumin, J. Gall, and L. Van Gool, “Incremental learn-ing of ncm forests for large-scale image classification,” in Proceedingsof the IEEE conference on computer vision and pattern recognition,2014, pp. 3654–3661.

[56] M. Tschannen, O. Bachem, and M. Lucic, “Recent advancesin autoencoder-based representation learning,” arXiv preprintarXiv:1812.05069, 2018.

[57] Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3D point capsulenetworks,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2019, pp. 1009–1018.

[58] M. Oliveira, L. Seabra Lopes, G. H. Lim, S. H. Kasaei, A. D. Sappa,and A. M. Tome, “Concurrent learning of visual codebooks and objectcategories in open-ended domains,” in 2015 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS). IEEE, 2015,pp. 2488–2495.

[59] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M.Tome, “An adaptive object perception system based on environmentexploration and bayesian learning,” in 2015 IEEE International Con-ference on Autonomous Robot Systems and Competitions. IEEE, 2015,pp. 221–226.

[60] S. H. M. Kasaei, L. Seabra Lopes, and A. M. Tome, “Local lda:Open-ended learning of latent topics for 3D object recognition,” IEEEtransactions on pattern analysis and machine intelligence (PAMI),2019.

[61] S. H. Kasaei, A. M. Tome, and L. Seabra Lopes, “Hierarchical objectrepresentation for open-ended object category learning and recogni-tion,” in Advances in Neural Information Processing Systems, 2016,pp. 1948–1956.

[62] S. H. Kasaei, L. Seabra Lopes, and A. M. Tome, “Concurrent 3Dobject category learning and recognition based on topic modelling andhuman feedback,” in 2016 International Conference on AutonomousRobot Systems and Competitions (ICARSC). IEEE, 2016, pp. 329–334.

[63] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3D recognitionand pose using the viewpoint feature histogram,” in 2010 IEEE/RSJInternational Conference on Intelligent Robots and Systems. IEEE,2010, pp. 2155–2162.

[64] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz, “Learninginformative point classes for the acquisition of object model maps,” in2008 10th International Conference on Control, Automation, Roboticsand Vision. IEEE, 2008, pp. 643–650.

[65] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms(FPFH) for 3D registration,” in 2009 IEEE International Conferenceon Robotics and Automation. IEEE, 2009, pp. 3212–3217.

[66] W. Wohlkinger and M. Vincze, “Ensemble of shape functions for3D object classification,” in 2011 IEEE international conference onrobotics and biomimetics. IEEE, 2011, pp. 2987–2992.

[67] S.-H. Cha, “Comprehensive survey on distance/similarity measuresbetween probability density functions,” City, vol. 1, no. 2, p. 1, 2007.

[68] S. H. Kasaei, M. Ghorbani, J. Schilperoort, and W. van der Rest,“Investigating the importance of shape features color constancy colorspaces and similarity measures in open-ended 3D object recognition,”arXiv preprint arXiv:2002.03779, 2020.

[69] S. H. Kasaei, “Look further to recognize better: Learning sharedtopics and category-specific dictionaries for open-ended 3D objectrecognition,” arXiv preprint arXiv:1907.12924, 2019.

[70] S. Gao, I. W.-H. Tsang, and Y. Ma, “Learning category-specificdictionary and shared dictionary for fine-grained image categorization,”IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 623–634,2013.

16

Page 17: The State of Service Robots: Current Bottlenecks in Object

[71] Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do,“Weakly supervised fine-grained categorization with part-based imagerepresentation,” IEEE Transactions on Image Processing, vol. 25, no. 4,pp. 1713–1725, 2016.

[72] T. Luddecke, T. Kulvicius, and F. Worgotter, “Context-based affor-dance segmentation from 2D images for robot actions,” Robotics andAutonomous Systems, 2019.

[73] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects incontext,” in European conference on computer vision. Springer, 2014,pp. 740–755.

[74] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. J.Belongie, “Objects in context.” in ICCV, vol. 1, no. 2. Citeseer, 2007,p. 5.

[75] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler,R. Urtasun, and A. Yuille, “The role of context for object detectionand semantic segmentation in the wild,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2014, pp.891–898.

[76] M. Mauro, H. Riemenschneider, A. Signoroni, R. Leonardi, andL. Van Gool, “A unified framework for content-aware view selectionand planning through view importance,” Proceedings BMVC 2014, pp.1–11, 2014.

[77] S. H. Kasaei, J. Sock, L. Seabra Lopes, A. M. Tome, and T.-K.Kim, “Perceiving, learning, and recognizing 3D objects: An approachto cognitive service robots,” in Thirty-Second AAAI Conference onArtificial Intelligence, 2018.

[78] B. Li, Y. Lu, and H. Johan, “Sketch-based 3D model retrieval by view-point entropy-based adaptive view clustering,” in Proceedings of theSixth Eurographics Workshop on 3D Object Retrieval. EurographicsAssociation, 2013, pp. 49–56.

[79] A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim,“Recovering 6D object pose and predicting next-best-view in thecrowd,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 3583–3592.

[80] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3D shapenets: A deep representation for volumetric shapes,” inProceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1912–1920.

[81] B. E. Stein and M. A. Meredith, The merging of the senses. The MITPress, 1993.

[82] M. O. Ernst and H. H. Bulthoff, “Merging the senses into a robustpercept,” Trends in cognitive sciences, vol. 8, no. 4, pp. 162–169, 2004.

[83] M. A. Eckert, N. V. Kamdar, C. E. Chang, C. F. Beckmann, M. D.Greicius, and V. Menon, “A cross-modal system linking primaryauditory and visual cortices: Evidence from intrinsic fMRI connectivityanalysis,” Human brain mapping, vol. 29, no. 7, pp. 848–857, 2008.

[84] S. Luo, J. Bimbo, R. Dahiya, and H. Liu, “Robotic tactile perception ofobject properties: A review,” Mechatronics, vol. 48, pp. 54–67, 2017.

[85] C. Kertesz and M. Turunen, “Common sounds in bedrooms (csibe)corpora for sound event recognition of domestic robots,” IntelligentService Robotics, vol. 11, no. 4, pp. 335–346, 2018.

[86] R. Gao and K. Grauman, “2.5D visual sound,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2019,pp. 324–333.

[87] ——, “Co-separating sounds of visual objects,” in Proceedings of theIEEE International Conference on Computer Vision, 2019, pp. 3879–3888.

[88] R. Gao, R. Feris, and K. Grauman, “Learning to separate objectsounds by watching unlabeled video,” in Proceedings of the EuropeanConference on Computer Vision (ECCV), 2018, pp. 35–53.

[89] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T.Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: Aspeaker-independent audio-visual model for speech separation,” arXivpreprint arXiv:1804.03619, 2018.

[90] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, andA. Torralba, “The sound of pixels,” in Proceedings of the EuropeanConference on Computer Vision (ECCV), 2018, pp. 570–586.

[91] H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,”in Proceedings of the IEEE International Conference on ComputerVision, 2019, pp. 1735–1744.

[92] F. Ficuciello, “Hand-arm autonomous grasping: Synergistic motions toenhance the learning process,” Intelligent Service Robotics, vol. 12,no. 1, pp. 17–25, 2019.

[93] A. T. Miller and P. K. Allen, “GraspIt! a versatile simulator for roboticgrasping,” IEEE Robotics Automation Magazine, vol. 11, no. 4, pp.110–122, 2004.

[94] M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L.Wyatt, “One-shot learning and generation of dexterous grasps for novelobjects,” The International Journal of Robotics Research, vol. 35, no. 8,pp. 959–976, 2016.

[95] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea,and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust graspswith synthetic point clouds and analytic grasp metrics,” arXiv preprintarXiv:1703.09312, 2017.

[96] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven graspsynthesisa survey,” IEEE Transactions on Robotics, vol. 30, no. 2, pp.289–309, 2013.

[97] A. Sahbani, S. El-Khoury, and P. Bidaud, “An overview of 3Dobject grasp synthesis algorithms,” Robotics and Autonomous Systems,vol. 60, no. 3, pp. 326–336, 2012.

[98] E. Johns, S. Leutenegger, and A. J. Davison, “Deep learning agrasp function for grasping under gripper pose uncertainty,” in 2016IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS). IEEE, 2016, pp. 4461–4468.

[99] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting roboticgrasps,” The International Journal of Robotics Research, vol. 34, no.4-5, pp. 705–724, 2015.

[100] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry,K. Kohlhoff, T. Kroger, J. Kuffner, and K. Goldberg, “Dex-Net 1.0:A cloud-based network of 3D objects for robust grasp planning usinga multi-armed bandit model with correlated rewards,” in 2016 IEEEinternational conference on robotics and automation (ICRA). IEEE,2016, pp. 1957–1964.

[101] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[102] J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, and K. Goldberg, “Dex-Net 3.0: Computing robust robot vacuum suction grasp targets in pointclouds using a new analytic model and deep learning,” arXiv preprintarXiv:1709.06670, 2017.

[103] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley,and K. Goldberg, “Learning ambidextrous robot grasping policies,”Science Robotics, vol. 4, no. 26, p. eaau4984, 2019.

[104] D. Morrison, P. Corke, and J. Leitner, “Closing the Loop for RoboticGrasping: A Real-time, Generative Grasp Synthesis Approach,” inProc. of Robotics: Science and Systems (RSS), 2018.

[105] ——, “Learning robust, real-time, reactive robotic grasping,” TheInternational Journal of Robotics Research, p. 0278364919859066,2019.

[106] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza,D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-placeof novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in 2018 IEEE International Conference onRobotics and Automation (ICRA). IEEE, 2018, pp. 1–8.

[107] A. Myers, C. L. Teo, C. Fermuller, and Y. Aloimonos, “Affordancedetection of tool parts from geometric features,” in 2015 IEEE Inter-national Conference on Robotics and Automation (ICRA). IEEE, 2015,pp. 1374–1381.

[108] T.-T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An end-to-enddeep learning approach for object affordance detection,” in 2018 IEEEinternational conference on robotics and automation (ICRA). IEEE,2018, pp. 1–5.

[109] N. Elango and A. Faudzi, “A review article: investigations on softmaterials for soft robot manipulations,” The International Journal ofAdvanced Manufacturing Technology, vol. 80, no. 5-8, pp. 1027–1037,2015.

[110] L. Moriello, L. Biagiotti, C. Melchiorri, and A. Paoli, “Manipulatingliquids with robots: A sloshing-free solution,” Control EngineeringPractice, vol. 78, pp. 129–141, 2018.

[111] L. Antanas, P. Moreno, M. Neumann, R. P. de Figueiredo, K. Kersting,J. Santos-Victor, and L. De Raedt, “Semantic and geometric reasoningfor robotic grasping: a probabilistic logic approach,” AutonomousRobots, vol. 43, no. 6, pp. 1393–1418, 2019.

[112] J. Sun, J. L. Moore, A. Bobick, and J. M. Rehg, “Learning visual objectcategories for robot affordance prediction,” The International Journalof Robotics Research, vol. 29, no. 2-3, pp. 174–197, 2010.

[113] T. Hermans, J. M. Rehg, and A. Bobick, “Affordance prediction vialearned object attributes,” in ICRA: Workshop on Semantic Perception,Mapping, and Exploration, vol. 1, no. 2. Citeseer, 2011.

[114] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for roboticgrasping: A real-time, generative grasp synthesis approach,” in Inter-national Conference on Robotics: Science and Systems (RSS), 2018.

17

Page 18: The State of Service Robots: Current Bottlenecks in Object

[115] T. A. Funkhouser, “Robotic pick-and-place of novel objects in clutterwith multi-affordance grasping and cross-domain image matching,”International Journal of Robotics Research, 2019.

[116] Y. Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4G: Amodalsingle-view single-shot se (3) grasp detection in cluttered scenes,” arXivpreprint arXiv:1910.14218, 2019.

[117] M. S. Kopicki, D. Belter, and J. L. Wyatt, “Learning better generativemodels for dexterous, single-view grasping of novel objects,” TheInternational Journal of Robotics Research, vol. 38, no. 10-11, pp.1246–1267, 2019.

[118] A. Murali, A. Mousavian, C. Eppner, C. Paxton, and D. Fox, “6-DOFgrasping for target-driven object manipulation in clutter,” in 2020 IEEEInternational Conference on Robotics and Automation (ICRA). IEEE,2020, pp. 1–8.

[119] “Dexterous hand.” [Online]. Available: https://www.shadowrobot.com/products/dexterous-hand/

[120] G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga,C. Von Hofsten, K. Rosander, M. Lopes, J. Santos-Victor et al.,“The iCub humanoid robot: An open-systems platform for research incognitive development,” Neural Networks, vol. 23, no. 8-9, pp. 1125–1134, 2010.

[121] “icub tech.” [Online]. Available: https://www.iit.it/research/facilities/icub-tech

[122] R. Deimel and O. Brock, “A novel type of compliant and underactuatedrobotic hand for dexterous grasping,” The International Journal ofRobotics Research, vol. 35, no. 1-3, pp. 161–185, 2016.

[123] C. Piazza, G. Grioli, M. Catalano, and A. Bicchi, “A century of robotichands,” Annual Review of Control, Robotics, and Autonomous Systems,vol. 2, pp. 1–32, 2019.

[124] S. H. Kasaei, N. Shafii, L. Seabra Lopes, and A. M. Tome, “Inter-active open-ended object, affordance and grasp learning for roboticmanipulation,” in 2019 IEEE International Conference on Robotics andAutomation (ICRA), 2019.

[125] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, J. Bohg, T. Asfour,and S. Schaal, “Learning of grasp selection based on shape-templates,”Autonomous Robots, vol. 36, no. 1-2, pp. 51–65, 2014.

[126] N. Shafii, S. H. Kasaei, and L. Seabra Lopes, “Learning to graspfamiliar objects using object view recognition and template matching,”in 2016 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS). IEEE, 2016, pp. 2895–2900.

[127] A. E. Johnson and M. Hebert, “Using spin images for efficient objectrecognition in cluttered 3D scenes,” IEEE Transactions on patternanalysis and machine intelligence, vol. 21, no. 5, pp. 433–449, 1999.

[128] S.-J. Huang, W.-H. Chang, and J.-Y. Su, “Intelligent robotic gripperwith adaptive grasping force,” International Journal of Control, Au-tomation and Systems, vol. 15, no. 5, pp. 2272–2282, 2017.

[129] B. Sundaralingam and T. Hermans, “Relaxed-rigidity constraints: kine-matic trajectory optimization and collision avoidance for in-graspmanipulation,” Autonomous Robots, vol. 43, no. 2, pp. 469–483, 2019.

[130] S. S. Mirrazavi Salehian, N. Figueroa, and A. Billard, “A unified frame-work for coordinated multi-arm motion planning,” The InternationalJournal of Robotics Research, vol. 37, no. 10, pp. 1205–1232, 2018.

[131] T. Fromm, “Self-supervised damage-avoiding manipulation strategyoptimization via mental simulation,” arXiv preprint arXiv:1712.07452,2017.

[132] G. Kang, Y. B. Kim, Y. H. Lee, H. S. Oh, W. S. You, and H. R. Choi,“Sampling-based motion planning of manipulator with goal-orientedsampling,” Intelligent Service Robotics, pp. 1–9, 2019.

[133] J. Pan, L. Zhang, and D. Manocha, “Collision-free and smooth trajec-tory computation in cluttered environments,” The International Journalof Robotics Research, vol. 31, no. 10, pp. 1155–1175, 2012.

[134] D. Rakita, B. Mutlu, and M. Gleicher, “Relaxedik: Real-time synthesisof accurate and feasible robot arm motion.” in Robotics: Science andSystems, 2018.

[135] A. H. Qureshi, A. Simeonov, M. J. Bency, and M. C. Yip, “Motionplanning networks,” in 2019 International Conference on Robotics andAutomation (ICRA). IEEE, 2019, pp. 2118–2124.

[136] A. H. Qureshi, Y. Miao, A. Simeonov, and M. C. Yip, “Motion plan-ning networks: Bridging the gap between learning-based and classicalmotion planners,” arXiv preprint arXiv:1907.06013, 2019.

[137] L. S. Sha Luo, Hamidreza Kasaei, “Accelerating reinforcement learningfor reaching using continuous curriculum learning,” arXiv preprintarXiv:2002.02697, 2020.

[138] M. J. Aein, E. E. Aksoy, and F. Worgotter, “Library of actions: Imple-menting a generic robot execution framework by using manipulationaction semantics,” The International Journal of Robotics Research, p.0278364919850295, 2018.

[139] N. Jetchev and M. Toussaint, “Discovering relevant task spacesusing inverse feedback control,” Autonomous Robots, vol. 37, no. 2,pp. 169–189, Aug 2014. [Online]. Available: https://doi.org/10.1007/s10514-014-9384-1

[140] D. Chaves, J. Ruiz-Sarmiento, N. Petkov, and J. Gonzalez-Jimenez,“Integration of CNN into a robotic architecture to build semantic mapsof indoor environments,” in International Work-Conference on ArtificialNeural Networks. Springer, 2019, pp. 313–324.

[141] S. Karaman and E. Frazzoli, “Sampling-based algorithms foroptimal motion planning,” The International Journal of RoboticsResearch, vol. 30, no. 7, pp. 846–894, 2011. [Online]. Available:https://doi.org/10.1177/0278364911406761

[142] A. Cherubini and F. Chaumette, “Visual navigation of a mobile robotwith laser-based collision avoidance,” The International Journal ofRobotics Research, vol. 32, no. 2, pp. 189–205, 2013.

[143] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose:realtime multi-person 2D pose estimation using part affinity fields,”arXiv preprint arXiv:1812.08008, 2018.

[144] Y.-h. Liang and C. Cai, “Intelligent collision avoidance based ontwo-dimensional risk model,” Journal of Algorithms & ComputationalTechnology, vol. 10, no. 3, pp. 131–141, 2016.

[145] N. H. Singh and K. Thongam, “Neural network-based approaches formobile robot navigation in static and moving obstacles environments,”Intelligent Service Robotics, vol. 12, no. 1, pp. 55–67, 2019.

[146] L. Zeng and G. M. Bone, “Mobile robot collision avoidance in humanenvironments,” International Journal of Advanced Robotic Systems,vol. 10, no. 1, p. 41, 2013.

[147] D. Helbing and P. Molnar, “Social force model for pedestriandynamics,” Phys. Rev. E, vol. 51, pp. 4282–4286, May 1995. [Online].Available: https://link.aps.org/doi/10.1103/PhysRevE.51.4282

[148] X.-T. Truong, V. N. Yoong, and T.-D. Ngo, “Socially aware robot nav-igation system in human interactive environments,” Intelligent ServiceRobotics, vol. 10, no. 4, pp. 287–295, 2017.

[149] D. Fridovich-Keil, A. Bajcsy, J. F. Fisac, S. L. Herbert, S. Wang, A. D.Dragan, and C. J. Tomlin, “Confidence-aware motion prediction forreal-time collision avoidance,” The International Journal of RoboticsResearch, p. 0278364919859436, 2019.

[150] H. Yervilla-Herrera, J. I. Vasquez-Gomez, R. Murrieta-Cid, I. Becerra,and L. E. Sucar, “Optimal motion planning and stopping test for 3-dobject reconstruction,” Intelligent Service Robotics, vol. 12, no. 1, pp.103–123, 2019.

[151] A. Eitel, N. Hauff, and W. Burgard, “Learning to singulate objects usinga push proposal network,” in Robotics Research. Springer, 2020, pp.405–419.

18