anticipating noxious states from visual cues using deep...

6
Anticipating Noxious States from Visual Cues using Deep Predictive Models Indranil Sur 1 and Heni Ben Amor 1 Abstract—To ensure system integrity, robots need to proac- tively avoid any unwanted physical perturbation that may cause damage to the underlying hardware. In this paper, we investigate a machine learning approach that allows robots to anticipate impending physical perturbations from perceptual cues. In contrast to other approaches that require knowledge about sources of perturbation to be encoded before deployment, our method is based on experiential learning. Robots learn to associate visual cues with subsequent physical perturbations and contacts. In turn, these extracted visual cues are then used to predict potential future perturbations acting on the robot. To this end, we introduce a novel deep network architecture which combines multiple sub-networks for dealing with robot dynamics and perceptual input from the environment. We present a self-supervised approach for training the system that does not require any labeling of training data. Inference is perfomed using approximate Bayesian inference and moment matching. Extensive experiments in a human-robot interaction task show that a robot can learn to predict physical contact by a human interaction partner without any prior information or labeling. Furthermore, the network is able to successfully predict physical contact from either depth stream input or traditional video input. I. INTRODUCTION According to Isaac Asimov’s third law of robotics [1], “a robot must protect its own existence as long as such protection does not conflict with the First or Second Law”, i.e., as long as it does not harm a human. Aspects of safety and self-preservation are tightly coupled to autonomy and longevity of robotic systems. For robots to explore their environment and engage in physical contact with objects and humans, they need to ensure that any such interaction may not lead to tear, damage, or irreparable harm to the underlying hardware. Situations that jeopardize the integrity of the system need to be detected and actively avoided. This determination can be performed in either a reactive way, e.g., by using sensors to identify forces acting on the robot, or in a pro-active way, e.g., by detecting impending collisions. In recent years, a plethora of safety methods have been proposed that are based on reactive strategy. Approaches for compliant control and, in particular, impedance control techniques have been shown to enable safe human-robot interaction [2] in a variety of close-contact tasks. Such methods are typically referred to as post-contact approaches, since they react to forces exchanged between the system and its environment after they first occur. In many application domains, however, robots need to pro- actively reason about impending damage before it occurs. 1 Indranil Sur and Heni Ben Amor are with the School of Computing, Informatics, Decision Systems Engineering, Arizona State University, 699 S Mill Ave, Tempe, AZ 85281 {isur2,hbenamor}@asu.edu 0 2 4 6 time (s) Anticipation perturbation Fig. 1: Baxter robot anticipates physical contact by human. Methods that tackle these scenarios, have mostly focused on proximity detection and collision avoidance. Motion tracking or other sensing devices are used to identify nearby humans and obstacles in order determine whether they intersect the robot’s path. To this end, a human expert has to reason about the expected obstacles before deployment and, in turn, hand-code methods for object recognition, human tracking, or collision detection. Such methods are largely based on the status quo of the environment and do not take in account the expected future states. For example, the behavior of a human interaction partner may already provide cues whether or not physical contact is to be expected. In addition, such methods suffer from limited adaptivity – the robotic system is not able to incorporate other or new sources of physical perturbations that have not been considered by the human expert. Changes in the application domain typically require the expert to specify the set of obstacles/perturbants and how they can be identified from sensor data. However, for robots to autonomously explore new goals, tasks, and environments they cannot be constantly relying on human intervention and involvement. In order to increase resilience of robotic systems, the processes responsible for ensuring safety should inherently be (a) adaptive to changes that occur during the cycle of operation and (b) anticipative in nature, so as to proactively avoid damage. In biology such processes are common place: humans and animals experience pain as the guiding signal which ensures self-preservation and safe, “open ended” exploration of the

Upload: others

Post on 05-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anticipating Noxious States from Visual Cues using Deep ...juxi.net/workshop/deep-learning-rss-2017/papers/Sur.pdf · Anticipating Noxious States from Visual Cues using Deep Predictive

Anticipating Noxious States from Visual Cues using Deep PredictiveModels

Indranil Sur1 and Heni Ben Amor1

Abstract— To ensure system integrity, robots need to proac-tively avoid any unwanted physical perturbation that maycause damage to the underlying hardware. In this paper, weinvestigate a machine learning approach that allows robots toanticipate impending physical perturbations from perceptualcues. In contrast to other approaches that require knowledgeabout sources of perturbation to be encoded before deployment,our method is based on experiential learning. Robots learn toassociate visual cues with subsequent physical perturbationsand contacts. In turn, these extracted visual cues are then usedto predict potential future perturbations acting on the robot.To this end, we introduce a novel deep network architecturewhich combines multiple sub-networks for dealing with robotdynamics and perceptual input from the environment. Wepresent a self-supervised approach for training the system thatdoes not require any labeling of training data. Inference isperfomed using approximate Bayesian inference and momentmatching. Extensive experiments in a human-robot interactiontask show that a robot can learn to predict physical contactby a human interaction partner without any prior informationor labeling. Furthermore, the network is able to successfullypredict physical contact from either depth stream input ortraditional video input.

I. INTRODUCTIONAccording to Isaac Asimov’s third law of robotics [1],

“a robot must protect its own existence as long as suchprotection does not conflict with the First or Second Law”,i.e., as long as it does not harm a human. Aspects of safetyand self-preservation are tightly coupled to autonomy andlongevity of robotic systems. For robots to explore theirenvironment and engage in physical contact with objectsand humans, they need to ensure that any such interactionmay not lead to tear, damage, or irreparable harm to theunderlying hardware. Situations that jeopardize the integrityof the system need to be detected and actively avoided. Thisdetermination can be performed in either a reactive way, e.g.,by using sensors to identify forces acting on the robot, orin a pro-active way, e.g., by detecting impending collisions.In recent years, a plethora of safety methods have beenproposed that are based on reactive strategy. Approachesfor compliant control and, in particular, impedance controltechniques have been shown to enable safe human-robotinteraction [2] in a variety of close-contact tasks. Suchmethods are typically referred to as post-contact approaches,since they react to forces exchanged between the system andits environment after they first occur.

In many application domains, however, robots need to pro-actively reason about impending damage before it occurs.

1Indranil Sur and Heni Ben Amor are with the School of Computing,Informatics, Decision Systems Engineering, Arizona State University, 699S Mill Ave, Tempe, AZ 85281 {isur2,hbenamor}@asu.edu

0 2 4 6time (s)

Anticipation

perturbation

Fig. 1: Baxter robot anticipates physical contact by human.

Methods that tackle these scenarios, have mostly focused onproximity detection and collision avoidance. Motion trackingor other sensing devices are used to identify nearby humansand obstacles in order determine whether they intersect therobot’s path. To this end, a human expert has to reasonabout the expected obstacles before deployment and, in turn,hand-code methods for object recognition, human tracking,or collision detection. Such methods are largely based onthe status quo of the environment and do not take in accountthe expected future states. For example, the behavior of ahuman interaction partner may already provide cues whetheror not physical contact is to be expected. In addition, suchmethods suffer from limited adaptivity – the robotic systemis not able to incorporate other or new sources of physicalperturbations that have not been considered by the humanexpert. Changes in the application domain typically requirethe expert to specify the set of obstacles/perturbants and howthey can be identified from sensor data. However, for robotsto autonomously explore new goals, tasks, and environmentsthey cannot be constantly relying on human interventionand involvement. In order to increase resilience of roboticsystems, the processes responsible for ensuring safety shouldinherently be (a) adaptive to changes that occur during thecycle of operation and (b) anticipative in nature, so as toproactively avoid damage.

In biology such processes are common place: humans andanimals experience pain as the guiding signal which ensuresself-preservation and safe, “open ended” exploration of the

Page 2: Anticipating Noxious States from Visual Cues using Deep ...juxi.net/workshop/deep-learning-rss-2017/papers/Sur.pdf · Anticipating Noxious States from Visual Cues using Deep Predictive

64x64x1 64x64x32

32x32x3216x16x64

8x8x644x4x128

5x5 convLSTM 5x5 convLSTMstride 2 5x5 convLSTM

stride 2 5x5 convLSTMstride 2

5x5 convLSTMstride 2 flatten

batchnorm

...

7

LSTM

LSTM

Dense28...

...

15

50...

200

...

20...

LSTM

50...Action

State

Perceptual Anticipation Model

...

10

Motor System Sensory System

Deep Dynamics Model

28

...

28

...

50

...

50

......

7

TrainingSignal

Visual Input

State

Action

Fig. 2: Architecture of the Deep Predictive Model for perturbations.

world. Over time, biological creatures learn to anticipateimpending pain sensations from environmental and proprio-ceptive cues, e.g., from the velocity and direction of an ob-stacle, or an imbalance in posture. The relationship betweenpain and learning is bi-directional in humans and vertebrateanimals. Our perception of pain shapes the way we learn,but the perception itself is also shaped by the experienceswe undergo during learning. In particular, repeated exposureto a sensory cue (e.g. a specific image or object) preceding anegative outcome (e.g. electric shock) will turn the cue into aconditioned stimulus that helps anticipate the shock in futuretrials. This ability to associate a certain stimuli with negativefuture sensations is considered to be essential to survival.Inspired by the relationship between pain and learning inbiological creatures, we propose in this paper a new methodfor learning to anticipate physical perturbations in roboticsystems. Robot safety is realized through an adaptive processin which prior experiences are used to predict impendingharm to the system. These predictions are based on (1)environmental cues, e.g., camera input, (2) proprioceptiveinformation, and (3) the intended actions of the robot. Weintroduce a deep predictive model (DPM) that leverages allthree sources of information in order to anticipate contactand physical perturbations in a window of future time steps.An important characteristics or this deep predictive models,is that it differentiates between forces and perturbations thatare the result of an executed robot behavior, e.g., fast armmovements, and external perturbations that are the result ofoutside influence. After learning, the DPM can be used toanticipate future states that result in forces being applied tothe robot. In turn, this information can be used to providevisual feedback to the user, or actively avoid the predictedstates, or change the impedance of the system.

II. RELATED WORK

Safety plays a critical role in the field of robotics. Recently,various methods have been put forward in order to protect ahuman interaction partner from harm. The work in [3], forexample, uses proprioceptive sensing to identify collisionsand, in turn, execute a reactive control strategy that enablesthe robot to retract from the location of impact. In a similarvein, the work in [4] describes methods for rapid collisiondetection and response through trajectory scaling. Many ap-proaches to safe human-robot interaction are based on rapidsensing through force-torque sensors or tactile sensors [5].Another approach is to generate estimates of external forcesacting on a robot using disturbance observers [6]. These ap-proaches, however, require a model of the underlying plant,which in the case of complex humanoid robots can becomechallenging to derive. In addition, nonlinearities underlyingthe current robot or task can often lead to instabilitiesin the system [7]. To circumvent such challenges, severalapproaches have been proposed for learning perturbationfilters using a data-driven machine learning method [8], [9],[10]. An alternative, bio-inspired approach was proposed in[11]. In particular, an “Artificial Robot Nervous System” wasintroduced which emulates the human skin structure, as wellas the spikes generated whenever an impact occurs.

All of the above techniques are reactive in nature. Onlyafter contact with its environment, can a robot detect a phys-ical impact or perturbation and react to it. However, manycritical situations require a proactive avoidance strategy. Tothis end, various methods for human motion anticipationhave been proposed [12], [13]. Given a partially observedhuman motion, a robot can anticipate the goal location andintermediate path and accordingly generate a collision freenavigation strategy. However, such approaches are brittle inthat they require some form of human tracking. If the source

Page 3: Anticipating Noxious States from Visual Cues using Deep ...juxi.net/workshop/deep-learning-rss-2017/papers/Sur.pdf · Anticipating Noxious States from Visual Cues using Deep Predictive

of the perturbation is non-human, e.g., a moving object thenno anticipation can occur. In this paper, we are interested indynamic approaches to anticipation of perturbations. Robotslearn to predict contacts or impact by associating them withvisual cues. This is similar in spirit to [14] where dashcamvideos were used to predict car collisions. However, animportant difference is that our predictions are based on bothvisual cues, proprioceptive sensors, as well as next robotactions.

III. DEEP PREDICTIVE MODELS OF PHYSICALPERTURBATIONS

Our goal is to enable an intelligent system to anticipateundesired external perturbations before they occur. Followingthe biological inspiration, repeated exposure to a sensorycue preceding a physical perturbation will turn the cue intoa predictive variable that helps anticipate the perturbation.To this end, we propose the deep predictive model seen inFig. 2. A key insight of our approach, is that a robot canlearn to correlate perceptual data, e.g., camera image, to itsfuture internal states of the robot. By establishing a mappingbetween perceptual information and internal states, the it cananticipate harm before it occurs.

The DPM is based on a modular architecture, in whichtwo modules generate the necessary relationship in order tomap perceptual data to future anticipated noxious signals.The first module of the DPM is a deep dynamics module thatlearns to discriminate between external perturbations causedby outside events and natural variations of sensor readingsdue to the currently executed behavior and sensor noise.The deep dynamics module is trained to generate a signal,whenever the recorded sensor values cannot be explained bythe actions of the robot.

The second module of the DPM is the perceptual anticipa-tion module– a deep network that takes visual percepts, pro-prioceptive states, and actions taken as inputs and generatespredictions over future noxious signals. It learns to associatespecific visual and proprioceptive cues to harmful states.The perceptual anticipation module contains convolutionalrecurrent layers [15], [16] that process the visual inputin both space and time. Learning is performed using theparadigm of self-supervised learning. More specifically, thetraining signal produced by the deep dynamics module isused as a target signal. No human intervention is need inorder to either provide new training data or label existingdata.

Subsequently, we will explain the elements of the in-troduced network architecture and describe the underlyingtraining and inference process in more detail.

A. Deep Dynamics Model

The task of the deep dynamics model is to identify exoge-nous perturbations affecting the robot. Detecting such per-turbations in the sensor stream can be challenging when therobot is performing dynamic movements that by themselvescause fluctuation in the sensor readings. To discriminate

between exogenous and endogenous perturbations, we willuse a strategy inspired by the human motor system.

Before sending an action at ∈ RQ to the actuators, therobot creates a copy of at, the so-called efference copy,and predicts the expected sensory stimuli after execution.This prediction is performed by the deep dynamics model,a neural network that maps the current state st and intendedaction at onto the expected next state s∗t+1.

After executing action at, we can measure the discrepancybetween the expected sensations and the measured sensorvalues, also called the reafference. The degree of discrepancyis an indicator for external physical perturbations.Data Collection: Without loss of generality, we definefor the remainder of the paper the system state to best = [θt, θt, θt,pt]

T ∈ RP which includes joint angles θt,joint velocities θt, accelerations θt and end-effector pose pt.

To collect training data for training the deep dynamicsmodel, we use motor babbling [17], [18]. To this end, smallchanges are applied as the control action at ∈ RQ, whereQ is the number of degrees of freedom for the robot, i.e.,the number of joints. Considering a naive implementationof motor babbling, the action can be sampled from anisotropic Gaussian distribution at ∼ N (0, σ2I) . However,empirically we have observed that such a naive samplingapproach does not effectively cover the state-action spaceby unnecessarily focusing on samples in the center of thedistribution.

To alleviate this problem, we use a bi-modal distributionand incorporate a simple momentum term

π ∼ B(0.5) (1)

u ∼ π N (µ1, σ2I) + (1− π) N (−µ1, σ2I) (2)at = (u+ at−1)/2 (3)

Actions sampled using the above strategy effectively coverthe state-action space and generate trajectories without caus-ing wear and tear. The result of the motor babbling phase, is adataset for training which consists of N triplets (st,at, st+1)consisting of the current state, current action, and the nextstate. The individual matrices storing all states and actionsare denoted by St,St+1 ∈ RN×P and At ∈ RN×Q.Model Learning: The deep dynamics model is an artificialneural network that maps a state st and action at on tothe expected next state s∗t+1. Training is performed usingDropout [19] and data collected in the motor babblingprocess. A Euclidean loss function is used to identify theerror.

The neural network generates a point estimate for anyset of inputs. However, due to the non-determinism andnoise underlying such tasks, it is important to reason aboutuncertainties when making predictions. To this end, weleverage recent theoretical insights in order to generateprobabilistic predictions from a trained neural network. Inparticular, it was shown in [20] that neural network learningusing the Dropout method [19] is equivalent to a Bayesianapproximation of a Gaussian Process modeling the trainingdata. Following this insight, we generate a set of predictions

Page 4: Anticipating Noxious States from Visual Cues using Deep ...juxi.net/workshop/deep-learning-rss-2017/papers/Sur.pdf · Anticipating Noxious States from Visual Cues using Deep Predictive

40

10

20

20

time (s)0 10 20 30 40 500

Fig. 3: Analysis of predictive error functions

{s1,. . . , sT } from a trained network through T stochasticforward passes. The generated predictions form a possiblycomplex distribution represented as a population of solutions.We then extract an approximate parametric form of theunderlying distribution through moment matching

E(s∗t+1) ≈1

T

T∑i=1

si (4)

Var(s∗t+1) ≈1

T

T∑i=1

sTi si − E(s∗t+1)TE(s∗t+1) (5)

Given the above distribution moments we can reason aboutthe uncertainty underlying the prediction process. We will seein the next section, that this information can be incorporatedinto any distance measure used for identifying incongruentsensations.Perturbation as Predictive Error: As described above anddepicted in Fig. 2, we generate in every time step a predictionof the next sensory state s∗t+1 based on the current stateand action. Consequently, after executing the action at, wemeasure the true sensor values of the robot and comparethem to the prediction, i.e., δ = E(s∗t+1)−s∗t+1 . Taking thenorm of vector δ we get an estimate of the discrepancy be-tween robot expectation and reality. Assuming a reasonablyaccurate model, this discrepancy is an estimate of exogenousperturbations that affected the system dynamics. To correctfor the inherent model uncertainty and the probabilistic na-ture of the predictions, we can take an exponential accordingto ∆ =‖τ‖, with τi = exp

(δi − 2 Var(s∗t+1)i

).

This approach effectively incorporates the confidencebounds of the prediction into perturbation estimation andthe exponential scaling simplifies the process of discerningnominal behavior from abnormal behavior due to externalphysical influence.

Fig. 3 depicts the effect the exponential scaling at confi-dence bound. The ground-truth highlighted in gray was hand-labeled using a video stream as reference. We can see bothpredictive error functions produce elevated responses duringmoments of contact. Yet, by incorporating an exponentialscaling as performed in ∆ we can reduce spurious activationsand false-positives.

B. Perceptual Anticipation Model

In this section, we will describe how experienced perturba-tions can be correlated to predictive visual cues. In turn, these

visual cues can later be used to anticipate the occurrenceof physical contact. For example, perceiving an approachingwall may indicate an impending collision. Still, whether ornot an external perturbation will occur is also dependent onthe next actions of the robot, e.g., whether or not the robotwill stop its course. Hence, states and actions need to beincluded in the prediction process.

In our approach, a dedicated model – the Perceptual An-ticipation Model – learns to predict impending perturbationsfrom a sequence of visual input data, robot actions, andsensory states. Visual input representing the environment isgiven by an R × C grid. Each cell of the grid stores Fmeasurements, i.e., depth or color channels that may varyover time. Thus, the visual input corresponds to a tensorX ∈ RR×C×F .

Training is based on repeated physical interaction withthe environment, e.g., a human or static objects. Throughoutthis process, a stream of visuals is recorded using eithera traditional RGB camera or an RGB-D depth camera.The result is a sequence of observations Xt. As the sametime, states st and actions at of the robot are recorded. Inaddition, at every time step, the previously introduced deepdynamics model is run, in order to generate estimates pt ofperturbations currently acting on the robot.

Given the above data sets, the goal of training the Percep-tual Anticipation Model is to approximate the distribution

P (pt+1, . . . , pt+k |Xt−j , . . . ,Xt,

st−j , . . . , st,

at−j , . . . ,at) (6)

where j defines a window of past time steps. More specif-ically, we are interested in a predictive model that generatesexpected future perturbations pt+1, · · · , pt+k conditioned onpast inputs Xt−j , as well as current and previous robot statesand actions.

The presented task requires both attention to spatial pat-terns within the visual input, as well as attention to temporalpatterns and behavioral dynamics. Hence, special care hasto be taken to ensure that the model architecture used forlearning can effectively identify and incorporate both sourcesof information when making a prediction.Network Architecture: In order to base all predictions onboth spatial and temporal information, we employ a deepconvolutional recurrent network to learn the anticipationmodel. The input of the network is a sequence of visualdata, the robot states, and actions. The output of the networkis a vector pt = [pt+1, · · · , pt+k]

T ∈ Rk that describesthe likelihood of a perturbation at each of the time steps{t+1, · · · , t+k}. In order to incorporate temporal dynamics,recurrent units are used according to either the Long ShortTerm Memory (LSTM) [21] model or the Gated RecurrentUnits (GRU) [22] model. These recurrent units keep trackof a hidden state. In turn, the next state of the network iscalculated as a function of the hidden state and the newinputs. Note that convGRU have fewer parameters, which

Page 5: Anticipating Noxious States from Visual Cues using Deep ...juxi.net/workshop/deep-learning-rss-2017/papers/Sur.pdf · Anticipating Noxious States from Visual Cues using Deep Predictive

reduced both training and execution time. Faster forwardpasses can be crucial for real-time robotics application asthe one described here.Loss Function: In order to train all network parameters, weuse a weighted binary cross-entropy as a loss function

L = − 1

T

T∑t=1

(wpt log(pt) + (1− pt)

(1− log(pt)

))(7)

where w is weight penalty, pt is the ground truth, and pt

is the likelihood of perturbation generated by the network.

IV. EXPERIMENTAL RESULTS

We conducted a human-robot interaction study to inves-tigate the validity of the introduced approach. In all of thefollowing experiments the prediction rate was 5Hz. Systemstate dimension P is 28 and degrees of freedom Q is 7. Meanand standard deviation for action sampling is µ = 0.1, σ =0.05. Dimension of environment data: R,C = 64 and F = 1(gray-scale frame channel or depth channel). The parametersof DPM are set to: k = 10, w = 3, j = 9.

A. Interaction Dataset

For training the anticipation model, we have created aninteraction dataset involving 10 participants. Each participantinteracted with the robot for about ten minutes. This resultedin a data set of 3000 data points per person. The participantswere instructed to move towards the robot and touch itsarm at any location. Throughout this process, the arm ofthe robot was continuously performing a forward-backwardmovement.

In addition, in 40-50% of the interactions, the participantswere instructed to pretend approaching the robot and thenturning back without any contact. Incorporating such feigningmoves on the part of the human interaction partner, as well asa constant movement of the arm, ensures that the robot hasto constantly monitor its state and the environment in orderto appropriately update its belief about impending physicalcontact. Data recorded from 9 of the participants was usedfor training, while 1 participant was used for validation. Datafrom a separate set of 3 participants was used as test data toevaluate the generalization capabilities of the model.

Two different versions of the training set were generated.First, we created a setup in which depth images were used asinput. In the second setup, we used grayscale video imagesas input.

B. Example Interaction

Fig. 4 shows an example interaction between the robotand a human subject applying forces on its arm. After train-ing the perceptual anticipation model, we provide currentobservations in form of images (or depth images) as input.In turn, the network generates estimates for the next k timesteps which reflect the likelihood of a future perturbation att + k. The moment of contact is depicted in Fig. 4 by avertical line. We can see that the predictions for pt+1, pt+5

and pt+10 are aligned along a diagonal. The predictor witha larger horizon, i.e., looking ten time steps into the future,

activates early to indicate a high likelihood of a contact. Thepredictors with a shorter horizon activate at later time steps,based ont the horizon they have been trained on. Note theattenuation of the activations, once the robot is outside theenvelope for which the predictors have been trained to fire.

C. Visualizing Network Saliency

To better understand the decision process by which theperceptual anticipation model generates predictions, we vi-sualized the underlying saliency using the method introducedin [23]. Fig. 5 shows an example of saliency maps at differentmoments in time during a human-robot interaction. Regionshighlighted in red or yellow correspond to pixels with astrong influence on the output of the network. Originally,the network is not focusing on any particular region withinthe image. However, as the human subject starts walkingin direction of the robot, the network approximately startsfocusing on pixels around the body of both the humansubject, as well as the right arm of the robot. As the humancomes closer, the saliency scores for each pixel becomelarger, as indicated by the yellow coloring. At the onset ofa contact, the saliency scores attenuate.

V. CONCLUSIONS

In this paper, we showed how Deep Predictive Models andself-supervised learning can be used for anticipating impend-ing physical impact to the robot body. By learning a mappingbetween observed visual cues and future perturbations, robotscan anticipate upcoming hazardous, noxious or unwantedstates. To this end, we introduced a deep network architecturethat combines spatial, temporal, and probabilistic reasoningin order to generate predictions over a horizon of next timesteps. The network learns to focus on visual features thatare most indicative of future exogenous perturbations. Forfuture work, we aim at extending the framework, such thatthe predictions can be conditioned on entire control policiesof the robot and not only a single next action. Furthermore,we will incorporate prediction framework into reinforcementlearning, in order to also autonomously learn preventivemotions for avoiding perturbations.

REFERENCES

[1] I. Asimov, I, Robot. Fawcett Publications, 1950.[2] S. Haddadin, Towards Safe Robots: Approaching Asimov’s 1st Law.

Springer Publishing Company, Incorporated, 2013.[3] A. De Luca, A. Albu-Schaffer, S. Haddadin, and G. Hirzinger, “Colli-

sion detection and safe reaction with the dlr-iii lightweight manipulatorarm,” in International Conference on Intelligent Robots and Systems(IROS), 2006 IEEE/RSJ. IEEE, 2006, pp. 1623–1630.

[4] S. Haddadin, A. Albu-Schaffer, A. De Luca, and G. Hirzinger,“Collision detection and reaction: A contribution to safe physicalhuman-robot interaction,” in 2008 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. IEEE, 2008, pp. 3356–3363.

[5] H. Iwata, H. Hoshino, T. Morita, and S. Sugano, “Force detectablesurface covers for humanoid robots,” in 2001 IEEE/ASME Interna-tional Conference on Advanced Intelligent Mechatronics. Proceedings(Cat. No.01TH8556), vol. 2, 2001, pp. 1205–1210 vol.2.

Page 6: Anticipating Noxious States from Visual Cues using Deep ...juxi.net/workshop/deep-learning-rss-2017/papers/Sur.pdf · Anticipating Noxious States from Visual Cues using Deep Predictive

contact

0 2 4 6 8 10

time (s)

Fig. 4: Predictions of the likelihood of physical perturbation generated from a perceptual anticipation model which wastrained on grayscale video input. The predictors fire at different moments ahead of the actual moment of contact.

Fig. 5: Saliency map visualization of pt+5 predictor. It anticipates any interaction 1s into the future. The images show,highlighted in red, the pixels the network is paying attention when generating an output. In the beginning, almost no regionis active, but as the human starts approaching the robot, the network focuses on the human and also on the robot arm. Oncethe subject is in contact with the robot, the activations attenuates.

[6] K. S. Eom, I. H. Suh, W. K. Chung, and S. R. Oh, “Disturbance ob-server based force control of robot manipulator without force sensor,”in Proceedings. 1998 IEEE International Conference on Robotics andAutomation (Cat. No.98CH36146), vol. 4, May 1998, pp. 3012–3017vol.4.

[7] T. Murakami, F. Yu, and K. Ohnishi, “Torque sensorless control inmultidegree-of-freedom manipulator,” IEEE Transactions on IndustrialElectronics, vol. 40, no. 2, pp. 259–265, 1993.

[8] E. Berger, D. Muller, D. Vogt, B. Jung, and H. B. Amor, “Transferentropy for feature extraction in physical human-robot interaction:Detecting perturbations from low-cost sensors,” in 2014 IEEE-RASInternational Conference on Humanoid Robots. IEEE, 2014, pp.829–834.

[9] E. Berger, M. Sastuba, D. Vogt, B. Jung, and H. B. Amor, “Dynamicmode decomposition for perturbation estimation in human robotinteraction,” in The 23rd IEEE International Symposium on Robot andHuman Interactive Communication. IEEE, 2014, pp. 593–600.

[10] E. Berger, M. Sastuba, D. Vogt, B. Jung, and H. Ben Amor, “Es-timation of perturbations in robotic behavior using dynamic modedecomposition,” Advanced Robotics, vol. 29, no. 5, pp. 331–343, 2015.

[11] J. Kuehn and S. Haddadin, “An artificial robot nervous system to teachrobots how to feel pain and reflexively react to potentially damagingcontacts,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp.72–79, 2017.

[12] J. Mainprice, R. Hayne, and D. Berenson, “Goal set inverse optimalcontrol and iterative replanning for predicting human reaching motionsin shared workspaces,” IEEE Transactions on Robotics, vol. 32, no. 4,pp. 897–908, Aug 2016.

[13] C. Perez-D’Arpino and J. A. Shah, “Fast target prediction of humanreaching motion for cooperative human-robot manipulation tasks usingtime series classification,” in 2015 IEEE International Conference on

Robotics and Automation (ICRA), 2015.[14] f.-h. chan, y.-t. chen, y. xiang, and m. sun, “Anticipating accidents in

dashcam videos,” in asian conference computer vision (accv).[15] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural

networks for scene labeling.” in ICML, 2014, pp. 82–90.[16] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,

S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrentconvolutional networks for visual recognition and description,” inThe IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2015.

[17] R. Saegusa, G. Metta, G. Sandini, and S. Sakka, “Active motorbabbling for sensorimotor learning,” in Robotics and Biomimetics,2008. ROBIO 2008. IEEE International Conference on. IEEE, 2009,pp. 794–799.

[18] Y. Demiris and A. Dearden, “From motor babbling to hierarchicallearning by imitation: a robot developmental pathway,” 2005.

[19] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.

[20] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:Representing model uncertainty in deep learning,” arXiv preprintarXiv:1506.02142, vol. 2, 2015.

[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[22] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” arXiv preprintarXiv:1406.1078, 2014.

[23] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolu-tional networks: Visualising image classification models and saliency

maps,” arXiv preprint arXiv:1312.6034, 2013.