oswald berthold and verena v. hafnerrpg.ifi.uzh.ch/docs/iros13workshop/berthold.pdf ·...

Neural sensorimotor primitives for vision-controlled flying robots

Oswald Berthold and Verena V. Hafner

I. AbstractAbstract—Vision-based control of aerial robots

adds the challenge of visual sensing on top ofthe basic control problems. Here we pursue theapproach of using sensor-coupled motion primi-tives in order to tie together both the sensor andmotor components of flying robots to leveragesensorimotor interaction in a fundamental man-ner. We also consider the temporal scale overwhich autonomy is to be achieved. Longer termautonomy requires significant adaptive capacity ofthe respective control structures. We investigatelinear combinations of some nonlinear networkactivity for the generation of motor output andultimately, behaviour. In particular we are in-terested in how such networks can be set up toautonomously learn a desired behaviour. For that,we are looking at bootstrapping of motor primi-tives. The contribution of this work is a neurallyinspired architecture for the representation ofsensorimotor primitives capable of online learningon a flying robot.

II. IntroductionOur aim is the design of control circuits for mul-

tirotor type helicopters to safely and robustly moveand navigate in unmodified everyday environments.This requires a highly adaptive architecture whichcan be realized by using Reinforcement Learning(RL) techniques in combination with autonomouslyacquired internal models of the robot to be con-trolled. The main sensory mode is vision, aided byseveral auxiliary channels, the overall dynamics ofthe underlying system, however, are largely unaf-fected by the choice of sensors. This course of actionis motivated both by technical requirements andbiological models. The general setting is to startthe robot learning episode with a default parame-terization of the primitive, or policy. The policy isstochastic and the system follows the gradient of acost function with respect to the policy parameters.The major challenge is to do this without the robotdamaging or destroying itself.The control problems that we consider here are ba-

sic stabilization of positions in space in the presenceof continuous perturbation and noise. We considerthe question: Can we bootstrap a controller based

purely on robotic self-exploration with the marginalcondition of not destroying the robot during learn-ing?

III. Related workA. Biology

The biological idea of primitives has a long history.Motor primitives date back at least to Helmholtzand are still receiving current interest. In biologicalcontext motion primitives are mostly referred to asCentral Pattern Generators (CPG), emphasizing thefact of their autonomous nature. The decompositionof sensing into the activity of specialist primitivesstarts at the latest with the ecological approach ofGibson. In investigations of motor control, severalauthors describe a coherent picture of the features ofbiological control systems (Arbib 1981; Mussa-Ivaldiand Solla 2004; Grillner 2006; Ijspeert 2008).

In summary, these features are their distributedcharacter which results in intelligent behaviourfrom the ground up through component autonomyand goal-awareness at lowest levels due to motor-referenced sensory primitives; their subsumption ofconcepts such as regulators, feedback controllers,homeostasis and their mediating function in outputtransformation; presence of internal models coupledwith identification mechanisms to realize adaptivecontrol; the ability to realize minute yet significantadjustment which is in many cases necessary forsuccessful actions. Overall, adaptability appears tobe favored over precision.

B. Proposed architectures for representing primitivesRecently proposed computational models for

primitive representation can be classified accordingto their location on the model-based to model-freespectrum. On the model-based end, Lupashinet al. (2010); Schoellig, Mueller, and D’Andrea(2012); Mellinger, Michael, and Kumar (2012) usea detailed quadrotor flight dynamics model derivedfrom first principles which is then refined iteratively.Behaviourial targets are trajectory following withvarying DoF and aerobatics. Faust et al. (2013) learnswing-free trajectories on load-carrying quadrotors

using Approximate Value Iteration (AVI) on topof similar models. The autonomous plane of Bry,Bachrach, and Roy (2012) also falls into this class.

Using attractors in dynamical systems has beenproposed repeatedly for achieving computationalfunction, an example are Attractor Neural Networks(ANNs) (Mussa-Ivaldi and Solla 2004). Ijspeert,Nakanishi, and Schaal (2002) introduced the conceptof Dynamic Movement Primitives (DMP) which areexplicitly constructed attractor systems which drivea second system via a nonlinear coupling that isgenerated with a set of weighted basis functions.The weight parameters can be optimized with PolicySearch (PS) methods (Schaal et al. 2004; Kober,Wilhelm, et al. 2012). Tedrake et al. (2009) haveused a similar approach for controlling a modelplane around stall conditions. Sensory coupling ofDMPs is not as straightforward however (Kober,Mohler, and Peters 2008).

Kolter and Ng (2009) and Lau 2011 integratecoarse models of the systems under considerationinto their policy representations. The coarse modelreflects the intuitive understanding of the system, inmany cases the sign of the control input dependentderivatives suffices. Michels, Saxena, and Ng (2005)used a Markov Decision Process (MDP) model andthe PEGASUS RL algorithm for learning vision-based obstacle avoidance on an RC car.

Rottmann et al. (2007) are bootstrapping analtitude controller using RL and Gaussian ProcessRegression (GPR) for approximating the state-action value function but special provision mustbe made to partition learning into chunks suitablefor GPR. In their comprehensive work on learningautonomous helicopter aerobatic maneuvers, Abbeeland Ng (2004); Abbeel, Coates, and Ng (2010) useInverse Reinforcement Learning (IRL) to optimizea coarse model through expert demonstrationsand use that for learning aerobatic maneuvers onan autonomous helicopter. This is an incompletesample of the literature, for a good survey of RL inrobotics please refer to Kober and Peters (2012).

Finally, Waegeman, Wyffels, and Schrauwen 2012propose a neural realization of motion primitiveswhich are acquired through imitation usingregression, which is a standard reservoir computingmethod (Jaeger 2001; Maass, Natschlaeger, andMarkram 2002).

IV. Methods and modelA. Multirotor motion primitives

We choose to work on top of a standardattitude-stabilized controller which acceptsurobot = (u1, u2, u3, u4)T as the time-dependentcontrol input, where the ui for i = {1,.., 4}correspond to roll, pitch, yaw and thrust. The timeindex t is omitted for clarity. Simply speaking,thrust effects a vertical acceleration, setting roll andpitch angles effects lateral accelerations in the plane.By integrating these actions, we attain velocity andposition in space. We are searching for an alphabetof excitation patterns of u with which to synthesizeacceleration controlled behaviour in the 6 DoF of arigid body in R3.

B. Network modelWe use neural reservoirs for representing the inter-

nal model from which controller output is derived.The reservoir is a recurrent neural network withrandom connectivity per instantiation. It can be usedas a universal dynamical system approximator. Thefirst important parameter of such a reservoir is itssize, that is, the number of neuronsN in the network.Increasing the network size N increases the generalmodelling capacity of the network.

We represent the Sensorimotor Primitive (SMP)with a leaky integrator reservoir and an outputweight vector W out

i associated with output channeli. These components model the actual physical sys-tem and can generate control and prediction outputs.The network state evolves according to

4xt = λW resrt + W inut + W fzt + W bias1

xt+1 = (1− τ)xt + τ4xt

rt+1 = tanh(xt+1) + νstate

The matrices W res,W in,W f,W bias are the N ×Nreservoir, N ×U input, N ×Y readout feedback andN × 1 bias matrices respectively. U and Y designatethe number of inputs and readout units. The scalar λis a scaling factor to effect a desired spectral radiusfor the reservoir matrix and in our experiments ischosen as λ = g 1√

pN, where gain g = 1.1 and the

connection probability for the reservoir matrix p =0.1. The state noise νstate is uniformly distributedwith amplitude -0.1 to 0.1 and has a regularizingeffect. Network output is computed as

yi,t = W outi,t rt

2

for each channel i at time t. The final policy canbe formed according to

urobot = (y1, y2, y3, y4)T

In the experiments we do not synthesize the fullurobot but only look at one or two readout units. Theother controls are either assigned constant values orare driven by e.g. a PID controller.

C. Learning

There are several methods available for trainingreservoirs. We interrelate three different methods,learning rules, PS, and regression in Figure 1. In thebootstrapping case, we want use close to no priorinformation on how the control signal depends onthe sensory input. The priors we do introduce inthe process are limited to an offset for the thrustcommand to bridge the "dead band" in the lower halfof the allowed range (the system is unresponsive), atime constant for the network dynamics, the size ofthe reservoir and an amplitude for the explorationnoise ν. We use an unsupervised learning methodimplemented via a Hebbian learning rule (Hoerzer,Legenstein, and Maass 2012). The learning rule ex-ploits correlations in the exploration noise ν and theimmediate reward. During learning, Gaussian noiseis added to the readout signal by letting

yi,t = W outrt + νt = yi,t + νt

with ν ∼ N (µ, σ). The reward is computed by aperformance measure Pt at each time step. A low-pass filtered version Pt = αPt−1 + (1 − α)Pt is alsogenerated with α = 0.8. The modulator signal isderived from Pt via

Mt ={

1 if Pt > Pt

0 otherwise

and the weight update then is given as

∆W outi,t = ηi,trt(yi,t − yi,t)Mt

,with yi,t being a low-pass filtered version of yi,t

analogously to P . The signals Pt and Mt indexedwith motor channel i, but this is not strictly neces-sary. The entire situation is schematically depictedin Figure 1.

Readouts

RobotRoll

Pitch

Thrust

Learning rule

Policy search

∆WWW out

. . .

uuures

Learning

y1

y2

y3

Readout feedback

RegressiontargetForcing

Robot state sss = (pppT ,vvvT ,aaaT , ...)Taaavvvppp

rrr

Reservoir

Fig. 1. Schematic of the learning system. Those componentsdirectly involved with learning in the proposed setup arecolored in bright red, related methods in a darker shade.Application of the Hebbian learning rule can be regardedas policy search. PS methods emit ∆W derived from onlineexploration, regression computes W directly from a batch.

D. SimulatorSimulations were done with the flight model sim-

ulator crrcsim 1, which was modified to be used as asoftware-in-the-loop drop in for the real helicopter.This allows using the same software infrastructure ason our real robots. The control loop is executed at36 Hz.

V. ExperimentsIn these experiments we consider basic stabilizing

behaviour for a multirotor helicopter, in particularhovering at fixed altitudes and keeping a fixed lateralposition. What we aim to bootstrap is a policyfor general cases, even when we might not have aclear a priori idea of the general direction of motorinduced sensory consequences. These experimentswere performed to investigate the suitability of theapproach to application on a real robot.

A. Bootstrapping asymmetrical stabilization: hoverThe most basic capability of a flying robot is, ar-

guably, that of getting off the ground. In the simplestcase, the goal state could be required to be inside agiven band, that is, between a lower and an upperlimit disregarding the specific altitude achieved bythe controller. Nonetheless we set a singular setpointas the target for learning.

For bootstrapping we apply exploratory Hebbianlearning to the reservoir readout weights. This im-plements a policy gradient method via immediate

1http://sourceforge.net/apps/mediawiki/crrcsim/index.php?title=Main_Page

3

http://sourceforge.net/apps/mediawiki/crrcsim/index.php?title=Main_Page

http://sourceforge.net/apps/mediawiki/crrcsim/index.php?title=Main_Page

reward correlation with action. A crucial point beforelearning can take place is determining the dead-timeor temporal delay T of the sensory consequenceswith regard to motor action. Here we bumped thesystem and determined the temporal delay withcross-correlation of the recorded reponses. The re-ward used in this experiment is Pt = at sgn et withacceleration at, positional error et = pd − pt, thedesired position pd and the sign of the error sgn e ∈{1,−1}. The performance measure rewards acceler-ation pointing in a direction reducing the error witha sharp inversion when crossing the setpoint. Thedelta rule needs to respect the temporal delay T sowe used ∆W out = η(yt−T − yt−T )Mtrt−T .

Structurally, the reservoir receives ures = (p, v, a)as input, where p, v, a are position, velocity andacceleration respectively with regard to the verticalaxis. The readout is connected to the thrust mo-tor channel and is also fed back into the reservoiras an efference copy. Before learning, the readoutweights are initialized to zero as W = 0 and after ashort washout period of 300 time-steps, learning isstarted. Figure 2 shows an exemplary learning runin that configuration. After about three (real-time)minutes the system achieves ground clearance andafter twice the amount of time achieves satisfyingperformance. Learning is performed with a constantlearning rate η for about 550 seconds when learningis clamped and the system is put into testing mode.The testing episode is magnified in Figure 3. Thespikes are strong perturbations achieved by haltingcontroller execution for several seconds, propagatingthe last motor command emitted, and then resumingexecution. The system returns to the operating pointwithout overshoot. Figure 4 shows the first twoprincipal components of the reservoir activation dur-ing another testing episode of a different controllerinstance with positive and negative perturbationsaround the setpoint.What is special in this setup, apart from the

obvious dynamical assymmetry is that in order totake off, the robot needs to move through GroundEffect (GE). GE alters the thrust to lift ratio, sothat a thrust value which initially lifts the robot offthe ground is insufficient for climbing once GE is left.

B. ResultsWe generated statistics over 100 learning runs

to assess the robustness of the learning process it-self. The learning duration has been set to a fixedamount, during which a major proportion of runsreach a satisfying level of performance. Those that

0.00

0.05

0.10

0.15

0.20

0.25

|W |

0 100 200 300 400 500 600

t [s]

02468

1012

z[f

t] ztzd

Fig. 2. Exemplary learning run for bootstrapping the hovercontroller. Learning starts at t = 8.3s and the robot makes itsfirst jump in the air shortly after. The weight vector normoscillates slightly in the beginning due to the invserion ofthe reward value when crossing the target setpoint but thisbehaviour subsequently ceases. At about 3 minutes the robotfinally clears the ground and remains airborne for the rest ofthe episode, except for touching the ground once at t = 300.Until termination of learing at t = 555s, variance steadilydecreases.

540 560 580 600 620 640 660 680 700

t [s]

02468

1012

z[f

t] ztzd

Fig. 3. The testing episode of the controller whose acquisitionis shown in Figure 2. The system is perturbed at various pointsmarked in red during the episode visible as large spikes inthe curve. It returns robustly to baseline activity around thedesignated setpoint. The seemingly chaotic resting activity isinduced by internal system noise.

−0.8 −0.4 0.0 0.4 0.8pc1

−0.8

−0.4

0.0

0.4

0.8

pc2

Fig. 4. First two principal components of the reservoiractivation during hover testing. The system went through posi-tive and negative perturbations of varying magnitude robustlyreturns to a stable fixpoint.

do not, fall into two categories. In some cases thelearning process picks up the wrong direction rightat the beginning, resulting in negative thrust values

4

and making the system unresponsive. In the othercases, the reservoir locks into autonomous oscilla-tions which the learning process fails to suppressand which dominate the output. See Figure 5 forexamplary altitude profiles and RMSE for all 100runs.

0 20 40 60 80 100 120 140

t [s]

02468

10

z[f

t]

0 20 40 60 80 100

Run Index [#]

012345

RM

SE

Fig. 5. Statistics over 100 learning runs. The upper graphshows a sample of superimposed temporal altitude profilesduring testing of the converged controller. The lower graphshows the RMSE for the final testing interval for all runs.Those runs displayed in the upper graph are colored red inthe lower one.

We ran the experiments with reservoirs of sizesN = {20, 100} although we did not, at this point,generate comparative statistics. Not entirely surpris-ing, the larger reservoir overall tends towards betterperformance in terms of convergence speed, finalerror and robustness. Apart from disturbances, wealso tested injecting additional measurement noise inthe altitude reading, clamping of inputs and setpointmodulation by adding an offset to the measurement.We applied the same method to learning lateral sta-bilization of the robot and also investigated episodicpolicy refinement using PS methods. Due to limitedspace these results cannot be further illustrated here.Other possible follow-up experiments from this gen-eral scenario are discussed in section VI.

VI. Discussion and future workAll the learning methods discussed here are only

able to find local optima in the performance land-scape. This is still sufficient for a variety of tasks. Ofparticular importance to us was the applicability oflearning to real robots without the immediate dangerof destroying the robot during exploration.We think the learning process of the hover con-

troller is remarkable in that bootstrapping occurs ina double sense. First, the controller itself has to be

learnt but at the same time, the learning task is notstationary but the system needs to learn implicitlyto first take off before it can stabilize itself at agiven hover altitude requiring different magnitudesof changes to output signal.

The acquired primitives are innately grounded inthat they are only referenced to locally measurablequantities. Using only the sign of the positionalerror besides acceleration in the reward puts weakdemands on the visual sensing of position on the realrobot. Compared with other approaches our methodis very general in that almost no prior knowledge isneeded for a successful learning run. It can readily beapplied to a variety of vehicles in an online setting.At the same time, if prior knowledge can be givenin the form of a demonstration, this could also beexploited.

Future work will involve verifying the approach onreal robots (Berthold, Müller, and Hafner 2011) andautonomous determination of the inherent temporaldelay. It has to be noted however, that using episodicpolicy search, the temporal delay issue is of noconcern.

VII. SummaryWe presented a neural architecture capable of rep-

resenting sensorimotor primitives suitable for model-free reinforcement learning. We demonstrated theapplication of that architecture to problems in MAVmotion control although we are not limited in scopeto MAVs. We were able to bootstrap SMPs for sta-bilization tasks using online learning and exploitingsensorimotor interaction. The bootstraping processis gentle enough to not destroy the system duringexploration and thus can be applied to learning onreal hardware. In addition, the same process canbe applied to post-bootstrapping adaptation of theacquired primitives to changes in the environment.This enables implementation of continuously learn-ing system. The resulting controllers display antici-patory behaviour and are extremely robust to noisewhich make them well suited for use in combinationwith vision-based state estimates. Supplementarymaterial includes python scripts for learning of sta-bilizing an acceleration controlled point mass in twodimensions, thus illustrating the principles.

VIII. AcknowledgementsThis work is supported by the DFG Research

Training Group METRIK. We are grateful for in-spiring discussions to Christian Blum and DamienDrix.

5

ReferencesAbbeel, Pieter, Adam Coates, and Andrew Y. Ng(2010). “Autonomous Helicopter Aerobaticsthrough Apprenticeship Learning”. In: I. J. Rob.Res. 29.13, pp. 1608–1639.

Abbeel, Pieter and Andrew Y. Ng (2004). “Appren-ticeship learning via inverse reinforcement learn-ing”. In: Proc. ICML. ACM, pp. 1–.

Arbib, Michael A. (1981). “Supplement 2: Handbookof Physiology, The Nervous System, Motor Con-trol”. In: APS/Wiley. Chap. Perceptual Structuresand Distributed Motor Control, pp. 1449–1480.

Berthold, Oswald, Mathias Müller, andVerena V. Hafner (2011). “A quadrotor platformfor bio-inspired navigation experiments”. In:International workshop on bio-inspired robots.

Bry, Adam, Abraham Bachrach, and Nicholas Roy(2012). “State estimation for aggressive flight inGPS-denied environments using onboard sensing”.In: ICRA, pp. 1–8.

Faust, Aleksandra et al. (2013). Learning Swing-freeTrajectories for UAVs with a Suspended Load inObstacle-free Environments. ICRA WS AL.

Grillner, Sten (2006). “Biological Pattern Genera-tion: The Cellular and Computational Logic ofNetworks in Motion”. In: Neuron 52.5, pp. 751–766.

Hoerzer, Gregor M., Robert Legenstein, andWolfgang Maass (2012). “Emergence of ComplexComputational Structures From Chaotic NeuralNetworks Through Reward-Modulated HebbianLearning”. In: Cerebral Cortex.

Ijspeert, Auke Jan (2008). “Central pattern genera-tors for locomotion control in animals and robots:A review”. In: Neural Networks 21.4, pp. 642 –653.

Ijspeert, Auke Jan, Jun Nakanishi, and Stefan Schaal(2002). “Learning Attractor Landscapes for Learn-ing Motor Primitives”. In: Proc. NIPS. MIT Press,pp. 1547–1554.

Jaeger, Herbert (2001). The ‘‘echo state’’ approach toanalysing and training recurrent neural networks -with an Erratum note. Tech. rep. German NationalResearch Center for Information Technology.

Kober, Jens, B. Mohler, and Jan Peters (2008).“Learning perceptual coupling for motor primi-tives”. In: Proc. IROS, pp. 834–839.

Kober, Jens and Jan Peters (2012). “ReinforcementLearning in Robotics: A Survey”. In:Reinforcement Learning. Springer BerlinHeidelberg, pp. 579–610.

Kober, Jens, Andreas Wilhelm, et al. (2012). “Re-inforcement learning to adjust parametrized mo-tor primitives to new situations”. In: AutonomousRobots 33.4, pp. 361–379. issn: 0929-5593.

Kolter, J. Zico and Andrew Y. Ng (2009). “Policysearch via the signed derivative.” In: Proc. RSS.

Lau, Tak-Kit (2011). “PRISCA: A policy searchmethod for extreme trajectory following”. In:Proc. CDC-ECC, pp. 2856–2862.

Lupashin, S. et al. (2010). “A simple learning strat-egy for high-speed quadrocopter multi-flips”. In:Proc. ICRA, pp. 1642–1648.

Maass, Wolfgang, Thomas Natschlaeger, andHenry Markram (2002). “Real-time Computingwithout stable states: A New Framework forNeural Computation Based on Perturbations”.In: Neural Computation 14.11, 2531–2560.

Mellinger, Daniel, Nathan Michael, and Vijay Kumar(2012). “Trajectory generation and control for pre-cise aggressive maneuvers with quadrotors”. In: I.J. Rob. Res. 31.5, pp. 664–674.

Michels, Jeff, Ashutosh Saxena, and Andrew Y.Ng (2005). “High speed obstacle avoidance usingmonocular vision and reinforcement learning”. In:Proc. ICML, pp. 593–600.

Mussa-Ivaldi, F. A. and S. A. Solla (2004). “NeuralPrimitives for Motion Control”. In: IEEE Journalof Oceanic Engineering 29, pp. 640+.

Rottmann, Axel et al. (2007). “Autonomous blimpcontrol using model-free reinforcement learning ina continuous state and action space”. In: Proc.IROS.

Schaal, Stefan et al. (2004). “Learning movementprimitives”. In: Proc. ISRR. Springer.

Schoellig, Angela P., Fabian L. Mueller, and Raf-faello D’Andrea (2012). “Optimization-based iter-ative learning for precise quadrocopter trajectorytracking”. In: Autonomous Robots 33.1-2, pp. 103–127.

Tedrake, Russ et al. (2009). Learning to Fly like aBird.

Waegeman, Tim, Francis Wyffels, and BenjaminSchrauwen (2012). “A discrete/rhythmic patterngenerating RNN”. eng. In: Proc. ESANN, pp. 567–572.

6

oswald berthold and verena v. hafnerrpg.ifi.uzh.ch/docs/iros13workshop/berthold.pdf ·...

Documents