ac#ve&explora#on&mechanisms&in& …

43
Ac#ve Explora#on Mechanisms in Autonomous Learning and Development A Robo&c Modeling Perspec&ve Pierre&Yves Oudeyer Project&Team INRIA&ENSTA&ParisTech FLOWERS h?p://www.pyoudeyer.com h?ps://flowers.inria.fr Twi?er: @pyoudeyer 1

Upload: others

Post on 02-Jan-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ac#ve&Explora#on&Mechanisms&in& …

!!Ac#ve&Explora#on&Mechanisms&in&

Autonomous&Learning&and&Development&A!Robo&c!Modeling!Perspec&ve!

Pierre&Yves!Oudeyer!Project&Team!INRIA&ENSTA&ParisTech!FLOWERS!

h?p://www.pyoudeyer.com!!h?ps://flowers.inria.fr!Twi?er:!@pyoudeyer!

! 1!

Pierre-Yves Oudeyer
Texte
Page 2: Ac#ve&Explora#on&Mechanisms&in& …

!Autonomous&Learning&and&Development!in!Human!Infants!

•  How$do$developmental$structures$form?$$$•  What$is$the$role$of$structured$learning$curriculums?$

•  How$do$they$enable$autonomous$learning?!

Not!autonomous!learning!

2!

Page 3: Ac#ve&Explora#on&Mechanisms&in& …

Cogni;ve$sciences!models!to!understand!be?er!

human!development!

Lifelong!autonomous!learning!in!!

robo;cs$and$AI!

ApplicaOons!in!!educa;onal$technologies!

Team:!~20&25!people!•  4!Inria!seniors:!!

PY.!Oudeyer,!M.!Lopes,!!D.!Roy,!A&L.!Vollmer!

•  3!Ensta!ParisTech!!•  seniors:!!

D.!Filliat,!F.!Stulp,!!A.!Gepperth!!

•  8&9!PhDs!•  3&5!engineers!•  1&3!postdocs!

Many!collaboraOons!with!researchers!in!!•  Developmental!psychology!•  Neuroscience!•  RoboOcs!and!AI!•  EducaOonal!sciences!

3!

Page 4: Ac#ve&Explora#on&Mechanisms&in& …

ExploraOon!and!guidance!

mechanisms!

Intrinsic$mo;va;on,$ac;ve$learning$curiosity$$!•  CogniOve!science!•  RoboOcs/AI!•  ApplicaOons!in!educaOon!!Body$morphology$and$growth!!!•  CogniOve!science!•  RoboOcs/AI!•  ApplicaOons!in!educaOon!!Interac;ve$learning,$imita;on$!•  CogniOve!science!•  RoboOcs/AI!•  ApplicaOons!in!educaOon!!

4!

Page 5: Ac#ve&Explora#on&Mechanisms&in& …

Intrinsic$mo;va;on,$ac;ve$learning$curiosity$$!•  Cogni;ve$science$•  RoboOcs/AI!•  ApplicaOons!in!educaOon!!Body$morphology$and$growth!!!•  CogniOve!science!•  RoboOcs/AI!•  ApplicaOons!in!educaOon!!Interac;ve$learning,$imita;on$!•  CogniOve!science!•  RoboOcs/AI!•  ApplicaOons!in!educaOon!!

5!

ExploraOon!and!guidance!

mechanisms!

Page 6: Ac#ve&Explora#on&Mechanisms&in& …

Spontaneous!acOve!exploraOon!

h?ps://www.youtube.com/watch?v=8vNxjwt2AqY!

Page 7: Ac#ve&Explora#on&Mechanisms&in& …

Intrinsic!moOvaOon,!curiosity!and!acOve!learning!

! Intrinsic!drive!to!reduce!uncertainty,!and!to!experiencing!novelty,!surprise,!!cogniOve!dissonance,!challenge,!incongruences,!…!! OpOmal!interest!=!opOmal!difficulty!=!neither!trivial!nor!too!difficult!challenges!Berlyne!(1960),!White!(1960),!Kagan!(1972),!Csikszentmihalyi!(1996),!(Kidd!et!al.,!2012),!…!See!review!in!(Oudeyer!et!al.,!2016)!!

Flow!theory!Csikszentmihalyi!!(1996)!

Page 8: Ac#ve&Explora#on&Mechanisms&in& …

Intrinsic!moOvaOon,!curiosity!and!acOve!learning!

8!

! Intrinsic!drive!to!reduce!uncertainty,!and!to!experiencing!novelty,!surprise,!cogniOve!dissonance,!challenge!

(Fron;ers$in$Neuroscience$2007;$IEEE$TEC$2007;$Trends$in$Cogni;ve$Science,$Nov.$2013;$Progress$in$Brain$Research,$2016;$$Fron;ers$in$Neuroscience,$2014;$Scien;fic$Reports,$2016;$PNAS,$in$press)$$

ContribuOon!of!Flowers!lab!and!colleagues!in!the!last!10!years:!!Development!of!a$unified$formal$and$theore;cal$framework$$

in!psychology!and!neuroscience!

Flow!theory!Csikszentmihalyi!!(1996)!

Page 9: Ac#ve&Explora#on&Mechanisms&in& …

PredicOve!brain!framework:!Exploring!to!learn!world!models!

AcOon!Sensori!

consequences!

Model!learning!

WORLD$

Predic;ons$

Page 10: Ac#ve&Explora#on&Mechanisms&in& …

PredicOve!brain!framework:!Exploring!to!learn!world!models!

AcOon!Sensori!

consequences!

Model!learning!

WORLD$

Predic;ons$

Goals!

Page 11: Ac#ve&Explora#on&Mechanisms&in& …

The!acOve!exploraOon!system!

AcOon!Sensori!

consequences!

Model!learning!

Monitoring!of!novelty/

uncertainty/learning/…!

Intrinsic!reward!

AcOon/Goal!selecOon!

(Oudeyer!et!al.,!2016;!Go?lieb!et!al.,!2013;!Friston,!2012;!Barto!et!al.,!2004;!!Oudeyer!et!al.,!2007;!Schmidhuber,!1991)!!Also!related!to!!Child!as!a!(Bayesian)!scienOst!hypothesis!(Shulz,!Gopnick,!Tenenbaum,,!…)!

WORLD$Predic;ons$

Goals!

(Baranes!and!Oudeyer,!2013)!

Page 12: Ac#ve&Explora#on&Mechanisms&in& …

InteracOon!with!other!moOvaOons!and!learning!mechanism!

AcOon!Sensori!

consequences!

PredicOon!learning!

Monitoring!of!novelty/

uncertainty/learning/…!

Intrinsic!reward!

AcOon!selecOon!

.$

.$

.$

Food!moOvaOon!

Physical!integrity!

presevaOon!

WORLD$

Social!bonding!

moOvaOon!

WORLD$Predic;ons$

Page 13: Ac#ve&Explora#on&Mechanisms&in& …

Curiosity!in!experimental!studies!$

!•  Perceptual!curiosity!e.g.!Stahl!and!

Feigenson,!2015!!

First Presentation(12∼15 s)

Fixation (4∼6 s)

Fixation (4∼6 s)

Fixation (4∼6 s)

Curiosity Rating(Self-Paced)

Confidence Rating(Self-Paced)

Second Presentation(3∼5 s)

Answer Display(4∼6 s)

Time

Confidence Level (Jittered)

Nor

mal

ized

Cur

iosi

tyWhat instrument was inventedto sound like a human singing?

What instrument was inventedto sound like a human singing?

What instrument was inventedto sound like a human singing?

What instrument was inventedto sound like a human singing?

Violin

a

b

c

Violin

What is the name of the galaxythat Earth is a part of?

Milky Way

+

+

+

+

How curious are you?

Not curiousat all Very curious

How confident are you (%)?or tip of tongue (TOT)?

0 10 20 30 40 50 60 70 80 90 100 TOT

Trial OnsetFixation (4∼6 s)

3

2

1

0

–1

–2

–30 0.2 0.4 0.6 0.8 1

1 2 3 4 5 6 7

Fig. 1. Experimental protocol and behavioral results in Experiment 1: (a) sample questions, (b) trial sequence, and (c) distribution of curiosityratings as a function of confidence. The questions in (a) are examples of items with relatively high (left; average score 5 5.72) and low (right; averagescore 5 2.28) curiosity ratings. For the scatter plot (c), all confidence ratings were jittered by adding random numbers U ! ["0.01, 0.01], to conveydata density. There was also a ‘‘tip of the tongue’’ response option, but there were too few of these responses to analyze, so they were excluded. Thelarge, unfilled circles indicate mean curiosity at each confidence level. Diamonds indicate individual observations. The solid curve is the regressionline of curiosity against confidence, P, and P(1" P). The estimated regression was calculated as follows: curiosity 5"0.49 – 0.39P 1 4.77P(1" P) 1residual curiosity.

Volume 20—Number 8 965

M.J. Kang et al.

EXPERIMENT 2

Method

Participants and TaskSixteen Caltech students (11 males, 5 females) participated.

Informed consent was obtained using a consent form approvedby the internal review board at Caltech. The task, questions, and

time line were identical to those in Experiment 1 except for someminor changes: The order of questions was fully randomizedacross the experiment, no fixation screens were presented, each

question was presented for a fixed duration of 10 s, and acountdown screen showing every second was presented for the 5

s before the answer.Upon completing the task, subjects were surprised with a

request to return within 11 to 16 days for a follow-up study.

Twelve returned and provided data used in the analyses. At the

follow-up session, subjects were shown the same questions andasked to recall the correct answers (earning $0.25 for each

correct answer, in addition to $15 for participation). (See Ex-perimental Materials in the Supporting Information availableon-line for the instructions given to subjects.)

MeasuresFor the first session, the measured variables were identical tothose of Experiment 1 except that subjects’ guesses were re-corded during the task (as a check on postscanner overreporting

of correct guesses in Experiment 1; the percentage of correctguesses was not significantly different between the experiments),

as was pupil dilation response (PDR) before and after the answerdisplay. The second session provided a new recall measure.

TABLE 1

Brain Regions Exhibiting Greater Activation on High-Curiosity Trials Than on Low-Curiosity Trials During the FirstPresentation of the Question

Region Hemisphere

Coordinates Spatial extent(voxels) t(18)x y z

Caudate Left !9 3 3 10 4.04Inferior frontal gyrus/Brodmann’s area 45 Left !54 24 21 112 5.71Inferior frontal gyrus/Brodmann’s area 45 Right 48 24 21 5 4.01Parahippocampal gyrus Left !33 !39 !12 21 4.04Parahippocampal gyrus Right 36 !30 !18 5 4.46Medial frontal gyrus Left !12 36 48 26 4.49Middle frontal gyrus, premotor cortex Left !27 15 57 70 5.71Lingual gyrus Right 18 !63 !3 11 4.57Cerebellum Right 36 !69 !36 34 4.67

Note. All locations are reported in Montreal Neurological Institute coordinates.

z = 21

L LR R

y = 6

Fig. 2. Brain regions that showed differential activity in high- versus low-curiosity trials during the first questionpresentation in Experiment 1 (p < .001 uncorrected, prep > .99, extent threshold " 5). Colored areas showedgreater activation on high-curiosity trials in the median-split analysis (red), the modulator analysis (yellow), and theanalysis of residual curiosity (green). The illustration at the right is a close-up view of the overlapping caudateactivations. Ant 5 anterior; Pos 5 posterior; L 5 left; R 5 right; IFG 5 inferior frontal gyrus.

Volume 20—Number 8 967

M.J. Kang et al.

!•  Epistemic!curiosity!e.g.!

Kang!et!al.,!2009;!Gruber!et!al.,!2014!

•  Study!impact!on!memorizaOon/learning!•  Neural!correlates!(also!in!monkeys,!e.g.!WaelO!et!al.,!2001;!Go?lieb!et!al.,!2013)!•  Behavioral!correlates!(e.g.!eye!movements,!Baranes!et!al.,!2015)!!

A?enOon!driven!by!surprising!external$s;muli$!in!short$;me$scales!

Page 14: Ac#ve&Explora#on&Mechanisms&in& …

Spontaneous!exploraOon:!!

Beyond!visual!a?enOon!towards!external!sOmuli,!Beyond!surprise,!!

Beyond!short&Ome!scales:!!!

Open!quesOons!!in!psychology!and!neuroscience!

Page 15: Ac#ve&Explora#on&Mechanisms&in& …

(1)!Beyond!visual!a?enOon!towards!external!sOmuli!

!!!What!kind!of!choices!can!be!made!

during!ac&ve!explora&on?!(interes&ngness!of!what?)!

Page 16: Ac#ve&Explora#on&Mechanisms&in& …

(FronOers!in!Neuroscience,!2014;!ICDL&Epirob!2014;!!See!also!ScienOfic!Reports,!2016;!PNAS,!in!press)!

A new experimental setup to study the structure ofcuriosity-driven exploration in humans

Brice Miard, Pierre Rouanet, Jonathan Grizou,Manuel Lopes, Jacqueline Gottlieb, Adrien Baranes, Pierre-Yves Oudeyer

Flowers Team: Inria, Ensta Paris-Tech - Department of Neuroscience, Columbia University

[email protected]

I. INTRODUCTION

Curiosity is a key element of human development, driving us to explore spontaneously novel objects, activities and environ-ments [1]. Curiosity-driven exploration strategies permit us to interact, learn and evolve quickly in an open ended world. It isthus an important challenge to understand the fundamental mechanisms of spontaneous exploration and curiosity in humans.

One of the first experiments on this topic was made by Harlow, where monkeys played with simple 2D puzzles to highlightthe relation between complexity, motivation and learning [2]. Another early example is McReynolds et al. who created the”curiosity box” where identical boxes, with different toys inside, are presented to young children [3]. Recently, an experimentwas made with infants to study action selection guided by intrinsic motivations with a mechatronic board. In this study, subjectshave to learn the relation between actions on pushbuttons and the opening of boxes [4]. They showed significant differences inexploration strategy (sensorimotor vs a more learning directed exploration) between three and four years old infants. Despitea number of other psychology and neuroscience experiments in humans and monkeys, we still know little about the precisemechanisms of curiosity [1].

Many computational models of curiosity have been elaborated [1], [5]. Some of these models specifically targeted themodeling of curiosity and its role in human sensorimotor and cognitive development, showing how it can generate automaticallybehavioral and cognitive developmental structures sharing interesting similarities with infant development [6], [7]. These lines ofwork allowed to identify the wide diversity of potential mechanisms that could be at play to drive spontaneous exploration [8].We focus here on the intrinsic motivation mechanisms driving exploration, i.e. the processes allowing an agent to choose itselfgoals when freely involved in a task.

Novel hypotheseses have been formulated such that curiosity-driven sensorimotor exploration could be organized as tomaximize learning progress [9], [10], which is different from more classical hypothesis conceptualizing curiosity as a drive tomaximize uncertainty of novelty. Yet, experimental setups designed so far in the literature do not allow to separate betweenthese hypotheses. A major research challenge is thus to design experimental setups which could allow to confirm or invalidateindividual hypotheses. Furthermore, it is important to note that intrinsic motivation mechanisms could be influenced whenasked to behave according to a specific protocol. This will be discussed in details when describing our experimental setup.

Here, we make a step in this direction by presenting an exploratory study with humans designed to analyze and measureproperties of curiosity-driven exploration of a priori unknown sensorimotor spaces. More specifically, we are interested in therelation between exploration and learning progress.

II. EXPERIMENTAL SETUP

Fig. 1. a) Subjects are exploring how to control an ellipsis displayed on the screen in front of them by moving their body joints tracked by a Kinect device.b) Details of the interface shown to the user on a screen in front of them: in particular, you can see the controlled ellipsis (in red) and the target one (inbrown).

This experimental setup is designed as a game setting human subjects into an intrinsically motivated activity [11]. Participantscan freely explore and shift between several games, each being about finding a mapping between their movements (body joints

Baranes et al. Intrinsically motivated exploration

FIGURE 1 | Task design. (A) Individual game. The subjects pressed a key to intercept a stream of moving dots (arrow) as they crossed the screen center.(B–D) Selection screens in the 7-game, 64-game, and 64N-game versions.

play a minimum of 70 games and a minimum of 20 min. This dualrequirement was meant to prevent a strategy of simply minimiz-ing time on the task by selecting only the shortest games. Beyondthese basic requirements, there were no additional constraints,and the instructions emphasized that the payment for the sessionwas fixed and entirely independent of the game performance orthe chosen games.

At the end of the sessions testing the 64-game version we con-ducted an additional procedure, administered without warning,to determine whether the subjects monitored their progress in thetask. After a subject completed the session, we selected 5 gamesthat the subject had played at least twice and which spanned therange of difficulties that he/she had sampled. We asked the sub-ject to play each game once more and then asked him/her torate (1) how much they estimate that their performance changedover the repetitions of the game, and (2) how much do theybelieve they could improve if they had five more tries. In eachcase the subjects gave their rating on a scale ranging from −5(a large decrease in performance) to +5 (a large improvement inperformance).

DATA ANALYSISFor the analyses in Figures 1–6 the unit of analysis was one sub-ject; we obtained the appropriate measure for one subject andthen pooled across the sample. To generate the colormaps inFigures 3A,B, we divided each subject’s first 70 games into asliding window of 2 games stepped by 1 game throughout the

FIGURE 2 | General performance. (A) Performance as a function of speedin the 3 versions. Each bin represents the average and standard error(s.e.m.) of the fraction correct for the corresponding dot speed across allthe subjects tested. (B) The distribution of performance levels in the7-game condition. The points show the average and s.e.m. (acrosssubjects) of the number of games in each of 6 performance bins.

session, computed the subject’s distribution of selected speedsand fraction correct in each bin, and then computed the averagesacross subjects. To examine the performance-dependent choicestrategy (Figure 5), we assigned each game that a subject played

www.frontiersin.org October 2014 | Volume 8 | Article 317 | 3

Baranes et al. Intrinsically motivated exploration

FIGURE 4 | Range of selected games. (A) Distribution of theselected speeds (top) and fraction correct (bottom) across an entiresession. The values show the mean and s.e.m. across subjects.(B) Choices of individual subjects. Each line represents one subject

and shows the maximum, minimum and average dot speed selectedby that subject. Subjects are ordered according to the task version(or task combination) that they performed, and in chronological orderwithin a task group.

FIGURE 5 | Local strategy for game selection. Each point showsthe average and s.e.m. of the probability to repeat, increase ordecrease difficulty as a function of prior game performance.

Solid colored traces show the empirical data, dotted black tracesshow the results of simulations using a random game selectionstrategy.

www.frontiersin.org October 2014 | Volume 8 | Article 317 | 5

Intrinsically motivated exploration of sensorimotor activities

Page 17: Ac#ve&Explora#on&Mechanisms&in& …

3.1 Active Model Babbling

A particular exploration strategy available in Explauto is called Active Model Babbling [Forestier andOudeyer, 2016]. With this strategy, di↵erent motor and sensory spaces are given to the robot, e.g. thespace of motor parameters of a robotic arm, and sensory spaces representing the movement of each objectin the scene, and the robot will learn behavioral primitives to explore and reach new e↵ects in the di↵erentsensory spaces. The exploration of one sensory space can give information in the other spaces, and therobot can learn that some objects (e.g. a joystick) can be used as a tool to control other objects (e.g.another robot) in a hierarchical manner. This strategy was shown to be very e�cient to explore multipletask/goal spaces based on the monitoring of learning progress in the di↵erent spaces. A Jupyter notebookimplementing and studying this strategy is available online at this address: http://nbviewer.jupyter.org/github/sebastien-forestier/ExplorationAlgorithms/blob/master/main.ipynb.

3.2 Interaction

In the Active Model Babbling exploration strategy, guidance from human peers can be integrated atdi↵erent levels. When the robot is set up in a compliant mode, human peers can demonstrate roboticmovements that the robot stores and can reuse later in its autonomous exploration by experimentingagain those movements or variations of the movements as in [Nguyen and Oudeyer, 2012]. Also, users candemonstrate the function of objects or create situations (e.g. use the joystick controling the second robotto grab a ball) that the robot will try to reproduce by exploring new motor primitives. Finally, users canalso give a value about the interestingness to explore some task/goal spaces (e.g. moving the blue ball),that will be combined with the intrinsic motivation of the robot and will push it to explore those spaces.

4 Demonstration Setup

The demonstration shows how the Explauto li-brary allows a Poppy Torso to explore its environ-ment and use social guidance to understand the in-teraction between the di↵erent objects in the envi-ronment. The robot interacts with several objectsincluding a joystick that controls another robot thatcan be used as a tool to move other objects. Naivehumans can interact with the robot by driving its ex-ploration towards interesting objects, or by demon-strating useful movements of the arm or the joystick.

Figure 2: Demonstration setup.

References

[Forestier and Oudeyer, 2016] Forestier, S. and Oudeyer, P.-Y. (2016). Modular active curiosity-driven discovery oftool use. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea.

[Lapeyre et al., 2014] Lapeyre, M., Rouanet, P., Grizou, J., Nguyen, S., Depraetre, F., Le Falher, A., and Oudeyer,P.-Y. (2014). Poppy Project: Open-Source Fabrication of 3D Printed Humanoid Robot for Science, Educationand Art. In Digital Intelligence 2014, page 6, Nantes, France.

[Moulin-Frier et al., 2014] Moulin-Frier, C., Rouanet, P., Oudeyer, P.-Y., and others (2014). Explauto: an open-source Python library to study autonomous exploration in developmental robotics. In ICDL-Epirob-InternationalConference on Development and Learning, Epirob.

[Nguyen and Oudeyer, 2012] Nguyen, S. and Oudeyer, P.-Y. (2012). Active choice of teachers, learning strategiesand goals for a socially guided intrinsic motivation learner. Paladyn, 3(3):136–146.

2

The choice spaces in curiosity-driven exploration

Hand!movements!!!!!!!!!!!!!!!!!!!!!!!

Object!1!movements!!!!!!!!!

Choose!acOon!parameters!!!!!!

Choose!goal/!RL!problem!parameters!

Object!2!movements!

Object!3!!movements!

Object!4!!movements!

Object!5!!movements!

Choose!model!

Model!1!

Model!5!

Model!4!Model!7!

Choose!model!path!/!strategy!

Choose!informaOon!source:!self&exploraOon!or!human!teacher!

Choose!what!to!look!at!

Page 18: Ac#ve&Explora#on&Mechanisms&in& …

3.1 Active Model Babbling

A particular exploration strategy available in Explauto is called Active Model Babbling [Forestier andOudeyer, 2016]. With this strategy, di↵erent motor and sensory spaces are given to the robot, e.g. thespace of motor parameters of a robotic arm, and sensory spaces representing the movement of each objectin the scene, and the robot will learn behavioral primitives to explore and reach new e↵ects in the di↵erentsensory spaces. The exploration of one sensory space can give information in the other spaces, and therobot can learn that some objects (e.g. a joystick) can be used as a tool to control other objects (e.g.another robot) in a hierarchical manner. This strategy was shown to be very e�cient to explore multipletask/goal spaces based on the monitoring of learning progress in the di↵erent spaces. A Jupyter notebookimplementing and studying this strategy is available online at this address: http://nbviewer.jupyter.org/github/sebastien-forestier/ExplorationAlgorithms/blob/master/main.ipynb.

3.2 Interaction

In the Active Model Babbling exploration strategy, guidance from human peers can be integrated atdi↵erent levels. When the robot is set up in a compliant mode, human peers can demonstrate roboticmovements that the robot stores and can reuse later in its autonomous exploration by experimentingagain those movements or variations of the movements as in [Nguyen and Oudeyer, 2012]. Also, users candemonstrate the function of objects or create situations (e.g. use the joystick controling the second robotto grab a ball) that the robot will try to reproduce by exploring new motor primitives. Finally, users canalso give a value about the interestingness to explore some task/goal spaces (e.g. moving the blue ball),that will be combined with the intrinsic motivation of the robot and will push it to explore those spaces.

4 Demonstration Setup

The demonstration shows how the Explauto li-brary allows a Poppy Torso to explore its environ-ment and use social guidance to understand the in-teraction between the di↵erent objects in the envi-ronment. The robot interacts with several objectsincluding a joystick that controls another robot thatcan be used as a tool to move other objects. Naivehumans can interact with the robot by driving its ex-ploration towards interesting objects, or by demon-strating useful movements of the arm or the joystick.

Figure 2: Demonstration setup.

References

[Forestier and Oudeyer, 2016] Forestier, S. and Oudeyer, P.-Y. (2016). Modular active curiosity-driven discovery oftool use. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea.

[Lapeyre et al., 2014] Lapeyre, M., Rouanet, P., Grizou, J., Nguyen, S., Depraetre, F., Le Falher, A., and Oudeyer,P.-Y. (2014). Poppy Project: Open-Source Fabrication of 3D Printed Humanoid Robot for Science, Educationand Art. In Digital Intelligence 2014, page 6, Nantes, France.

[Moulin-Frier et al., 2014] Moulin-Frier, C., Rouanet, P., Oudeyer, P.-Y., and others (2014). Explauto: an open-source Python library to study autonomous exploration in developmental robotics. In ICDL-Epirob-InternationalConference on Development and Learning, Epirob.

[Nguyen and Oudeyer, 2012] Nguyen, S. and Oudeyer, P.-Y. (2012). Active choice of teachers, learning strategiesand goals for a socially guided intrinsic motivation learner. Paladyn, 3(3):136–146.

2

Hierarchical strategic learning algorithms (Lopes and Oudeyer, 2012; Nguyen and Oudeyer, 2013)

Hand!movements!!!!!!!!!!!!!!!!!!!!!!!

Object!1!movements!!!!!!!!!

Choose!acOon!parameters!!!!!!

Choose!goal/!RL!problem!parameters!

Object!2!movements!

Object!3!!movements!

Object!4!!movements!

Object!5!!movements!

Choose!model!

Model!1!

Model!5!

Model!4!Model!7!

Choose!model!path!/!strategy!

Choose!informaOon!source:!self&exploraOon!or!human!teacher!

IAC!!(Oudeyer!et!al.,!2007)!

R&IAC!!(Oudeyer!and!Baranes,!

2009)!SAGG&Random!!

(Oudeyer!and!Baranes,!2010)!SAGG&RIAC!!

(Oudeyer!and!Baranes,!2010&13)!McSAGG&RIAC!

(Oudeyer!and!Baranes,!2012)!

MACOB!!(ForesOer!and!Oudeyer,!

2016)!

HACOB!!(ForesOer!and!Oudeyer,!

2016)!

SGIM&ACTS!(Nguyen!and!Oudeyer,!2013)!(Moulin&Frier!et!al.,!2014)!

Page 19: Ac#ve&Explora#on&Mechanisms&in& …

(1)!Beyond!«!surprise!»:!what!are!the!features!of!interes&ngness?!

•  High!novelty/high!complexity?!•  (Bayesian!?)!Surprise?!(In!and!Baldi)!•  Knowledge!gap,!cogniOve!dissonance?!(Kagan,!FesOnger,!Lowenstein)!•  Intermediate!novelty,!intermediate!complexity?!(Berlyne,!Kidd)!•  Intermediate!challenge?!(White,!Csikszentmihalyi)!•  Free!energy?!(Friston)!•  Learning$progress,$improvement$of$predic;on$errors?$(Oudeyer$et$al.,$Schmidhuber)$!

All!these!measures!can!be!mathemaOcally!modelled!and!compared!in!computaOonal!and!roboOc!experiments!!(Oudeyer,!Go?lieb!and!Lopes,!2016)!

Page 20: Ac#ve&Explora#on&Mechanisms&in& …

O u d e y e r a n d K a p l a n

Frontiers in Neurorobotics | November 2007 | Volume 1 | Article 612

Both could be viewed as rationale strategies in certain contexts. Stability permits to act in order to decrease the inherent instability of perception and could lead for instance to tracking behavior (Kaplan and Oudeyer, 2003). On the contrary, variance motivation could lead to explore unknown sensorimotor contingencies far from equilibrium.

EXAMPLES OF COMPUTATIONAL MODELS OF NON-INTRINSIC MOTIVATION SYSTEMSFor clarity sake, we will shortly present in this section some computa-tional models of non-intrinsic motivation systems which are nevertheless internal.

Let’s imagine for instance that one wants to build a robot with a social presence motivation and that this robot can recognize faces in its envi-ronment. If the robot does not see enough faces, it should act as if it is lonely and look for social interaction. if it sees too many, it should be overwhelmed and try to avoid new social interactions. If we defi ne Fτ (t) the average number of faces seen during the last τ timeframes and Fτ

σ the optimal average number faces, the reward for socially balanced inter-action (SocM) could be defi ned as (C 1 and C 2 being some constants to be defi ned):

r SM t C eC F t Fr( )

( )→( ) = ⋅ − −1

22

τσ

(36)

If the same manner, we can program a reward for energy maintenance that pushes the robot to maintain energy at an intermediary level (EnerM) (between starvation and indigestion) by defi ning E(t) the energy at time t and Eσ the optimal energy level and the following reward formula:

r SM t C eC E t E

( )( )→( ) = ⋅ − −

12

(37)

Motivation systems of these kinds have been investigated by many researchers (e.g., see Breazeal, 2002 for a series of relevant examples). They are very good for simulating natural complex balanced behavior.

However, they should not be considered as intrinsic motivation systems as they are defi ned based on measures related to specifi c sensori chan-nels (energy level, number of faces seen).

DISCUSSIONIn spite of the diversity of the computational approaches of intrinsic moti-vation that we presented, there is a point of convergence for all of them. Each of the described models defi nes a certain interpretation of intrinsic motivation in terms of properties of the fl ow of sensorimotor values and of its relation to the knowledge and know-how of the system independ-ently of the meaning of the sensorichannels that are involved. This defi ni-tion contrasts greatly with defi nitions based on behavioral observation (activities with no apparent goal except the activity itself) and may at fi rst seem non-intuitive as its behavioral consequences can only be explored through computational modeling and robotic experiments. Moreover, simple variants of these intrinsic motivation systems will not push a sys-tem towards exploration (e.g., FM, CM or StabM will push a robot to stand still), but we believe it is formally more coherent to conceptualize them also as intrinsic motivations, even if some psychologists would not do so. In fact, we believe that this kind of systematic computational approach to intrinsic motivation can play a crucial role in organizing the debate around their very defi nition, as well as their role in behavior, learning and development, in particular because it permits to discuss hypothesis on a clearly defi ned common ground.

The table on Figure 7 presents all the models discussed in this paper and the families to which they belong (Intrinsic vs. Extrinsic, Adaptive vs. Fixed, Knowledge-based, Competence-based or Morphological, Information theo-retic or Predictive, Homeostatic vs. Heterostatic). For each model we give a rough estimation of its exploration potential (how likely such a motivation can lead to exploratory and investigation behaviours) and of its organization potential (how likely such a motivation can lead to a structured and organ-ized behaviour). We also estimate the computational cost and number of computational models existing so far for each of the categories. This table permits to clarify the landscape of intrinsic motivation models, show the

Figure 7. This table presents all the models discussed in this paper and the families to which they belong. For each model we give a rough estimation of its exploration potential (how likely such a motivation can lead to exploratory and investigation behaviours) and of its organization potential (how likely such a motivation can lead to a structured and organized behaviour). We also estimate the computational cost and number of computational models existing so far for each of the categories.

Homeostatic (-)vs Heterostatic

(+)Motivation Exploration

potentialOrganization

potentialComputational

costExisting models

Internal

Intrinsic

Adaptive

Knowledge-based

Informationtheoretic

+

UM *** * *** **

IGM *** *** *** **

DSM ** *** *** *– DFM * *** *** *

Predictive

+ NM *** * * ***– ILNM ** ** * **

+ LPM *** *** ** **

SM ** ** ** *– FM * *** ** **

Competence-based+

IM *** * ** *

CPM *** *** ** *– CM * *** ** *

Fixed Morphological–

SyncM * *** ** **

StabM * *** * **+ VarM *** * * *

Extrinsic– SocM / / * ***– EnerM / / * ***

November 2007 | Volume 1 | Article 6 | www.frontiersin.org1

What is intrinsic motivation? A typology of computational approaches

Pierre-Yves Oudeyer1,2,* and Frederic Kaplan3

1. Sony Computer Science Laboratory Paris, Paris, France2. INRIA Bordeaux-Sud-Ouest, France3. Ecole Polytechnique Federale de Lausanne, EPFL – CRAFT, Lausanne, Switzerland

Edited by: Max Lungarella, University of Zurich, Switzerland

Reviewed by: Jeffrey L. Krichmar, The Neurosciences Institute, USACornelius Weber, Johann Wolfgang Goethe University, Germany

Intrinsic motivation, centrally involved in spontaneous exploration and curiosity, is a crucial concept in developmental psychology. It has been argued to be a crucial mechanism for open-ended cognitive development in humans, and as such has gathered a growing interest from developmental roboticists in the recent years. The goal of this paper is threefold. First, it provides a synthesis of the different approaches of intrinsic motivation in psychology. Second, by interpreting these approaches in a computational reinforcement learning framework, we argue that they are not operational and even sometimes inconsistent. Third, we set the ground for a systematic operational study of intrinsic motivation by presenting a formal typology of possible computational approaches. This typology is partly based on existing computational models, but also presents new ways of conceptualizing intrinsic motivation. We argue that this kind of computational typology might be useful for opening new avenues for research both in psychology and developmental robotics.

Keywords: intrinsic motivation, cognitive development, reward, reinforcement learning, exploration, curiosity, computational modeling, artifi cial intelligence, developmental robotics

INTRODUCTIONThere exists a wide diversity of motivation systems in living organisms, and humans in particular. For example, there are systems that push the organism to maintain certain levels of chemical energy, involving the ingestion of food, or systems that push the organism to maintain its temperature or its physical integrity in a zone of viability. Inspired by these kinds of motivation and their understanding by (neuro-) ethologists, roboticists have built machines endowed with similar systems with the aim of providing them with autonomy and properties of life-like intel-ligence (Arkin, 2005). For example sowbug-inspired robots (Endo and Arkin, 2001), praying mantis robots (Arkin et al., 1998) dog-like robots (Fujita et al., 2001) have been constructed.

Some animals, and this is most prominent in humans, also have more general motivations that push them to explore, manipulate or probe their environment, fostering curiosity and engagement in playful and new activities. This kind of motivation, which is called intrinsic motivation by psychologists (Ryan and Deci, 2000), is paramount for sensorimotor and cognitive development throughout lifespan. There is a vast literature in psychology that explains why it is essential for cognitive growth and organization, and investigates the actual potential cognitive processes underlying intrinsic motivation (Berlyne, 1960; Csikszentmihalyi, 1991; Deci and Ryan, 1985; Ryan and Deci, 2000; White, 1959). This has gath-

ered the interest of a growing number of researchers in developmental robotics in the recent years, and several computational models have been developed (see Barto et al., 2004; Oudeyer et al., 2007 for reviews).

However, the very concept of intrinsic motivation has never really been consistently and critically discussed from a computational point of view. It has been used intuitively by many authors without asking for what it really means. Thus, the fi rst objective and contribution of this paper is to present an overview of this concept in psychology followed by a critical reinterpre-tation in computational terms. We show that the defi nitions provided in psy-chology are actually unsatisfying. As a consequence, we will set the ground for a systematic operational study of intrinsic motivation by presenting a typology of possible computational approaches, and discuss whether it is possible or useful to give a single general computational defi nition of intrin-sic motivation. The typology that we will present is partly based on exist-ing computational models, but also presents new ways of conceptualizing intrinsic motivation. We will try to focus on how these models relate to each other and propose a classifi cation into broad but distinct categories.

INTRINSIC MOTIVATION FROM THE PSYCHOLOGIST’S POINT OF VIEWIntrinsic motivation and instrumentalizationAccording to Ryan and Deci (2000) (pp. 56),

Intrinsic motivation is defi ned as the doing of an activity for its inherent satisfaction rather than for some separable consequence. When intrinsi-cally motivated, a person is moved to act for the fun or challenge entailed rather than because of external products, pressures, or rewards.

Intrinsic motivation is clearly visible in young infants, that consistently try to grasp, throw, bite, squash or shout at new objects they encounter. Even if less important as they grow, human adults are still often intrinsically motivated while they play crosswords, make paintings, do gardening or

*Correspondence: Pierre-Yves Oudeyer, Sony Computer Science Laboratory Paris, 6 rue Amyot, 75005 Paris, France. e-mail: [email protected]: 06 September 2007; paper pending published: 09 October 2007; accepted: 27 October 2007; published online: 02 November 2007.Citation: Front. Neurorobot. (2007) 1: 6. doi: 10.3389/neuro.12.006.2007Copyright © 2007 Oudeyer and Kaplan. This is an open-access article subject to an exclusive license agreement between the authors and the Frontiers Research Foundation, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are credited.

Page 21: Ac#ve&Explora#on&Mechanisms&in& …

3.1 Active Model Babbling

A particular exploration strategy available in Explauto is called Active Model Babbling [Forestier andOudeyer, 2016]. With this strategy, di↵erent motor and sensory spaces are given to the robot, e.g. thespace of motor parameters of a robotic arm, and sensory spaces representing the movement of each objectin the scene, and the robot will learn behavioral primitives to explore and reach new e↵ects in the di↵erentsensory spaces. The exploration of one sensory space can give information in the other spaces, and therobot can learn that some objects (e.g. a joystick) can be used as a tool to control other objects (e.g.another robot) in a hierarchical manner. This strategy was shown to be very e�cient to explore multipletask/goal spaces based on the monitoring of learning progress in the di↵erent spaces. A Jupyter notebookimplementing and studying this strategy is available online at this address: http://nbviewer.jupyter.org/github/sebastien-forestier/ExplorationAlgorithms/blob/master/main.ipynb.

3.2 Interaction

In the Active Model Babbling exploration strategy, guidance from human peers can be integrated atdi↵erent levels. When the robot is set up in a compliant mode, human peers can demonstrate roboticmovements that the robot stores and can reuse later in its autonomous exploration by experimentingagain those movements or variations of the movements as in [Nguyen and Oudeyer, 2012]. Also, users candemonstrate the function of objects or create situations (e.g. use the joystick controling the second robotto grab a ball) that the robot will try to reproduce by exploring new motor primitives. Finally, users canalso give a value about the interestingness to explore some task/goal spaces (e.g. moving the blue ball),that will be combined with the intrinsic motivation of the robot and will push it to explore those spaces.

4 Demonstration Setup

The demonstration shows how the Explauto li-brary allows a Poppy Torso to explore its environ-ment and use social guidance to understand the in-teraction between the di↵erent objects in the envi-ronment. The robot interacts with several objectsincluding a joystick that controls another robot thatcan be used as a tool to move other objects. Naivehumans can interact with the robot by driving its ex-ploration towards interesting objects, or by demon-strating useful movements of the arm or the joystick.

Figure 2: Demonstration setup.

References

[Forestier and Oudeyer, 2016] Forestier, S. and Oudeyer, P.-Y. (2016). Modular active curiosity-driven discovery oftool use. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea.

[Lapeyre et al., 2014] Lapeyre, M., Rouanet, P., Grizou, J., Nguyen, S., Depraetre, F., Le Falher, A., and Oudeyer,P.-Y. (2014). Poppy Project: Open-Source Fabrication of 3D Printed Humanoid Robot for Science, Educationand Art. In Digital Intelligence 2014, page 6, Nantes, France.

[Moulin-Frier et al., 2014] Moulin-Frier, C., Rouanet, P., Oudeyer, P.-Y., and others (2014). Explauto: an open-source Python library to study autonomous exploration in developmental robotics. In ICDL-Epirob-InternationalConference on Development and Learning, Epirob.

[Nguyen and Oudeyer, 2012] Nguyen, S. and Oudeyer, P.-Y. (2012). Active choice of teachers, learning strategiesand goals for a socially guided intrinsic motivation learner. Paladyn, 3(3):136–146.

2

Active curiosity-driven exploration as a means to learn models of the world dynamics/affordances?

!!Searching!for!maximal!novelty!or!uncertainty!!or!entropy!will!not!be!efficient!at!all!!

Page 22: Ac#ve&Explora#on&Mechanisms&in& …

1

1

2

34

2

4

3

time spent in each sensorimotor context based on the principle of maximizing error reduction

errors in prediction

time

time

1

1

2

34

2

4

3

time spent in each sensorimotor context based on the principle of maximizing error reduction

errors in prediction

time

time

Errors!in!predicOon!in!4!acOviOes!

Ome!

%!of!Ome!spent!in!each!acOvity!based!on!the!principle!of!maximizing!learning!progress!

Ome!

The$Learning$Progress$$hypothesis$!! Explore!tasks!that!currently!maximize!empirical!learning!progress!

(Oudeyer!et!al.,!2004;!2007;!Go?lieb!et!al.,!2013;!Oudeyer!et!al.,!2016)!!!! OpOmal!acOve!learning!of!mulOple!tasks!with!

concace!learning!curves!!in!the!Strategic!Student!Learning!framework!(Lopes!and!Oudeyer,!2012)!

! This!achieves!automated$and$intrinsically$mo;vated$curriculum$learning!of!mulOple!tasks!and/or!models!(Oudeyer!et!al.,!2007;!Oudeyer!and!Baranes,!2013;!ForesOer!et!al.,!2016)!

Page 23: Ac#ve&Explora#on&Mechanisms&in& …

Intrinsic!moOvaOon!!to!search!for!(intermediate)!

novelty!or!complexity!!

State!curiosity:!Experience!of!(intermediate)!

novelty!or!complexity!

Learning!!Memory!retenOon!

leads!to! fosters!

intrinsic!!reward!

(A)$$$

(WalO!et!al.,!2001;!Kang!et!al.,!2009;!Gruber!et!al.,!2014;!Stahl!and!Feigenson!,!2015)!!

Intrinsic!moOvaOon!!to!search!for!!

learning!progress!(improvement!of!predicOon!errors)!

State!curiosity:!Experience!of!

learning!progress!

Learning!!Memory!retenOon!

leads!to!

intrinsic!!reward!

fosters!

PosiOve!feedback!loop!

(B)$The!Learning!Progress!hypothesis:!curiosity!=!acOve!learning!

(Oudeyer!et!al.,!2007;!Oudeyer,!Go?lieb,!Lopes,!2016)!

Page 24: Ac#ve&Explora#on&Mechanisms&in& …

Examples!of!model!architectures!

24!

Page 25: Ac#ve&Explora#on&Mechanisms&in& …

Space of Controllers Task Space = Space of Effects

Reachable Space of Effects

Forward Model

Inverse Model

T

⇡✓1

⇡✓2 ⇡✓3

⇡✓4

��1 , R�1

��2 , R�2

��3 , R�3

(Context, Movement) "

Effect

54 A. Baranes, P.-Y. Oudeyer / Robotics and Autonomous Systems 61 (2013) 49–73

We introduce a measure of competence for a given goal reachingattempt as dependent on two metrics: the similarity between thepoint in the task space yf attained when the reaching attempt hasterminated, and the actual goal yg ; and the respect of constraints⇢. These conditions are represented by a cost, or competence,functionC defined in [�1; 0], such that higherC(yg , yf , ⇢)will be,the more a reaching attempt will be considered as efficient. Fromthis definition, we set a measure of competence �yg directly linkedwith the value of C(yg , yf , ⇢):

�yg =⇢C(yg , yf , ⇢) if C(yg , yf , ⇢) "sim < 00 otherwise

where "sim is a tolerance factor such that C(yg , yf , ⇢) > "simcorresponds to a goal reached. We note that a high value of �yg(i.e. close to 0) represents a system that is competent to reach thegoal yg while respecting constraints ⇢. A typical instantiation ofC , without constraints ⇢, is defined as C(yg , yf , ;) = �kyg � yf k2,and is the direct transposition of prediction error in RIAC [16,51] tothe task space in SAGG-RIAC. Yet, this competence measure mighttake some other forms in the SAGG-RIAC architecture, such as thevariants explored in the experiments below.

2.4.2. Measure of competence during a reaching attempt or duringgoal-directed optimization

When the system exploits its previously learnt models to reacha goal yg , using a computed ⇡✓ through adequate local regression,or when it is using the low-level goal-directed optimization tooptimize the best current ⇡✓ to reach a self-generated goal yg , itdoes not only collect data allowing to measure its competence toreach yg , but since the computed ⇡✓ might lead to a different effectye 6= yg , it also allows to collect new data for improving the inversemodel and the measure of competence to reach other goals in thelocality of ye. This allows to use all experiments of the robot toupdate the model of competences over the space of parameterizedgoals.

2.4.3. Definition of local competence progressThe active goal self-generation and self-selection relies on

a feedback linked with the notion of competence introducedabove, and more precisely on the monitoring of the progress oflocal competences. We first need to define this notion of localcompetence. Let us consider a subspace called a region R ⇢Y . Then, let us consider different measures of competence �yicomputed for different attempted goals yi 2 R, in a time windowconsisting of the ⇣ last attempted goals. For the region R, we cancompute a measure of competence � that we call a local measuresuch that:

� =

0

B@

Pyj2R

(�yj)

|R|

1

CA (1)

with |R|, cardinal of R.Let us now consider different regions Ri of Y such that Ri ⇢

Y ,S

i Ri = Y (initially, there is only one region which is thenprogressively and recursively split; see below and see Fig. 2).Each Ri contains attempted goals {yi1,t1 , yi2,t2 , . . . , yik,tk}Ri andcorresponding competences obtained {�yi1,t1

, �yi2,t2, . . . , �yik,tk

}Ri ,indexed by their relative time order of experimentation t1 < t2 <· · · < tk|tn+1 = tn + 1 inside this precise subspace Ri (ti are notthe absolute time, but integer indexes of relative order in the givenregion).

An estimation of interest is computed for each region Ri. Theinterest interesti of a region Ri is described as the absolute value ofthe derivative of local competences inside Ri, hence the amplitude of

Fig. 2. Task space and example of regions and subregions split during the learningprocess according to the competence level. Each region displays its competencelevel over time,measurewhich is used for the computation of the interest accordingto Eq. (2).

local competence progress, over a sliding time window of the ⇣ morerecent goals attempted inside Ri (Eq. (2)):

interesti =

������

0

@|Ri|� ⇣

2Pj=|Ri|�⇣

�yj

1

A �

0

@|Ri|P

j=|Ri|� ⇣2

�yj

1

A

������

⇣. (2)

By using a derivative, the interest considers the variation ofcompetences, and by using an absolute value, it considers cases ofincreasing and decreasing competences. In SAGG-RIAC, we will usethe term competence progress with its general meaning to denotethis increase and decrease of competences.

An increasing competence signifies that the expected compe-tence gain in Ri is important. Therefore, potentially, selecting newgoals in regions of high competence progress could bring both ahigh information gain for the learned model, and also drive thereaching of not previously achieved goals.

Depending on the starting position and potential evolution ofthe environment or of the body (e.g. breaking of a body part), adecrease of competences inside already well-reached regions canarise. In this case, the system should be able to focus again in theseregions in order to at least verify the possibility to re-establish ahigh level of competence inside. This explains the usefulness toconsider the absolute value of the competence progress as shownin Eq. (2).

Using a slidingwindow in order to compute the value of interestprevents the system from keeping each measure of competencein its memory, and thus limits the storage resource needed by thecore of the SAGG-RIAC architecture.

2.4.4. Goal self-generation using the measure of interestUsing the previous description of interest, the goal self-

generation and self-selection mechanism carries out two differentprocesses:1. Splitting of the space Y where goals are chosen, into subspaces,

according to heuristics that allows to maximally discriminateareas according to their levels of interest.

2. Selecting the next goal to perform.Such a mechanism has been described in the RIAC algorithmintroduced in [51], but was previously applied to the actuatorspace S rather than to the goal/task space Y as is done in SAGG-RIAC. Here, we use the same kind of methods such as a recursivesplit of the space, each split being triggered once a predefinedmaximum number of goals gmax has been attempted inside. Eachsplit is performed such that it maximizes the difference of theinterest measure described above in the two resulting subspaces.This allows the easy separation of areas of differing interest andtherefore of differing reaching difficulty. More precisely, here thesplit of a region Rn into Rn+1 and Rn+2 is done by selecting amongm randomly generated splits, a split dimension j 2 |Y | and then aposition vj such that:• All the yi of Rn+1 have a jth component smaller than vj;• All the yi of Rn+2 have a jth component higher than vj;

54 A. Baranes, P.-Y. Oudeyer / Robotics and Autonomous Systems 61 (2013) 49–73

We introduce a measure of competence for a given goal reachingattempt as dependent on two metrics: the similarity between thepoint in the task space yf attained when the reaching attempt hasterminated, and the actual goal yg ; and the respect of constraints⇢. These conditions are represented by a cost, or competence,functionC defined in [�1; 0], such that higherC(yg , yf , ⇢)will be,the more a reaching attempt will be considered as efficient. Fromthis definition, we set a measure of competence �yg directly linkedwith the value of C(yg , yf , ⇢):

�yg =⇢C(yg , yf , ⇢) if C(yg , yf , ⇢) "sim < 00 otherwise

where "sim is a tolerance factor such that C(yg , yf , ⇢) > "simcorresponds to a goal reached. We note that a high value of �yg(i.e. close to 0) represents a system that is competent to reach thegoal yg while respecting constraints ⇢. A typical instantiation ofC , without constraints ⇢, is defined as C(yg , yf , ;) = �kyg � yf k2,and is the direct transposition of prediction error in RIAC [16,51] tothe task space in SAGG-RIAC. Yet, this competence measure mighttake some other forms in the SAGG-RIAC architecture, such as thevariants explored in the experiments below.

2.4.2. Measure of competence during a reaching attempt or duringgoal-directed optimization

When the system exploits its previously learnt models to reacha goal yg , using a computed ⇡✓ through adequate local regression,or when it is using the low-level goal-directed optimization tooptimize the best current ⇡✓ to reach a self-generated goal yg , itdoes not only collect data allowing to measure its competence toreach yg , but since the computed ⇡✓ might lead to a different effectye 6= yg , it also allows to collect new data for improving the inversemodel and the measure of competence to reach other goals in thelocality of ye. This allows to use all experiments of the robot toupdate the model of competences over the space of parameterizedgoals.

2.4.3. Definition of local competence progressThe active goal self-generation and self-selection relies on

a feedback linked with the notion of competence introducedabove, and more precisely on the monitoring of the progress oflocal competences. We first need to define this notion of localcompetence. Let us consider a subspace called a region R ⇢Y . Then, let us consider different measures of competence �yicomputed for different attempted goals yi 2 R, in a time windowconsisting of the ⇣ last attempted goals. For the region R, we cancompute a measure of competence � that we call a local measuresuch that:

� =

0

B@

Pyj2R

(�yj)

|R|

1

CA (1)

with |R|, cardinal of R.Let us now consider different regions Ri of Y such that Ri ⇢

Y ,S

i Ri = Y (initially, there is only one region which is thenprogressively and recursively split; see below and see Fig. 2).Each Ri contains attempted goals {yi1,t1 , yi2,t2 , . . . , yik,tk}Ri andcorresponding competences obtained {�yi1,t1

, �yi2,t2, . . . , �yik,tk

}Ri ,indexed by their relative time order of experimentation t1 < t2 <· · · < tk|tn+1 = tn + 1 inside this precise subspace Ri (ti are notthe absolute time, but integer indexes of relative order in the givenregion).

An estimation of interest is computed for each region Ri. Theinterest interesti of a region Ri is described as the absolute value ofthe derivative of local competences inside Ri, hence the amplitude of

Fig. 2. Task space and example of regions and subregions split during the learningprocess according to the competence level. Each region displays its competencelevel over time,measurewhich is used for the computation of the interest accordingto Eq. (2).

local competence progress, over a sliding time window of the ⇣ morerecent goals attempted inside Ri (Eq. (2)):

interesti =

������

0

@|Ri|� ⇣

2Pj=|Ri|�⇣

�yj

1

A �

0

@|Ri|P

j=|Ri|� ⇣2

�yj

1

A

������

⇣. (2)

By using a derivative, the interest considers the variation ofcompetences, and by using an absolute value, it considers cases ofincreasing and decreasing competences. In SAGG-RIAC, we will usethe term competence progress with its general meaning to denotethis increase and decrease of competences.

An increasing competence signifies that the expected compe-tence gain in Ri is important. Therefore, potentially, selecting newgoals in regions of high competence progress could bring both ahigh information gain for the learned model, and also drive thereaching of not previously achieved goals.

Depending on the starting position and potential evolution ofthe environment or of the body (e.g. breaking of a body part), adecrease of competences inside already well-reached regions canarise. In this case, the system should be able to focus again in theseregions in order to at least verify the possibility to re-establish ahigh level of competence inside. This explains the usefulness toconsider the absolute value of the competence progress as shownin Eq. (2).

Using a slidingwindow in order to compute the value of interestprevents the system from keeping each measure of competencein its memory, and thus limits the storage resource needed by thecore of the SAGG-RIAC architecture.

2.4.4. Goal self-generation using the measure of interestUsing the previous description of interest, the goal self-

generation and self-selection mechanism carries out two differentprocesses:1. Splitting of the space Y where goals are chosen, into subspaces,

according to heuristics that allows to maximally discriminateareas according to their levels of interest.

2. Selecting the next goal to perform.Such a mechanism has been described in the RIAC algorithmintroduced in [51], but was previously applied to the actuatorspace S rather than to the goal/task space Y as is done in SAGG-RIAC. Here, we use the same kind of methods such as a recursivesplit of the space, each split being triggered once a predefinedmaximum number of goals gmax has been attempted inside. Eachsplit is performed such that it maximizes the difference of theinterest measure described above in the two resulting subspaces.This allows the easy separation of areas of differing interest andtherefore of differing reaching difficulty. More precisely, here thesplit of a region Rn into Rn+1 and Rn+2 is done by selecting amongm randomly generated splits, a split dimension j 2 |Y | and then aposition vj such that:• All the yi of Rn+1 have a jth component smaller than vj;• All the yi of Rn+2 have a jth component higher than vj;

(acOve!selecOon!of!parameterized!motor!programs)$!

Cost!funcOon!for!driving!motor!exploraOon!=!

generaOng!dynamically!a!learning!curriculum!!

motor25!

CuriosityYdriven$acOve!!Motor$Explora;on!

Motor!space!

Page 26: Ac#ve&Explora#on&Mechanisms&in& …

Space of Controllers Task Space = Space of Effects

Reachable Space of Effects

Forward Model

Inverse Model

T

⇡✓1

⇡✓2 ⇡✓3

⇡✓4

��1 , R�1

��2 , R�2

��3 , R�3

(Context, Movement) "

Effect

54 A. Baranes, P.-Y. Oudeyer / Robotics and Autonomous Systems 61 (2013) 49–73

We introduce a measure of competence for a given goal reachingattempt as dependent on two metrics: the similarity between thepoint in the task space yf attained when the reaching attempt hasterminated, and the actual goal yg ; and the respect of constraints⇢. These conditions are represented by a cost, or competence,functionC defined in [�1; 0], such that higherC(yg , yf , ⇢)will be,the more a reaching attempt will be considered as efficient. Fromthis definition, we set a measure of competence �yg directly linkedwith the value of C(yg , yf , ⇢):

�yg =⇢C(yg , yf , ⇢) if C(yg , yf , ⇢) "sim < 00 otherwise

where "sim is a tolerance factor such that C(yg , yf , ⇢) > "simcorresponds to a goal reached. We note that a high value of �yg(i.e. close to 0) represents a system that is competent to reach thegoal yg while respecting constraints ⇢. A typical instantiation ofC , without constraints ⇢, is defined as C(yg , yf , ;) = �kyg � yf k2,and is the direct transposition of prediction error in RIAC [16,51] tothe task space in SAGG-RIAC. Yet, this competence measure mighttake some other forms in the SAGG-RIAC architecture, such as thevariants explored in the experiments below.

2.4.2. Measure of competence during a reaching attempt or duringgoal-directed optimization

When the system exploits its previously learnt models to reacha goal yg , using a computed ⇡✓ through adequate local regression,or when it is using the low-level goal-directed optimization tooptimize the best current ⇡✓ to reach a self-generated goal yg , itdoes not only collect data allowing to measure its competence toreach yg , but since the computed ⇡✓ might lead to a different effectye 6= yg , it also allows to collect new data for improving the inversemodel and the measure of competence to reach other goals in thelocality of ye. This allows to use all experiments of the robot toupdate the model of competences over the space of parameterizedgoals.

2.4.3. Definition of local competence progressThe active goal self-generation and self-selection relies on

a feedback linked with the notion of competence introducedabove, and more precisely on the monitoring of the progress oflocal competences. We first need to define this notion of localcompetence. Let us consider a subspace called a region R ⇢Y . Then, let us consider different measures of competence �yicomputed for different attempted goals yi 2 R, in a time windowconsisting of the ⇣ last attempted goals. For the region R, we cancompute a measure of competence � that we call a local measuresuch that:

� =

0

B@

Pyj2R

(�yj)

|R|

1

CA (1)

with |R|, cardinal of R.Let us now consider different regions Ri of Y such that Ri ⇢

Y ,S

i Ri = Y (initially, there is only one region which is thenprogressively and recursively split; see below and see Fig. 2).Each Ri contains attempted goals {yi1,t1 , yi2,t2 , . . . , yik,tk}Ri andcorresponding competences obtained {�yi1,t1

, �yi2,t2, . . . , �yik,tk

}Ri ,indexed by their relative time order of experimentation t1 < t2 <· · · < tk|tn+1 = tn + 1 inside this precise subspace Ri (ti are notthe absolute time, but integer indexes of relative order in the givenregion).

An estimation of interest is computed for each region Ri. Theinterest interesti of a region Ri is described as the absolute value ofthe derivative of local competences inside Ri, hence the amplitude of

Fig. 2. Task space and example of regions and subregions split during the learningprocess according to the competence level. Each region displays its competencelevel over time,measurewhich is used for the computation of the interest accordingto Eq. (2).

local competence progress, over a sliding time window of the ⇣ morerecent goals attempted inside Ri (Eq. (2)):

interesti =

������

0

@|Ri|� ⇣

2Pj=|Ri|�⇣

�yj

1

A �

0

@|Ri|P

j=|Ri|� ⇣2

�yj

1

A

������

⇣. (2)

By using a derivative, the interest considers the variation ofcompetences, and by using an absolute value, it considers cases ofincreasing and decreasing competences. In SAGG-RIAC, we will usethe term competence progress with its general meaning to denotethis increase and decrease of competences.

An increasing competence signifies that the expected compe-tence gain in Ri is important. Therefore, potentially, selecting newgoals in regions of high competence progress could bring both ahigh information gain for the learned model, and also drive thereaching of not previously achieved goals.

Depending on the starting position and potential evolution ofthe environment or of the body (e.g. breaking of a body part), adecrease of competences inside already well-reached regions canarise. In this case, the system should be able to focus again in theseregions in order to at least verify the possibility to re-establish ahigh level of competence inside. This explains the usefulness toconsider the absolute value of the competence progress as shownin Eq. (2).

Using a slidingwindow in order to compute the value of interestprevents the system from keeping each measure of competencein its memory, and thus limits the storage resource needed by thecore of the SAGG-RIAC architecture.

2.4.4. Goal self-generation using the measure of interestUsing the previous description of interest, the goal self-

generation and self-selection mechanism carries out two differentprocesses:1. Splitting of the space Y where goals are chosen, into subspaces,

according to heuristics that allows to maximally discriminateareas according to their levels of interest.

2. Selecting the next goal to perform.Such a mechanism has been described in the RIAC algorithmintroduced in [51], but was previously applied to the actuatorspace S rather than to the goal/task space Y as is done in SAGG-RIAC. Here, we use the same kind of methods such as a recursivesplit of the space, each split being triggered once a predefinedmaximum number of goals gmax has been attempted inside. Eachsplit is performed such that it maximizes the difference of theinterest measure described above in the two resulting subspaces.This allows the easy separation of areas of differing interest andtherefore of differing reaching difficulty. More precisely, here thesplit of a region Rn into Rn+1 and Rn+2 is done by selecting amongm randomly generated splits, a split dimension j 2 |Y | and then aposition vj such that:• All the yi of Rn+1 have a jth component smaller than vj;• All the yi of Rn+2 have a jth component higher than vj;

54 A. Baranes, P.-Y. Oudeyer / Robotics and Autonomous Systems 61 (2013) 49–73

We introduce a measure of competence for a given goal reachingattempt as dependent on two metrics: the similarity between thepoint in the task space yf attained when the reaching attempt hasterminated, and the actual goal yg ; and the respect of constraints⇢. These conditions are represented by a cost, or competence,functionC defined in [�1; 0], such that higherC(yg , yf , ⇢)will be,the more a reaching attempt will be considered as efficient. Fromthis definition, we set a measure of competence �yg directly linkedwith the value of C(yg , yf , ⇢):

�yg =⇢C(yg , yf , ⇢) if C(yg , yf , ⇢) "sim < 00 otherwise

where "sim is a tolerance factor such that C(yg , yf , ⇢) > "simcorresponds to a goal reached. We note that a high value of �yg(i.e. close to 0) represents a system that is competent to reach thegoal yg while respecting constraints ⇢. A typical instantiation ofC , without constraints ⇢, is defined as C(yg , yf , ;) = �kyg � yf k2,and is the direct transposition of prediction error in RIAC [16,51] tothe task space in SAGG-RIAC. Yet, this competence measure mighttake some other forms in the SAGG-RIAC architecture, such as thevariants explored in the experiments below.

2.4.2. Measure of competence during a reaching attempt or duringgoal-directed optimization

When the system exploits its previously learnt models to reacha goal yg , using a computed ⇡✓ through adequate local regression,or when it is using the low-level goal-directed optimization tooptimize the best current ⇡✓ to reach a self-generated goal yg , itdoes not only collect data allowing to measure its competence toreach yg , but since the computed ⇡✓ might lead to a different effectye 6= yg , it also allows to collect new data for improving the inversemodel and the measure of competence to reach other goals in thelocality of ye. This allows to use all experiments of the robot toupdate the model of competences over the space of parameterizedgoals.

2.4.3. Definition of local competence progressThe active goal self-generation and self-selection relies on

a feedback linked with the notion of competence introducedabove, and more precisely on the monitoring of the progress oflocal competences. We first need to define this notion of localcompetence. Let us consider a subspace called a region R ⇢Y . Then, let us consider different measures of competence �yicomputed for different attempted goals yi 2 R, in a time windowconsisting of the ⇣ last attempted goals. For the region R, we cancompute a measure of competence � that we call a local measuresuch that:

� =

0

B@

Pyj2R

(�yj)

|R|

1

CA (1)

with |R|, cardinal of R.Let us now consider different regions Ri of Y such that Ri ⇢

Y ,S

i Ri = Y (initially, there is only one region which is thenprogressively and recursively split; see below and see Fig. 2).Each Ri contains attempted goals {yi1,t1 , yi2,t2 , . . . , yik,tk}Ri andcorresponding competences obtained {�yi1,t1

, �yi2,t2, . . . , �yik,tk

}Ri ,indexed by their relative time order of experimentation t1 < t2 <· · · < tk|tn+1 = tn + 1 inside this precise subspace Ri (ti are notthe absolute time, but integer indexes of relative order in the givenregion).

An estimation of interest is computed for each region Ri. Theinterest interesti of a region Ri is described as the absolute value ofthe derivative of local competences inside Ri, hence the amplitude of

Fig. 2. Task space and example of regions and subregions split during the learningprocess according to the competence level. Each region displays its competencelevel over time,measurewhich is used for the computation of the interest accordingto Eq. (2).

local competence progress, over a sliding time window of the ⇣ morerecent goals attempted inside Ri (Eq. (2)):

interesti =

������

0

@|Ri|� ⇣

2Pj=|Ri|�⇣

�yj

1

A �

0

@|Ri|P

j=|Ri|� ⇣2

�yj

1

A

������

⇣. (2)

By using a derivative, the interest considers the variation ofcompetences, and by using an absolute value, it considers cases ofincreasing and decreasing competences. In SAGG-RIAC, we will usethe term competence progress with its general meaning to denotethis increase and decrease of competences.

An increasing competence signifies that the expected compe-tence gain in Ri is important. Therefore, potentially, selecting newgoals in regions of high competence progress could bring both ahigh information gain for the learned model, and also drive thereaching of not previously achieved goals.

Depending on the starting position and potential evolution ofthe environment or of the body (e.g. breaking of a body part), adecrease of competences inside already well-reached regions canarise. In this case, the system should be able to focus again in theseregions in order to at least verify the possibility to re-establish ahigh level of competence inside. This explains the usefulness toconsider the absolute value of the competence progress as shownin Eq. (2).

Using a slidingwindow in order to compute the value of interestprevents the system from keeping each measure of competencein its memory, and thus limits the storage resource needed by thecore of the SAGG-RIAC architecture.

2.4.4. Goal self-generation using the measure of interestUsing the previous description of interest, the goal self-

generation and self-selection mechanism carries out two differentprocesses:1. Splitting of the space Y where goals are chosen, into subspaces,

according to heuristics that allows to maximally discriminateareas according to their levels of interest.

2. Selecting the next goal to perform.Such a mechanism has been described in the RIAC algorithmintroduced in [51], but was previously applied to the actuatorspace S rather than to the goal/task space Y as is done in SAGG-RIAC. Here, we use the same kind of methods such as a recursivesplit of the space, each split being triggered once a predefinedmaximum number of goals gmax has been attempted inside. Eachsplit is performed such that it maximizes the difference of theinterest measure described above in the two resulting subspaces.This allows the easy separation of areas of differing interest andtherefore of differing reaching difficulty. More precisely, here thesplit of a region Rn into Rn+1 and Rn+2 is done by selecting amongm randomly generated splits, a split dimension j 2 |Y | and then aposition vj such that:• All the yi of Rn+1 have a jth component smaller than vj;• All the yi of Rn+2 have a jth component higher than vj;

(acOve!selecOon!of!parameterized!RL!problems)$!

Cost!funcOon!for!driving!mulO&task!exploraOon!!=!generaOng!dynamically!a!learning!curriculum!!

OpOmizaOon!(policy!learning)!and!transfer!across!tasks!

Constraints!over!opOmizaOon!and!transfer!(e.g.!maturaOon)!

26!

CuriosityYdriven$acOve!!Goal$Explora;on!

Goal!space!

Page 27: Ac#ve&Explora#on&Mechanisms&in& …

(5)!Do!short&term!curiosity&driven!exploraOon!mechanisms!have!

consequences!on!!long$term$organiza;on$of$learning?!

!(=!how!intrinsic!moOvaOon!can!self&

organize!curriculum!learning?)!

Page 28: Ac#ve&Explora#on&Mechanisms&in& …

The!Playground!Experiments!

(Oudeyer!et!al.,!IEEE!Trans.!EC!2007;!Oudeyer!and!Smith,!2016)!h?ps://www.youtube.com/watch?v=uAoNzHjzzys!

Page 29: Ac#ve&Explora#on&Mechanisms&in& …

!Func;ons:$$•  Autonomous!learning!of!novel!affordances!and!

skills,!e.g.!object!manipulaOon,!before!they!are!needed!for!external!needs!!

•  Self&organizaOon!of!developmental!trajectories,!bootstrapping!of!communicaOon!

!!Developmental!and!evoluOonary!consequences!!

Topics in Cognitive Science (2016) 1–11Copyright © 2016 Cognitive Science Society, Inc. All rights reserved.ISSN:1756-8757 print / 1756-8765 onlineDOI: 10.1111/tops.12196

How Evolution May Work Through Curiosity-DrivenDevelopmental Process

Pierre-Yves Oudeyer,a Linda B. Smithb

aInria and Ensta ParisTech, FrancebDepartment of Psychological and Brain Sciences, Indiana University

Received 13 January 2014; received in revised form 4 February 2015; accepted 6 February 2015

Abstract

Infants’ own activities create and actively select their learning experiences. Here we reviewrecent models of embodied information seeking and curiosity-driven learning and show that thesemechanisms have deep implications for development and evolution. We discuss how these mecha-nisms yield self-organized epigenesis with emergent ordered behavioral and cognitive develop-mental stages. We describe a robotic experiment that explored the hypothesis that progress inlearning, in and for itself, generates intrinsic rewards: The robot learners probabilistically selectedexperiences according to their potential for reducing uncertainty. In these experiments, curiosity-driven learning led the robot learner to successively discover object affordances and vocal interac-tion with its peers. We explain how a learning curriculum adapted to the current constraints of thelearning system automatically formed, constraining learning and shaping the developmental trajec-tory. The observed trajectories in the robot experiment share many properties with those in infantdevelopment, including a mixture of regularities and diversities in the developmental patterns.Finally, we argue that such emergent developmental structures can guide and constrain evolution,in particular with regard to the origins of language.

Keywords: Development; Evolution; Curiosity; Infant active learning; Robotic modelling; Self-organization; Motor development; Speech development; Origins of language

1. Introduction

Learning experiences do not passively “happen” to infants. Rather, infants’ own activi-ties create and select these experiences. Piaget (1952) described a pattern of infant activ-ity that is highly illustrative of this point. He placed a rattle in a 4-month-old infant’shands. As the infant moved the rattle, it would both come into sight and also make a

Correspondence should be sent to Pierre-Yves Oudeyer, Inria and Ensta ParisTech, 200, avenue de laVieille Tour, 33405 Talence, France. E-mail: [email protected]

Autonomous!acquisiOon!of!!skill!repertoires!

Page 30: Ac#ve&Explora#on&Mechanisms&in& …

RegulariOes!and!diversity!in!developmental!trajectories!

(arer!Waddington’s!epigeneOc!landscape)!

state!

Page 31: Ac#ve&Explora#on&Mechanisms&in& …

Self&organizaOon!of!vocal!development!

•  CollaboraOon!with!D.!K.!Oller,!Univ.!Memphis,!US!

(Moulin&Frier,!Nguyen!and!Oudeyer,!FronOers!in!CogniOve!Science,!2013)!

�DIVA!Vocal!tract!model!(Guenther)!

TwoYlayers$ac;ve$learning:$1)  Ac;ve$choice$selfYexplora;on$vs.$Imita;on$2)  If$selfYexplora;on:$ac;ve$goal$selec;on$

Page 32: Ac#ve&Explora#on&Mechanisms&in& …

Emergent!developmental!stages!

(Oller,!2000)!

Page 33: Ac#ve&Explora#on&Mechanisms&in& …

Learning!tool!use!(nested!affordances)!+!using!intrinsic!moOvaOon!to!learn!

complex!RL!problem!with!rare!rewards!

33!Nested tool use, 2nd Best demo prize @NIPS, 2016 (in front of several GAFAs demos)

33!h?ps://www.youtube.com/watch?v=NOLAwD4ZTW0!

Page 34: Ac#ve&Explora#on&Mechanisms&in& …

Strategy selection

Chen & Siegler 20002

ReinterpreOng!infant!tool!use!experiments?!

! e.g.!Chen!and!Siegler!on!strategy!selecOon!(see!ForesOer!and!Oudeyer,!CogSci!2016;!ICDL&Epirob!2016)!!

Page 35: Ac#ve&Explora#on&Mechanisms&in& …

Efficiency!of!LP&based!curiosity&driven!learning!of!skills!

Page 36: Ac#ve&Explora#on&Mechanisms&in& …

Active learning of omnidirectional locomotion

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100001

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Reaching Error

Dis

tanc

e

Number of Actions (time steps)

SAGG-RIACSAGG-Random

ACTUATOR-RandomACTUATOR-RIAC

1000001

6

3

2

4

5

Figure 4: A robot can learn to walk just by exploring smartly a sensorimotorspace. In the experiment. a progress-driven kernel controls the movement of thedi�erent motors of a four-legged robot. For each motor, it chooses the period,the phase and the amplitude of a sinusoidal signal. The prediction system triesto predict the e�ect of the di�erent set of parameters in the way the imagecaptured by a camera placed on the robot’s head is modified. This indirectlyreflects the movement of its torso. At each iteration the kernel produces thevalues for the next parameter set in order to maximize the reduction of theprediction error.

11

u

v

3 DOF/Leg

! Performance!higher!than!more!classical!acOve!learning!algorithms!in!real!sensorimotor!spaces!(non&staOonary,!non!homogeneous)!!(Baranes!and!Oudeyer,!IEEE!TAMD!2009;!RoboOcs!and!Autonomous!Systems;!2013)!

−1;1[ ]24 −1;1[ ]3

Efficiency!of!acOve!learning!!in!high&dimensions!

Control Space: Task Space: h?ps://www.youtube.com/watch?v=_HusNBLV7yM!

Page 37: Ac#ve&Explora#on&Mechanisms&in& …

Learning omnidirectional control of soft/deformable objects

(Autonomous Robots, 2014)

Perceptual/visual representations (IEEE TAMD, 2014; ICRA 2016; IROS 2016)

h?ps://www.youtube.com/watch?v=m&desi46Uwk!h?ps://www.youtube.com/watch?v=OWjLaGv33i0!

AcOve!scheduling!of!both!tasks!and!learning!strategies!by!acOve!selecOon!of!when!to!ask!

help!to!human!teachers,!which!teacher!to!ask,!and!what!to!ask!them!

(Paladyn!journal!of!Behavioural!RoboOcs,!2013)!

Page 38: Ac#ve&Explora#on&Mechanisms&in& …

Curiosity&driven!ac;ve$choice$of$teachers$and$modes$of$imita;on$

(Nguyen!and!Oudeyer,!Palad.!Behav.!Rob.,!2013;!Autonomous!Rob.!2013)!

AcOve!choice!

Social!vs!!

IM!learning!AcOve!Choice!!

teacher!

AcOve!Choice!

ImitaOon!type!

AcOve!Choice!

Goal!

Page 39: Ac#ve&Explora#on&Mechanisms&in& …

EducaOon!tech!applicaOon:!automated!generaOon!of!personalized!learning!curriculum!for!human!brains!

KidLearn!project:!!PersonalizaOon!of!teaching!sequences!(curriculum)!in!Intelligent!Tutoring!Systems!(Journal&of&Educa#onal&Data&Mining,&2015)&!+&contract&with&ITWell&company&and&transfer&in&Skillogs&company&

•  More!students!reached!and!succeed!more!difficult!type!of!exercises.!

•  500!children!in!more!than!10!schools!

39!

Page 40: Ac#ve&Explora#on&Mechanisms&in& …

Interdisciplinary!collaboraOons!(ERC!Grant!+!HFSP!+!associate!team!Neurocuriosity)!

Jacqueline!Go?lieb,!Cogni#ve&Neuroscience&Lab&NY,!US!

Celeste!Kidd!Dev.&&Psychology&&lab&Univ.!!Rochester!

Flowers&Lab&Inria&and&Ensta&ParisTech&

3.1 Active Model Babbling

A particular exploration strategy available in Explauto is called Active Model Babbling [Forestier andOudeyer, 2016]. With this strategy, di↵erent motor and sensory spaces are given to the robot, e.g. thespace of motor parameters of a robotic arm, and sensory spaces representing the movement of each objectin the scene, and the robot will learn behavioral primitives to explore and reach new e↵ects in the di↵erentsensory spaces. The exploration of one sensory space can give information in the other spaces, and therobot can learn that some objects (e.g. a joystick) can be used as a tool to control other objects (e.g.another robot) in a hierarchical manner. This strategy was shown to be very e�cient to explore multipletask/goal spaces based on the monitoring of learning progress in the di↵erent spaces. A Jupyter notebookimplementing and studying this strategy is available online at this address: http://nbviewer.jupyter.org/github/sebastien-forestier/ExplorationAlgorithms/blob/master/main.ipynb.

3.2 Interaction

In the Active Model Babbling exploration strategy, guidance from human peers can be integrated atdi↵erent levels. When the robot is set up in a compliant mode, human peers can demonstrate roboticmovements that the robot stores and can reuse later in its autonomous exploration by experimentingagain those movements or variations of the movements as in [Nguyen and Oudeyer, 2012]. Also, users candemonstrate the function of objects or create situations (e.g. use the joystick controling the second robotto grab a ball) that the robot will try to reproduce by exploring new motor primitives. Finally, users canalso give a value about the interestingness to explore some task/goal spaces (e.g. moving the blue ball),that will be combined with the intrinsic motivation of the robot and will push it to explore those spaces.

4 Demonstration Setup

The demonstration shows how the Explauto li-brary allows a Poppy Torso to explore its environ-ment and use social guidance to understand the in-teraction between the di↵erent objects in the envi-ronment. The robot interacts with several objectsincluding a joystick that controls another robot thatcan be used as a tool to move other objects. Naivehumans can interact with the robot by driving its ex-ploration towards interesting objects, or by demon-strating useful movements of the arm or the joystick.

Figure 2: Demonstration setup.

References

[Forestier and Oudeyer, 2016] Forestier, S. and Oudeyer, P.-Y. (2016). Modular active curiosity-driven discovery oftool use. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea.

[Lapeyre et al., 2014] Lapeyre, M., Rouanet, P., Grizou, J., Nguyen, S., Depraetre, F., Le Falher, A., and Oudeyer,P.-Y. (2014). Poppy Project: Open-Source Fabrication of 3D Printed Humanoid Robot for Science, Educationand Art. In Digital Intelligence 2014, page 6, Nantes, France.

[Moulin-Frier et al., 2014] Moulin-Frier, C., Rouanet, P., Oudeyer, P.-Y., and others (2014). Explauto: an open-source Python library to study autonomous exploration in developmental robotics. In ICDL-Epirob-InternationalConference on Development and Learning, Epirob.

[Nguyen and Oudeyer, 2012] Nguyen, S. and Oudeyer, P.-Y. (2012). Active choice of teachers, learning strategiesand goals for a socially guided intrinsic motivation learner. Paladyn, 3(3):136–146.

2

Monkeys!

Robots!

Human!children!

Linda!Smith!Dev.&Psych.&Indiana!Univ.,!US!

40!

Page 41: Ac#ve&Explora#on&Mechanisms&in& …

Web!site!with!all!videos!of!talks!

Organized$by$Inria$Flowers$(PY$Oudeyer,$M$Lopes),$Columbia$Univ$(J.$Goalieb)$and$Birbeck$College$(T.$Gliga)$$Neuroscience,!Computa;onal$modelling,!Developmental$psychology,!Ethology/animal$research$$Speakers:!J.!Nelson,!D.!Markant,!S.!Kouider,!M.!Gruber,!K.!Murayama,!J.!O’Reilly,!K.!Friston,!G.!Baldassarre,!P.!Dayan,!K.!Doya,!W.!Shultz,!A.!Bell,!L.!Hunt,!D.!Bell,!K.!Begus,!L.!Goupil,!L.!Feigenson,!D.!Bavellier,!A.!Gopnik.$!

41!

h?ps://openlab&flowers.inria.fr/t/second&interdisciplinary&symposium&on&informaOon&seeking&curiosity&and&a?enOon&neurocuriosity&2016/187!

Page 42: Ac#ve&Explora#on&Mechanisms&in& …

Selected!publicaOons!

42!

•  Baranes,!A.,!Oudeyer,!P.&Y.,!2013.!AcOve!learning!of!inverse!models!with!intrinsically!moOvated!goal!exploraOon!in!robots.!Robot.!Auton.!Syst.!61!(1),!49–73.!!

•  Baranes,!A.F.,!Oudeyer,!P.Y.,!Go?lieb,!J.,!2014.!The!effects!of!task!difficulty,!novelty!and!the!size!of!the!search!space!on!intrinsically!moOvated!exploraOon.!Front.!Neurosci.!8,!1–9.!!

•  Baranes,!A.,!Oudeyer,!P.Y.,!Go?lieb,!J.,!2015.!Eye!movements!reveal!epistemic!curiosity!in!human!observers.!Vis.!Res.!117,!81–90.!!

•  Benureau,!F.C.Y.,!Oudeyer,!P.&Y.,!2016.!Behavioral!diversity!generaOon!in!autonomous!ex&!ploraOon!through!reuse!of!past!experience.!Front.!Robot.!AI!3,!8.!h?p://dx.doi.org/!10.3389/frobt.2016.00008.!!

•  Clement,!B.,!Roy,!D.,!Oudeyer,!P.&Y.,!Lopes,!M.,!2015.!MulO&armed!bandits!for!intelligent!tutoring!systems.!J.!Educ.!Data!Mining!7!(2),!20–48.!!

•  ForesOer!S,!Oudeyer!P&Y.!!2016!Modular!AcOve!Curiosity&Driven!Discovery!of!Tool!Use.!2016!IEEE/RSJ!InternaOonal!Conference!on!Intelligent!Robots!and!Systems!(IROS).!!

•  Go?lieb,!J.,!Oudeyer,!P.&Y.,!Lopes,!M.,!Baranes,!A.,!2013.!InformaOon!seeking,!curiosity!and!a?enOon:!computaOonal!and!neural!mechanisms.!Trends!Cogn.!Sci.!17!(11),!585–596.!!

•  Kaplan,!F.,!Oudeyer,!P.&Y.,!2003.!MoOvaOonal!principles!for!visual!know&how!development.!In:!Prince,!C.G.,!Berthouze,!L.,!Kozima,!H.,!Bullock,!D.,!Stojanov,!G.,!Balkenius,!C.!(Eds.),!Proceedings!of!the!3rd!InternaOonal!Workshop!on!EpigeneOc!RoboOcs:!Modeling!CogniOve!Development!in!RoboOc!Systems,!vol.!101.!Lund!University!CogniOve!Studies,!Lund,!pp.!73–80.!

•  Kaplan,!F.,!Oudeyer,!P.&Y.,!2007b.!In!search!of!the!neural!circuits!of!intrinsic!moOvaOon.!Front.!Neurosci.!1!(1),!225–236.!!

!!

Page 43: Ac#ve&Explora#on&Mechanisms&in& …

43!

•  Lopes,!M.,!Oudeyer,!P.Y.,!2012.!The!strategic!student!approach!for!life&long!exploraOon!and!learning.!In:!IEEE!InternaOonal!Conference!on!Development!and!Learning!and!EpigeneOc!RoboOcs!(ICDL).!IEEE,!pp.!1–8.!

•  Lopes,!M.,!Lang,!T.,!Toussaint,!M.,!Oudeyer,!P.&Y.,!2012.!ExploraOon!in!Model&Based!Reinforcement!Learning!by!Empirically!EsOmaOng!Learning!Progress.!In:!Proceedings!of!Neural!InformaOon!Processing!Systems!(NIPS!2012).!NIPS,!Tahoe,!USA.!!

•  Moulin&Frier,!C.,!Nguyen,!M.,!Oudeyer,!P.&Y.,!2014.!Self&organizaOon!of!early!vocal!development!in!infants!and!machines:!the!role!of!intrinsic!moOvaOon.!Front.!Cogn.!Sci.!4,!1–20.!h?p://dx.doi.org/10.3389/fpsyg.2013.01006.!

•  Oudeyer,!P.&Y.,!Kaplan,!F.,!2007.!What!is!intrinsic!moOvaOon?!A!typology!of!computaOonal!approaches.!Front.!Neurorobot.!1,!6.!h?p://dx.doi.org/10.3389/neuro.12.006.2007.!!

•  Oudeyer,!P.&Y.,!Smith,!L.,!2016.!How!evoluOon!can!work!through!curiosity&driven!developmental!process.!Top.!Cogn.!Sci.!8!(2),!492–502.!

•  Oudeyer,!P.&Y.,!Kaplan,!F.,!Hafner,!V.,!2007.!Intrinsic!moOvaOon!systems!for!autonomous!mental!development.!IEEE!Trans.!Evol.!Comput.!11!(2),!265–286.!

•  Oudeyer,!P.!Y.,!Go?lieb,!J.,!&!Lopes,!M.!(2016).!Intrinsic!moOvaOon,!curiosity,!and!learning:!Theory!and!applicaOons!in!educaOonal!technologies.!Progress!in!brain!research,!229,!257&284.!

!!!