learn to relate observation to the internal state for ... · department of computer science and...
TRANSCRIPT
Department of Computer Science and Engineering University of Texas at Arlington
Arlington, TX 76019
Learn to Relate Observation to the Internal State for Robot Imitation
Heng Kou [email protected]
Technical Report CSE-2006-3
This report was also submitted as an M.S. thesis.
LEARN TO RELATE OBSERVATION TO THE INTERNAL STATE FOR
ROBOT IMITATION
by
HENG KOU
Presented to the Faculty of the Graduate School of
The University of Texas at Arlington in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN COMPUTER SCIENCE
THE UNIVERSITY OF TEXAS AT ARLINGTON
August 2006
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my thesis supervisor Dr. Manfred
Huber for invaluable inspiration and guidance from very beginning of my graduate study.
Many thanks to him for introducing me to Artificial Intelligence and patiently explaining
every detail to me.
I also appreciate Dr. Diane Cook and Dr. Lynn Peterson for taking time to serve
on my committee and their valuable feedback.
I would like to thank my lab mates in the Robotics Lab for sharing their great
experience with me. Many thanks to Ashok, Eric and Srividhya for delicious food, nice
conversation and encouragements.
Finally I would like to give my special thanks to my parents and my lovely sister
for their endless love and encouragements. Although they are thousands of miles away,
they never leave me alone and always in my heart.
May 5, 2006
iv
ABSTRACT
LEARN TO RELATE OBSERVATION TO THE INTERNAL STATE FOR
ROBOT IMITATION
Publication No.
Heng Kou, MS
The University of Texas at Arlington, 2006
Supervising Professor: Manfred Huber
Imitation is a powerful form of learning widely used through our life. When you
learn a new sport, e.g. basketball, the fast way is to observe how an instructor or a player
is playing basketball. At the beginning you may just exactly follow the steps taken by
the instructor or player and practice with implicit or explicit feedback (reward and/or
penalty). From time to time you find a better way to play, which may be different from
the instructor or player.
In this thesis we present an approach that allows a robot to learn a new behavior
through demonstration. The demonstration is a sequence of observed states, representing
the state of the demonstrator and the environment. In order to imitate, the robot
develops an internal model, which contains all possible states that can be achieved by its
own capabilities and learns a mapping which allows it to interpret the demonstration by
relating observations to the internal states.
We introduce a distance function to represent the similarity between the observed
state and the internal state. The goal is that the robot learns the distance function,
v
i.e. the mapping between the observed states and internal states. Given a sequence of
observed states, the robot is then able to produce a policy by minimizing the distance
function as well as the actions’ cost. This approach works even if the demonstrator
and imitator have different bodies and/or capabilities, where the robot is only able to
approximately achieve the task.
In the learning approach presented here, the robot first learns the correspondence
between the observed state and the internal state. The robot then learns which aspect of
the demonstration is more important such that the robot is able to finish the task even
if the environment or its capabilities are different.
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Imitation with Similar Bodies and/or Actions . . . . . . . . . . . . . . . 7
2.2 Imitation with Dissimilar Bodies and/or Actions . . . . . . . . . . . . . . 8
2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. IMITATION APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 A Framework for Imitation . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4. LEARNING THE DISTANCE FUNCTION . . . . . . . . . . . . . . . . . . . 19
4.1 Derive a Policy Using the Distance Function . . . . . . . . . . . . . . . . 19
4.2 Learning the Policy as Well as the Distance Function . . . . . . . . . . . 22
4.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Real-Valued Reinforcement Learning . . . . . . . . . . . . . . . . 24
4.4 Exploration Over Policy Space . . . . . . . . . . . . . . . . . . . . . . . . 25
vii
4.5 Derive Distance Function From Policy . . . . . . . . . . . . . . . . . . . 27
4.6 Representation Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6.1 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 29
4.6.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.3 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.4 Initialization Algorithm . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7.1 Our Approach vs. Purely Autonomous Learning . . . . . . . . . . 42
5. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 State Representation . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 Action Representation . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Learning Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Learning the Mapping Between Observed and Internal State . . . 46
5.2.2 Generalization in Dynamically Changing Environments . . . . . . 47
5.3 Example Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Trash Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Toy Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Futon Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Experiments Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.1 Evaluation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
BIOGRAPHICAL STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
viii
LIST OF FIGURES
Figure Page
1.1 Relate Observation to Internal States . . . . . . . . . . . . . . . . . . . 2
1.2 Fail to Relate Observation to Internal States . . . . . . . . . . . . . . . 3
1.3 Re-mapping with the Distance Function . . . . . . . . . . . . . . . . . . 4
3.1 Trash Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Imitation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Derive a Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Learning the Distance Function . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Explore on Policy Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Multiplayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Imitation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Trash Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Toy Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Futon Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Task Performance for Case 1 . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 Task Performance for Case 2 . . . . . . . . . . . . . . . . . . . . . . . . 54
5.7 Task Performance for Case 3 . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 Task Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
LIST OF TABLES
Table Page
4.1 Resource Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Experiment Configuration for Trash Cleaning . . . . . . . . . . . . . . . 52
5.2 Experiment Configuration for Toy Collection . . . . . . . . . . . . . . . 52
x
LIST OF ALGORITHMS
2.1 Learning to Imitate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Scaled Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 40
xi
CHAPTER 1
INTRODUCTION
1.1 Motivation
As robotic systems move into more complex and realistic environments, the need
for learning capabilities increases. With learning capability, the robot can deal with a
dynamic environment that is tedious or even impossible to completely captured by a
programmer. It also reduces the cost to preprogram a robot for each task. Traditional
learning approaches that use purely autonomous learning, provide a general solution for
a variety of tasks, but learning from scratch is not practical in the real world.
Imitation, also called learning from observation, overcomes some of the disadvan-
tages of both pre-programming and purely autonomous learning:
• Easy learning: The imitator could learn a new task by simply observing how a
demonstrator (human or artificial agent) executes the task.
• Fast learning: Imitation speeds up the learning process by transferring knowledge
from demonstrator to imitator – the demonstration is a possible solution of the
task.
• Easy training: The demonstrator could be an expert in any task domain without
an extensive technical background.
• Low training overhead: The demonstrator could continue its own work without
paying attention to the imitator.
Although imitation, as a powerful learning technique, is widely used in biological
systems, it introduces numerous challenges when we apply it to robotic systems:
1
2
• First the robot must be able to extract useful information from the demonstration.
Whenever there is a significant change in the environment, the robot could detect it
through its sensors, and segment the continuous sensor stream into a discrete rep-
resentation. Those aspects that cannot be directly observed or cannot be extracted
from the sensor stream won’t be represented in the observed model.
• Second, since the demonstration does not naturally look like the imitation, the
robot has to learn to interpret percepts from sensors. The robot’s interpretation
capability will affect the quality of the imitation.
• Third the robot must be able to address the problem that the demonstrator and
imitator have different bodies and/or capabilities, which is not unusual in the real
world.
The demonstration is a sequence of observed states, representing the state of the
demonstrator and the environment. Transitions happen whenever a significant change
has been detected. The internal states represent the state of the imitator and the envi-
ronment, which are achieved by using its own capabilities.
The imitation problem can be considered as relating the given demonstration to a
sequence of internal states which achieve the same effects (see Figure 1.1).
Internal States
Observation
Move Drop
Grab Move
Mapping
Figure 1.1. Relate Observation to Internal States.
3
When the demonstrator and imitator have similar bodies and capabilities, each
internal state exactly matches the observed state. But when the imitator is substantially
different from the demonstrator, the imitator may not be able to establish this mapping
and the imitation may fail. Figure 1.2 shows an example:
Demonstration
X
Internal States
Move Drop
Grab Move
Mapping
Figure 1.2. Fail to Relate Observation to Internal States.
In this example, the demonstrator executes a PUSH action but the imitator can
not. Although there is an alternative by executing a sequence of actions, i.e. grab, move
and drop, the imitator is not able to establish a mapping between the observed state and
internal state after it executes a GRAB action, thus the imitation failed.
In this thesis, we develop a framework that allows the robot to learn new behav-
iors from demonstrations. We introduce a distance function to represent the similarity
between the observed state and the internal state. The smaller the distance, the more
similar they are. The imitator tries to learn the distance function such that it is able
to find a sequence of internal states which approximates the demonstration even if the
imitator has a different body and capabilities from the demonstrator.
Figure 1.3 shows that with our approach the imitator is able to establish a mapping
between the observed states and the internal states even if no exact match exists.
4
Demonstration
Internal Model
Move Drop
Grab Move
Mapping
Figure 1.3. Re-mapping with the Distance Function.
To facilitate learning, the imitator first learns direct state correspondences by relat-
ing single step demonstrations to the internal policies. Then the more complex tasks are
used to learn the implications of the environmental differences and behavioral capabilities
on the imitator.
1.2 Outline
Chapter 2 first gives a review of previous work in this area; then several approaches
taken by other researchers will be discussed in detail; at the end our approach will be
presented.
Chapter 3 gives an overview of our imitation approach.
Chapter 4 first introduces the distance function, which represents the similarity
between the observed state and internal state, and explains how the imitator learns the
optimal policy and derives the distance function from that policy.
Second we describe reinforcement learning and real-valued reinforcement learning,
which the imitator uses to learn the distance function and the optimal policy.
Third we discuss neural networks, which are used as our distance function repre-
sentation. We present two popular neural network architectures, multilayer perceptrons
and radial basis function, and show the learning algorithms.
5
Chapter 5 gives a detailed description of our experiments and then shows the
results.
Chapter 6 presents a summary of the work described in this thesis.
CHAPTER 2
RELATED WORK
Imitation, as a powerful learning technique, has been heavily studied for years in
robotic systems. Researchers applied this mechanism to the control of robot arms [1], to
drive a robot through a maze [2], or to transport objects [3].
The work can be divided into two main areas: learning to imitate where the imi-
tator learns how to duplicate the demonstration, and learning from imitation where the
imitator learns the task model directly (e.g. in [2, 4], the robot learned to associate the
wall configuration with the appropriate action in a maze navigation). Learning from
imitation assumes that the correspondence between the demonstrator and imitator has
been established.
Learning to imitate can be further divided into action-level imitation and function-
level imitation [5]. In action-level imitation, the imitator tries to copy every single action
taken by the demonstrator. This has several drawbacks that make it impractical in many
real environments:
• Actions taken by the demonstrator may be irrelevant to the task.
• Actions taken by the demonstration can not be directly observed. What can be
observed is a state representation, a result of an action. Therefore an action recog-
nition approach must be provided.
• Imitation may fail when the imitator has a different body or different actions from
the demonstrator. For example, a robot without a gripper can not pick up an object
given a human demonstration that carries an object and drops it in the corner.
6
7
• Imitation may fail even if the mapping between the demonstrator and imitator has
been established, since two corresponding actions may have different effects. For
example, a child can not reach the top level of a shelf even if he/she raises the hand
as an adult does.
The action-level imitation only fits a limited number of imitation problems, like
learning to dance.
In function-level imitation, the imitator considers whether a certain sequence of
effects in the demonstration can be achieved through its own capabilities rather than
to focus on the actions taken by the demonstrator. The imitator tries to learn the
correspondence between the observed state and internal state such that it is able to
produce a policy even if its body and capabilities are different from the demonstrator.
Like in the action-level imitation, consider the example where the demonstrator is
a human while the imitator is a robot without a gripper. The demonstration is picking
up an object, moving to a target and dropping the object. In the function-level imitation,
the robot can perform the task by using its front pumper to push the object to the target.
In this chapter we first review the previous work done in this area where the
demonstrator and imitator have similar bodies and/or behavior, then discuss in detail
the approaches that have been taken for heterogeneous agents.
2.1 Imitation with Similar Bodies and/or Actions
Hayes and Demiris’s work [2] [4] focused on the imitation problem where the demon-
strator and imitator have similar behaviors. They tried to construct a robot controller
that allows a robot to learn to associate its percepts in the environment with the ap-
propriate actions. The experiment lets the robot learn to navigate through a maze by
following a demonstrator.
8
In [2], the demonstrator was a robot with different size. The imitator recognized the
action by detecting a change in the direction of the demonstrator’s movement, performed
a teacher-following behavior, and used its own observations to relate the changes in the
environment to its own forward, left and right turn actions. The association itself is a
set of if-then rules.
In [4], the demonstrator became a human. In order to detect the change in the
direction of a human’s movement, they used a neural network to classify the face image.
As the human’s head rotates the face image changes and so do the angular relationships
between the heads of the demonstrator and imitator. Although they proposed two alter-
natives in movement matching, (i) know each action’s effect and choose the one that has
the desired effect, and (ii) try all the available actions and choose the one that has the
minimum error function, they did not mention which one was used in the experiments.
Nicolescu and Mataric [3] proposed a hierarchical architecture where tasks are
represented in a graph. A graph node can be either an abstract behavior or a Network
Abstract Behavior, which can be further decomposed into lower level representations.
Abstract behaviors, which including execution conditions and the behavior effects, can
activate primitive behaviors to achieve the specified effect. This architecture allows to
address larger and more complex tasks, but it requires that demonstrator and imitator
have similar representations and capabilities, which might not be usual in real-world
system.
2.2 Imitation with Dissimilar Bodies and/or Actions
Atkeson and Schaal [1] implemented learning from demonstration on an anthropo-
morphic robot arm. The goal of the task is to have the robots learn a task model and
a reward function from the demonstration, rather than simply copying the human arm
9
movement. Based on the learned reward function and task model, an appropriate policy
is computed.
The pendulum swing up task is presented by a human, showing how to move the
hand so that the pendulum swings up and then is balanced in the inverted position.
During learning, feedback is given to balance the unstable inverted pendulum when it
is nearly upright with a small angular velocity. If the pendulum is within the capture
region of the feedback controller the task is considered to complete successfully.
The experiments first showed that following the human arm trajectory failed to
swing the pendulum up. The pendulum did not get even halfway up to the vertical
position and started to oscillate. This is due to the different hand structures between the
human and robot and the robot can therefore not exactly reproduce the human motion.
In order to find a swing up trajectory that works for the robot, they applied a planner
to learn a task model and a reward function. Both parametric model (with a priori
knowledge) and nonparametric model (a general relationship learned from the training
data) are constructed and learned. The planner used the human demonstration as initial
trajectories and penalized deviations from the demonstration trajectory with a reward
function. The demonstration provides an example solution and greatly reduces the need
for exploration.
Although the robot arm is different from the human’s, they did not focus on learning
the correspondence between the human hand trajectory and the robot arm’s. Instead
they tried to learn the task model directly, i.e. the relationship among pendulum angle,
pendulum angular velocity and the horizontal hand acceleration.
Nehaniv and Dautenhahn [5] said, “A fundamental problem for imitation is to
create an appropriate (partial) mapping between the body of the system being imitated
and the imitator.” They distinguished the effect-level imitation from action-level and
program-level imitation. Effect-level imitation concerns the effects on the environment
10
rather than particular actions, which is important when the demonstrator and imitator
have dissimilar bodies and/or actions.
To address the problem introduced by dissimilar bodies and/or actions, they tried
to find a set of actions which achieve the same effect on the state, then produce a set of
equivalent states. They also used a metric function to represent the similarity between
the observed states and internal states, which in turn is used to evaluate the badness of
imitation. By minimizing the sum of this metric function over all intermediate states,
the imitator can maximally achieve a successful imitation.
It is not clear how they determine whether two states are equivalent or not. Since
not every aspect of state information is relevant to the task, some aspects may be more
important than the other. Exactly matching won’t allow the robot to perform well when
the environment slightly changes. In addition, their metric function is simple, only 0
(equal) and 1 (non-equal), without intermediate values. The imitator may not be able to
produce a policy which approximately achieves the demonstration even if one sequence
of states is more similar to the observed states than the other.
Price and Boutiller [6] proposed an implicit imitation framework, which used im-
itation to accelerate reinforcement learning. Their work removed the assumption that
the demonstrator and imitator should have homogeneous actions. In order to overcome
this difficulty, two techniques were introduced in their model: action feasibility testing,
which allows the imitator to determine whether the observed state can be duplicated; and
k-step repair, where the imitator tries to find an alternative to approximate the observed
states.
Experiments are given in grid worlds, in which the imitator tries to approximately
match the path taken by the demonstrator. Different situations are considered, where
the imitator has different action capabilities, or where the state space for imitation is
11
slightly different from the one for demonstration. The agent employed with feasibility
testing and k-step repair outperforms others without these techniques.
Although they dealt with the case that the demonstrator and imitator may have
heterogeneous actions, they ignored the fact that the demonstration may be not optimal
for the imitator. The imitator always prefers the same action as the demonstrator 1.
Johnson and Demiris [7] introduced inverse-forward models to solve the correspon-
dence problem, i.e. to “link the recognition of actions that are being demonstrated to
the execution of the same actions by the imitator”. The inverse model takes the current
state and the goal state as input and produces the commands as output that can move
from the current state to the goal state. The forward Model produces the next state
given the current state and commands as input.
First a set of inverse models are executed at the same time with the same cur-
rent/goal states; second the commands produced are fed into a set of forward models;
third the predicted states produced by the forward models are compared with the actual
next state and a reward or penalty is given depending on whether they exactly match.
The reward or penalty is used to calculate the confidence of the inverse models. The
higher the confidence, the better its predicted state matches the observed state. Finally,
the demonstrator will choose the inverse model which has the highest confidence. To
overcome the dissimilar capabilities between the demonstrator and imitator, they allow
lower-level inverse models to signal their confidence to the upper level and let the upper-
level inverse models choose the closest component inverse model that performs well.
The confidence is the level of certainty that the imitator has about each inverse
model. It is either increased or decreased by a constant reward or penalty depending
whether the predicted state matches the actual state. When dealing with non-trivial
problems, the representation of the confidence becomes an issue.
1The k-step repair happens only if the action feasibility testing failed
12
To evaluate the inverse models, the imitation performs an exact match between the
predicted state and the actual state. Since the imitator has no way to know which aspects
of the demonstration are more important than others, it may sometimes fail to recognize
a match. Also, the exact match prevents the imitator from finding an approximate
imitation strategy.
Gudla and Huber [8] proposed a different approach to handle the case where the
demonstrator and imitator have different bodies and/or actions. They introduced a cost
function to represent the similarity between the observed and internal state, which is
a weighted-sum of the features in the environment, e.g. location, orientation, gold and
arrow. A stochastic reinforcement learning algorithm was used to learn the weight factor
for each feature such that the robot can determine which feature is more important in
this task domain. Combined with the action’s cost, the imitator tries to find a policy
that can as closely as possible achieve the observed states with less cost.
Two experiments were implemented in the Wumpus world. The Wumpus world
is a computer game that contains a number of locations, each of which may have gold,
obstacle, bottomless pit or a hidden mysterious monster, called Wumpus. An agent
moves through the environment to collect as much gold as possible while avoiding being
killed by falling into a bottomless pit or by the Wumpus.
In the first example task, the demonstrator shoots the Wumpus and brings the
gold back. The imitator, who does not have shooting capability, takes a path that
does not encounter the Wumpus and brings the gold back. In the second example, the
demonstrator brings the gold back and drops it at the start state. The imitator without
drop capability didn’t grab the gold and came back to the start state.
Similar to the metric function proposed by Nehaniv and Dautenhahn [5], the cost
function used here represents the similarity between the observed state and internal state.
Instead of just 0 and 1, this cost function is a weighted-sum of the difference for each
13
feature between the observed state and the internal state. Its drawback is that the cost
function considers each feature separately, which is not always true.
As we see, some approaches taken in previous work assume that either the demon-
strator and imitator have similar bodies and/or capabilities or the correspondence be-
tween the demonstrator and imitator has been established. Although other approaches
use a function to represent the similarity between the observed state and internal state,
the function either has a simple form or makes assumption about how the features are
related.
2.3 Our Approach
In this thesis, we develop a framework that allows the robot to learn the new task
by observing how a demonstrator - either a human or robot - executes the task. First
the robot tries to learn the correspondence between the observed states and the internal
states such that it is able to produce a policy to approximate the demonstration as closely
as possible even if it has different body and/or actions from the demonstrator. Second
the robot learns which aspects of the demonstration is more important to be able to
finish the task even if the environment slightly changes.
We use a distance function to represent the similarity between the observed state
and the internal state. The smaller the value, the more similar they are. The dis-
tance function is represented by a neural network, which makes no assumption about
the structure of the function, the correspondence of features, body characteristics or ac-
tion capabilities. Compared with the approaches taken in [8], [7], [5], it is general and
task-independent.
As the distance function is unknown, the robot learns the distance function through
reinforcement from the environment or the demonstrator. As a measure of the quality of
14
imitation is often difficult and subjective (observer-dependent), we use task performance
to approximate a measure of imitation quality.
Algorithm 2.1 shows how the robot learns the distance function as well as an optimal
policy.
Algorithm 2.1: Learning to Imitate
Given a demonstration, a sequence of observed state
repeat
Based on current distance function, produce a policy
Generate a new policy
if new policy has higher reward then
Derive a new distance function from the policy
Feed the new distance function into neural networkuntil no better policy has been found
Given a demonstration, the robot considers all possible internal states, calculates
the distance between the observed state and the internal state using the current distance
function, and derives the policy that it considers to be the closest2 to the demonstration.
Then it generates a new policy by exploring over the policy space, executes the
policy and receives reward from the environment. If the new policy outperforms the
current one, it means the distance function encoded in this new policy is more accurate
than the current one. The robot derives a new distance function from the new policy
and feeds it into the neural network.
This procedure is repeated until an optimal distance function is found. When an
optimal policy is found, we can assume that the robot has learned the correspondence
between the observed state and the internal state. When an optimal policy is found for
a number of different scenarios, we can expect that the distance function has correctly
identified important aspects of the demonstrations.
2The action’s cost is considered also.
CHAPTER 3
IMITATION APPROACH
3.1 Overview
Learning through observation is a learning technique widely adopted by humans
and animals. During the early phase of life, they try to establish the correspondence
between the demonstrator and themselves. As soon as this correspondence is established,
they can learn complex tasks easier. They also learn through practice under different
scenarios such that they can apply the new behavior even if the environment is slightly
different.
Learning through observation in biological systems presents a possible approach
for building learning capabilities in an artificial agent, e.g. a robot. The robot can learn
the new behavior from a demonstration given by an agent, either a human or another
artificial agent.
The demonstration is a sequence of observed states, representing the state of the
demonstrator and the environment. Transitions happen whenever a significant change
has been detected. The internal states represent the state of the imitator and the envi-
ronment, which are achieved by using its own capabilities.
Figure 3.1 shows a demonstration for a house cleaning task, where a house cleaner
walks around, picks up the trash, moves towards a trash bin and drops the trash.
You expect that the robot, even without legs, can not only produce a policy to
accomplish the cleaning task, but also distinguish trash from other objects from practice.
Although the demonstrator and imitator have different bodies and capabilities, it
is not necessary that the robot takes the same actions as the demonstrator does. As long
15
16
Demonstrator
Drop Trash
Trash
Trash: GREEN/PAPER
Obstacle: RED/WOOD
Trash Bin: BLACK/MARBLE
Move to Trash
Grab Trash
Move Trash Bin
Tra
sh
Figure 3.1. Trash Cleaning.
as the robot is able to achieve the same effects - i.e. find the trash and throw it into the
trash bin - it is a successful imitation. In order to achieve the same effects, the robot
must be able to learn the correspondence between the observed state and internal state,
e.g. an observed state where the demonstrator is next to a green object corresponds to
an internal state where the robot is next to a green object.
After this correspondence has been established, the imitator can learn which aspects
of the demonstration are more important, e.g. the characteristics of trash. This can be
done by letting the robot practice with different kinds of trash such that it is able to
ignore the irrelevant aspects and derive an optimal policy.
3.2 A Framework for Imitation
We develop a framework (see imitation part in Figure 3.2) which allows the robot
to learn new behavior through imitation.
17
Demonstration Imitation
Observation Modeling Mapping Execution & adaptation
Task Policy
Adjustment
Perceptual State Sequence
Observed States
Sensor Input
Action Output
Figure 3.2. Imitation Model.
Within the demonstration phase, the Observation and Modeling components allow
the imitator to observe the demonstrator and derive functional representations of the
demonstration.
Within the imitation phase, the Mapping component[8] tries to learn the corre-
spondence between the observed states and internal states, and to derive a policy that
as closely as possible achieves the task using its own actions. We introduce a distance
function to represent the similarity between the observed state and the internal state.
The smaller the distance, the more similar they are. By minimizing the total cost over
the entire sequence of internal states, including the distance and action cost, the robot
is able to find an optimal policy.
The imitator uses the A* algorithm[9] to derive a policy. Each node in the search
tree represents an observed/internal state pair. Each node has a cost, i.e. the actual
cost and the heuristic cost. The actual cost includes the action’s cost from the start
internal state to the current state, plus the distance between each internal state and its
corresponding observed state. The heuristic cost is the estimate of the cost from the
current state to the internal state corresponding to the last observed state.
The initial internal state is associated with the first observed state. Starting from
the initial internal state, the imitator develops the set of its successors using the available
18
actions1. Each action produces a new internal state, which either corresponds to the
same observed state as the current internal state or the next observed state. There is
one special successor in which the current internal state corresponds to the next observed
state without taking any action. After calculating the cost for each successor, they are
compared against other internal states. New states are added into the internal state set,
while the duplicate states with lower costs replace the existing ones. The internal state
with the smallest cost is chosen to be explored next. Each chosen state may affect the
final policy because of the fact that a new sequence of internal states with smaller costs
has been found. This procedure repeats until an internal state corresponding to the last
observed state is found, and a policy is finally determined.
The imitator learns the distance function as well as the optimal policy through
reinforcement learning. First the imitator uses the current distance function to derive
a policy. Second it generates a new policy by exploring over policy space. Then the
Execution and Adaptation component executes the new policy and receives feedback
from the environment. The feedback is given either by the demonstrator or a trainer in
the form of a scalar signal which indicates how good the imitator performs the task. If
the new policy gets higher reward, we assume its distance function is more accurate and
a new distance function is derived from that policy. Finally a neural network, which is
used to represent the distance function, is trained with this new distance function.
This procedure is repeated until no better policy has been found for a reasonable
period of time.
1By consider all possible actions, the robot is able to deal with the cases where the demonstrator andimitator have similar or dissimilar bodies and capabilities.
CHAPTER 4
LEARNING THE DISTANCE FUNCTION
4.1 Derive a Policy Using the Distance Function
In order to derive a policy, the imitator must consider both action cost and the
distance function. Figure 4.1 shows how the distance function affects the policy derived.
80
7 0 7
11| DI SS
22| DI SS
12| DI SS 21
| DI SS
5 0 5
23| DI SS
33| DI SS 32
| DI SS
4
34| DI SS
44| DI SS 43
| DI SS
20 0
65 2 78
5 3 10
4 0
Figure 4.1. Derive a Policy.
Each node here represents a state pair, an internal state and its corresponding
observed state, while a link represents a transition from an internal state to another one
by executing an action. The number in the node is the distance between the observed
19
20
state and the internal state. The number on each link is the action’s cost. The total cost
for each node is calculated as:
f = g + h (4.1)
g =n∑
i=1
(ASIi|SIi−1
+ DSIi|SDj
) (4.2)
where g is the actual cost from the start state to the current state; h is the estimated
cost from current state to the internal state which corresponds to the final observed state;
SIiis the internal state i, SDj
is the observed state j; SIi|SDj
is the state pair, the internal
state i and its corresponding observed state j; ASIi|SIi−1
is the action’s cost moving from
state SIi−1to state SIi
; DSIi|SDj
is the distance between the internal state i and its
corresponding observed state j.
Starting from the initial internal state, which corresponds to the first observed state,
the imitator always chooses the node with minimum cost (Equation 4.1) and generates its
successors using all available actions. Each action produces a new internal state, which
either corresponds to the same observed state as the current state, or corresponds to the
next observed state. In addition, a new successor is created from the current state which
corresponds to the next observed state without taking any action. New nodes are added
into the search tree while the duplicate nodes with lower costs replace the existing ones.
Then another node with the minimum cost is chosen to be explored next. This procedure
repeats until an internal state corresponding to the final observed state is found, and a
policy is derived.
21
Suppose the observed state sequence is SD1 −SD2 −SD3 −SD4 . In order to find an
optimal policy, both distance function and action cost are considered; e.g. the total cost
for taking policy π1: SI1|SD1 − SI2|SD2 − SI3|SD3 − SI3|SD4 is :
f1 = ASI1,SI2
+ DSI2|SD2
+ ASI2,SI3
+ DSI3|SD3
+ ASI3,SI3
+ DSI3|SD4
= 7 + 0 + 5 + 2 + 0 + 5
= 19
Similarly for policy π2 starting with: SI1|SD1 − SI2|SD2 − SI3|SD2 the cost is :
f2 = ASI1,SI2
+ DSI2|SD2
+ ASI2,SI3
+ DSI3|SD2
+ hSI3|SD2
(4.3)
= 7 + 0 + 5 + 78 + hSI3|SD2
(4.4)
= 90 + hSI3|SD2
(4.5)
since the distance between the internal state SI3 and its corresponding observed state SD2
is 78, its total cost is already larger than the cost of policy π1 even without considering
the heuristic cost.
In contrast, the distance between state SI4 and its corresponding observed state
SD4 is 3, which is smaller than the distance of state SI3|SD4 , but the action’s cost moving
from state SI3|SD3 to state SI4|SD4 is larger than the cost from state SI3|SD3 to state
SI3|SD4 .
In the end, the sequence of internal states SI1|SD1 − SI2|SD2 − SI3|SD3 − SI3|SD4
has the minimum cost and is chosen as the final policy taken by the imitator.
22
4.2 Learning the Policy as Well as the Distance Function
Because of the relationship between the distance function and the policy, the robot
can either learn the distance function and use it to produce the policy, or learn the policy
and derive the distance function accordingly.
Learning the distance function seems straightforward, but the problem is that there
are an infinite number of solutions for one policy. The robot may waste time learning
distance functions which produce the same policy.
On the other hand, there are fewer optimal policies for a given task 1. But there
are many distance functions that can produce the desired policy. Learning one possible
distance function is much easier than to learn a particular one. Thus the robot learns
the optimal policy and derives the distance from that policy. Figure 4.2 shows how the
imitator learns the distance function:
As both the distance function and the optimal policy are unknown, we need a
learning mechanism to guide the learning process. Reinforcement learning is a learning
technique, which allows the agent to learn the optimal policy through feedback (re-
ward/penalty) from the environment. As a measure of the quality of imitation is often
difficult (and subjective), the feedback is a measure of the task performance to approxi-
mate a measure of imitation quality.
Next we give a brief introduction of reinforcement learning and stochastic real-
valued unit.
4.3 Reinforcement Learning
There are three major learning paradigms, supervised learning, unsupervised learn-
ing and reinforcement learning, each corresponding to a particular learning task. Super-
1The internal state may correspond to a different observed state, but the outcome is the same.
23
Use distance function and A*Algorithm to generate currentpolicy
Genera te a new po l i cy byr a n dom l y p e r t u r b i n g t h edistance function in the A* tree
Execute new policy and receivereward
Reward is higher thanexpected reward of current
policy?
Calculate the minimum distancechange required at each nodefor the tree to produce the newpolicy
Train the neural network withupdated distance function
Yes
Converge or reachmaximum try?
NO
Yes
NO
Figure 4.2. Learning the Distance Function.
vised learning is usually used to learn a function from examples of its inputs and outputs
while unsupervised learning learns a pattern only through the examples of its input.
Unlike supervised learning, where the desired output is presented along with the
input, in reinforcement learning only the input is provided. The agent learns an optimal
policy through reinforcement, reward or penalty. This approach is suitable when the
optimal policy is unknown or cannot be identified directly.
In reinforcement learning, the agent is responsible to explore the environment while
learning a policy. Utilizing current knowledge to produce a policy is known as exploita-
tion, while randomly picking an action to produce a new policy that may yield a higher
reward, is known as exploration. It is reasonable that the agent explores during the early
stage of learning while it has no knowledge of the environment, and exploits at the end
24
when it learned pretty well about the environment and the cost of exploring tends to be
higher than the benefit it gains.
Reinforcement learning often learns slowly, requiring many samples before being
able to make an accurate judgment as to where to explore next in the action space. It
is even possible that in a high dimensional space with a complex reward function and
a large number of actions to learn, reinforcement learning only discovers a suboptimal
policy.
To overcome its weakness, reinforcement learning is used in conjunction with other
learning techniques, e.g. learning from demonstration. On the one hand, the imitator
uses reinforcement from the environment to learn the optimal policy. On the other hand,
the demonstration provides knowledge of the task and its possible solution, which can
speed up the learning process.
4.3.1 Real-Valued Reinforcement Learning
Gullapalli’s stochastic real-valued function [10] is a reinforcement learning tech-
nique, which generates an optimal mapping from real-valued inputs to a real-valued
output guided by a reward signal.
The idea is to compute the real-valued outputs according to a normal distribution,
which is defined by a mean and a standard deviation. The mean is an estimate of
the optimal value, which has the maximum reinforcement from the environment. The
standard deviation controls the search space around the current mean value.
During the exploration, a new value is generated from the normal distribution.
After receiving feedback from the environment, these two parameters are adjusted to
increase the probability of producing the optimal real value. If the actual reward is higher
than the expected reward, the mean is pushed towards the new point proportionally to
25
how much better it is, and the standard deviation decreases. Otherwise, the mean is
pushed away from the new point and the standard deviation increases.
Initially the mean is chosen randomly and the standard deviation is high which
allows the agent to explore the entire space of possible output values. As learning pro-
ceeds, the mean moves around the optimal value and the standard deviation decreases.
In the end, when the mean reaches the optimal solution, the variance becomes 0 and no
more exploration is performed.
4.4 Exploration Over Policy Space
In order to learn the distance function, the robot first generates the current policy
using exploitation, then it generates a new policy using exploration. Two policies are
compared to each other. If the new policy receives a reward higher than the expected
reward of the current policy, a new distance function is derived from the new policy.
To allow the robot to perform both exploration and exploitation reasonably, the
distance for each state pair is represented by a normal distribution, which is defined by
a mean and a standard deviation. The mean is the current distance, an estimate of the
optimal value which obtained the maximum reward from the environment. The standard
deviation controls the search space around the current mean.
Generating a new policy is similar to generating a regular policy. The only excep-
tion is that the imitator produces a new distance value based on its mean and standard
deviation rather than the current distance. Figure 4.3 shows how a new policy is gener-
ated through exploration over distance space.
On the left hand side, the numbers inside nodes B, C and D are the current distance
between the internal state and its corresponding observed state. The bold links represent
the current policy derived from the current distance function.
26
A
B C D
E F
G H
A
B C D
I J
K L
Current Policy New Policy
30 20 10 24 16 50
8 5
10
8 5
10
Figure 4.3. Explore on Policy Space.
On the right hand side, the numbers inside nodes B, C and D are the new distance,
the results of an exploration on distance space. Now node D has the smallest distance.
As the actual cost, action cost and heuristic cost2 do not change, the distance is the only
factor that may cause a change in the derived policy. The bold links show the new policy
derived from exploration.
As soon as the new policy is generated, the robot executes the policy and waits for
the reward from the environment. This reward is compared with the current one. If the
new one happens to be better, the robot would like to derive a distance function from the
new policy. This process repeats until no better policy has been found for a reasonable
period of time.
Initially the standard deviation is big such that the imitator is able to explore the
entire distance space. It gradually decreases (Equation 4.6), assuming the mean has
moved toward the optimal value.
σt+1 = σt ∗ γ (4.6)
2We assume that the nodes in each level correspond to the same observed state, thus the heuristiccost can be ignored at this moment.
27
where σt is the standard deviation at time t and γ is a discount factor with 0 < γ < 1.
4.5 Derive Distance Function From Policy
When the new policy receives higher reward, the imitator derives a new distance
function from that policy, rather than learning the distance function used to generate the
new policy. Since that distance function is generated by randomly exploring the distance
space, it might not represent previous tasks well. In addition, learning an arbitrary
distance function is easier than learning a particular one as there are an infinite number
of distance functions for a policy.
In order to derive a new distance function from a policy, the imitator increases the
distance for nodes that have been chosen in the current policy but not in the new one
and decreases the distance for nodes that have been chosen in the new policy but not
in the current one, such that this adjusted distance function is able to produce the new
policy:
• For those nodes in the current A* tree, which have the maximum cost in each
branch, increase the distance such that the cost is higher than the maximum cost
in the new policy.
• For the sibling nodes in the new A* tree, increase the distance such that the cost
is higher than the policy node in the same level and its successors.
• For each policy node in the new A* tree, decrease the distance such that its cost is
lower than the maximum cost of the current policy, and the cost of its siblings and
its predecessors’ siblings.
28
The change made to the distance function is proportional to how much better the
new policy is compared to the current one. The bigger the difference, the bigger the
change is:
∆i =
∑
k
sign(∆k,i)(|∆k,i|+ ε) ∗ (Rk − Rk)
max(∑
k
(Rk − Rk))(4.7)
where k is a task, ∆k,i is the change required for node i in task k which could be positive
or negative, ε is the margin, Rk is the reward of the new policy for task k and Rk is the
expected reward for the current policy generated by the network.
As we want to derive a new distance function with minimum change to the current
distance function, the imitator trains the neural network with the new distance function
for a fixed amount of time, then verifies whether the new policy can be produced. If the
answer is yes, a more accurate distance function has been learned; otherwise the imitator
tries to derive another new distance function based on the updated distance function.
4.6 Representation Issue
A distance function can be represented by a table, a simple function. But this will
only work when the task domain is small and simple. As the problem becomes more
complicate, a more suitable form of distance function is needed.
The nature of the distance function - given a pair of the observed and internal
states, return a value representing the similarity between them - allows the learning task
to be cast as a function approximation problem. We need a function approximator which
can generate appropriate distance values for the given states and make a good prediction
for those unseen.
29
Neural networks are one popular form for function approximation, having the capa-
bilities to approximate any continuous function to arbitrary accuracy when given enough
hidden units and an appropriate set of weights[11].
We use a multilayer perceptron network to represent the distance function. The
scaled conjugate gradient algorithm is used to train the network. Compared with back-
propagation and other learning algorithms, the scaled conjugate gradient algorithm is
easy to configure and shows good convergence.
4.6.1 Function Approximation
Representation is a big issue in artificial intelligent, which may affect performance
and generalization. For simple problems, explicit representations, e.g. a table or simple
function, can perform pretty well. But as the problem becomes more complicated, the
need for a suitable representation increases. The same happens here; we need a general
representation for our distance function. The nature of the distance function, calcu-
lating the distance for a given pair of observed state and internal state, is a function
approximation problem.
Function approximation deals with the problem of building an approximator (black
box model) to represent the relation 3 embodied in a given set of input/output pairs such
that the apporximator can generate appropriate output when presented with input and
also give a good prediction when given a new input.
Function approximation can be regarded as a problem of hypersuface construc-
tion. For the given training set, the inputs correspond to coordinates and the outputs
correspond to the height of the surface. Learning becomes the problem of hypersurface
reconstruction, including all the data points given in the examples and approximating
the surface between the data points (generalization).
3Suppose this relation exists between input and output.
30
Neural networks are a popular form for function approximation and have been
applied to many problems. Given a large enough layer of hidden units and an appropriate
set of weights, any continuous function can be approximated to arbitrary accuracy[11].
But it is still a challenge to actually find a suitable network configuration, i.e. a set of
parameters that provides the best possible approximation for the given set of examples.
4.6.2 Neural Network
Originally inspired by biological systems, neural networks are composed of units
which are connected through links. By choosing the number of hidden layers, the number
of units in each layer, and the activation function, you can construct different neural
networks for different problems. Too few units can lead to under-fit but too many units
can cause over-fitting, where all training points are well fitted but the network shows
poor generalization for new inputs.
There are two main structures for neural networks: feed-forward networks and
recurrent networks. A feed-forward network represents a function of its current input
whereas a recurrent network feeds its output back into its own inputs. The two popular
feed-forward networks are multi-layer perceptron and radial basis function, which are
discussing in the following.
4.6.2.1 Multilayer Perceptron Network
The multilayer perceptron network contains one input layer, one or multiple hidden
layers 4 and one output layer. Each layer contains a set of units. Every unit in each layer
is connect to a unit in the succeeding layer through a link, which is associated with a
4The more the hidden layers, the larger the space of hypotheses that the network can represent.
31
weight parameter. Figure 4.4 shows a three-layer feed-forward network, which contains
one input layer, one hidden layer and one output layer5.
Input …
…...
. x 1
x 2
x n
……
....
x 1
x 2
x n
g h
w ij
w j
Hidden layer
……
...
Output
y
Figure 4.4. Multiplayer Perceptron.
The units in the hidden layer and output layer have an activation function, e.g. a
sigmoid function:
g(x) =1
1 + e−x(4.8)
First each input unit is fed an element from an input vector ~x; then each hidden
unit calculates its net input Hx,j by applying a weighted-sum over all the input units,
Hx,j =∑
(wij ∗ xi) + Hb,j (4.9)
5Since there is no processing taking place in the input layer, this network is also called a two-layernetwork or a one-hidden-layer network.
32
where xi is input from input unit i, wij is the weight between input unit i and hidden
unit j, and Hb,j is the bias on hidden unit j. Then it applies the activation function g to
produce its output Hy,j:
Hy,j = g(Hx,j) (4.10)
Finally the outputs from the hidden layer arrive at the output layer which calculates
the weighted-sum, applies the activation function f 6 to produce the final output y of the
network.
y = f(∑
(wj ∗Hy,j) + Ob) (4.11)
where wj is the weight on the link between hidden unit j and the output unit and Ob is
the bias on the output unit.
4.6.2.2 Radial Basis Function Network
A Radial Basis Function Network can be regarded as a feed-forward neural network
with a single hidden layer and an activation function using a Gaussian function,
g(x) = e−|x−µ|2
2σ2 (4.12)
where µ and σ are the center and deviation of the Gaussian function, respectively.
When given an input vector, each hidden unit produces its output based on the
distance between the input vector and its center. The smaller the distance, the bigger
its output is. At the end, only those hidden units that are close enough to the input
vector contribute to the final output. In this sense RBF can be trained to represent a
distribution of the input data.
6The activation function in the output layer could be different from the one in the hidden layer
33
Usually the learning process in a RBF network is divided into two steps: unsuper-
vised learning and supervised learning. During unsupervised learning, the centers and
deviations of the hidden units are determined to represent the distribution of the sample
data. During supervised learning, the weights between the hidden units and output units
are adjusted to approximate the desired output.
In the unsupervised learning phase, the centers of all hidden units are selected to
reflect the natural clustering of the input data. The two most common methods are:
• randomly choosing from sample data points.
• choosing an optimal set of points as centers, such that each hidden unit is placed
at the centroid of clusters of the sample data, while each sample data point belongs
to the hidden unit that is nearest to it. This is called the K-means algorithm.
The deviation for each hidden unit is chosen such that Gaussians overlap with a
few nearby hidden units. You can choose one of the following methods:
• explicitly assign a value for all the hidden units, e.g. dmax√2m
, where dmax is the
maximum distance between the chosen centers and m is the number of hidden
units.
• K-Nearest neighbor. Each unit’s deviation is individually set to the mean distance
to its K nearest neighbors
Having determined the center and deviation for each hidden unit, the next step is
to learn the weights between the hidden units and output units. Any learning algorithm
applied to multilayer perceptrons can be used for RBF networks. An alternative is to
adjust not only the weights between the hidden units and output units during supervised
learning, but also the centers and deviation of the hidden units.
Although the unsupervised learning allows the RBF network to converge fast, pre-
determining the centers and deviation without reference to the prediction task is also a
weakness.
34
Another weakness is the high dimensionality of each hidden unit. Since the number
of dimensions of each hidden unit is determined by the number of input parameters, the
dimension of the hidden units increases as the input parameters increase. One solution
could be feature deduction by taking a good guess of the dimension of the hidden units
assuming that some input parameters are not as important as others.
4.6.2.3 Multilayer Perceptrons vs. Radial Basis Function Network
MLP and RBF networks are two popular nonlinear layered feed-forward network
types. They present all the characteristics of the feed-forward network, but differ from
each other in the following aspects:
• A MLP has one or more hidden layers while a RBF network only has a single hidden
layer.
• A MLP computes the inner product of the input vector and weights between the
input units and hidden units and then applies the activation function; a RBF
network uses the input vector directly by computing the Euclidean distance between
the input vector and the center of the hidden units.
• In a MLP, the hidden unit is one dimensional, while in a RBF network, the dimen-
sion of the hidden units is the same as the size of the input vector. As RBF networks
are more sensitive to the curse of dimensionality, it may have great difficulties if
the number of parameters in the input vector is large.
• RBF networks use Gaussians as their activation function while MLP use a sigmoid
function. Sigmoid units can have outputs over a large region of the input space
while Gaussian units only respond to relatively small regions of the input space.
• A MLP builds a global approximation for the given sample data set, while the
RBF network constructs a local approximation to represent the distribution of the
sample data.
35
• The output layer in a RBF network is linear, while in MLP it could be linear
(function approximation) or non-linear (classification).
• MLPs are trained by supervised learning, where the set of weights are adjusted
to minimize a non-linear error function. The training of RBF networks can be
split into an unsupervised learning phase, where the centers and deviations of the
hidden units are set based on sample data, and a supervised learning phase, which
determines the output-layer weights.
RBF networks’ local approximation make the learning process faster. But on the
other hand, it also implies that RBFs are not inclined to extrapolate beyond known data:
the response drops off rapidly towards zero if data points are far from the training data.
RBF networks, even when designed efficiently, tend to have many more hidden
units than a comparable MLP does. The larger the input space, the more hidden units
RBF networks require.
RBF networks’ local approximation and fast convergence are very attractive to
our imitation problem. The problem is that the sample data used for unsupervised
initialization does not reflect the distribution of the input space, which grows as the
imitation proceeds. Also as the input vector is pretty large, it dramatically drives up the
dimensionality of hidden units, which could be a potential problem.
4.6.3 Learning Algorithms
In order to find a suitable set of weights and biases, a learning algorithm is needed
to train the network. Most learning algorithms employ some form of gradient descent,
which takes the derivative of the error function with respect to the weight factors and
adjusts them in a gradient-related direction.
We start with back-propagation, then give a detail introduction of scaled conjugate
gradient[12].
36
4.6.3.1 Back-propagation
Backpropagation is a supervised learning technique used for training artificial neu-
ral networks. It requires that the activation function used by each unit be differentiable.
Standard backpropagation is a gradient descent algorithm in which the network
weights are moved along the negative of the gradient of the performance function.
Algorithm 4.1: Gradient Descent
Initialize each wi to some small random value;repeat
Initialize each ∆wi to 0;foreach < ~xj, yj > in the sample data set do
Input the instance ~x to the unit and compute the output oj;foreach wi linear unit weight do
∆wi = ∆wi − η∂Ej
∂wi;
foreach wi linear unit weight dowi = wi + ∆wi;
until the termination condition is met ;
It starts with an initial parameter vector, and iteratively adjusts weights to decrease
the mean square error.
∆wt = −η∂E
∂w(4.13)
wt = wt−1 + ∆wt (4.14)
where E is the performance function, usually the mean square error; ∂E∂w
is the
gradient; η is the learning rate, controlling the step size of the update; wt−1 and wt are
the weight factor at time t − 1 and t, respectively. ∆wt is the change being made to
weight w at time t.
As the direction is fixed, picking the learning rate for a nonlinear network becomes
a challenge. A learning rate that is too large leads to unstable learning. A learning rate
37
that is too small results in an incredibly long training time. When gradient descent is
performed on the error surface it is possible for the network solution to become trapped
in a local minimum. To overcome this problem, several variations are provided.
In Delta-bar-delta [13], the learning rate η is adaptable,
ηt =
ηt−1 + κ δt−1δt > 0
(1− γ)ηt−1 δt−1δt < 0
ηt−1 otherwise
where
δt =∂Et
∂wt(4.15)
δt = (1− α)δt + αδt−1 (4.16)
α is a momentum parameter, κ is the linear increment factor and γ is the exponential
decay factor. At each time step, the current derivative δt is compared with the previous
derivative δt−1 (averaged). If they possess same sign, the learning rate η is increased
linearly; if they possess opposite signs, the learning rate is decreased exponentially. The
idea of using momentum is motivated by the the need to escape from local minima, and
may be effective in certain problems.
Backpropagation is generally very slow because it requires small learning rates for
stable learning. The momentum variation is usually faster since it allows higher learning
rates while maintaining stability.
38
4.6.3.2 Conjugate Gradient Algorithm
Although the network being trained may be theoretically capable of performing
correctly, backpropagation and its variations may not always find a solution. This is par-
tially because of a constant step size and partially because it uses a linear approximation:
E(w + y) ≈ E(w) + E ′(w)T y (4.17)
where w and y are two weight vectors; E(w) is an error function (e.g. least square
function), E ′(w) is the partial derivative of the error function E with respect to each
coordinate component in vector w.
The conjugate gradient algorithm, as opposed to backpropagation, is based on the
second order approximation:
E(w + y) ≈ E(w) + E ′(w)T y +1
2yT E”(w)y (4.18)
where E”(w) is the Hessian matrix of the error function E.
Let p1, ..., pN be a conjugate system 7. The step from a starting point y1 to a critical
point y∗ can be expressed as a linear combination of p1, ..., pN
y∗ − y1 =∑
αipi, αiε< (4.19)
7A set of nonzero weight vectors p1, ..., pN in <N is said to be a conjugate system with respect to anonsingular symmetric N ∗N matrix A if pT
i Apj = 0(i 6= j, i = 1, , k)
39
where pi is the search direction which is determined recursively as a linear combination
of the current steepest descent vector, −E ′(yi), and the previous direction pi−1. α is a
step size, which can be represented as,
α = − pTj E ′(y1)
pTj E”(w)pj
(4.20)
As can be seen, in each iteration the conjugate gradient has to calculate the Hessian
matrix E”(wi), which requires O(N2) memory and O(N3) in calculation complexity[12].
One possible approach is to approximate the step size with a line search.
4.6.3.3 Scaled Conjugate Gradient
In the conjugate gradient algorithm, a line search is used to approximate the step
size to avoid the Hessian matrix calculation. The scaled conjugate gradient (Algorithm
4.2) tries to approximate the term sk = E”(wk)pk in the form
sk = E”(wk)pk (4.21)
≈ E ′(wk + σkpk)− E ′(wk)
σk
(4.22)
Scaled conjugate gradient stops when any of the following conditions occurs:
• The maximum number of epochs (repetitions) is reached
• The maximum amount of time has been exceeded
• Performance has been minimized to the goal
• The performance gradient falls below mingrad (e.g. 10−6)
40
Algorithm 4.2: Scaled Conjugate Gradient
1. Choose weight vector w1 and scalars 0 < σ < 10−4, 0 < λ1 < 10−6, λ1 = 0Set p1 = r1 = −E ′(w1), k = 1 and success = true
2. if success =true, then calculate second order information:σk = σ
pk
sk = E′(wk+σkpk)−E′(wk)σk
δk = pTk sk
3. Scale δk : δk = δk + (λk−λk)|pk|2
4. If δk ≤ 0 then make the Hessian matrix positive definite:λk = 2(λk − δk
|pk|2 )δk = δk + λk|pk|2λk = λk
5. Calculate step size:µk = pT
k rk
αk = µk
δk
6. Calculate the comparison parameter:∆k = 2δk[E(wk)−E(wk+αkpk)]
µ2k
7. If ∆k ≥ 0, then a successful reduction in error can be made.wk+1 = wk + αkpk,rk+1 = E ′(wk+1)λk = 0, success = true.If k mod N = 0 then restart algorithm:pk+1 = rk+1
elseβk =
|rk+1|2−rTk+1rk
µk
pk+1 = rk+1 + βkpk
If ∆k ≥ 0.75, then reduce the scale parameter:λk = 1
4λk
elseλk = λk
success = false8. If ∆k < 0.25, then increase the scale parameter:
λk = λk + δk(1−∆k)|pk|2
9. if the steepest descent direction kk 6= 0, then set k= k+1 and go to 2 elseterminate and return wk+1 as the desired minimum.
41
4.6.4 Initialization Algorithm
In the RBF networks, unsupervised learning is used to set center and deviation to
represent the distribution of the sample data, which speeds up the learning process and
its convergence.
Nguyen and Widrow [14] also proposed an initialization algorithm for a two-layer
neural network to improve the learning speed. Their approach generates initial weight
and bias values between input layer and hidden layer such that the active regions of the
layer’s neurons are distributed approximately evenly over the input space. Other weights,
those associated with hidden layer and output layer are chosen randomly from uniform
distribution in the range [-1, 1]. This algorithm is usually used as a default initialization
algorithm in the scaled conjugate gradient implementation.
In our experiments, randomly chosen weights behaved as well as this initialization
algorithm. This is because the inputs given during initialization do not represent the
distribution of the input space, it grows as the imitation proceeds.
4.7 Discussion
In our approach, the imitator learns the distance function as well as the optimal
policy. First the imitator learns the optimal policy through practical experience, second,
it derives a new distance function from the policy, and then trains the neural network
with this new distance function. In order to learn the distance function, the maximum
learning cost could be:
fmax = p ∗ d ∗ u (4.23)
where p is the number of practical experiences executed, d is the number of derivations
(of the new distance function) for each practical experience if it receives a higher reward,
u is the number of network updates for each new distance function derived.
42
Table 4.1 gives an estimate of cost for learning a distance function in our approach:
Table 4.1. Resource Requirement
number of practical experience, p 450 (average)number of derivation, d max. 100number of network updates, u max. 500
There is a tradeoff between imitation capabilities and the amount of time spent
on learning a distance function. But as the learning process proceeds, the learning cost
could dramatically decrease because the distance function becomes more general and the
number of derivations and network updates decrease.
In general, and in particular in robot applications this should be a good approach
because: (i) The derivation and network update can be done using off-line learning; (ii)
In robot systems, the amount of practical experience is usually considered by far the
most important measure and off-line learning is not quite as important as long as it is
reasonable.
4.7.1 Our Approach vs. Purely Autonomous Learning
Compared with the traditional learning approaches that use purely autonomous
learning, the imitation approach presented in this thesis has a number of advantages:
• The imitator learns the distance function as well as the policy. Like standard rein-
forcement learning, the imitator learns the optimal policy by exploring over policy
space and uses the reward from the environment to guide the learning process.
The imitator then derives the distance function from the policy using off-line learn-
43
ing. Thus there is no additional practical experience required to learn the distance
function beyond what is required to learn the policy.
• The distance function represents the similarity between the observed state and
internal state, which can be used to establish the correspondence between the ob-
served state and internal state. With this correspondence, the imitator is able to
learn the new task easier and faster. As the number of tasks increases, the overall
number of practical experiences should be lower than for the traditional learning
approaches because the distance function allows for a better initial policy for a
new learning task. At the same time, the overhead caused by learning the distance
function is expected to decrease also as the distance function becomes more general
and fewer changes should be required.
• Due to time limitation or other factors, a pure autonomous learning approach
may sometimes not be able to learn the task and thus might perform badly. Our
imitation approach still allows for reasonable task performance even if the system
has only one attempt at the new task, because of the established correspondence
and the similarity among the tasks.
Thus our imitation approach should most of the time be competitive and, as the number
of tasks to learn increases, should outperform systems that learn from scratch even in
terms of off-line learning costs.
CHAPTER 5
EXPERIMENTS
5.1 Introduction
Our experiments are implemented in a simulation domain. Figure 5.1 shows the
environment.
Trash
Toy: BLUE/PLASTIC
Toy Corner: BROWN/WOOD
Futon1: WHITE/COTTON
Futon2: BLACK/LEATHER
Sofa1: WHITE/COTTON
Trash: GREEN/PAPER
Obstacle: RED/WOOD
Trash Bin: BLACK/MARBLE
Tra
sh
Figure 5.1. Imitation Environment.
There are a number of objects, e.g. trash, obstacle and trash bin, where the
demonstrator and imitator can perform a set of tasks.
44
45
5.1.1 State Representation
The state, both observed state and internal state used in these experiments, is
expressed by a set of relationships between the demonstrator/imitator and objects1 in
the environment. Each object is either ”NEXT” to or ”AWAY” from the demonstrator
or imitator. The only exception is that whenever an object is dropped into a trash bin
it disappears and its representation in the state is removed.
There is one more relation to represent the status of the gripper, either empty or
having an object. When it is not empty, e.g. ”ON Gripper Toy”, it can not pick up
another object.
Except for the demonstrator, the imitator and the gripper, each object is repre-
sented by a set of features: color and texture. In each task, we intentionally make some
features more important than others, and let the robot learn from practice.
The robot has to learn the mapping between the observed state and internal state,
including not only the correspondence between the demonstrator and imitator, but also
the correspondence of features in the environment.
5.1.2 Action Representation
The robot used for the experiments here has some simple capabilities, e.g. moving
around, identify the color and texture of objects, carrying an object. Specifically it can
perform the following actions:
• Move target : Move from the current location to the target.
• Grab object : Pick up an object using its gripper.
• Drop: Drop the object from its gripper. One special case of this action is that when
the robot drops an object into a trash bin, the object disappears.
1The maximum number of objects in the environment is three.
46
There is one action that the demonstrator can execute but the imitator can’t, the
PUSH action2.
5.2 Learning Procedure
The learning procedure is divided into two phases. First the imitator learns the
correspondence between the observed state and internal state through a set of simple
tasks, e.g. moving to a particular object, or grabbing an object. Then a set of real world
tasks are given, e.g. trash cleaning, toy collection and futon match. During this step, the
robot learns the important aspects of the tasks by practicing under different scenarios.
The correspondence built in the first step can be used directly in the second step, which
speeds up the learning process.
5.2.1 Learning the Mapping Between Observed and Internal State
In this step, the robot tries to learn the correspondence between the observed state
and the internal state through a set of simple tasks. There are 27 instances, including
moving from one location to another, grabbing a particular object, pushing an object
towards somewhere or dropping.
As the demonstrator and the robot have different capabilities, the robot may not
perform a particular task in the same way as the demonstrator does. One example is
that the robot cannot execute a PUSH action, instead it executes a sequence of actions,
GRAB, MOVE and DROP to accomplish the task.
As the goal is to learn the correspondence between the observed state and internal
state, the environment for the imitation is the same as for the demonstration.
2The action executed by the demonstrator cannot be observed by the imitator. The action has to beinferred from the change in the environment
47
5.2.2 Generalization in Dynamically Changing Environments
After establishing the mapping between the observed state and the internal state,
the robot is sometimes able to produce an optimal policy when given a more complex
task. The problem is that the demonstration is given under one scenario, which does not
represent the real task domain. Even if the robot successfully imitates one task, we can
not expect it to perform well when the environment slightly changes.
Meanwhile, not every feature presented in the task is equally important, some of
them may be totally irrelevant. A practical approach is to let the robot practice under
different scenarios and learn important aspects of the task. By ignoring the irrelevant
aspects, the robot is able to derive an optimal policy even if the environment slightly
changes.
There are three tasks, trash cleaning, toy collection and futon match, with a to-
tal of 14 instances. Different scenarios are chosen randomly during each training step.
Whenever the robot executes a policy, it receives feedback from the environment or the
demonstrator. The feedback could be a full reward when it successfully finishes the task,
or a partial reward depending on how many subgoals it achieves.
5.3 Example Tasks
During the second learning phase, a set of complex tasks is given. It contains three
tasks, Trash Cleaning, Toy Collection and Futon Match. In the following section, we will
introduce them separately.
5.3.1 Trash Cleaning
In the cleaning task (Figure 5.2), there are three objects, trash, obstacle and trash
bin. The trash is distinguished from other objects by a GREEN color, which is not known
by the imitator.
48
Demonstrator
Drop Trash
Trash
Trash: GREEN/PAPER
Obstacle: RED/WOOD
Trash Bin: BLACK/MARBLE
Move to Trash
Grab Trash
Move Trash Bin
Tra
sh
Figure 5.2. Trash Cleaning.
Initially the demonstrator is far away from all the objects. It moves towards the
trash, grabs it, then carries it towards the trash can, and finally drops it into the trash
bin.
The robot has to learn not only the cleaning task, but also the identification of the
trash. During imitation, different types of trash are presented in the environment, i.e.
GREEN color objects with different texture. Whenever the robot drops trash into the
trash bin, it gets a reward. It even gets a partial reward when it achieves part of the
task, e.g. grab the trash.
5.3.2 Toy Collection
Figure 5.3 shows how a demonstrator performs a Toy Collection task - he moves
to a toy and pushes it towards a particular location, called Toy Corner.
49
Demonstrator
Trash
Toy: BLUE/PLASTIC
Trash: GREEN/PAPER
Toy Corner: BROWN/WOOD
Move to Toy
Push Toy toward Toy Corner
Tra
sh
Figure 5.3. Toy Collection.
Initially the imitator is next to the trash, it has to learn to deliver the toy to the
toy corner, not the trash.
The toy is distinguished from other objects by a PLASTIC texture, which is also
not known to the robot. Different kinds of toys are provided through various colors with
PLASTIC texture.
Another challenge of this task is that the robot does not have the PUSH capability.
Although the robot learns an alternative in the first training step, it must learn to apply
the same strategy under different scenarios.
5.3.3 Futon Matching
Futon Matching is a sorting problem, where the robot learns to put the futon on
the sofa whose color and texture are matched with the futon’s. The demonstration is
shown in Figure 5.4.
50
Futon1: WHITE/COTTON
Futon2: BLACK/LEATHER
Sofa1: WHITE/COTTON
Demonstrator
Grab Futon2
$
$
$
$
$
$
Drop Futon2
Move to Futon1
$ $
$
Grab Futon1
$ $
$
Drop Futon1
Move to Sofa1
Figure 5.4. Futon Match.
There are one sofa and two futons. The sofa is white and cotton, one futon is white
and cotton, the other is black and leather. Initially the black futon is on the sofa and the
demonstrator is next to them. The demonstrator grabs the black futon, moves towards
the white futon, drops the black one and grabs the white one. Finally the demonstrator
brings the white futon back and puts it on the sofa.
In order to learn the relationship between futon and sofa - the color and texture
match - the robot is given various types of sofas and futons, e.g. LEATHER, DENIM,
and TROPIC.
5.4 Experiments Result
In order to evaluate the distance function represented by the neural network, we
develop two distance functions, one is a random version, the other is a deliberate hand-
coded version.
51
These two distance functions reflect two extreme cases. On the one hand, the
random version knows nothing about the task and randomly generates a distance value
for each state pair. On the other hand, the hand-coded version knows every detail of the
training tasks and performs a complex calculation to produce the distance. Depending
on the environment which implicitly indicates the task, either an exact match or a partial
match is performed between the observed state and internal state:
• Whenever there is a trash bin around, the GREEN object’s texture is ignored.
• Whenever there is a toy corner nearby, the PLASTIC object’s color is ignored.
As the toy collection task is executed by the PUSH action in the demonstration,
to allow the robot to generate an alternative, we carefully assign a small distance
between the final observed state and the internal state sequence where the robot
tries to grab a toy, move to the toy corner and drop the toy.
• As long as the objects in the internal state have the same color and texture, ignore
the difference between the observed objects and internal objects.
• Otherwise perform an exact match.
5.4.1 Evaluation Tasks
We define a set of tasks for evaluation purpose. It contains three different types:
simple tasks, complex tasks and new evaluation tasks.
First we define three new instances for the simple task to verify whether the robot
learns the correspondence between the observed states and internal states. There are
three objects in the environment, including a yellow stone, a sofa and an obstacle. The
demonstrator performs a certain action in each demonstration, e.g. move, grab or drop.
Second we want to verify whether the training tasks have been learned or not.
Trash Cleaning and Toy Collection are selected for this purpose. Each task is trained
and tested under three different cases, which is shown in Table 5.1 and 5.2.
52
Table 5.1. Experiment Configuration for Trash Cleaning
Training TestingCase 1 green/metal, glass, stone, paper, wood green/cottonCase 2 green/metal, glass, stone, paper green/woodCase 3 green/metal, wood, stone, paper green/glass
Table 5.2. Experiment Configuration for Toy Collection
Training TestingCase 1 red, blue, yellow, white/plastic brown/plasticCase 2 brown, blue, yellow, white/plastic red/plasticCase 3 red, blue, brown, white/plastic yellow/plastic
Third we create two new tasks. One is a variation of the futon match task. Instead
of having two futons and one sofa, this task have two sofas with different color and
texture, and one futon matching one of them. We want to see whether the robot can
successfully finish the task after futon match training. The other one is a variation of the
trash cleaning task, where the demonstrator cleans two pieces of trash in the environment.
Different pieces of trash are given during the imitation.
5.4.2 Results
We have run 60 experiments over three different training and evaluation sets.
Figure 5.5 shows the task performance for case 1. (a) shows the average reward and
its standard deviation over 20 experiments, while (b) shows the result over 9 experiments
that learned a stable distance function within the given amount of training time. The
stable experiments are selected based on the task performance in the second training
step, by selecting those that did not find better policies for at least 6000 iterations.
53
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
Simple Tasks Complex Tasks New Tasks
(a)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(b)
Figure 5.5. Average reward for case 1 over 20 experiments(a) and 9 experiments(b).
In each figure, from left to right, are the task performance for four different distance
functions: random version(Random), distance function after first(1st) and second(2nd)
training step and the hand-coded version(Hand-coded), respectively.
Each distance function is evaluated on three different tasks: simple tasks, complex
tasks and new tasks. The number of instances for each type of tasks is different, 27 for
simple tasks, 14 for complex tasks and 7 for new tasks.
Figure 5.6 shows the results for case 2, where eleven experiments are selected as
stable using the same criteria as for Figure 5.5 (b).
54
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(a)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(b)
Figure 5.6. Average reward for case 2 over 20 experiments(a) and 11 experiments(b).
Figure 5.7 shows the results for case 3. We can see that although the training and
evaluation sets are different between case 1, 2 and 3, the results are similar. After the
second training step, although the performance for the simple task degrades slightly, the
overall performance increases.
Figure 5.8 shows the overall performance over 60 experiments. After correspon-
dence training, the distance function performs well on the simple tasks, but not on the
more complex tasks and evaluation tasks. After the second training step, although the
performance on simple tasks degrades, the average reward over complex tasks and eval-
55
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(a)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(b)
Figure 5.7. Average reward for case 3 over 20 experiments(a) and 9 experiments(b).
uation tasks increases. This is even clearly for the selected experiments, where not only
the average rewards increase, but also the standard deviations decrease.
The results show that the learned distance function significantly outperforms a
random function. For the cases where the learning approach found a stable solution, the
performance on the simple and the complex tasks is within one standard deviation of the
optimal hand-coded strategy. In addition, the performance on the new tasks shows the
approach is able to generalize beyond the training tasks to related new tasks.
56
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(a)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Random 1st 2nd Hand-coded
Rew
ard
(b)
Figure 5.8. Average reward over 60 experiments(a) and 29 experiments(b).
CHAPTER 6
CONCLUSION
As robotic systems move into more complex and realistic environments, the need
for learning capability increases. But traditional learning approaches that use purely
autonomous learning, often learn slowly. To overcome this weakness, they are used in
conjunction with other learning techniques. Imitation, also called learning from demon-
stration, is one choice as the demonstration provides a possible solution of the task.
The imitation problem can be considered as relating the given demonstration to
a sequence of internal states, which achieve the same effects. When the demonstrator
and imitator have similar bodies and capabilities, each internal state exactly matches the
observed state. But when the imitator is substantially different from the demonstrator,
the imitator may not be able to establish the correspondence between the observed state
and the internal state, and the imitation may fail.
In the previous work, some approaches assume that either the demonstrator and im-
itator have similar bodies and/or capabilities or the correspondence between the demon-
strator and imitator has been established. Although other approaches use a function to
represent the similarity between the observed state and the internal state, the function
either has a simple form or makes assumptions about how the features are related.
In this thesis, we introduce a distance function to represent the similarity between
the observed state and internal state. This distance function makes no assumption about
the structure of the function or the correspondence of features. By minimizing the
distance as well as action cost over the entire sequence of internal states, the imitator is
able to approximate the demonstration even if it has different body and capabilities.
57
58
The imitator learns the distance function as well as the optimal policy. Instead of
learning the distance function directly, it learns an optimal policy through reinforcement
from the environment and derives the distance function through that policy. As a measure
of the quality of imitation is often difficult and subjective (observer-dependent), task
performance is used to approximate a measure of imitation quality.
To facilitate learning, the imitator first learns the correspondences between the
observed state and internal state through some simple tasks. Then the more complex
tasks are used to learn the implications of the environmental difference and behavioral
capabilities on the imitator.
The experiments show that the imitator can easily learn the correspondence be-
tween the observed state and internal state even if it has different capabilities from the
demonstrator. As the tasks become more complex, the chance to find the optimal policy
varies.
After the second training step, the overall performance increases. If the imitator
performs well on the training task, its performance does not degrade dramatically on the
new tasks.
REFERENCES
[1] C. Atkeson and S. Schaal, “Robot learning from demonstration,” 1997.
[2] G. Hayes and J. Demiris, “A robot controller using learning by imitation,” 1994.
[3] M. N. Nicolescu, “A framework for learning from demonstration, generalization and
practice in human-robot domains,” Ph.D. dissertation, University of Southern Cal-
ifornia, May 2003.
[4] J. Demiris and G. Hayes, “Imitative learning mechanisms in robots and humans,”
Proceedings of the 5th European workshop on learning robots, pp. 88–93, 1996.
[5] C. Nehaniv and K. Dautenhahn, “Mapping between dissimilar bodies: Affordances
and the algebraic foundations of imitation.” Edinburgh, Scotland: In Proc. Euro-
pean Workshop on Learning Robots, July 1998, pp. 64–72.
[6] B. Price and C. Boutillier, “Imitation and reinforcement learning with heterogeneous
action,” 2001.
[7] M. Johnson and Y. Demiris, “Abstraction in recognition to solve the correspondence
problem for robot imitation,” 2004.
[8] S. V. Gudla and M. Huber, “Cost-based policy mapping for imitation,” in In pro-
ceedings of the 16th International FLAIRS Conference, St. Augustine, FL, 2003, pp.
17–21.
[9] P. Hart, N. Nilsson, and B. Raphael, “A formal basis for the heuristic determination
of minimum cost paths,” IEEE Transactions on Systems Science and Cybernetics,
vol. 4, no. 2, p. 100C107, 1968.
[10] V. Gullapalli, “A stochastic reinforcement learning algorithm for learning real-value
functions,” Neural Networks, vol. 3, pp. 671–692, 1991.
59
60
[11] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural
Networks, vol. 4, pp. 251–257, 1991.
[12] M. F. Møller, “A scaled conjugate gradient algorithm for fast supervised learning,”
Neural networks, vol. 6, pp. 525–533, 1993.
[13] R. A. Jacobs, “Increased rates of convergence through learning rate adaptation,”
Robert A. Jacobs, vol. 1, pp. 295–307, 1998.
[14] D. Nguyen and B. Widrow, Eds., Improving the learning speed of 2-Layer Neural
networks by choosing Initial values of the adaptive weights., vol. 3, no. 21-26. Proc.
of Intern. Joint Conference on Neural Networks.
[15] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, second
edition ed. Prentice Hall, 2002.
[16] T. M. Mitchell, Machine Learning, first edition ed. McGraw Hill, 1997.
[17] A. J. Smith, “Dynamic generalisation of continuous action spaces in reinforcement
learning: A neurally inspired approach,” Ph.D. dissertation, University of Edin-
burgh, October 2001.
[18] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1996, ch. Radial-Basis Function Networks, pp. 256–317.
BIOGRAPHICAL STATEMENT
Heng Kou received her Bachelor of Science degree in Computer Science from Branch
of Tianjin University, China. She started her graduate studies in August 2003 and
received her Master of Science degree in Computer Science and Engineering from The
University of Texas at Arlington in August 2006.
61