inverse reinforcement learning for telepresence robots
TRANSCRIPT
Inverse Reinforcement Learning for Telepresence Robots
Shimon Whiteson Dept. of Computer Science
University of Oxford
joint work with Kyriacos Shiarlis & João Messias
Acknowledgements
João MessiasKyriacos Shiarlis
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
What Is Telepresence?
“Skype on a stick”
Human Controller / Pilot Interaction Target
Telepresence Benefits
• Greater physical presence:“Your alter ego on wheels”
• Mobility enables spontaneous interaction
Application: Education Accessibility
Application: Remote Health Care
Application: Elderly Accessibility
“Cafe Philo" Discussion Group
Les Arcades Prevention Centre in Troyes, France
Telepresence Limitations• Learning curve for human controller
• Otherwise automatic behaviours (e.g., body language) require manual execution
• Human controller must simultaneously make high-level and low-level decisions
• Leads to cognitive overload: mistakes at the low level; less attention for the high level [Tsui et al. 2011]
• Result is poor quality social interaction
TERESA Solution
• A new telepresence system with partial autonomy: automate low-level decision making
• Free human controller to focus on high-level decisions
• Requires social intelligence:
• Social navigation
• Social conversation
TERESA Robot
Experiments with Real Subjects
Les Arcades Prevention Centre in Troyes, France
Video
Interface
Human Controller
Audiovisual Processing Environmentaudio &
videoaudio &video
Cost Estimator State Estimator
Navigator Body Pose Controller
controllerreactions
peoplereactions
peoplelocations
peoplebehavior
state estimates
cost estimates
high-levelcontrolsignals
bodyposes motion
Cognitive Architecture
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Inverse Reinforcement Learning
Reward & Value
R(s, a) =KX
k=1
wk�k(s, a)
V ⇡(s) = E{hX
t=1
R(st, at) | s1 = s}
V ⇡(s) =KX
k=1
wk
E{
hX
t=1
�k(st, at) | s1 = s}!
=KX
k=1
wkµ⇡,1:hk |s1=s,
unknown weight feature function
feature expectation under s1
Data & Feature Expectations
D =�⌧1, ⌧2, ...⌧N
⌧i = h(s⌧i1 , a⌧i1 ), (s⌧i2 , a⌧i2 ), . . . , (s⌧ih , a⌧ih )i
Dataset:
Trajectory:
eµDk =
1
N
X
⌧2D
hX
t=1
�k(s⌧t , a
⌧t )
µ⇡k |D =
X
s2SPD(s1 = s)µ⇡
k |s1=s
Empirical feature expectation:
Feature expectation of ∏:
Maximum Causal Entropy IRLfind: max
⇡H(Ah||Sh
)
subject to: eµDk = µ⇡
k |D 8k
and:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
H(Ah||Sh) = �
hX
t=1
X
s1:t2St
a1:t2At
P (a1:t, s1:t) log(P (at|st))
where H is the causal entropy:
B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.
depends only on previous states & actions
Lagrangian
L(⇡, w) = H(Ah||Sh) +KX
k=1
wk(µ⇡k |D � eµD
k )
find: min
w
n
max
⇡(L(⇡, w))
o
subject to:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
Setting to zero implies:
Solving the Lagrangianr⇡(st,at)L(⇡, w)
⇡(st, at) / exp
H(At:h||St:h
) +
KX
k=1
wkµ⇡,t:hk |st,at
!
Finding ∏ corresponds to performing soft value iteration:
Q
w
(s, a)
soft
=
KX
k=1
w
k
�
k
(s, a) +
X
s
0
T (s, a, s
0)V
w
(s
0)
V
w
(s)
soft
= log
X
a
exp(Q
w
(s, a))
⇡(s, a) = exp(Q
w
(s, a)� V
w
(s))
Q⇡(st, at)
Gradient descent w.r.t. in the outer loop:
Outer Loop
Finding ∏ for a fixed corresponds to the inner loop of:
min
w
n
max
⇡(L(⇡, w))
o
w
w
rwL(⇡, w) = µ⇡|D � eµD,
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Failed Demonstrations
• Existing IRL methods learn only from success
• IRL motivation: describing reward is hard but demonstrating success is easy
• Failed demonstrations also often available whenever humans learn via trial and error
• We extend IRL to include learning from failure
• Finds cost function that encourages behaviour that:
• Maximises similarity to successful demos
• Minimises similarity to failed demos
Inverse Reinforcement Learning from Failure
Kyriacos Shiarlis‚ João Messias and Shimon Whiteson. Inverse Reinforcement Learning from Failure. In AAMAS 2016.
Add inequality constraints:
Naive Approach
|eµFk � µ⇡
k |F | > ↵k 8k
max
⇡,↵H(Ah||Sh
) +
KX
k=1
↵k
Non-convex!
Convex relaxation:
max
⇡,✓H(Ah||Sh
) +
KX
k=1
✓k(µ⇡k |F � eµF
k )
Intractable!
IRLF
max
⇡,✓,zH(Ah||Sh
) +
KX
k=1
✓kzk � �
2
||✓||2
subject to: µ⇡k |D = eµD
k 8kand: µ⇡
k |F � eµFk = zk 8k
and:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A,
causal entropy dissimilarity regularisation
new constraint
Lagrangian
L(⇡, ✓, z, wD, wF ) = H(Ah||Sh)� �
2||✓||2 +
KX
k=1
✓kzk+
KX
k=1
wDk (µ⇡
k |D � eµDk ) +
KX
k=1
wFk (µ⇡
k � eµFk |F � zk)
find: min
w
⇢max
⇡,✓,z(L(⇡, w))
�
subject to:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
First find critical points w.r.t. z and θ, then w.r.t. ∏, yielding:
Solving the Lagrangian
Yielding a new soft value iteration:
⇡(st, at) / exp
H(At:h||St:h
) +
KX
k=1
(wDk + wF
k )µ⇡,t:hk |st,at
!
Q
w
(s, a)
soft
=
KX
k=1
(w
Dk
+ w
Fk
)�
k
(s, a) +
X
s
0
T (s, a, s
0)V
w
(s
0)
V
w
(s)
soft
= log
X
a
exp(Q
w
(s, a))
⇡(s, a) = exp(Q
w
(s, a)� V
w
(s))
Outer Loop
rwDkL⇤(wD, wF ) = µ⇡⇤
k |D � eµDk ,
rwFkL⇤(wD, wF ) = µ⇡⇤
k |F � eµFk � �wF
k .
Finding ∏,θ,z for a fixed corresponds to the inner loop of:
find: min
w
⇢max
⇡,✓,z(L(⇡, w))
�
subject to:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
w
Gradient descent w.r.t. and in the outer loop:wD wF
Incremental IRLF
• Updates to each set of weights affect feature expectations and thus interact with each other
• Some failed demos may be better for fine tuning
• Solution: anneal �
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Simulated Robot Domain
Simulated Robot ResultsOverlapping
E: reach target, avoid obstacles T: reach target,
hit obstacle
Complementary: E: reach target T: hit obstacle
w.r.
t. Ex
pert
w.r.
t. Ta
boo
Contrasting E: reach target, avoid obstacles T: hit obstacle
Varying # of Successful Demos
(Contrasting Scenario)
Learned Reward Functions
Expert Taboo
IRL IRLF(Contrasting Scenario)
Factory Domain Results
w.r.t. Expert w.r.t. Taboo
R. Dearden & C. Boutilier. Abstraction and Approximate Decision-Theoretic Planning. Artificial Intelligence, 89(1):219–283, 1997.
Real Robot Data
Real-World IRL Evaluation• Missing ground-truth reward functions
• Current practice: use qualitative and/or ad-hoc metrics
• New approach:
• Apply IRL to D and F separately
• Resulting reward functions := ground truth
• Generate new data from corresponding policies
• Apply IRLF to new datasets
Real Robot Results
w.r.t. Expert w.r.t. Taboo
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Rapidly Exploring Random Trees
Figure source: Wikipedia
Next Version of TERESA
Extra Slides
Wizard of Oz MethodologyHuman Controller
Interaction Target
Human Obstacles
Wizard of Oz
Deep IRLF
Figure source: Wulfmeier et al.
Markus Wulfmeier, Peter Ondruska, Ingmar Posner Maximum Entropy Deep Inverse Reinforcement Learning, arXiv:1507.04888