inverse reinforcement learning for telepresence robots

Inverse Reinforcement Learning for Telepresence Robots

Shimon Whiteson Dept. of Computer Science

University of Oxford

joint work with Kyriacos Shiarlis & João Messias

Acknowledgements

João MessiasKyriacos Shiarlis

Outline• TERESA

• Inverse Reinforcement Learning

• Inverse Reinforcement Learning from Failure

• Results

• Next Steps

What Is Telepresence?

“Skype on a stick”

Human Controller / Pilot Interaction Target

Telepresence Benefits

• Greater physical presence:“Your alter ego on wheels”

• Mobility enables spontaneous interaction

Application: Education Accessibility

Application: Remote Health Care

Application: Elderly Accessibility

“Cafe Philo" Discussion Group

Les Arcades Prevention Centre in Troyes, France

Telepresence Limitations• Learning curve for human controller

• Otherwise automatic behaviours (e.g., body language) require manual execution

• Human controller must simultaneously make high-level and low-level decisions

• Leads to cognitive overload: mistakes at the low level; less attention for the high level [Tsui et al. 2011]

• Result is poor quality social interaction

TERESA Solution

• A new telepresence system with partial autonomy: automate low-level decision making

• Free human controller to focus on high-level decisions

• Requires social intelligence:

• Social navigation

• Social conversation

TERESA Robot

Experiments with Real Subjects

Les Arcades Prevention Centre in Troyes, France

Interface

Human Controller

Audiovisual Processing Environmentaudio &

videoaudio &video

Cost Estimator State Estimator

Navigator Body Pose Controller

controllerreactions

peoplereactions

peoplelocations

peoplebehavior

state estimates

cost estimates

high-levelcontrolsignals

bodyposes motion

Cognitive Architecture

Outline• TERESA



• Results

• Next Steps

Inverse Reinforcement Learning

Reward & Value

R(s, a) =KX

k=1

wk�k(s, a)

V ⇡(s) = E{hX

t=1

R(st, at) | s1 = s}

V ⇡(s) =KX

k=1

wk

E{

hX

t=1

�k(st, at) | s1 = s}!

=KX

k=1

wkµ⇡,1:hk |s1=s,

unknown weight feature function

feature expectation under s1

Data & Feature Expectations

D =�⌧1, ⌧2, ...⌧N

⌧i = h(s⌧i1 , a⌧i1 ), (s⌧i2 , a⌧i2 ), . . . , (s⌧ih , a⌧ih )i

Dataset:

Trajectory:

eµDk =

1

N

X

⌧2D

hX

t=1

�k(s⌧t , a

⌧t )

µ⇡k |D =

X

s2SPD(s1 = s)µ⇡

k |s1=s

Empirical feature expectation:

Feature expectation of ∏:

Maximum Causal Entropy IRLfind: max

⇡H(Ah||Sh

)

subject to: eµDk = µ⇡

k |D 8k

and:

X

a2A⇡(s, a) = 1 8s 2 S

and: ⇡(s, a) � 0 8s 2 S, a 2 A

H(Ah||Sh) = �

hX

t=1

X

s1:t2St

a1:t2At

P (a1:t, s1:t) log(P (at|st))

where H is the causal entropy:

B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.

depends only on previous states & actions

Lagrangian

L(⇡, w) = H(Ah||Sh) +KX

k=1

wk(µ⇡k |D � eµD

k )

find: min

w

n

max

⇡(L(⇡, w))

o

subject to:

X

a2A⇡(s, a) = 1 8s 2 S

and: ⇡(s, a) � 0 8s 2 S, a 2 A

Setting to zero implies:

Solving the Lagrangianr⇡(st,at)L(⇡, w)

⇡(st, at) / exp

H(At:h||St:h

) +

KX

k=1

wkµ⇡,t:hk |st,at

!

Finding ∏ corresponds to performing soft value iteration:

Q

w

(s, a)

soft

=

KX

k=1

w

k

�

k

(s, a) +

X

s

0

T (s, a, s

0)V

w

(s

0)

V

w

(s)

soft

= log

X

a

exp(Q

w

(s, a))

⇡(s, a) = exp(Q

w

(s, a)� V

w

(s))

Q⇡(st, at)

Gradient descent w.r.t. in the outer loop:

Outer Loop

Finding ∏ for a fixed corresponds to the inner loop of:

min

w

n

max

⇡(L(⇡, w))

o

w

w

rwL(⇡, w) = µ⇡|D � eµD,

Outline• TERESA



• Results

• Next Steps

Failed Demonstrations

• Existing IRL methods learn only from success

• IRL motivation: describing reward is hard but demonstrating success is easy

• Failed demonstrations also often available whenever humans learn via trial and error

• We extend IRL to include learning from failure

• Finds cost function that encourages behaviour that:

• Maximises similarity to successful demos

• Minimises similarity to failed demos

Inverse Reinforcement Learning from Failure

Kyriacos Shiarlis‚ João Messias and Shimon Whiteson. Inverse Reinforcement Learning from Failure. In AAMAS 2016.

Add inequality constraints:

Naive Approach

|eµFk � µ⇡

k |F | > ↵k 8k

max

⇡,↵H(Ah||Sh

) +

KX

k=1

↵k

Non-convex!

Convex relaxation:

max

⇡,✓H(Ah||Sh

) +

KX

k=1

✓k(µ⇡k |F � eµF

k )

Intractable!

IRLF

max

⇡,✓,zH(Ah||Sh

) +

KX

k=1

✓kzk � �

2

||✓||2

subject to: µ⇡k |D = eµD

k 8kand: µ⇡

k |F � eµFk = zk 8k

and:

X

a2A⇡(s, a) = 1 8s 2 S

and: ⇡(s, a) � 0 8s 2 S, a 2 A,

causal entropy dissimilarity regularisation

new constraint

Lagrangian

L(⇡, ✓, z, wD, wF ) = H(Ah||Sh)� �

2||✓||2 +

KX

k=1

✓kzk+

KX

k=1

wDk (µ⇡

k |D � eµDk ) +

KX

k=1

wFk (µ⇡

k � eµFk |F � zk)

find: min

w

⇢max

⇡,✓,z(L(⇡, w))

�

subject to:

X

a2A⇡(s, a) = 1 8s 2 S

and: ⇡(s, a) � 0 8s 2 S, a 2 A

First find critical points w.r.t. z and θ, then w.r.t. ∏, yielding:

Solving the Lagrangian

Yielding a new soft value iteration:

⇡(st, at) / exp

H(At:h||St:h

) +

KX

k=1

(wDk + wF

k )µ⇡,t:hk |st,at

!

Q

w

(s, a)

soft

=

KX

k=1

(w

Dk

+ w

Fk

)�

k

(s, a) +

X

s

0

T (s, a, s

0)V

w

(s

0)

V

w

(s)

soft

= log

X

a

exp(Q

w

(s, a))

⇡(s, a) = exp(Q

w

(s, a)� V

w

(s))

Outer Loop

rwDkL⇤(wD, wF ) = µ⇡⇤

k |D � eµDk ,

rwFkL⇤(wD, wF ) = µ⇡⇤

k |F � eµFk � �wF

k .

Finding ∏,θ,z for a fixed corresponds to the inner loop of:

find: min

w

⇢max

⇡,✓,z(L(⇡, w))

�

subject to:

X

a2A⇡(s, a) = 1 8s 2 S

and: ⇡(s, a) � 0 8s 2 S, a 2 A

w

Gradient descent w.r.t. and in the outer loop:wD wF

Incremental IRLF

• Updates to each set of weights affect feature expectations and thus interact with each other

• Some failed demos may be better for fine tuning

• Solution: anneal �

Outline• TERESA



• Results

• Next Steps

Simulated Robot Domain

Simulated Robot ResultsOverlapping

E: reach target, avoid obstacles T: reach target,

hit obstacle

Complementary: E: reach target T: hit obstacle

w.r.

t. Ex

pert

w.r.

t. Ta

boo

Contrasting E: reach target, avoid obstacles T: hit obstacle

Varying # of Successful Demos

(Contrasting Scenario)

Learned Reward Functions

Expert Taboo

IRL IRLF(Contrasting Scenario)

Factory Domain Results

w.r.t. Expert w.r.t. Taboo

R. Dearden & C. Boutilier. Abstraction and Approximate Decision-Theoretic Planning. Artificial Intelligence, 89(1):219–283, 1997.

Real Robot Data

Real-World IRL Evaluation• Missing ground-truth reward functions

• Current practice: use qualitative and/or ad-hoc metrics

• New approach:

• Apply IRL to D and F separately

• Resulting reward functions := ground truth

• Generate new data from corresponding policies

• Apply IRLF to new datasets

Real Robot Results

w.r.t. Expert w.r.t. Taboo

Outline• TERESA



• Results

• Next Steps

Rapidly Exploring Random Trees

Figure source: Wikipedia

Next Version of TERESA

Extra Slides

Wizard of Oz MethodologyHuman Controller

Interaction Target

Human Obstacles

Wizard of Oz

Deep IRLF

Figure source: Wulfmeier et al.

Markus Wulfmeier, Peter Ondruska, Ingmar Posner Maximum Entropy Deep Inverse Reinforcement Learning, arXiv:1507.04888

inverse reinforcement learning for telepresence robots

Documents