reinforcement learning and the reward engineering principle

Reinforcement Learning and the Reward Engineering Principle

Daniel Dewey

[email protected]; AAAI Spring Symposium Series 2014

mailto:[email protected]

A modest aim:

What role goals in AI research?

…through the lens of reinforcement

learning.



Reinforcement learning and AI

Definitions: “control” “dominance”

The reward engineering principle

Conclusions



Stuart Russell, “Rationality and Intelligence”

RL and AI

“…one can define AI as the problem of designing systems that do the right thing.


Now we just need a definition for

‘right.’”

Reinforcement learning provides a definition: maximize total rewards.


RL and AI


action

reward

state

Agent EnvironmentAI


RL and AI


Understand and Exploit

Inference, Planning, Learning,

Metareasoning, Concept formation,

etc…


RL and AI

Advantages:• Simple and cheap• Flexible and abstract• Measurable


“worse is better”

…and used in natural neural nets (brains!)


RL and AI


Outside the frame:Some behaviours cannot be elicited(by any rewards!)

As RL AI becomes more general and autonomous, it becomes harder to get good results with RL.

Key concepts: Control and dominance





Conclusions



Definitions: “control”


A user has control when the agent’s received rewards equal the user’s chosen reward.




action

reward

state

Agent Environment




action

reward

Environment 1

User

Environment 2

state action

reward




user chooses reward

Environment 2

Agent User

Environment 1




Agent

env. “chooses” reward

Environment 2

Environment 1

User


Definitions: “dominance”


Why does control matter?

Loss of control can create situations where no possible sequence of rewards can elicit the desired behaviour.

These behaviours are dominated by other behaviours.




A “behaviour” (sequence of actions) is a policy.

1 ? 0 ? ? ? 0 ?

a1 a2 a3 a7a4 a5 a6 a8

P1




1 ? 0 ? ? ? 0 ?P1

User-chosen rewards




Env.-chosen rewards (loss of control)

1 ? 0 ? ? ? 0 ?P1




1 ? 0 ? ? ? 0 ?P1

1 0 ? 1 ? ? 1 1P2

Can rewards make either better?




1 1 0 1 1 1 0 1P1

1 0 0 1 0 0 1 1P2

Choose all rewards 1: Max. reward = 6

Choose all rewards 0: Min. reward = 4




1 0 0 0 0 0 0 0P1

1 0 1 1 1 1 1 1P2

Choose all rewards 0: Min. reward = 1

Choose all rewards 1: Max. reward = 7




1 ? 0 ? ? ? 0 ?P1

1 1 1 1 1 ? 1 1P3




1 1 0 1 1 1 0 1P1

1 1 1 1 1 0 1 1P3

Max. reward = 6

Min. reward = 7




Dominated by P3

Dominates P1

1 ? 0 ? ? ? 0 ?P1

1 1 1 1 1 ? 1 1P3




A dominates B if no possible assignment of rewards causes R(A) > R(B).

No series of rewards can prompt a dominated policy; they are unelicitable. (A less obvious

result: every unelicitable policy is dominated.)


Recap


Control is sometimes lost;

Loss of control enables dominance;

Dominance makes some policies

unelicitable.

All of this is outside the “RL AI

frame”

…but is clearly part of the AI problem(do the right thing!)


Generality: the range of policies an agent has reasonably efficient access to.

Autonomy: ability to function in environments with little interaction from users.

= better chance of finding dominant policies

= more frequent loss of control

Additional factors






Conclusions



Reward Engineering Principle


As RL AI becomes more general and autonomous, it becomes both more difficult and more important to constrain the environment to avoid loss of control.…because general / autonomous RL AI has• better chance of dominant policies;• more unelicitable policies;• more significant effects





Conclusions




Heed the Reward Engineering Principle.

• Consider existence of dominant policies

• Be as rigorous as possible in excluding them

• Remember what’s outside the frame!

RL AI users:



Expand the frame! Make goal design a first-class citizen.

Consider alternatives: manually coded utility functions, preference learning, …?

Watch out for dominance relations (e.g. in “dual” motivation systems, between intrinsic and extrinsic)

AI Researchers:


Thank you!

Work supported by theAlexander Tamas Research Fellowship


Toby Ord, Seán Ó hÉigeartaigh, and two anonymous judges, for comments.


reinforcement learning and the reward engineering principle

Documents

aaai spring symposium

intelligence rl

physically5 rl

environment6 rl

good frame

reinforcement learning

lens of reinforcement

ai problem easier