reinforcement learning and the reward engineering principle
DESCRIPTION
Reinforcement Learning and the Reward Engineering Principle. Daniel Dewey. [email protected] ; AAAI Spring Symposium Series 2014. A modest aim: What role goals in AI research? …through the lens of reinforcement learning. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/1.jpg)
Reinforcement Learning and the Reward Engineering Principle
Daniel Dewey
[email protected]; AAAI Spring Symposium Series 2014
![Page 2: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/2.jpg)
A modest aim:
What role goals in AI research?
…through the lens of reinforcement
learning.
[email protected]; AAAI Spring Symposium Series 2014
![Page 3: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/3.jpg)
Reinforcement learning and AI
Definitions: “control” “dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
![Page 4: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/4.jpg)
Stuart Russell, “Rationality and Intelligence”
RL and AI
“…one can define AI as the problem of designing systems that do the right thing.
[email protected]; AAAI Spring Symposium Series 2014
Now we just need a definition for
‘right.’”
Reinforcement learning provides a definition: maximize total rewards.
![Page 5: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/5.jpg)
RL and AI
[email protected]; AAAI Spring Symposium Series 2014
action
reward
state
Agent EnvironmentAI
![Page 6: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/6.jpg)
RL and AI
[email protected]; AAAI Spring Symposium Series 2014
Understand and Exploit
Inference, Planning, Learning,
Metareasoning, Concept formation,
etc…
![Page 7: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/7.jpg)
RL and AI
Advantages:• Simple and cheap• Flexible and abstract• Measurable
[email protected]; AAAI Spring Symposium Series 2014
“worse is better”
…and used in natural neural nets (brains!)
![Page 8: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/8.jpg)
RL and AI
[email protected]; AAAI Spring Symposium Series 2014
Outside the frame:Some behaviours cannot be elicited(by any rewards!)
As RL AI becomes more general and autonomous, it becomes harder to get good results with RL.
Key concepts: Control and dominance
![Page 9: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/9.jpg)
Reinforcement learning and AI
Definitions: “control” “dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
![Page 10: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/10.jpg)
Definitions: “control”
[email protected]; AAAI Spring Symposium Series 2014
A user has control when the agent’s received rewards equal the user’s chosen reward.
![Page 11: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/11.jpg)
Definitions: “control”
[email protected]; AAAI Spring Symposium Series 2014
action
reward
state
Agent Environment
![Page 12: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/12.jpg)
Definitions: “control”
[email protected]; AAAI Spring Symposium Series 2014
action
reward
Environment 1
User
Environment 2
state action
reward
![Page 13: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/13.jpg)
Definitions: “control”
[email protected]; AAAI Spring Symposium Series 2014
user chooses reward
Environment 2
Agent User
Environment 1
![Page 14: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/14.jpg)
Definitions: “control”
[email protected]; AAAI Spring Symposium Series 2014
Agent
env. “chooses” reward
Environment 2
Environment 1
User
![Page 15: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/15.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
Why does control matter?
Loss of control can create situations where no possible sequence of rewards can elicit the desired behaviour.
These behaviours are dominated by other behaviours.
![Page 16: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/16.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
A “behaviour” (sequence of actions) is a policy.
1 ? 0 ? ? ? 0 ?
a1 a2 a3 a7a4 a5 a6 a8
P1
![Page 17: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/17.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
1 ? 0 ? ? ? 0 ?P1
User-chosen rewards
![Page 18: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/18.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
Env.-chosen rewards (loss of control)
1 ? 0 ? ? ? 0 ?P1
![Page 19: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/19.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
1 ? 0 ? ? ? 0 ?P1
1 0 ? 1 ? ? 1 1P2
Can rewards make either better?
![Page 20: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/20.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
1 1 0 1 1 1 0 1P1
1 0 0 1 0 0 1 1P2
Choose all rewards 1: Max. reward = 6
Choose all rewards 0: Min. reward = 4
![Page 21: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/21.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
1 0 0 0 0 0 0 0P1
1 0 1 1 1 1 1 1P2
Choose all rewards 0: Min. reward = 1
Choose all rewards 1: Max. reward = 7
![Page 22: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/22.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
1 ? 0 ? ? ? 0 ?P1
1 1 1 1 1 ? 1 1P3
![Page 23: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/23.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
1 1 0 1 1 1 0 1P1
1 1 1 1 1 0 1 1P3
Max. reward = 6
Min. reward = 7
![Page 24: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/24.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
Dominated by P3
Dominates P1
1 ? 0 ? ? ? 0 ?P1
1 1 1 1 1 ? 1 1P3
![Page 25: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/25.jpg)
Definitions: “dominance”
[email protected]; AAAI Spring Symposium Series 2014
A dominates B if no possible assignment of rewards causes R(A) > R(B).
No series of rewards can prompt a dominated policy; they are unelicitable. (A less obvious
result: every unelicitable policy is dominated.)
![Page 26: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/26.jpg)
Recap
[email protected]; AAAI Spring Symposium Series 2014
Control is sometimes lost;
Loss of control enables dominance;
Dominance makes some policies
unelicitable.
All of this is outside the “RL AI
frame”
…but is clearly part of the AI problem(do the right thing!)
![Page 27: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/27.jpg)
Generality: the range of policies an agent has reasonably efficient access to.
Autonomy: ability to function in environments with little interaction from users.
= better chance of finding dominant policies
= more frequent loss of control
Additional factors
[email protected]; AAAI Spring Symposium Series 2014
![Page 28: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/28.jpg)
Reinforcement learning and AI
Definitions: “control” “dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
![Page 29: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/29.jpg)
Reward Engineering Principle
[email protected]; AAAI Spring Symposium Series 2014
As RL AI becomes more general and autonomous, it becomes both more difficult and more important to constrain the environment to avoid loss of control.…because general / autonomous RL AI has• better chance of dominant policies;• more unelicitable policies;• more significant effects
![Page 30: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/30.jpg)
Reinforcement learning and AI
Definitions: “control” “dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
![Page 31: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/31.jpg)
[email protected]; AAAI Spring Symposium Series 2014
Heed the Reward Engineering Principle.
• Consider existence of dominant policies
• Be as rigorous as possible in excluding them
• Remember what’s outside the frame!
RL AI users:
![Page 32: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/32.jpg)
[email protected]; AAAI Spring Symposium Series 2014
Expand the frame! Make goal design a first-class citizen.
Consider alternatives: manually coded utility functions, preference learning, …?
Watch out for dominance relations (e.g. in “dual” motivation systems, between intrinsic and extrinsic)
AI Researchers:
![Page 33: Reinforcement Learning and the Reward Engineering Principle](https://reader036.vdocument.in/reader036/viewer/2022062500/568150c0550346895dbee207/html5/thumbnails/33.jpg)
Thank you!
Work supported by theAlexander Tamas Research Fellowship
[email protected]; AAAI Spring Symposium Series 2014
Toby Ord, Seán Ó hÉigeartaigh, and two anonymous judges, for comments.