reinforcement learning and artificial intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 ·...

51
Reinforcement Learning and Artificial Intelligence PIs: Rich Sutton Michael Bowling Dale Schuurmans Vadim Bulitko plus: Dale Schuurmans Vadim Bulitko Lori Troop Mark Lee

Upload: others

Post on 28-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Reinforcement Learningand Artificial Intelligence

PIs:Rich SuttonMichael BowlingDale SchuurmansVadim Bulitko

plus:Dale SchuurmansVadim BulitkoLori TroopMark Lee

Page 2: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Reinforcement learning is learning from interaction to achieve a goal

Environment

actionstate

rewardAgent

• complete agent• temporally situated• continual learning & planning• object is to affect environment• environment stochastic & uncertain

Page 3: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

ReinforcementLearning

ArtificialIntelligence

Psychology

ControlTheory

Neuroscience

Page 4: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Intro to RLwith a robot bias

• policy

• reward

• value

• TD error

• world model

• generalized policy iteration

Page 5: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

5

Worldor world model

Page 6: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Hajime Kimura’s RL Robots

Before After

Backward New Robot, Same algorithm

Page 7: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Devilsticking

Finnegan SoutheyUniversity of Alberta

Stefan Schaal & Chris AtkesonUniv. of Southern California

“Model-based ReinforcementLearning of Devilsticking”

Page 8: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 9: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 10: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

The RoboCup Soccer Competition

Page 11: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Autonomous Learning of Efficient GaitKohl & Stone (UTexas) 2004

Page 12: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 13: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 14: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 15: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Policies

• A policy maps each state to an action to take

• Like a stimulus–response rule

• We seek a policy that maximizes cumulative reward

• The policy is a subgoal to achieving reward

Page 16: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

The goal of intelligence is to maximize the cumulative sum of a single received number:

“reward” = pleasure - pain

Artificial Intelligence = reward maximization

The Reward Hypothesis

Page 17: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Brain reward systems

Hammer, Menzel

Honeybee Brain VUM Neuron

What signal does this neuron carry?

Page 18: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Value

Page 19: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Value systems are hedonism with foresight

Value systems are a means to reward,yet we care more about values than rewards

All efficient methods for solving sequential decision problems determine (learn or compute) “value functions”

as an intermediate step

We value situations according to how much reward we expect will follow them

Page 20: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

“Even enjoying yourself you call evil whenever it leads to the loss of a pleasure greater than its own, or lays up pains that outweigh its pleasures. ... Isn't it the same when we turn back to pain? To suffer pain you call good when it either rids us of greater pains than its own or leads to pleasures that outweigh them.”

–Plato, Protagoras

Pleasure ≠ good

Page 21: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

• Reward r : States → Pr(ℜ) e.g.,

• Policies π : States → Pr(Actions) π∗ is optimal

• Value Functions

Vπ :States →ℜ

Vπ (s) = Eπ γ k−1rt+kk=1

given that st = s

rt ∈ℜ

discount factor ≈1 but <1

Page 22: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Backgammon

STATES: configurations of the

playing board (≈1020)

ACTIONS: moves

REWARDS: win: +1

lose: –1

else: 0

a “big” game

Page 23: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

. . .. . .

TD-Gammon

. . .. . .

Value

TD ErrorVt+1

−Vt

Action selectionby 2-3 ply search

Tesauro, 1992-1995

Start with a random Network

Play millions of games against itself

Learn a value function from this simulated experience

Six weeks later it’s the best player of backgammon in the world

Page 24: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

The Mountain Car Problem

Minimum-Time-to-Goal Problem

Moore, 1990

Goal

Gravity wins

SITUATIONS: car's position and velocity

ACTIONS: three thrusts: forward, reverse, none

REWARDS: always –1 until car reaches the goal

No Discounting

Page 25: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Value Functions Learned while solving the Mountain Car problem

Minimize Time-to-Goal

Value = estimated time to goal

Goal region

Page 26: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

TD error

Page 27: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

• Reward r : States → Pr(ℜ) e.g.,

• Policies π : States → Pr(Actions) π∗ is optimal

• Value Functions

Vπ :States →ℜ

Vπ (s) = Eπ γ k−1rt+kk=1

given that st = s

rt ∈ℜ

discount factor ≈1 but <1

Page 28: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Vt = E γ k−1rt+kk=1

= E rt+1 + γ γ k−1rt+1+kk=1

≈ rt+1 +γVt+1

TD errort = rt+1 +γVt+1 − Vt

Page 29: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Wolfram Schultz, et al.

Brain reward systems seem to signal TD error

TD error

Page 30: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

30

Worldor world model

Page 31: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

World models

Page 32: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

“Autonomous helicopter flightvia Reinforcement Learning”

Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004

Page 33: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 34: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of
Page 35: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Reason as RL over Imagined Experience

1. Learn a predictive model of the world’s dynamics

transition probabilities, expected immediate rewards

2. Use model to generate imaginary experiences

internal thought trials, mental simulation (Craik, 1943)

3. Apply RL as if experience had really happened

vicarious trial and error (Tolman, 1932)

Page 36: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

GridWorld Example

Page 37: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Intro to RLwith a robot bias

• policy

• reward

• value

• TD error

• world model

• generalized policy iteration

Page 38: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Policy Iteration

π0 →V π 0 →π1 → Vπ1 →Lπ * →V * →π *

Improvement is monotonic

Converges is a finite number of steps

policy evaluation policy improvement“greedification”

Generalized Policy Iteration: Intermix the two steps more finely, state by state, action by action, sample by sample

Page 39: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Generalized Policy Iteration

Policyπ

ValueFunction

V

policyevaluation

π * V

greedification

*

π

Page 40: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

Page 41: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

A learned, time-varying prediction of imminent reward

Page 42: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

A learned, time-varying prediction of imminent rewardKey to all efficient methods for finding optimal policies

Page 43: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

A learned, time-varying prediction of imminent rewardKey to all efficient methods for finding optimal policies

This has nothing to do with either biology or computers

Page 44: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

It’s all created fromthe scalar reward signal

Page 45: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

It’s all created fromthe scalar reward signal

together with the causal structure of the world

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

Page 46: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

It’s all created fromthe scalar reward signal

together with the causal structure of the world

Summary:RL’s Computational Theory of Mind

Reward

ValueFunction

PredictiveModel

Policy

Page 47: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

additions for next year

• sarsa• how it might actually applied to something they

might do

Page 48: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

48

Worldor world model

Page 49: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Current work

An individual’s knowledge consists of predictions about its future (low-level) sensations and actions

Knowledge is subjective

Everything else (objects, houses, people, love) must be constructed from sensation and action

Page 50: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

Subjective/Predictive Knowledge

• “John is in the coffee room”

• “My car in is the South parking lot”

• What we know about geography, navigation

• What we know about how an object looks, rotates

• What we know about how objects can be used

• Recognition strategies for objects and letters

• “The portrait of Washington on the dollar in the wallet in my other pants in the laundry, has a mustache on it”

Page 51: Reinforcement Learning and Artificial Intelligencehinton/csc321/notes/sutton.pdf · 2007-01-03 · Backward New Robot, Same algorithm. Devilsticking Finnegan Southey University of

RoboCup soccer keepawayStone, Sutton & Kuhlmann, 2005