Reinforcement Learning’sComputational Theory of
MindRich Sutton
Andy Barto Satinder Singh Doina Precup
with thanks to:
Outline
• Computational Theory of Mind• Reinforcement Learning
– Some vivid examples
• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)
• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error
• Speculative extensions– Reason– Knowledge
Marr’s Three Levels at which any information processing system can be
understood• Computational Theory Level
– What are the goals of the computation?– What is being computed?– Why are these the right things to compute?– What overall strategy is followed?
• Representation and Algorithm Level– How are these things computed?– What representation and algorithms are used?
• Hardware Implementation Level– How is this implemented physically?
What and Why?
How?
Really how?
Cash Register
• Computational Theory– Adding numbers– Making change– Computing tax– Controling access to cash drawer
• Representations and Algorithms– Are numbers stored in decimal or binary, or
BCD?– Is multiplication done by repeated adding?
• Hardware Implementation– Silicon or gears?– Motors or springs?
What and Why?
Word Processor
• Computational Theory– The whole “application” level– To display document as it will appear on paper– To make it easy to change, enhance
• Representations and Algorithms– How is the document stored?– What algorithms are used to maintain its display?– The underlying C code
• Hardware Implementation– How does the display work?– How does the silicon implement the computer?
What and Why?
Flight
• Computational Theory– Aerodynamics– Lift, propulsion, airfoil shape
• Representations and Algorithms– Fixed wings or flapping?
• Hardware Implementation– Steel, wood, or feathers?
What and Why?
Importance of Computational Theory
• The levels are loosely coupled• Each has an internal logic, coherence of its own• But computational theory drives all lower levels• We have so often gone wrong by mixing CT with
the other levels– Imagined neuro-physiological constraints– The meaning of connectionism, neural networks
• Constraints within the CT level have more force• We have little computational theory for AI
– No clear problem definition– Many methods for knowledge representation, but no
theory of knowledge
Outline
• Computational Theory of Mind• Reinforcement Learning
– Some vivid examples
• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)
• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error
• Speculative extensions– Reason– Knowledge
Reinforcement learning:
Learning from interaction
to achieve a goal Environment
actionstate
rewardAgent
•complete agent• temporally situated• continual learning & planning• object is to affect environment• environment stochastic &
uncertain
Strands of History of RLTrial-and-error
learningTemporal-difference
learningOptimal control,value functions
Thorndike ()1911
Minsky
Klopf
Barto et al
Secondary reinforcement ()
Samuel
Witten
Sutton
Hamilton (Physics)1800s
Shannon
Bellman/Howard (OR)
Werbos
Watkins
Holland
Hiroshi Kimura’s RL Robots
QuickTime™ and a decompressor
are needed to see this picture.
Before
QuickTime™ and a decompressor
are needed to see this picture.
After
QuickTime™ and a decompressor
are needed to see this picture.
Backward
QuickTime™ and a decompressor
are needed to see this picture.
New Robot, Same algorithm
The Acrobot Problem
e.g., Dejong & Spong, 1994
Sutton, 1995
Minimum–Time–to–Goal:
4 state variables: 2 joint angles 2 angular velocities
Tile coding with 48 layers
Goal: Raise tip above line
Torqueapplied
here
tip
Reward = -1 per time step
fixed base
Examples of Reinforcement Learning• Robocup Soccer Teams Stone & Veloso, Reidmiller et al.
– World’s best player of simulated soccer, 1999; Runner-up 2000
• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods
• Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin– World's best assigner of radio channels to mobile telephone calls
• Elevator Control Crites & Barto– (Probably) world's best down-peak elevator controller
• Many Robots– navigation, bi-pedal walking, grasping, switching between skills...
• TD-Gammon and Jellyfish Tesauro, Dahl– World's best backgammon player
New Applications of RL• CMUnited Robocup Soccer Team Stone & Veloso
– World’s best player of Robocup simulated soccer, 1998
• KnightCap and TDleaf Baxter, Tridgell & Weaver – Improved chess play from intermediate to master in 300 games
• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods
• Walking Robot Benbrahim & Franklin– Learned critical parameters for bipedal walking
Real-world applications using on-line learning Back-
prop
QuickTime™ and aCinepak decompressor
are needed to see this picture.
Backgammon
SITUATIONS: configurations of
the playing board (about 1020)
ACTIONS: moves
REWARDS: win: +1
lose: –1
else: 0
20
Pure delayed reward
TD-Gammon
. . .. . .
. . .. . .
Value
TD Error
Vt1 Vt
Action selectionby 2-3 ply search
Tesauro, 1992-1995
Start with a random Network
Play millions of games against itself
Learn a value function from this simulated experience
This produces arguably the best player in the world
Outline
• Computational Theory of Mind• Reinforcement Learning
– Some vivid examples
• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)
• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error
• More speculative extensions– Reason– Knowledge
The Reward Hypothesis
The mind’s goal is to maximize the cumulative sum of a received scalar
signal (reward)
The Reward Hypothesis
The mind’s goal is to maximize the cumulative sum of a received scalar
signal (reward)
can be conceived of, understood as
The Reward Hypothesis
The mind’s goal is to maximize the cumulative sum of a received scalar
signal (reward)
Must come from outside,not under the mind’s direct control
The Reward Hypothesis
The mind’s goal is to maximize the cumulative sum of a received scalar
signal (reward)
a simple, single number(not a vector or symbol structure)
The Reward Hypothesis
• Obvious? Brilliant? Demeaning? Inevitable? Trivial?
• Simple, but not trivial
• May be adequate, may be completely satisfactory
• A good null hypothesis
The mind’s goal is to maximize the cumulative sum of a received scalar
signal (reward)
Policies
• A policy maps each state to an action to take– Like a stimulus–response rule
• We seek a policy that maximizes cumulative reward
• The policy is a subgoal to achieving reward
Reward
Policy
Value Functions• Value functions = Predictions of expected reward
following states: Value: States Expected future reward
• Moment-by-moment estimates of how well its going • All efficient methods for finding optimal policies first estimate
value functions – RL methods, state-space planning methods, dynamic programming
• Recognizing and reacting to the ups and downs of life is an important part of intelligence
Reward
ValueFunction
Policy
The Mountain Car Problem
Minimum-Time-to-Goal Problem
Moore, 1990 Goal
Gravity wins
SITUATIONS: car's position and velocity
ACTIONS: three thrusts: forward, reverse, none
REWARDS: always –1 until car reaches the goal
No Discounting
Outline
• Computational Theory of Mind• Reinforcement Learning
– Some vivid examples
• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)
• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error
• Speculative extensions– Reason– Knowledge
Notation• Reward
r : States Pr() e.g.,
• Policies: States Pr(Actions) is optimal
• Value Functions
Vπ :States → ℜ
Vπ (s)=Eπ γk−1rt+kk=1
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭ given that st =s
rt ∈ℜ
discount factor ≈1 but <1
What should the prediction error be?
at time t?
Vt = E γk−1rt+kk=1
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭
= E rt+1 +γ γk−1rt+1+kk=1
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭
≈ rt+1 +γVt+1
TDerrort = rt+1 +γVt+1 −Vt
Reward Unexpected
Reward
Value
TD error
Reward Expected
Cue
Value
TD error
Reward Absent
Value
TD error
ComputationTheoretical
TD Errors
Representation & Algorithm Level:
The many methods for approximating valueDynamic
programming
Temporal-differencelearning
Monte Carlo
Exhaustivesearch
bootstrapping,
fullbackups
samplebackups
shallowbackups
deepbackups
Outline
• Computational Theory of Mind• Reinforcement Learning
– Some vivid examples
• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)
• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error
• Speculative extensions– Reason– Knowledge
Reason as RL over Imagined Experience
1. Learn a model of the world’s transition dynamicstransition probabilities, expected immediate rewards“1-step model” of the world
2. Use model to generate imaginary experiencesinternal thought trials, mental simulation (Craik, 1943)
3. Apply RL as if experience had really happened
Reward
ValueFunction
1-Step Model
Policy
Mind is About PredictionsHypothesis: Knowledge is predictive
About what-leads-to-what, under what ways of behavingWhat will I see if I go around the corner?Objects: What will I see if I turn this over?Active vision: What will I see if I look at my hand?Value functions: What is the most reward I know how to get?
Such knowledge is learnable, chainable
Hypothesis: Mental activity is working with predictionsLearning themCombining them to produce new predictions (reasoning)Converting them to action (planning, reinforcement
learning)Figuring out which are most useful
An old, simple, appealing idea
• Mind as prediction engine! • Predictions are learnable, combinable• They represent cause and effect, and can be
pieced together to yield plans• Perhaps this old idea is essentially correct.• Just needs
– Development, revitalization in modern forms– Greater precision, formalization, mathematics– The computational perspective to make it respectable– Imagination, determination, patience
• Not rushing to performance