reinforcement learning: how far can it go?
Post on 22-Feb-2016
40 Views
Preview:
DESCRIPTION
TRANSCRIPT
Reinforcement Learning:How far can it Go?
Rich SuttonUniversity of Massachusetts
ATT Research
With thanks to Doina Precup, Satinder Singh, Amy McGovern,
B. Ravindran, Ron Parr
Reinforcement LearningAn active, popular, successful approach to AI
15 – 50 years old emphasizes learning from interaction Does not assume complete knowledge of world
World-class applicationsStrong theoretical foundationsParallels in other fields: operations research, control
theory, psychology, neuroscienceSeeks simple general principles
How Far Can It Go ?
World-Class Applications of RL
TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player
Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller
Job-Shop Scheduling Zhang & Dietterich World’s best scheduler of space-shuttle payload processing
Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls
Outline
RL Past Trial and Error Learning
RL Present Learning and Planning Values
RL Future Constructivism
1985
2000
1950
RL began with dissatisfactionwith previous learning problems
Such as Learning from examples Unsupervised learning Function optimization
None seemed to be purposiveful Where is the learning to how to get something? Where is the learning by trial and error?
Need rewards and penalties,interaction with the world!
Rooms Example
Early learning methods could not learn how to get reward
The Reward Hypothesis
Is this reasonable?Is it demeaning?Is there no other choice?
It seems to be adequate
That purposes can be adequately represented as
maximization of the cumulative sum of a scalar reward signal received from the environment
RL Past – Trial and Error LearningLearned only a policy (a mapping from states to actions)Maximized only
Short-term reward (e.g., learning automata) Or delayed reward via simple action traces
Assumed good/bad rewards immed. distinguishable E.g., positive is good, negative is bad An implicitly known reinforcement baseline
Next steps were to learn baselines and internal rewards
Taking these next steps quickly led to modernvalue functions and temporal-difference learning
Movement is in the wrong direction 1/3 of the time
A Policy
Problems with Value-less RL Methods
Outline
RL Past Trial and Error Learning
RL Present Learning and Planning Values
RL Future Constructivism
1985
2000
1950
The Value-Function Hypothesis
That the dominant purpose of intelligence is to approximate these value functions
Value functions = Measures of expected rewardfollowing states:
V: States Expected future reward
or following state-action pairs: Q: States x Actions Expected future reward
All efficient methods for optimal sequential decision making estimate value functions
The hypothesis:
State-Value Function
RL PresentAccepts reward and value hypothesesMany real-world applications, some impressiveTheory strong and active,
yet still with more questions than answersStrong links to Operations ResearchA part of modern AI’s interest in uncertainty:
MDPs, POMDPs, Bayes nets, connectionismIncludes deliberative planning
Learning and Planning Values
New Applications of RLCMUnited Robocup Soccer Team Stone & Veloso
World’s best player of Robocup simulated soccer, 1998
KnightCap and TDleaf Baxter, Tridgell & Weaver Improved chess play from intermediate to master in 300 games
Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods
Walking Robot Benbrahim & Franklin Learned critical parameters for bipedal walking
Real-world applications using on-line learning Back-
prop
QuickTime™ and aCinepak decompressorare needed to see this picture.
RL Present, Part II:The Space of Methods
Dynamic programming
Temporal-differencelearning
Monte Carlo
Exhaustivesearch
bootstrapping,
fullbackups
samplebackups
shallowbackups
deepbackups
Also: Function Approx. Explore/Exploit Planning/Learning Action/state values Actor-Critic . . .
The TD Hypothesis
Even “Monte Carlo” methods can benefit TD methods enable them to be done incrementally
Even planning can benefit Trajectory following improves function approximation
and state sampling Sample backups reduce effect of branching factor
Psychological support TD models of reinforcement, classical conditioning
Physiological support Reward neurons show TD behavior (Schultz et al.)
That all value learning is driven by TD errors
PlanningModern RL includes planning
As in planning for MDPs A form of state-space planning Still controversial for some
Planning and learning are near identical in RL The same algorithms on real or imagined experience Same value functions, backups, function approximation
value/policy
model experience
acting
modellearning
planning directRL
Interactionwith world
Imaginedinteraction
RL Alg. Value/Policy
Planning with Imagined Experience
Realexperience
Imaginedexperience
Outline
RL Past Trial and Error Learning
RL Present Learning and Planning Values
RL Future Constructivism
1985
2000
1950
Constructivism
The active construction of representationsand models of the world to facilitate the learning and planning of values
Policy
Value functions
Representationsand Models
Great flexibility
here
PiagetDrescher
Whereas RL present is about solving an MDP,RL future will be about representing the
States Actions Transitions Rewards Features
to construct an MDP.Constructing the world to be the way we want it:
Markov Linear Small Reliable Independent Shallow Deterministic Additive Low branching
Constructivist Prophecy
The RL agent as active
world modeler
Linear-in-the-features methods are state of the art Memory-based methods
Two-stage architecture: Compute feature values
• Nonlinear, expansive, fixed or slowly changing mapping Map the feature values linearly to the result
• Linear, convergent, fast changing mapping
Works great if features are appropriate Fast, reliable, local learning; good generalization
Feature construction best done by hand ...or by methods yet to be found
Representing State, Part I: Features and Function Approximation
State Features Values
ConstructiveInduction
Good Features Bad Features
Features correspond to regions of similar value
Features unrelated to values
Representing State, Part II:Partial Observability
Not as big a deal as widely thought A greater problem for theory than for practice Need not use POMDP ideas
Can treat as function approximation issue Making do with imperfect observations/features Finding the right memories to add as new features
The key is to construct state representations that make the world more Markov – McCallum’s thesis
When immediate observations do not uniquelyidentify the current state; non-Markov problems
Representations of ActionNominally, actions in RL are low-level
The lowest level at which behavior can varyBut people work mostly with courses of action
We decide among these We make predictions at this level We plan at this level
Remarkably, all this can be incorporated in RL Course of action = policy + termination condition Almost all RL ideas, algorithms and theory extend Wherever actions are used, courses of action can be
substituted
Parr, Bradtke & Duff, Precup, Singh, Dietterich, Kaelbling, Huber &Grupen, Szepesvari, Dayan, Ryan & Pendrith, Hauskrecht, Lin...
Room-to-Room Courses of Action
A course of actionfor each hallwayfrom each room(2 of 8 shown)
Representing TransitionsModels can also be learned for courses of action
What state will we be in at termination?
How much reward will we receive along the way?
Mathematical form of models follows from the theory of semi-Markov decision processes
Permits planning at a higher level
Planning (Value Iteration)with Courses of Action
Iteration #0 Iteration #1 Iteration #2
with cell-to-cellprimitive actions
Iteration #0 Iteration #1 Iteration #2
with room-to-roomcourses of action
V(goa l)=1
V (goa l)=1
Reconnaissance ExampleMission: Fly over (observe) most
valuable sites and return to baseStochastic weather affects
observability (cloudy or clear) of sitesLimited fuel Intractable with classical optimal
control methodsActions:
Primitives: which direction to fly Courses: which site to head for
Courses compress space and time Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~106
Enable finding of best solutions
10
50
50
50
100
25
15 (reward)
5
25
8
Base100 decision steps
(mean time between weather changes)
?
B. Ravindran, UMass
Courses of actionpermit enormous flexibility
SubgoalsCourses of action are often goal-oriented
E.g., drive-to-work, open-the-doorA course can be learned to achieve its goalMany can be learned at once, independently
Solves classic problem of subgoal credit assignment Solves psychological puzzle of goal-oriented action
Goal-oriented courses of action create better MDP Fewer states, smaller branching factor Compartmentalizes dependencies
Their models are also goal-oriented recognizers...
PerceptionReal perception, like real action,
is temporally extendedFeatures are ability oriented
rather than sensor oriented What is a chair? Something that can be sat upon
Consider a goal-oriented course of action, like dock-with-charger Its model gives the probability of successfully docking
as a function of state I.e., a feature (detector) for states that afford docking
Such features can be learned without supervision
chargerDockableregion
This is RL with a totally different feelStill one primary policy and set of valuesBut many other policies, values, and models are
learned not directly in service of rewardThe dominant purpose is discovery, not reward
What possibilities does this world afford? How can I control and predict it in a variety of ways?
In other words, constructing representations to make the world: Markov Linear Small Reliable Independent Shallow Deterministic Additive Low branching
ImagineAn agent driven primarily by biased curiosityTo discover how it can predict and control
its interaction with the world What courses of action have predictable effects? What salient observables can be controlled? What models are most useful in planning?
A human coach presenting a series of Problems/Tasks Courses of action Highlighting key states, providing subpolicies,
termination conditions…
What is New?Constructivism itself is not new.
But actually doing it would be!
Does RL really change it, make it easier?That is, do values and policies help?
Yes! Because so much constructed knowledge is well represented as values and policies in service of approximating values and policies
RL’s goal-orientation is also critical to modeling goal-oriented action and perception
Take Home Messages
RL Past Let’s revisit, but not repeat past work
RL Present Do you accept that value functions are critical? And that TD methods are the way to find them?
RL Future It’s time to address representation construction Explore/understand the world rather than control it RL/values provide new structure for this May explain goal-oriented action and perception
How far can RL go?
A simple and general formulation of AI
Yet there is enough structure to make progress
While this is true, we should complicate no further, but seek general principles of AI
They may take us all the way to human-level intelligence
top related