reinforcement learning: a survey xin chen information and computer science department university of...

Reinforcement Learning: A survey

Xin ChenInformation and Computer Science DepartmentUniversity of HawaiiFall, 2006

2

I. Introduction

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

3

What is Reinforcement Learning

Is a subcategory of machine learningIs the problem faced by an agent that

must learn behavior through trial-and-error interactions with a dynamic environment.

4

Machine learning categories

Supervised learning Unsupervised learning

Reinforcement learning

Input/Output pairs Input only Input & Critic

Learning phase v.s. acting phase

Learning phase v.s. acting phase

Learning and acting simultaneously – online learning

Explicitly explore the world and learn by trial and error

5

Samuel (1959). Checkers learning program Bellman (1958), Ford and Fulkerson (1962). Bellman-Ford, single-destination,

shortest-path algorithms. Bellman (1961), Blackwell (1965). Research on optimal control let to the solution

of Markov decision processes. Holland’s (1986). Bucket brigade method for learning classifier systems. Barto et al. (1983). Approach to temporal credit assignment Sutton (1988). TD(λ) method and proof of its convergence for λ = 0. Dayan (1992). extended the above result to arbitrary values of λ. Watkins (1989). Q-learning to acquire optimal policies when the reward and action

transition functions are unknown. McCallum (1995) and Littman (1996). Extension of reinforcement learning to

settings with hidden state variables that violate the Markov assumption. Scale up of these methods: Maclin and Shavlik (1996). Lin (1992). Singh (1993).

Lin (1993). Dietterich and Flann (1995). Mitchell and Thrun (1993). Ring (1994). Recent reviews: Kaelbling et al. (1996); Barto et al. (1995); Sutton and Barto

(1998)

Research History

6

Two main strategies for solving Reinforcement Learning problems

Search in the space of behaviors to find one that performs well. E.g. genetic algorithms and genetic programming. (search and optimization)

Use statistical techniques and Dynamic Programming methods to estimate the utility of taking actions in states of the world.

7

Standard Reinforcement Learning Model

A discrete set of environment states, S; A discrete set of agent actions, A; and A set of scalar reinforcement signals (reward); typically {0,1}, or real

numbers.

8

Example of Reinforcement Learning

Agent’s job: Find a policy π, mapping states to actions, that maximizes some long-run measure of reinforcement.

We expect that in general the environment is non-deterministic in the long run.

9

Models of optimal behavior

Finite-horizon model

Infinite-horizon discounted model

Average-reward model

10

Measuring learning performance

Eventual convergence to optimal. Many algorithms have a provable guarantee of asymptotic convergence to optimal behavior. This is reassuring but useless in practice. An agent that quickly reaches a plateau at 99% of optimality may often be preferred to one that eventually converges but takes a long time in low optimality.

Speed of convergence to optimality. More practical is the speed of convergence to near-optimality. This measure needs the definition of how near to optimality is sufficient. A related measure is level of performance after a given time, which similarly requires that the given time be defined. Measures related to the speed of learning have an additional weakness: an algorithm that merely tries to achieve optimality as fast as possible may incur unnecessarily large penalties during the learning period.

Regret: a more appropriate measure, is the expected decrease in reward gained due to executing the learning algorithm instead of behaving optimally from the very beginning. It penalizes mistakes wherever they occur during the run. Unfortunately, results concerning the regret of algorithms are quite hard to obtain.

11

II. Exploration versus Exploitation


12

The k-armed Bandit problem

The simplest Reinforcement Learning problem: k-armed bandit problem.

Is a R.L. environment with a single state with only self-transitions.

Study extensively in the statistics and applied math literature.

It illustrates the fundamental tradeoff b/w exploitation and exploration.

13

Formally justified solutions

Dynamic programming approach The expense of filling in the table of V* values in this way for all attainable

belief states is linear in the number of belief states times actions, and thus exponential in the horizon.

Gettins allocation indices Guarantees optimal exploration and simple (given the table of index values),

promising in more complex applications. Proved useful in an application to robotic manipulation with immediate reward. Unfortunately, no one has yet been able to find an analog of index values for delayed reinforcement problems.

Learning automata Always converges to a vector containing a single 1 and the rest 0’s, but not

always converge to the correct action; but the probability that it converges to the wrong one can be made arbitrarily small.

14

Ad-hoc solutions

Simple, popular, rarely the best choice, but viewed as reasonable, computationally tractable heuristics.

Greedy strategies Take the action with best estimated payoff. No exploration.

Randomized strategies Epsilon-greedy: Take the best estimated expected reward by

default, but with probability p, choose an action at random. Boltzmann exploration

Interval-based techniques Kaelbling’s interval estimation algorithm. Action is chosen by

computing the upper bound of a 100*(1-a)% confidence-interval on the success probability of each action and choose the action with the highest upper bound.

15

More general problems

For multiple states, but reinforcement is still immediate, then all methods above can be replicated once for each state.

When generalization is required, these solutions must be integrated with generalization methods. This is easy for ad-hoc methods, but not for theoretical methods.

Many of these methods converge to such a regime that exploration are rarely or never taken any more. This is ok for stationary environment, but not for non-stationary world. Again, this is easy for the ad-hoc solutions, but not for the theoretic solutions.

16

III. Delayed Reward


17

Model delayed reward Reinforcement Learning as MDP

Problems with delayed reinforcement are well modeled as Markov decision processes (MDPs). An MDP consists of: a set of states S, a set of actions A, a reward function: R: S X A -> R, and a state transition function T: S X A -> π(S), where a member

of π(s) is a probability distribution over the set S. We write T(s, a, s’) for the probability of making a transition from state s to state s’ using action a.

This model is Markov if the state transitions are independent of any previous environment states or agent actions.

18

Finding a policy given a model

Value Iteration Policy Iteration In practice, value iteration is much faster per iteration,

but policy iteration takes fewer iterations. Puterman’s modified policy iteration algorithm provides

a method for trading iteration time for iteration improvement in a smoother way.

Several standard numerical-analysis techniques that speed the convergence of dynamic programming can be used to accelerate value and policy iteration, including multigrid methods, and state aggregation.

19

IV. Model-free methods


20

Temporal Difference TD(λ) Learning

Sutton, 1988. Learn by reducing discrepancies between estimates made by the

agent at different times.

Use a constant λ to combine the estimates obtained from n-step lookahead:

An equivalent recursive definiton for Qλ is:

21

Q-learning

Watkins, 1989. Is a special case of Temporal different algorithms

(TD(0)). Q-learning handles discounted infinite-horizon MDPs. Q-learning is easy to work with in practice. Q-learning is exploration insensitive. Q-learning is the most popular and seems to be the

most effective model-free algorithm for learning from delayed reinforcement.

However, it does not address scaling problem, and may converge quite slowly.

22

Q-learning rule

s – current state s’ – next state a – action a’ – action of the next state r – immediate reward γ – discount factor Q(s,a) – expected discounted reinforcement of taking action a in state s. <s, a, r, s’> is an experience tuple.

23

Q-learning algorithm

For each s, a, initialize table entry Q(s,a) 0 Observe current state s Do forever:

Select an action a and execute it Receive immediate reward r Observe the new state s’ Update the table entry for Q(s, a) as follows:

Q (s, a) = Q(s, a) + α [ r + max Q (s’, a’) – Q (s, a)] s s’

24

Convergence criteria

The system is a deterministic MDP. The immediate reward values are bounded, i.e., there

exists some positive constant c such that for all states s and actions a, |r(s,a)| < c.

The agent selects actions such that it visits every possible state-action pair infinitely often.

25

V. Model-based methods


26

Why model-based methods instead of model-free methods?

Model-free methods need very little computation time per experience, but make extremely inefficient use of the gathered data and often need a large amount of experience to achieve good performance.

Model-based methods are important in situations where computation is considered cheap and real-world experience costly.

27

Model-Based methods

Certainty equivalent methodsDynaPrioritized Sweeping / Queue-DynaRTDP (real-time dynamic programming) The Plexus planning system

28

VI. Generalization


29

Generalization over input

Generalization over input Immediate Reward

CRBPARCREINFORCE AlgorithmsLogic-Based Methods

Delayed RewardAdaptive Resolution Models

Generalization over action

30

Hierarchical methods

Feudal Q-learning In the simplest case, there is a high-level

master and low-level master.Compositional Q-learning

Consists of a hierarchy based on the temporal sequencing of sub-goals.

Hierarchical Distance to Goal

31

VII. Partially observable environments


32

Why Partially Observable Environments?

Complete observability is necessary for learning methods based on MDPs. But in many real-world environments, it’s not possible for the agent to have perfect and complete perception of the state of the environment.

33

Partially Observable Environments

State-Free Deterministic PoliciesState-Free Stochastic PoliciesPolicies with Internal State

Recurrent Q-learningClassifier SystemsFinite-history-window ApproachPOMDP Approach

34

VIII. Reinforcement learning applications


35

Reinforcement learning applications

Game Playing TD-Gammon: TD in Backgammon. Special features of Backgammon. Although experiments with other

games (Go, Chess) have in some cases produced interesting learning behavior, no success close to that of TD-Gammon has been repeated.

It is still an open question as to if and how the success of TD-Gammon can be repeated in other domains.

36

Reinforcement learning applications

Robotics and Control Schaal and Atkeson. Constructed a two-armed robot that learns to juggle a

device known as a devil-stick Mahadevan and Connell. Discussed a task in which a mobile robot pushes

large boxes for extended periods of time. Mataric. Describes a robotics experiment with very high dimensional state

space, containing many dozens of degrees of freedom. Crites and Barto. Q-learning has been used in an elevator dispatching task. Kaelbling. Packaging task from a food processing industry.

Interesting results To make a real system work it proved necessary to supplement the

fundamental algorithm with extra pre-programmed knowledge. None of the exploration strategies mirrors theoretically optimal (but

computationally intractable) exploration, and yet all proved adequate. The computational requirements of these experiments were all very different,

indicating that the differing computational demands of various reinforcement learning algorithms do indeed have an array of differing applications.

37

IX. Conclusions


38

Conclusions

Many techniques are good for small problems, but don’t scale.

To scale, we must make use of some bias. E.g., shaping, local reinforcement signals, imitation, problem decomposition and reflexes.

39

References

[1] Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore. Reinforcement Learning: A Survey. (1996) http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/rl-survey.html

[2] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction. (1998) http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html

[3] Machine Learning. Tom M. Mitchell. (1997) [4] Barto, A., Bradtke, S., Singh, S. Learning to act using real-time dynamic

programming. Artificial Intelligence, Special volume: Computational research on interaction and agency. 72(1), 81-138. (1995)

[5] Richard S. Sutton. 499/699 courses on Reinforcement Learning. University of

Alberta, Spring 2006. http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2006.html

40

A Java implementation of Q-learning

41

Questions

reinforcement learning: a survey xin chen information and computer science department university of...

Documents