reinforcement learning: a survey xin chen information and computer science department university of...

41
Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

Upload: cathleen-atkinson

Post on 26-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

Reinforcement Learning: A survey

Xin ChenInformation and Computer Science DepartmentUniversity of HawaiiFall, 2006

Page 2: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

2

I. Introduction

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 3: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

3

What is Reinforcement Learning

Is a subcategory of machine learningIs the problem faced by an agent that

must learn behavior through trial-and-error interactions with a dynamic environment.

Page 4: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

4

Machine learning categories

Supervised learning Unsupervised learning

Reinforcement learning

Input/Output pairs Input only Input & Critic

Learning phase v.s. acting phase

Learning phase v.s. acting phase

Learning and acting simultaneously – online learning

Explicitly explore the world and learn by trial and error

Page 5: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

5

Samuel (1959). Checkers learning program Bellman (1958), Ford and Fulkerson (1962). Bellman-Ford, single-destination,

shortest-path algorithms. Bellman (1961), Blackwell (1965). Research on optimal control let to the solution

of Markov decision processes. Holland’s (1986). Bucket brigade method for learning classifier systems. Barto et al. (1983). Approach to temporal credit assignment Sutton (1988). TD(λ) method and proof of its convergence for λ = 0. Dayan (1992). extended the above result to arbitrary values of λ. Watkins (1989). Q-learning to acquire optimal policies when the reward and action

transition functions are unknown. McCallum (1995) and Littman (1996). Extension of reinforcement learning to

settings with hidden state variables that violate the Markov assumption. Scale up of these methods: Maclin and Shavlik (1996). Lin (1992). Singh (1993).

Lin (1993). Dietterich and Flann (1995). Mitchell and Thrun (1993). Ring (1994). Recent reviews: Kaelbling et al. (1996); Barto et al. (1995); Sutton and Barto

(1998)

Research History

Page 6: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

6

Two main strategies for solving Reinforcement Learning problems

Search in the space of behaviors to find one that performs well. E.g. genetic algorithms and genetic programming. (search and optimization)

Use statistical techniques and Dynamic Programming methods to estimate the utility of taking actions in states of the world.

Page 7: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

7

Standard Reinforcement Learning Model

A discrete set of environment states, S; A discrete set of agent actions, A; and A set of scalar reinforcement signals (reward); typically {0,1}, or real

numbers.

Page 8: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

8

Example of Reinforcement Learning

Agent’s job: Find a policy π, mapping states to actions, that maximizes some long-run measure of reinforcement.

We expect that in general the environment is non-deterministic in the long run.

Page 9: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

9

Models of optimal behavior

Finite-horizon model

Infinite-horizon discounted model

Average-reward model

Page 10: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

10

Measuring learning performance

Eventual convergence to optimal. Many algorithms have a provable guarantee of asymptotic convergence to optimal behavior. This is reassuring but useless in practice. An agent that quickly reaches a plateau at 99% of optimality may often be preferred to one that eventually converges but takes a long time in low optimality.

Speed of convergence to optimality. More practical is the speed of convergence to near-optimality. This measure needs the definition of how near to optimality is sufficient. A related measure is level of performance after a given time, which similarly requires that the given time be defined. Measures related to the speed of learning have an additional weakness: an algorithm that merely tries to achieve optimality as fast as possible may incur unnecessarily large penalties during the learning period.

Regret: a more appropriate measure, is the expected decrease in reward gained due to executing the learning algorithm instead of behaving optimally from the very beginning. It penalizes mistakes wherever they occur during the run. Unfortunately, results concerning the regret of algorithms are quite hard to obtain.

Page 11: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

11

II. Exploration versus Exploitation

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 12: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

12

The k-armed Bandit problem

The simplest Reinforcement Learning problem: k-armed bandit problem.

Is a R.L. environment with a single state with only self-transitions.

Study extensively in the statistics and applied math literature.

It illustrates the fundamental tradeoff b/w exploitation and exploration.

Page 13: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

13

Formally justified solutions

Dynamic programming approach The expense of filling in the table of V* values in this way for all attainable

belief states is linear in the number of belief states times actions, and thus exponential in the horizon.

Gettins allocation indices Guarantees optimal exploration and simple (given the table of index values),

promising in more complex applications. Proved useful in an application to robotic manipulation with immediate reward. Unfortunately, no one has yet been able to find an analog of index values for delayed reinforcement problems.

Learning automata Always converges to a vector containing a single 1 and the rest 0’s, but not

always converge to the correct action; but the probability that it converges to the wrong one can be made arbitrarily small.

Page 14: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

14

Ad-hoc solutions

Simple, popular, rarely the best choice, but viewed as reasonable, computationally tractable heuristics.

Greedy strategies Take the action with best estimated payoff. No exploration.

Randomized strategies Epsilon-greedy: Take the best estimated expected reward by

default, but with probability p, choose an action at random. Boltzmann exploration

Interval-based techniques Kaelbling’s interval estimation algorithm. Action is chosen by

computing the upper bound of a 100*(1-a)% confidence-interval on the success probability of each action and choose the action with the highest upper bound.

Page 15: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

15

More general problems

For multiple states, but reinforcement is still immediate, then all methods above can be replicated once for each state.

When generalization is required, these solutions must be integrated with generalization methods. This is easy for ad-hoc methods, but not for theoretical methods.

Many of these methods converge to such a regime that exploration are rarely or never taken any more. This is ok for stationary environment, but not for non-stationary world. Again, this is easy for the ad-hoc solutions, but not for the theoretic solutions.

Page 16: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

16

III. Delayed Reward

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 17: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

17

Model delayed reward Reinforcement Learning as MDP

Problems with delayed reinforcement are well modeled as Markov decision processes (MDPs). An MDP consists of: a set of states S, a set of actions A, a reward function: R: S X A -> R, and a state transition function T: S X A -> π(S), where a member

of π(s) is a probability distribution over the set S. We write T(s, a, s’) for the probability of making a transition from state s to state s’ using action a.

This model is Markov if the state transitions are independent of any previous environment states or agent actions.

Page 18: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

18

Finding a policy given a model

Value Iteration Policy Iteration In practice, value iteration is much faster per iteration,

but policy iteration takes fewer iterations. Puterman’s modified policy iteration algorithm provides

a method for trading iteration time for iteration improvement in a smoother way.

Several standard numerical-analysis techniques that speed the convergence of dynamic programming can be used to accelerate value and policy iteration, including multigrid methods, and state aggregation.

Page 19: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

19

IV. Model-free methods

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 20: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

20

Temporal Difference TD(λ) Learning

Sutton, 1988. Learn by reducing discrepancies between estimates made by the

agent at different times.

Use a constant λ to combine the estimates obtained from n-step lookahead:

An equivalent recursive definiton for Qλ is:

Page 21: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

21

Q-learning

Watkins, 1989. Is a special case of Temporal different algorithms

(TD(0)). Q-learning handles discounted infinite-horizon MDPs. Q-learning is easy to work with in practice. Q-learning is exploration insensitive. Q-learning is the most popular and seems to be the

most effective model-free algorithm for learning from delayed reinforcement.

However, it does not address scaling problem, and may converge quite slowly.

Page 22: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

22

Q-learning rule

s – current state s’ – next state a – action a’ – action of the next state r – immediate reward γ – discount factor Q(s,a) – expected discounted reinforcement of taking action a in state s. <s, a, r, s’> is an experience tuple.

Page 23: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

23

Q-learning algorithm

For each s, a, initialize table entry Q(s,a) 0 Observe current state s Do forever:

Select an action a and execute it Receive immediate reward r Observe the new state s’ Update the table entry for Q(s, a) as follows:

Q (s, a) = Q(s, a) + α [ r + max Q (s’, a’) – Q (s, a)] s s’

Page 24: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

24

Convergence criteria

The system is a deterministic MDP. The immediate reward values are bounded, i.e., there

exists some positive constant c such that for all states s and actions a, |r(s,a)| < c.

The agent selects actions such that it visits every possible state-action pair infinitely often.

Page 25: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

25

V. Model-based methods

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 26: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

26

Why model-based methods instead of model-free methods?

Model-free methods need very little computation time per experience, but make extremely inefficient use of the gathered data and often need a large amount of experience to achieve good performance.

Model-based methods are important in situations where computation is considered cheap and real-world experience costly.

Page 27: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

27

Model-Based methods

Certainty equivalent methodsDynaPrioritized Sweeping / Queue-DynaRTDP (real-time dynamic programming)  The Plexus planning system

Page 28: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

28

VI. Generalization

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 29: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

29

Generalization over input

Generalization over input Immediate Reward

CRBPARCREINFORCE AlgorithmsLogic-Based Methods

Delayed RewardAdaptive Resolution Models

Generalization over action

Page 30: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

30

Hierarchical methods

Feudal Q-learning In the simplest case, there is a high-level

master and low-level master.Compositional Q-learning

Consists of a hierarchy based on the temporal sequencing of sub-goals.

Hierarchical Distance to Goal

Page 31: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

31

VII. Partially observable environments

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 32: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

32

Why Partially Observable Environments?

Complete observability is necessary for learning methods based on MDPs. But in many real-world environments, it’s not possible for the agent to have perfect and complete perception of the state of the environment.

Page 33: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

33

Partially Observable Environments

State-Free Deterministic PoliciesState-Free Stochastic PoliciesPolicies with Internal State

Recurrent Q-learningClassifier SystemsFinite-history-window ApproachPOMDP Approach

Page 34: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

34

VIII. Reinforcement learning applications

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 35: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

35

Reinforcement learning applications

Game Playing TD-Gammon: TD in Backgammon. Special features of Backgammon. Although experiments with other

games (Go, Chess) have in some cases produced interesting learning behavior, no success close to that of TD-Gammon has been repeated.

It is still an open question as to if and how the success of TD-Gammon can be repeated in other domains.

Page 36: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

36

Reinforcement learning applications

Robotics and Control Schaal and Atkeson. Constructed a two-armed robot that learns to juggle a

device known as a devil-stick Mahadevan and Connell. Discussed a task in which a mobile robot pushes

large boxes for extended periods of time. Mataric. Describes a robotics experiment with very high dimensional state

space, containing many dozens of degrees of freedom. Crites and Barto. Q-learning has been used in an elevator dispatching task. Kaelbling. Packaging task from a food processing industry.

Interesting results To make a real system work it proved necessary to supplement the

fundamental algorithm with extra pre-programmed knowledge. None of the exploration strategies mirrors theoretically optimal (but

computationally intractable) exploration, and yet all proved adequate. The computational requirements of these experiments were all very different,

indicating that the differing computational demands of various reinforcement learning algorithms do indeed have an array of differing applications.

Page 37: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

37

IX. Conclusions

Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods Learning an Optimal Policy: Model-based Methods Generalization Partially Observable Environments Reinforcement Learning Applications Conclusions

Page 38: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

38

Conclusions

Many techniques are good for small problems, but don’t scale.

To scale, we must make use of some bias. E.g., shaping, local reinforcement signals, imitation, problem decomposition and reflexes.

Page 39: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

39

References

[1] Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore. Reinforcement Learning: A Survey. (1996) http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/rl-survey.html

[2] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction. (1998) http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html

  [3] Machine Learning. Tom M. Mitchell. (1997)  [4] Barto, A., Bradtke, S., Singh, S. Learning to act using real-time dynamic

programming. Artificial Intelligence, Special volume: Computational research on interaction and agency. 72(1), 81-138. (1995)

  [5] Richard S. Sutton. 499/699 courses on Reinforcement Learning. University of

Alberta, Spring 2006. http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2006.html

Page 40: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

40

A Java implementation of Q-learning

Page 41: Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006

41

Questions