cs 621 reinforcement learning group 8 neeraj bisht ranjeet vimal nishant suren naineet c. patel...

50
CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Upload: colin-morris

Post on 04-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

CS 621 Reinforcement Learning

Group 8

Neeraj Bisht

Ranjeet Vimal

Nishant Suren

Naineet C. Patel

Jimmie Tete

Page 2: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Outline Introduction Motivation Passive Learning in an Known

Environment Passive Learning in an Unknown

Environment Active Learning in an Unknown

Environment Exploration Learning an Action Value Function

Generalization in Reinforcement Learning

Conclusion

References

Page 3: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Introduction

Reinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.

RL algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states.

In economics and game theory, RL is considered as a boundedly rational interpretation of how equilibrium may arise.

Page 4: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Motivation

Traditional Learning methods with teachers.

In RL no correct/incorrrect input/output are given.

Rather feedback after each stage is geiven.

Feedback for the learning process is called 'Reward' or 'Reinforcement'

In RL we examine how an agent can learn from success and failure, reward and punishment

Page 5: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

The RL framework

Environment is depicted as a finite-state Markov Decision process.(MDP)

Utility of a state U[i] gives the usefulness of the state

The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

Page 6: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

The RL problem

Rewards can be received either in intermediate or a terminal state.

Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)

The agent can be a passive or an active learner

Page 7: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.

Page 8: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Page 9: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

Page 10: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known EnvironmentAgent is provided:Mi j = a model given the probability of reaching from state i to state j

Page 11: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches1) LMS (least mean squares)2) ADP (adaptive dynamic programming)3) TD (temporal difference learning)

Page 12: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

LMSLMS (Least Mean Square) (Least Mean Square)

Agent makes random runs (sequences of random moves) through environment

[1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1

[1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

Page 13: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

LMS Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state

Probably converges to true expected value (utilities)

Page 14: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

LMSMain Drawback:- slow convergence- it takes the agent well over a 1000 training sequences to get close to the correct value

Page 15: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

ADP (Adaptive Dynamic Programming)

Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

Page 16: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known EnvironmentADPIn general:

Un+1(i) = Un(i)+ ∑ Mij . Un(j)

-Un(i) is the utility of state i after nth iteration-Initially set to R(i)- R(i) is reward of being in state i(often non zero for only a few end states)

- Mij is the probability of transition from state i to j

Page 17: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

Consider U(3,3)U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2)

= 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430

= 0.2152

ADP

Page 18: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

ADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

Page 19: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

TD (Temporal Difference Learning)

The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

Page 20: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

TD Learning Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5

Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule

Un+1(i) = Un(i)+ a(R(i) + Un(j) –Un(i))

Page 21: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in a Known Environment

TD LearningPerformance: Runs “noisier” than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)

Page 22: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in an Unknown Environment

LMS approach and TD approach operate unchanged in an initially unknown environment.

ADP approach adds a step that updates an estimated model of the environment.

Page 23: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in an Unknown Environment

ADP Approach

The environment model is learned by direct observation of transitions

The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

Page 24: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in an Unknown Environment

ADP & TD Approaches

The ADP approach and the TD approach are closely related

Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors

Page 25: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in an Unknown EnvironmentMinor differences : TD adjusts a state to agree with its observed

successor ADP adjusts the state to agree with all of the

successors

Important differences : TD makes a single adjustment per observed

transition ADP makes as many adjustments as it needs to

restore consistency between the utility estimates U and the environment model M

Page 26: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Passive Learning in an Unknown EnvironmentTo make ADP more efficient : directly approximate the algorithm for value

iteration or policy iteration prioritized-sweeping heuristic makes

adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates

Advantage of the approximate ADP : efficient in terms of computation eliminate long value iterations occur in early

stage

Page 27: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Active Learning in an Unknown Environment

An active agent must consider :

what actions to take what their outcomes may be how they will affect the rewards received

Page 28: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Active Learning in an Unknown EnvironmentMinor changes to passive learning agent:

environment model now incorporates the probabilities of transitions to other states given a particular action

maximize its expected utility agent needs a performance element to

choose an action at each step

Page 29: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Active Learning in an Unknown Environment

Active ADP Approach

need to learn the probability Maij of a

transition instead of Mij

the input to the function will include the action taken

Page 30: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Active Learning in an Unknown Environment

Active TD Approach

the model acquisition problem for the TD agent is identical to that for the ADP agent

the update rule remains unchanged the TD algorithm will converge to the same

values as ADP as the number of training sequences tends to infinity

Page 31: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Exploration

Learning also involves the exploration of unknown areas

Its an attempted to learn from self-play

Page 32: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Exploration

An agent can benefit from actions in 2 ways immediate rewards received percepts

Page 33: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Exploration

Wacky Approach Vs. Greedy Approach

The "wacky" approach acts randomly, in the hope that it will eventually explore the entire environment

the "greedy" approach acts to maximize its utility using current estimates

Page 34: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Exploration

The Exploration Function

a simple example

u= expected utility (greed)

n= number of times actions have been tried(wacky)

R+ = best reward possible

Page 35: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Learning An Action Value-Function

Q-Values?

An action-value function assigns an expected utility to taking a given action in a given state

Page 36: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Learning An Action Value-Function

The Q-Values Formula

U(i) = max Q(a, i) a

Page 37: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Learning An Action Value-Function

The Q-Values Formula Application

-just an adaptation of the active learning equation

Page 38: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Learning An Action Value-Function

The TD Q-Learning Update Equation

- requires no model

- calculated after each transition from state .i to j

Thus, they can be learned directly from reward feedback

Page 39: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Explicit Representation

we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form

explicit representation involves one output value for each input tuple.

Page 40: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Explicit Representation good for small state spaces, but the time to

convergence and the time per iteration increase rapidly as the space gets larger

it may be possible to handle 10,000 states or more

this suffices for 2-dimensional, maze-like environments

Page 41: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Explicit Representation Problem: more realistic worlds are out of

question

eg. Chess & backgammon are tiny subsets of the real world, yet their state spaces contain on the order of 1015 to 10120 states. So it would be absurd to suppose that one must visit all these states in order to learn how to play the game.

Page 42: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Implicit Representation Overcome the explicit problem a form that allows one to calculate the output

for any input, but that is much more compact than the tabular form.

Page 43: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Implicit Representation

For example ,

an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f1………fn:

U(i) = w1f1(i)+w2f2(i)+….+wnfn(i)

Page 44: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Implicit Representation enormous compression : achieved by an

implicit representation allows the learning agents to generalize from states it has visited to states it has not visited

the most important aspect : it allows for inductive generalization over input states.

Therefore, such method are said to perform input generalization

Page 45: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Input Generalisation

The cart pole problem:

set up the problem of balancing a long pole upright on the top of a moving cart.

Page 46: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Input Generalisation

The cart can be jerked left or right by a controller that observes x, x’, , and ’

the earliest work on learning for this problem was carried out by Michie and Chambers(1968)

their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials.

Page 47: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Input Generalisation The algorithm first discretized the 4-

dimensional state into boxes, hence the name it then ran trials until the pole fell over or the

cart hit the end of the track. Negative reinforcement was associated with the

final action in the final box and then propagated back through the sequence

Page 48: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Generalization In Reinforcement Learning

Input Generalisation

The discretization causes some problems when the apparatus was initialized in a different position

improvement : using the algorithm that adaptively partitions that state space according to the observed variation in the reward

Page 49: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

Conclusion

Passive Learning in an Known Environment

Passive Learning in an Unknown Environment

Active Learning in an Unknown Environment

Exploration Learning an Action Value Function Generalization in Reinforcement Learning

Page 50: CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete

References

http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/rl-survey.html

http://www-anw.cs.umass.edu/rlr/

Russel, S. and P. Norvig (1995). Artificial Intelligence - A Modern Approach. Upper Saddle River, NJ, Prentice Hall