mountain car problem using temporal difference(td) & value iteration(vi) reinforcement learning...

MOUNTAIN CAR MOUNTAIN CAR PROBLEMPROBLEM

USING TEMPORAL DIFFERENCE(TD) USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) & VALUE ITERATION(VI)

REINFORCEMENT LEARNING REINFORCEMENT LEARNING ALRORITHMSALRORITHMS

By By

Muzammil AbdulrahmanMuzammil Abdulrahman

&&

Yusuf Garba DambattaYusuf Garba Dambatta

Mevlana University Konya, TurkeyMevlana University Konya, Turkey20132013

INTRODUCTIONINTRODUCTION

The aim of the mountain car problem is for the The aim of the mountain car problem is for the car to learn on two continuous variables;car to learn on two continuous variables;

• position and position and • velocity velocity So that it can reach the top of the mountain in a So that it can reach the top of the mountain in a

minimum number of steps. minimum number of steps. By starting the car from rest, its engine power By starting the car from rest, its engine power

alone will not be powerful enough to bring the alone will not be powerful enough to bring the car over the hill in front.car over the hill in front.

2

To climb up the hill, the car would need to swing back and forth inside the valley

INTRODUCTION CONT.INTRODUCTION CONT.

3

http://en.wikipedia.org/wiki/File:Mcar.png

INTRODUCTION CONT.INTRODUCTION CONT.

By accelerating forward and backward in order By accelerating forward and backward in order to gather momentum. to gather momentum.

The agent receives a negative reward at every The agent receives a negative reward at every time step when the goal is not reachedtime step when the goal is not reached

The agent has no information about the goal The agent has no information about the goal until an initial success, it uses reinforcement until an initial success, it uses reinforcement learning methods.learning methods.

In this project, we employed TD-Q learning and In this project, we employed TD-Q learning and value iteration algorithmsvalue iteration algorithms

4

REINFORCEMENT LEARNINGREINFORCEMENT LEARNING

Reinforcement learning is Reinforcement learning is orthogonal orthogonal learning learning algorithm in the field of machine learningalgorithm in the field of machine learning

Where an estimation of the correctness of the Where an estimation of the correctness of the answer is provided to the systemanswer is provided to the system

It deals with how an agent should take an action It deals with how an agent should take an action in an environment so as to maximize a in an environment so as to maximize a cumulative rewardcumulative reward

It is a Learning from interactionIt is a Learning from interactionAnd is a Goal-oriented learningAnd is a Goal-oriented learning

5

CHARACTERISTICSCHARACTERISTICS

No direct training examples – (delayed) rewards insteadNo direct training examples – (delayed) rewards instead

Goal-oriented learningGoal-oriented learningLearning about, from, and while interacting with Learning about, from, and while interacting with

an external environmentan external environment Need for exploration of environment & exploitationNeed for exploration of environment & exploitation The environment might be stochastic and/or unknownThe environment might be stochastic and/or unknown The learning actions of the agent affect future rewardsThe learning actions of the agent affect future rewards

6

EXAMPLESEXAMPLES

Robot moving in an environment

EXAMPLESEXAMPLES

Chess Master

8

UNSUPERVISED LEARNINGUNSUPERVISED LEARNING

Training Info = Evaluation(rewards/penalties) Training Info = Evaluation(rewards/penalties)

Input Output(actions) Input Output(actions)

Objectives: Get as much reward as possibleObjectives: Get as much reward as possible

RL System

9

SUPERVISED LEARNINGSUPERVISED LEARNING

Training Info = desired (target) outputsTraining Info = desired (target) outputs

Input Output Input Output

Training example = {input (state), target output}Training example = {input (state), target output}

Error = (target output – actual output)Error = (target output – actual output)

Supervised Learning System

10

TEMPORAL DIFFERENCE(TD)TEMPORAL DIFFERENCE(TD)

Temporal difference (TD) learning is a prediction method.

It has been mostly used for solving the reinforcement learning problem.

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas.

11

TD Q-LEARNINGTD Q-LEARNING

The update of TD Q-Learning looks as followsThe update of TD Q-Learning looks as follows

Q (a, s) Q (a, s) Q (a, s) + α [(R(s) + γ maxQ (a, s) + α [(R(s) + γ max aaı ı Q (aQ (a ıı, s, s ıı) - Q (a, s)]) - Q (a, s)]

12

TD Q-LEARNING ALGORITHMTD Q-LEARNING ALGORITHMInitialize Q values for all states ‘s’ and actions ‘a’Initialize Q values for all states ‘s’ and actions ‘a’Obtain the current state Obtain the current state Select an action according to current stateSelect an action according to current stateImplement the selected action and obtain an immediate Implement the selected action and obtain an immediate

reward and the next state reward and the next state Update the Q function according to the above equationUpdate the Q function according to the above equationUpdate the system stateUpdate the system stateStop the algorithm if the maximum number of iteration Stop the algorithm if the maximum number of iteration

is reachedis reached13

εε -GREEDY SELECTION (Q,S,EPSILON)-GREEDY SELECTION (Q,S,EPSILON)

The agent randomly select action from Q table The agent randomly select action from Q table based on e-greedy strategy. based on e-greedy strategy.

Initially, epsilon=0.01 which is the probability of Initially, epsilon=0.01 which is the probability of selecting random action. selecting random action.

It will be approximately equal to zero, when the It will be approximately equal to zero, when the car agent has fully learned how to climb the car agent has fully learned how to climb the front hill (no randomness because it has learned front hill (no randomness because it has learned about best action).about best action).

14

STATE, ACTION & REWARDSTATE, ACTION & REWARD

State: State: The states are The states are position and speed. Position is position and speed. Position is between the range of -1.5 and 0.55 and speed is between the range of -1.5 and 0.55 and speed is between the range of -0.07 and 0.07between the range of -0.07 and 0.07

Action: Action: The agent has one of these 3 actions at all The agent has one of these 3 actions at all the time: Forward, backward, neutral (Forward the time: Forward, backward, neutral (Forward accelaration=+1accelaration=+1m/s2, backward deccelaration =-, backward deccelaration =-11m/s2 , neutral=0 , neutral=0 m/s2).).

Reward: Reward: The agent receive a rThe agent receive a reward eward of of -1 for all -1 for all actions except when the agent reaches the goal actions except when the agent reaches the goal state where state where it receives a 0 rewardit receives a 0 reward 15

VALUE ITERATIONVALUE ITERATION

The value iteration algorithm which is also called The value iteration algorithm which is also called backward inductionbackward induction

Combines policy improvement and a truncated Combines policy improvement and a truncated policy evaluation into a single update steppolicy evaluation into a single update step

VV ((ss)) == R R((ss)) + γ max ∑ T (s, a, s) V + γ max ∑ T (s, a, s) V ((ss ′′))

16

VALUE ITERATION ALGORITHMVALUE ITERATION ALGORITHM

Inputs: (S, A,T, R, γ), ε: threshold value Inputs: (S, A,T, R, γ), ε: threshold value

Initialize VInitialize V0 0 for state ‘s’ and action ‘a’for state ‘s’ and action ‘a’

for each compute the next approximation using Bellman for each compute the next approximation using Bellman backup equation.backup equation.

VV((ss)) R R((ss)) + γ max ∑ T (s, a, s) V + γ max ∑ T (s, a, s) V ((ss ′′))δ δ V V ((ss ′′) -) -VV((ss)) Until δ < εUntil δ < εReturn VReturn V

17

GRAPHICAL RESULTSGRAPHICAL RESULTSThe graph shows the relation between RMS The graph shows the relation between RMS

value (also called policy-loss) and the number of value (also called policy-loss) and the number of episodes. episodes.

RMS value is the error between the current Q RMS value is the error between the current Q value and the previous Q value. value and the previous Q value.

With any probability, an agent chooses action With any probability, an agent chooses action randomly. If the choosen action happen to be randomly. If the choosen action happen to be bad, it will cause instant rise in error.bad, it will cause instant rise in error.

At convergence, the error is approximately zero. At convergence, the error is approximately zero. In our case, convergence is reached when 3 or In our case, convergence is reached when 3 or

more successive RMS value equals 0.00more successive RMS value equals 0.00001 or less1 or less18

The car in the mountain will be displayed at 11th The car in the mountain will be displayed at 11th iteration to visualize how the car agent learns. iteration to visualize how the car agent learns.

19

GRAPHGRAPH

20

CONT.CONT.

21Figure show the graph of Total Reward vs Episode at 1000th episode

RESULT CONT.RESULT CONT.

The car in the mountain will be displayed at 11th The car in the mountain will be displayed at 11th iteration to visualize how the car agent learns. iteration to visualize how the car agent learns.

After 11th iteration, it will be stopped to reduce After 11th iteration, it will be stopped to reduce the time it takes to converge. the time it takes to converge.

After 3 or more successive RMS values equals After 3 or more successive RMS values equals 0.001 or less, the car will be displayed again to 0.001 or less, the car will be displayed again to show how it has fully learned how to reach goal show how it has fully learned how to reach goal state at any episode maintaining constant steps.state at any episode maintaining constant steps.

22

VVI RESULTSI RESULTS The graph below shows the convergence error The graph below shows the convergence error

over iterationsover iterations

23

VI CONT.VI CONT.

24

Figure 6 shows the graph of Optimal Positions and Velocities over time on top while bottom one displays the car learning in the mountain.

VI CONT.VI CONT.

The first Episode records the highest errorThe first Episode records the highest error This is because the error is the difference This is because the error is the difference

between the current value function and the between the current value function and the previous value function i.e. Error=previous value function i.e. Error= V V ((ss ′′) ) -V(s) -V(s)

But initially the previous value function is 0But initially the previous value function is 0Hence Error=Hence Error= V V ((ss ′′))

25

VI CONT.VI CONT.

At subsequent episodes, the error keeps At subsequent episodes, the error keeps decreasing as the next updated value functions decreasing as the next updated value functions increaseincrease..

At convergence, the error (with 0 value) is less At convergence, the error (with 0 value) is less

than the threshold value (than the threshold value (εε=0.0001) =0.0001) which is which is

the termination criteria for this projectthe termination criteria for this project.. Finally the optimal policy will be returnedFinally the optimal policy will be returned..

26

VI CONT.VI CONT.

The graphs below shows the optimal positions The graphs below shows the optimal positions and velocities over timeand velocities over time

The first graph is that of the optimal positions The first graph is that of the optimal positions over timeover time

It simply shows the optimal positions attained by It simply shows the optimal positions attained by the car as it attempt to reach the goal state at the car as it attempt to reach the goal state at different timedifferent time

27

CONT.CONT.

Also the second graph shows the optimal Also the second graph shows the optimal velocities attained by the car as it attempt to velocities attained by the car as it attempt to reach the goal state at different timereach the goal state at different time

The car initially accelerate from rest position to The car initially accelerate from rest position to attain a position of -0.2 it then swings back to attain a position of -0.2 it then swings back to gather enough momentum by attaining a gather enough momentum by attaining a position of -0.95, it finally accelerate forward position of -0.95, it finally accelerate forward again and reach the goal stateagain and reach the goal state

28

CONCLUSIONCONCLUSION

In this project, the temporal difference and value In this project, the temporal difference and value iteration learning algorithms were implemented for iteration learning algorithms were implemented for mountain car problem. Both the algorithms were mountain car problem. Both the algorithms were guaranteed to converge by determining the guaranteed to converge by determining the optimal policy for reaching the goal state.optimal policy for reaching the goal state.

29

mountain car problem using temporal difference(td) & value iteration(vi) reinforcement learning...

Technology