machine learning course: aass, Örebro univ130.243.105.49/~lilien/ml/seminars/2007_05_11a... ·...

AASS Örebro University- PhD Seminar 1

PhD SEMINAR

REINFORCEMENT LEARNING

MACHINE LEARNING COURSE: AASS, ÖREBRO UNIV

Presented by:

Muhammad Rehan AhmedPhD Student

Intelligent Control Lab


AGENDA

� Simple Learning Taxonomy

� Reinforcement Learning�Basic Terminologies like Policy, Goal

Rewards, Tasks etc, etc...

�Value Functions, How to Estimate ?

� General RL Algorithm

� Reinforcement Comparison

� Actor Critic Methods

� Research Paper


Simple Learning Taxonomy

� Supervised Learning:

� The Learner has Access to a Teacher who corrects it.� Teacher provide the required response to inputs. � Desired behavior is known.� Correct Target output is known for each input pattern� Some Time Costly (Expert examples are expansive & scarce)

� Unsupervised Learning:� No Access to Teacher.� Learner must search for order/patterns in the Environment � No Target Output� No Right Answer� Learning is made by repeating / searching the patterns of inputs.


Continued......� Reinforcement Learning: The Brief Concept

� The Learner is not told what actions to take, but gets reward/punishment from the Environment and Learns/ Adjust the Action to pick next time by Trial & Error Search.

� Learning from Interaction Agent Environment Interface� With Environment� To Achieve some Goal� Cheap & Plentiful

� Example: Baby Playing, Learning to Drive a Car, Hold Conversation, etc.

1. Environment’s Response affects our subsequent actions2. Agent find out the effect of his actions later


RL Vs Supervised Learning� Evaluation VS Instruction

� RL:� Training Information Evaluates the Action� Doesn’t say whether it was Best or Correct relative to all other Actions� Must try All Actions & Compare to see which is Best� Procedure

� Trial & Error Search for Actions� Must try All Actions� Reward is a Scalar (Other action can be better or worse)� Learning by selection (Selectively choose those Actions that prove

to be better)

� Supervised Learning:� Training Instructs� It gives correct answer regardless of the Action Chosen� No search in Action Space


Agent-Environment Interface

1

1

:situation / statenext resulting and

:reward immediate resulting gets

)( : stepat action produces

: stepat state observesAgent

,2,1,0 :steps timediscreteat interact t environmen andAgent

+

+ ℜ∈

∈

∈

=

t

t

tt

t

s

r

sAat

Sst

t K

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3 . . .

t +3a


Reinforcement Learning

� Objective of RL:

� Learning a policy (mapping from State to Actions) that maximizes a scalar reward / reinforcement signal.

� How Learn ?

1. Trial & Error Search:� Try out Actions to Learn which produce Highest Rewards

2. Delayed Effects , Delayed Rewards:� Actions affect Immediate reward + Next state + All subsequent

rewards

� Sequence:

� Sense States Choose Actions To achieve Goals.


Key Features of RL

� Trial & Error Search

� Environment is Stochastic (Uncertain)

� Reward may be delayed, so the Agent may need to sacrifice short term gains for a greater long term gains

� The agent has to balance the need to Explore its Environment and the need to Exploit its current knowledge

`Exploration – Exploitation Tradeoff´


Exploration VS Exploitation

� The Learner actively Interacts with the Environment:� At the beginning the learner does not know anything about the

Environment.� It gradually gains the Experience and Learns how to react to the

Environment

� Dilemma: After some number of steps, should learner select the Best current choice (Exploitation) OR Try to learn more about the Environment (exploration)?

� Exploitation may involve the selection of a sub-optimal action and prevent the learning of the optimal choice

� Exploration may spend to much time on trying bad currently suboptimal actions

� Must Do Both


RL Frame Work


ssaaas

t

ttt

t

=== when y that probabilit ),(

:, step

π

πat Policy

Policy� Telling the Agent what to do is its POLICY

� POLICY: � Mapping from States to Actions Probabilities� Way of behaving (Strategy)� Probability of taking Action `a´ in State `s´� Given in the state `s´ at time t, policy gives the probability the

agent’s action will be `a´

Reinforcement Learning To Learn the POLICY.


Agent Learns a Policy

� Agent detect a State, Choose an action & get the reward

� Agent Aim is to learn a POLICY that maximizes the reward

� Maximize the reward over the long term, not necessarily immediate reward

� For Example: Watch TV now, panic over home work later Vs. Do the home work now, watch TV while all your pals are panicking

Reinforcement learning methods specify how the agent changes its policy as a result of

experience


Goal, Reward & Return

� Goal: Maximize Total Reward Received

� Immediate Reward ‘r’ at each step

� Agent must maximize Expected Cumulative Reward

� Suppose the sequence of rewards after step t is : rt+1, rt+2, rt+3,…….

� Then the Return will be the Total Reward `Rt´

`Maximize the Expected Return, E{Rt}, for Each Step t´


shortsighted 0 ← γ → 1 farsighted

� Episodic Tasks:

� Interaction breaks naturally into episodes e.g., Play a game, trip to maze

� Rt = rt+1 + rt+2 + rt+3 +…….rT � where `T´ is a Final Step at which the terminal state is reached,

ending an episode / trial.

� Continuing Tasks:

� Interaction does not have natural episodes, meaning `T´= ∞� Discounted Return:

. theis ,10, where

, 0

13

2

21

ratediscount ≤≤

=+++= ∑∞

=

+++++

γγ

γγγk

kt

k

tttt rrrrR L

(Future rewards counts for more)

Task Types


Example-1Avoid failure: the pole falling beyond

a critical angle or the cart hitting end of

track.

reward = +1 for each step before failure

⇒ return = number of steps before failure

� As an episodic task where episode ends upon failure:

� As a continuing task with discounted return:

reward = −1 upon failure; 0 otherwise

⇒ return = −γ k , for k steps before failure

In either case, return is maximized by

avoiding failure for as long as possible.


Combined Notation� In Episodic Tasks, we number the time steps of each episode

starting from zero

� We can cover all cases by writing

Rt = γk rt +k +1,

k =0

∞

∑

where γ can be 1 only if a zero reward absorbing state is always reached.


Value Functions

� Value of State

� Value of State- Action Pair or Value of Action


1: Value of State

� Expected return that estimates how good it is for the agent to be in a given state

� How Good is defined in terms of Future rewards

� The value of a state is the expected return starting from that state; depends on the agent’s policy:

State - value function for policy π :

Vπ

( s) = Eπ Rt st = s{ }= Eπ γ k rt +k +1 st = sk =0

∞

∑

Or, without the expectation operator:

Vπ(s) = π (s,a) Ps ′ s

aRs ′ s

a+ γ V π( ′ s )[ ]

′ s

∑a

∑


2: Value of State-Action Pair� Value of State-Action Pair under policy: That estimates how good it is to

perform a given action in a given state� How Good is defined in terms of Future rewards (Expected return)

� The value of taking an action in a state under policy ππππ is the expected return starting from that state, taking that action, and thereafter following π :

{ }

====== ∑∞

=

++

0

1 ,,),(

:

k

ttkt

k

ttt aassrEaassREasQ γ

π

πππ

policyfor function value-Action

Q∗( s , a ) = E r t + 1 + γ max

′ a Q

∗( s t + 1 , ′ a ) s t = s , a t = a{ }

= Ps ′ s a

R s ′ s a

+ γ max′ a

Q∗( ′ s , ′ a )[ ]

′ s

∑


Value of Action: Q� Q: Value of an Action

� Expected Reward from that action� We have only the estimates of Q� We build these estimates from experience of rewards

How to Estimate the Q?

� Now, we have value estimates of all the actions at a given state:

How to select the action? 2 Choices:

1. Greedy Actions: Have Highest Estimated Q: EXPLOITATION2. Other Actions: Lower Estimated Q’s: EXPLORATION

� Agent can’t exploit all the time, It must sometimes explore to see if an action that currently looks bad eventually turns out to be good (long run)

� What if we know the Q values exactly? We choose that action which gives highest Q Value & hence we have searched the optimal Q:


1: How to Estimate Q� Q*(a): Optimal Value of Action

� Qt(a): Estimated value of action ´a` at time ´t`

� How to estimate Q: ? By Running mean:

� Suppose we choose action ´a`, ka times & observe a reward rion play i:

� Then we can estimate Q* from running mean:

Qt(a) = (r1+r2+r3+......rka)/ka

� If ka = 0 (trial of that action) , then ri = 0

� But, As ka �� ∞ , Qt(a) � Q*(a):

This is called sample average method for calculating Q


Incremental Update Equation

� Estimate Q* by Running Mean: (if we have tried action `a´ katimes)

Qt(a) = (r1+r2+r3+......rka)/ka

� Incremental Equation:

[ ]kkkk

k

i

ik

Qrk

QQ

rk

Q

−+

+=

+=

++

+

=

+ ∑

11

1

1

1

1

11

1

� New Estimate = Old Estimate + Step size [Target – Old Estimate]� Step size ( ): Depends on k � Often kept constant e.g. = 0.1� Gives more weight to recent rewards

αα


2: Action Selection� Greedy:

� Select the a* for which Q is highest

� Qt(a*)= maxa Qt(a) `Action that maximize the value function´

� So, a*= arg maxa Qt(a) *- `Best´

� Example: 10-armed Bandit

Suppose at Time `t´ we have actions 1 to 10:

Qt(a) ��

Qt(a*) = 0.4 and a* = 4th action: `argument´

� -greedy:

� Select random action of the time, else select greedy action

Sample all actions infinitely many times,

so as ka �� ∞ , Q’s converges to Q*

0.100.05000.20.40.10.30

ε

ε


General RL Algorithm

� Initialize Agents internal state (e.g. Q Values, other statistics)� Do for a long time (Looping)

� Observe current world state `s´� Chose action `a´ using the policy (greedy or e-greedy)� Execute action `a´� Let `r´ be immediate reward, s´ new world state� Update internal state based on s, a, r, s´ (previous internal state)

After many trials, agent will learn the optimal Q values (state-action pair) & hence reach the best policy, i.e., the best

mapping from states to actions

� Output a policy based on (e.g.) Learnt Q Values & follow it


RL Framework (Again)

� Task: One instance of RL Problem � Learning: How should agent change policy ???

(RL Algorithm)� Overall Goal: Maximize amount of reward received over time

Function


What RL Algorithms Do

� Continual, On-Line Learning

� Many RL methods can be understood as trying to solve the Bellman optimality equations in an approximate way.


Dynamic / Model of Environment

� Environment provides:

� Transition Function/ Probability:

� Reward Function:

� If Agent know `P´ & `R´ then it has complete information about the environment....

� Agent has to DISCOVER these while exploring the world: How ?

� Agent has to Evaluate Policy& Improve it at every instance untilit will reach at the OPTIMAL POLICY

Ps ′ s a

= Pr s t +1 = ′ s st = s,at = a{ } for all s, ′ s ∈S, a ∈ A(s).

Rs ′ s a

= E rt +1 st = s,at = a,st +1 = ′ s { } for all s, ′ s ∈S, a ∈ A(s).


Policy Evaluation & Policy Improvement


Model Free Learning

� No Model is Available: (No `P´ & `R´)

� Model Free Methods:

� Learn Optimal Policy without Learning Model

� Temporal Difference (TD Learning) is Model Free, Bootstrapping Method based on Sampling the State-Action Space


Temporal Difference Prediction� Policy Evaluation Prediction ProblemPrediction Problem:

� Trying to predict how much return we will get from being in the state `s´ & following a policy by learning the state value function V

� TD & MC methods: Solve Prediction Prob by Experience

� Given some experience following a policy , both methods updates their estimate V of V

� Monte Carlo Update

� Return Rt: Actual Return from st to End of Episode

� Must wait until the End of Episode to determine the increment toV(st)

� Target for Monte Carlo update: `Rt´

ππ

ππ

)]([)()( tttt sVRsVsV −+← α


Continued.......

� Simplest Temporal Difference Update: TD(0)

� Target For TD Update: Estimate of the Return

� TD method waits until the next Time step

� Bootstrapping Method : Its update basis in part on Existing Estimates.

� Learn a Guess from a Guess

)]()([)()( 11 ttttt sVsVrsVsV −++← ++ γα

)]([ 11 ++ + tt sVr γ


Temporal Difference Learning

� TD Learning: Combination of MC & DP Ideas

� Like MC Methods: TD Method learns directly form raw experience without a model of the environment dynamics

� Like DP: TD Method updates estimates based in part on other learned estimates without waiting for a Final outcome

Bootstrap Self Generate

� In Brief: TD methods:

� Does not need a Model

� Learns Directly from Experience

� Updates Estimates of V(st) based on what happens after visiting state `s´

a

ss

a

ss RP '' ,


Advantages of TD Learning Methods

� Do not need a Model of the Environment, of its Rewards & Next State Probability Distributions

� Online & Incremental: FASTFAST

� Need not to wait till the End of the Episode so need less memory & computation

� Updates are based on actual experience (rt+1)

� Converges to V (s)

� But must Decrease step size as Learning continues

π


TD For Control Problem� Control Methods Aim to find: Optimal Policy / Solution� Temporal Difference Learning- TD(0) is used to compute the Values

for a Given policy

� We came to TD for Control Problem: Learning Q Values:� Using the Generalized Policy Iteration (GPI) � With using TD Methods for Evaluation or Prediction Part� Again we face the need to Tradeoff between Exploration &

Exploitation� Resulting 2 main classes:

� On Policy: SARSASARSA� Off Policy: QQ-- LearningLearning

� We consider Transition from State-action pair to state-action pair & learn the value of the state-action pairs

� First step: To learn Action Value Function, Q


Softmax Action Selection


Continued......

� Effect of

� Softmax Action Selection becomes the same as Greedy Action Selection when the limit goes to zero

� Which one is better whether,Softmax or - Greedy

τ

GreedyyprobabilitAs

actioneequiprobalnearlynyprobabilitAs

→→

→∞→

,0

)(/1,

τ

τ

εUNCLEAR


Reinforcement Comparison� Central Theme:

Large Rewards Reoccurrence (High)Small Rewards Reoccurrence (Low)

� How to judge what is Large & Small Judgment Problem ?

� Compare the Reward with Reference Level (Reference reward)� Reference Reward Average of previously received rewards

� Basic Idea of Reinforcement Comparison� Large Reward means Higher than Average� Small Reward means Lower than Average

� Learning methods based on this Basic Idea are called Reinforcement Comparison Methods

Actions


Continued......� Reinforcement Comparison Methods maintains Overall Reference

Reward

� To pick among the actions, they maintain a separate measure of their preference for each action:

� pt(a) Preference for action on play t

� Preference is used to determine action selection probabilities by Softmax Relationship

� After each play, preference for the action selected on that play is updated:

� Preference update:

� Where, is the positive step size parameter

∑=

===n

b

bp

ap

tt

t

t

e

eaapra

1

)(

)(

}{)(π

][)()(_

1 tttttt rrapap −+=+ β

β


Continued......

� Higher rewards increases the reselecting probability of that action while the Low rewards decreases

� Reference Reward updated:

� Where, is step size parameter

][__

1

_

tttt rrrr −+=+ α

α


Actor Critic Methods� TD Methods having separate Memory Structure to represent the

Policy Independent of Value Function

1. Policy Structure ActorUsed to select actions

2. Est. Value Function CriticCriticizes the action made by Actor Typically State Value Function


Continued......� Extension of Idea of Reinforcement Comparison Methods to TD

Learning & to Full Reinforcement Problem

� Critic, critique whatever policy is being followed by Actor

� Criticisms (Scalar Signal) takes the Form of TD Error is the sole output of Critic & drives all learning

� Working Concept

� After each action selection, Critic evaluates the new state to determine whether the things have gone better or worse than expected. That evaluation is TD Error:

� V is the current value function implemented by Critic

� TD Error used to evaluate the action just selected� If TD Error Positive Tendency to select at, strengthened for Future

� If TD Error Negative Tendency to select at, weakened for Future

)()( 11 ttt sVsVrt −+= ++ γδ


Continued......

� Suppose actions are generated by Gibbs Softmax Method:

� p(s,a) indicates the tendency to select / preference for each action `a´ when in each state `s´

� p(s,a) are the values at time, of the modifiable policy parameter of the Actor

� Preference update rule (for strengthening & weakening)

� Where, positive step size parameter

∑=

====n

b

bsp

asp

tt

te

esstaapras

1

),(

),(

}|{),(π

ttttt aspasp βδ+← ),(),(

β


Significant Advantages

� Minimal computation time in order to select actions:

� Policy is explicitly stored reduces the extensive computation

� Learn Stochastic Policy

� Learn Optimal Probabilities of selecting various actions


Research Paper

Direct Reinforcement Adaptive Learning (DRAL) Fuzzy Logic Control for a Class of

Non Linear Systems

(Preecedings of the 12th IEEE)

(International symposium on intelligent control)


Paper Abstract

� Application of RL Tech’s to Feedback Control of Non Linear Systems using Adaptive Fuzzy Logic System (FLS)

� Good Model Non Linear Systems Difficult To Control

� This paper presents Adaptive FLS can handle Non Linearities through RL with No Preliminary Off Line Learning

� Approach Presented: � FLS is indirectly told about the effects of its Control Action on the System

Performance.

� On Line Learning is Based on Binary Reinforcement Signal from a Critic without knowing the Non linearity appearing in the system

`Direct Reinforcement Learning´


Major Modules� Fuzzy Logic System (FLS)

� Reinforcement Learning (Actor Critic Method)

� FLS + RL-Training Algorithm Adaptive FLS

� System to be Controlled:� Non Linear Plant Model

� Having: � Nonlinearity

� Disturbances


FLS, RL & Adaptive FLS� Fuzzy Control Model Free Approach

� Well known for Non Linear Control Technique

� Adaptive FLS FLS + RL-Training Algorithm

� FLS : Collection of IF-Then Rules� Training Algo: Adjusts the Parameters of FLS M.F’s,

according to input / output data

� Reinforcement Learning:Large Rewards Reoccurrence (High)Small Rewards Reoccurrence (Low)

� Extending this idea of Action Selections to depend on State Information: � Emerges the aspects of Feedback Control, Pattern Recognition &

Associative learning

Actions


Continued......

� Why RL is Very Important ?

� Ability to Generate Correct Control Action in situations where Difficult to Define a Priori the Desired Output for Each Input

� Critic in RL Scheme: Scalar Evaluation Signal

Much less information than the Desired o/p's required in Supervised learning

� Supervised Learning &

Direct Inverse Control A Priori Desired Output is known


Adaptive Fuzzy Logic System

Basic Configuration of Adaptive Fuzzy Logic System


Continued......

� Parameters to be Adjusted in Adaptive FLS:

� Control Representative Value of Each Rule `L´ ( )

� Parameters of Membership Function’s ( m , )

� Fuzzy Basis Function (FBF) : Strength of the Rule

θ

σ


Continued......

� Centroid & Width are real valued parameters of Gaussian M.F for i-th input variable & L-th Rule

� Adjustable parameters (Adaptive FBF) &

� Constraint: (Input Physical Domain) &

� FLS output will be;

i

l

i Um ∈

l

iσ

l

iml

iσ

l

im

0≠l

iσ


Continued......� Non Linear Function to be approximated can be represented as;

Provided by Adaptation Learning Algorithm

� Adaptation Learning Algorithm Presented later


DRAL Controller Design

Proposed DRAL Controller, illustrating overall Adaptive Learning Scheme


DRAL Controller Architecture� DRAL Controller is a combination of:

� Action Generating FLS (using Reinforcement Signal as a Adaptive-Learning signal) &

� A Fixed Gain Controller in the Performance measurement loop which uses an error based on given Reference Trajectory

� As Adaptive Learning Process Proceeds:

� Fixed Gain Controller has less & less influence on Control Action

� FLS gets more & more influence on Control Action

� DRAL Controller is designed to Learn how to Generate the Best Control Action at each time instant in the absence of complete information about the Plant & the Disturbances


Continued......

� Critic: Binary Sign Function (Output +1 or -1)

Based on Performance Measure

� This Evaluative Signal (Binary Reinforcement Signal) used for Generation of Correct Control Actions by FLS

� Critic works as Supervisor to the FLS & Control the Learning of the FLS by checking the Actual Performance against the Performance Measures


System Dynamics� Plant can be represented as:


Continued......�

� Full Tracking Error r(t) is not allowed to used for tuning the overall FLS, only a Reduced Reinforced Signal `R´ is allowed

R = sign(r)

� Proposed Control Law: (Input to the Plant)


Adaptive Learning Algorithm� The Derived Adaptive Learning Algorithm for tuning the

� Design Matrices determining the Learning Rate & Ki > 0 Design Parameter governing the Speed of Convergence


Simulation Results� The Dynamic Equation for n-Link Manipulator is as follows:

� The Dynamic Equation for 2-link Robot Manipulator is

� Assuming M(q) is Known, Non Linearities are Unknown


Continued......� Simulation parameters are:

� Lengths: l1=l2= 1m, masses: m1=m2= 1kg� Desired Trajectory:

� qdi(t)= sin(t) (rad)� qd2(t)= cos(t) (rad)

� Linear Gain Kv= diag[20 20]

� States 4 input Dimensions � Three Gaussian M.F selected / Input dimension� 81 Fuzzy IF-THEN Rules can be generated Rules= (M.F)i

� Trajectory tracking performance for 3 cases were studied� With only Fixed Gain in Performance Measure� Action Generating FLS using DRAL Algorithm� DRAL Algorithm in presence of Mass changes

� Programming : Turbo C


Continued......� Performance with only Fixed Gain in Performance Measure

Performance with Kv Only


Continued......

� Performance with Action Generating FLS using DRAL Algorithm

� Trajectory Following Ability : Fairly Good � Kv= Diag[20 20]� DRAL Controller

� Cancel the Non linearity in the Robot System

� Robot Dynamics are Unknown to DRAL Controller


Continued......� Performance with DRAL Algorithm in presence of Mass changes

� Mass m2 Changes (To Verify the Adaptation & Robustness )� At time t1=5s Mass changes from 1.0Kg to 2.5Kg (picked up)� At time t2=12s Mass changes from 2.5Kg to 1.0Kg (released)� Same Parameters� Fairly Satisfactory Performance� System Dynamics unknown to DRAL Controller


Key Features

� No need of Off-Line Training

� Binary Reinforcement Signal is Directly used for adjusting FLS Parameters

� Controller Reusability : �Same Controller Works even if Dynamics

changes


� Reference Books:� Sutton & Barto, Reinforcement learning: An introduction� Reinforcement Learning : A Survey (Journal of Artificial Intelligence

research 4 (1996) 237-285)

� Lecture Notes:� Machine Learning Course Lectures Slides (By Denny & Achim)� Ivan’s Lecture notes of Soft Computing & Control

� Research Papers:� Young H. Kim & Frank L. Lewis, Direct Reinforcement Adaptive

Learning Fuzzy Logic Control � Ihsan Omur, Mohamed Ali zohdy, Reinforcement Learning Control

of Non linear multilink system

Reference Books & Research Papers


THANKS

machine learning course: aass, Örebro univ130.243.105.49/~lilien/ml/seminars/2007_05_11a... ·...

Documents