machine learning course: aass, Örebro univ130.243.105.49/~lilien/ml/seminars/2007_05_11a... ·...

66
AASS Örebro University- PhD Seminar 1 PhD SEMINAR REINFORCEMENT LEARNING MACHINE LEARNING COURSE: AASS, ÖREBRO UNIV Presented by: Muhammad Rehan Ahmed PhD Student Intelligent Control Lab

Upload: others

Post on 04-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • AASS Örebro University- PhD Seminar 1

    PhD SEMINAR

    REINFORCEMENT LEARNING

    MACHINE LEARNING COURSE: AASS, ÖREBRO UNIV

    Presented by:

    Muhammad Rehan AhmedPhD Student

    Intelligent Control Lab

  • AASS Örebro University- PhD Seminar 2

    AGENDA

    � Simple Learning Taxonomy

    � Reinforcement Learning�Basic Terminologies like Policy, Goal

    Rewards, Tasks etc, etc...

    �Value Functions, How to Estimate ?

    � General RL Algorithm

    � Reinforcement Comparison

    � Actor Critic Methods

    � Research Paper

  • AASS Örebro University- PhD Seminar 3

    Simple Learning Taxonomy

    � Supervised Learning:

    � The Learner has Access to a Teacher who corrects it.� Teacher provide the required response to inputs. � Desired behavior is known.� Correct Target output is known for each input pattern� Some Time Costly (Expert examples are expansive & scarce)

    � Unsupervised Learning:� No Access to Teacher.� Learner must search for order/patterns in the Environment � No Target Output� No Right Answer� Learning is made by repeating / searching the patterns of inputs.

  • AASS Örebro University- PhD Seminar 4

    Continued......� Reinforcement Learning: The Brief Concept

    � The Learner is not told what actions to take, but gets reward/punishment from the Environment and Learns/ Adjust the Action to pick next time by Trial & Error Search.

    � Learning from Interaction Agent Environment Interface� With Environment� To Achieve some Goal� Cheap & Plentiful

    � Example: Baby Playing, Learning to Drive a Car, Hold Conversation, etc.

    1. Environment’s Response affects our subsequent actions2. Agent find out the effect of his actions later

  • AASS Örebro University- PhD Seminar 5

    RL Vs Supervised Learning� Evaluation VS Instruction

    � RL:� Training Information Evaluates the Action� Doesn’t say whether it was Best or Correct relative to all other Actions� Must try All Actions & Compare to see which is Best� Procedure

    � Trial & Error Search for Actions� Must try All Actions� Reward is a Scalar (Other action can be better or worse)� Learning by selection (Selectively choose those Actions that prove

    to be better)

    � Supervised Learning:� Training Instructs� It gives correct answer regardless of the Action Chosen� No search in Action Space

  • AASS Örebro University- PhD Seminar 6

    Agent-Environment Interface

    1

    1

    :situation / statenext resulting and

    :reward immediate resulting gets

    )( : stepat action produces

    : stepat state observesAgent

    ,2,1,0 :steps timediscreteat interact t environmen andAgent

    +

    + ℜ∈

    =

    t

    t

    tt

    t

    s

    r

    sAat

    Sst

    t K

    t

    . . . st art +1 st +1

    t +1art +2 st +2

    t +2art +3 st +3 . . .

    t +3a

  • AASS Örebro University- PhD Seminar 7

    Reinforcement Learning

    � Objective of RL:

    � Learning a policy (mapping from State to Actions) that maximizes a scalar reward / reinforcement signal.

    � How Learn ?

    1. Trial & Error Search:� Try out Actions to Learn which produce Highest Rewards

    2. Delayed Effects , Delayed Rewards:� Actions affect Immediate reward + Next state + All subsequent

    rewards

    � Sequence:

    � Sense States Choose Actions To achieve Goals.

  • AASS Örebro University- PhD Seminar 8

    Key Features of RL

    � Trial & Error Search

    � Environment is Stochastic (Uncertain)

    � Reward may be delayed, so the Agent may need to sacrifice short term gains for a greater long term gains

    � The agent has to balance the need to Explore its Environment and the need to Exploit its current knowledge

    `Exploration – Exploitation Tradeoff´

  • AASS Örebro University- PhD Seminar 9

    Exploration VS Exploitation

    � The Learner actively Interacts with the Environment:� At the beginning the learner does not know anything about the

    Environment.� It gradually gains the Experience and Learns how to react to the

    Environment

    � Dilemma: After some number of steps, should learner select the Best current choice (Exploitation) OR Try to learn more about the Environment (exploration)?

    � Exploitation may involve the selection of a sub-optimal action and prevent the learning of the optimal choice

    � Exploration may spend to much time on trying bad currently suboptimal actions

    � Must Do Both

  • AASS Örebro University- PhD Seminar 10

    RL Frame Work

  • AASS Örebro University- PhD Seminar 11

    ssaaas

    t

    ttt

    t

    === when y that probabilit ),(

    :, step

    π

    πat Policy

    Policy� Telling the Agent what to do is its POLICY

    � POLICY: � Mapping from States to Actions Probabilities� Way of behaving (Strategy)� Probability of taking Action `a´ in State `s´� Given in the state `s´ at time t, policy gives the probability the

    agent’s action will be `a´

    Reinforcement Learning To Learn the POLICY.

  • AASS Örebro University- PhD Seminar 12

    Agent Learns a Policy

    � Agent detect a State, Choose an action & get the reward

    � Agent Aim is to learn a POLICY that maximizes the reward

    � Maximize the reward over the long term, not necessarily immediate reward

    � For Example: Watch TV now, panic over home work later Vs. Do the home work now, watch TV while all your pals are panicking

    Reinforcement learning methods specify how the agent changes its policy as a result of

    experience

  • AASS Örebro University- PhD Seminar 13

    Goal, Reward & Return

    � Goal: Maximize Total Reward Received

    � Immediate Reward ‘r’ at each step

    � Agent must maximize Expected Cumulative Reward

    � Suppose the sequence of rewards after step t is : rt+1, rt+2, rt+3,…….

    � Then the Return will be the Total Reward `Rt´

    `Maximize the Expected Return, E{Rt}, for Each Step t´

  • AASS Örebro University- PhD Seminar 14

    shortsighted 0 ← γ → 1 farsighted

    � Episodic Tasks:

    � Interaction breaks naturally into episodes e.g., Play a game, trip to maze

    � Rt = rt+1 + rt+2 + rt+3 +…….rT � where `T´ is a Final Step at which the terminal state is reached,

    ending an episode / trial.

    � Continuing Tasks:

    � Interaction does not have natural episodes, meaning `T´= ∞� Discounted Return:

    . theis ,10, where

    , 0

    13

    2

    21

    ratediscount ≤≤

    =+++= ∑∞

    =

    +++++

    γγ

    γγγk

    kt

    k

    tttt rrrrR L

    (Future rewards counts for more)

    Task Types

  • AASS Örebro University- PhD Seminar 15

    Example-1Avoid failure: the pole falling beyond

    a critical angle or the cart hitting end of

    track.

    reward = +1 for each step before failure

    ⇒ return = number of steps before failure

    � As an episodic task where episode ends upon failure:

    � As a continuing task with discounted return:

    reward = −1 upon failure; 0 otherwise

    ⇒ return = −γ k , for k steps before failure

    In either case, return is maximized by

    avoiding failure for as long as possible.

  • AASS Örebro University- PhD Seminar 16

    Combined Notation� In Episodic Tasks, we number the time steps of each episode

    starting from zero

    � We can cover all cases by writing

    Rt = γk rt +k +1,

    k =0

    where γ can be 1 only if a zero reward absorbing state is always reached.

  • AASS Örebro University- PhD Seminar 17

    Value Functions

    � Value of State

    � Value of State- Action Pair or Value of Action

  • AASS Örebro University- PhD Seminar 18

    1: Value of State

    � Expected return that estimates how good it is for the agent to be in a given state

    � How Good is defined in terms of Future rewards

    � The value of a state is the expected return starting from that state; depends on the agent’s policy:

    State - value function for policy π :

    ( s) = Eπ Rt st = s{ }= Eπ γ k rt +k +1 st = sk =0

    Or, without the expectation operator:

    Vπ(s) = π (s,a) Ps ′ s

    aRs ′ s

    a+ γ V π( ′ s )[ ]

    ′ s

    ∑a

  • AASS Örebro University- PhD Seminar 19

    2: Value of State-Action Pair� Value of State-Action Pair under policy: That estimates how good it is to

    perform a given action in a given state� How Good is defined in terms of Future rewards (Expected return)

    � The value of taking an action in a state under policy ππππ is the expected return starting from that state, taking that action, and thereafter following π :

    { }

    ====== ∑∞

    =

    ++

    0

    1 ,,),(

    :

    k

    ttkt

    k

    ttt aassrEaassREasQ γ

    π

    πππ

    policyfor function value-Action

    Q∗( s , a ) = E r t + 1 + γ max

    ′ a Q

    ∗( s t + 1 , ′ a ) s t = s , a t = a{ }

    = Ps ′ s a

    R s ′ s a

    + γ max′ a

    Q∗( ′ s , ′ a )[ ]

    ′ s

  • AASS Örebro University- PhD Seminar 20

    Value of Action: Q� Q: Value of an Action

    � Expected Reward from that action� We have only the estimates of Q� We build these estimates from experience of rewards

    How to Estimate the Q?

    � Now, we have value estimates of all the actions at a given state:

    How to select the action? 2 Choices:

    1. Greedy Actions: Have Highest Estimated Q: EXPLOITATION2. Other Actions: Lower Estimated Q’s: EXPLORATION

    � Agent can’t exploit all the time, It must sometimes explore to see if an action that currently looks bad eventually turns out to be good (long run)

    � What if we know the Q values exactly? We choose that action which gives highest Q Value & hence we have searched the optimal Q:

  • AASS Örebro University- PhD Seminar 21

    1: How to Estimate Q� Q*(a): Optimal Value of Action

    � Qt(a): Estimated value of action ´a` at time ´t`

    � How to estimate Q: ? By Running mean:

    � Suppose we choose action ´a`, ka times & observe a reward rion play i:

    � Then we can estimate Q* from running mean:

    Qt(a) = (r1+r2+r3+......rka)/ka

    � If ka = 0 (trial of that action) , then ri = 0

    � But, As ka ���� ∞ , Qt(a) � Q*(a):

    This is called sample average method for calculating Q

  • AASS Örebro University- PhD Seminar 22

    Incremental Update Equation

    � Estimate Q* by Running Mean: (if we have tried action `a´ katimes)

    Qt(a) = (r1+r2+r3+......rka)/ka

    � Incremental Equation:

    [ ]kkkk

    k

    i

    ik

    Qrk

    QQ

    rk

    Q

    −+

    +=

    +=

    ++

    +

    =

    + ∑

    11

    1

    1

    1

    1

    11

    1

    � New Estimate = Old Estimate + Step size [Target – Old Estimate]� Step size ( ): Depends on k � Often kept constant e.g. = 0.1� Gives more weight to recent rewards

    αα

  • AASS Örebro University- PhD Seminar 23

    2: Action Selection� Greedy:

    � Select the a* for which Q is highest

    � Qt(a*)= maxa Qt(a) `Action that maximize the value function´

    � So, a*= arg maxa Qt(a) *- `Best´

    � Example: 10-armed Bandit

    Suppose at Time `t´ we have actions 1 to 10:

    Qt(a) ����

    Qt(a*) = 0.4 and a* = 4th action: `argument´

    � -greedy:

    � Select random action of the time, else select greedy action

    Sample all actions infinitely many times,

    so as ka ���� ∞ , Q’s converges to Q*

    0.100.05000.20.40.10.30

    ε

    ε

  • AASS Örebro University- PhD Seminar 24

    General RL Algorithm

    � Initialize Agents internal state (e.g. Q Values, other statistics)� Do for a long time (Looping)

    � Observe current world state `s´� Chose action `a´ using the policy (greedy or e-greedy)� Execute action `a´� Let `r´ be immediate reward, s´ new world state� Update internal state based on s, a, r, s´ (previous internal state)

    After many trials, agent will learn the optimal Q values (state-action pair) & hence reach the best policy, i.e., the best

    mapping from states to actions

    � Output a policy based on (e.g.) Learnt Q Values & follow it

  • AASS Örebro University- PhD Seminar 25

    RL Framework (Again)

    � Task: One instance of RL Problem � Learning: How should agent change policy ???

    (RL Algorithm)� Overall Goal: Maximize amount of reward received over time

    Function

  • AASS Örebro University- PhD Seminar 26

    What RL Algorithms Do

    � Continual, On-Line Learning

    � Many RL methods can be understood as trying to solve the Bellman optimality equations in an approximate way.

  • AASS Örebro University- PhD Seminar 27

    Dynamic / Model of Environment

    � Environment provides:

    � Transition Function/ Probability:

    � Reward Function:

    � If Agent know `P´ & `R´ then it has complete information about the environment....

    � Agent has to DISCOVER these while exploring the world: How ?

    � Agent has to Evaluate Policy& Improve it at every instance untilit will reach at the OPTIMAL POLICY

    Ps ′ s a

    = Pr s t +1 = ′ s st = s,at = a{ } for all s, ′ s ∈S, a ∈ A(s).

    Rs ′ s a

    = E rt +1 st = s,at = a,st +1 = ′ s { } for all s, ′ s ∈S, a ∈ A(s).

  • AASS Örebro University- PhD Seminar 28

    Policy Evaluation & Policy Improvement

  • AASS Örebro University- PhD Seminar 29

    Model Free Learning

    � No Model is Available: (No `P´ & `R´)

    � Model Free Methods:

    � Learn Optimal Policy without Learning Model

    � Temporal Difference (TD Learning) is Model Free, Bootstrapping Method based on Sampling the State-Action Space

  • AASS Örebro University- PhD Seminar 30

    Temporal Difference Prediction� Policy Evaluation Prediction ProblemPrediction Problem:

    � Trying to predict how much return we will get from being in the state `s´ & following a policy by learning the state value function V

    � TD & MC methods: Solve Prediction Prob by Experience

    � Given some experience following a policy , both methods updates their estimate V of V

    � Monte Carlo Update

    � Return Rt: Actual Return from st to End of Episode

    � Must wait until the End of Episode to determine the increment toV(st)

    � Target for Monte Carlo update: `Rt´

    ππ

    ππ

    )]([)()( tttt sVRsVsV −+← α

  • AASS Örebro University- PhD Seminar 31

    Continued.......

    � Simplest Temporal Difference Update: TD(0)

    � Target For TD Update: Estimate of the Return

    � TD method waits until the next Time step

    � Bootstrapping Method : Its update basis in part on Existing Estimates.

    � Learn a Guess from a Guess

    )]()([)()( 11 ttttt sVsVrsVsV −++← ++ γα

    )]([ 11 ++ + tt sVr γ

  • AASS Örebro University- PhD Seminar 32

    Temporal Difference Learning

    � TD Learning: Combination of MC & DP Ideas

    � Like MC Methods: TD Method learns directly form raw experience without a model of the environment dynamics

    � Like DP: TD Method updates estimates based in part on other learned estimates without waiting for a Final outcome

    Bootstrap Self Generate

    � In Brief: TD methods:

    � Does not need a Model

    � Learns Directly from Experience

    � Updates Estimates of V(st) based on what happens after visiting state `s´

    a

    ss

    a

    ss RP '' ,

  • AASS Örebro University- PhD Seminar 33

    Advantages of TD Learning Methods

    � Do not need a Model of the Environment, of its Rewards & Next State Probability Distributions

    � Online & Incremental: FASTFAST

    � Need not to wait till the End of the Episode so need less memory & computation

    � Updates are based on actual experience (rt+1)

    � Converges to V (s)

    � But must Decrease step size as Learning continues

    π

  • AASS Örebro University- PhD Seminar 34

    TD For Control Problem� Control Methods Aim to find: Optimal Policy / Solution� Temporal Difference Learning- TD(0) is used to compute the Values

    for a Given policy

    � We came to TD for Control Problem: Learning Q Values:� Using the Generalized Policy Iteration (GPI) � With using TD Methods for Evaluation or Prediction Part� Again we face the need to Tradeoff between Exploration &

    Exploitation� Resulting 2 main classes:

    � On Policy: SARSASARSA� Off Policy: QQ-- LearningLearning

    � We consider Transition from State-action pair to state-action pair & learn the value of the state-action pairs

    � First step: To learn Action Value Function, Q

  • AASS Örebro University- PhD Seminar 35

    Softmax Action Selection

  • AASS Örebro University- PhD Seminar 36

    Continued......

    � Effect of

    � Softmax Action Selection becomes the same as Greedy Action Selection when the limit goes to zero

    � Which one is better whether,Softmax or - Greedy

    τ

    GreedyyprobabilitAs

    actioneequiprobalnearlynyprobabilitAs

    →→

    →∞→

    ,0

    )(/1,

    τ

    τ

    εUNCLEAR

  • AASS Örebro University- PhD Seminar 37

    Reinforcement Comparison� Central Theme:

    Large Rewards Reoccurrence (High)Small Rewards Reoccurrence (Low)

    � How to judge what is Large & Small Judgment Problem ?

    � Compare the Reward with Reference Level (Reference reward)� Reference Reward Average of previously received rewards

    � Basic Idea of Reinforcement Comparison� Large Reward means Higher than Average� Small Reward means Lower than Average

    � Learning methods based on this Basic Idea are called Reinforcement Comparison Methods

    Actions

  • AASS Örebro University- PhD Seminar 38

    Continued......� Reinforcement Comparison Methods maintains Overall Reference

    Reward

    � To pick among the actions, they maintain a separate measure of their preference for each action:

    � pt(a) Preference for action on play t

    � Preference is used to determine action selection probabilities by Softmax Relationship

    � After each play, preference for the action selected on that play is updated:

    � Preference update:

    � Where, is the positive step size parameter

    ∑=

    ===n

    b

    bp

    ap

    tt

    t

    t

    e

    eaapra

    1

    )(

    )(

    }{)(π

    ][)()(_

    1 tttttt rrapap −+=+ β

    β

  • AASS Örebro University- PhD Seminar 39

    Continued......

    � Higher rewards increases the reselecting probability of that action while the Low rewards decreases

    � Reference Reward updated:

    � Where, is step size parameter

    ][__

    1

    _

    tttt rrrr −+=+ α

    α

  • AASS Örebro University- PhD Seminar 40

    Actor Critic Methods� TD Methods having separate Memory Structure to represent the

    Policy Independent of Value Function

    1. Policy Structure ActorUsed to select actions

    2. Est. Value Function CriticCriticizes the action made by Actor Typically State Value Function

  • AASS Örebro University- PhD Seminar 41

    Continued......� Extension of Idea of Reinforcement Comparison Methods to TD

    Learning & to Full Reinforcement Problem

    � Critic, critique whatever policy is being followed by Actor

    � Criticisms (Scalar Signal) takes the Form of TD Error is the sole output of Critic & drives all learning

    � Working Concept

    � After each action selection, Critic evaluates the new state to determine whether the things have gone better or worse than expected. That evaluation is TD Error:

    � V is the current value function implemented by Critic

    � TD Error used to evaluate the action just selected� If TD Error Positive Tendency to select at, strengthened for Future

    � If TD Error Negative Tendency to select at, weakened for Future

    )()( 11 ttt sVsVrt −+= ++ γδ

  • AASS Örebro University- PhD Seminar 42

    Continued......

    � Suppose actions are generated by Gibbs Softmax Method:

    � p(s,a) indicates the tendency to select / preference for each action `a´ when in each state `s´

    � p(s,a) are the values at time, of the modifiable policy parameter of the Actor

    � Preference update rule (for strengthening & weakening)

    � Where, positive step size parameter

    ∑=

    ====n

    b

    bsp

    asp

    tt

    te

    esstaapras

    1

    ),(

    ),(

    }|{),(π

    ttttt aspasp βδ+← ),(),(

    β

  • AASS Örebro University- PhD Seminar 43

    Significant Advantages

    � Minimal computation time in order to select actions:

    � Policy is explicitly stored reduces the extensive computation

    � Learn Stochastic Policy

    � Learn Optimal Probabilities of selecting various actions

  • AASS Örebro University- PhD Seminar 44

    Research Paper

    Direct Reinforcement Adaptive Learning (DRAL) Fuzzy Logic Control for a Class of

    Non Linear Systems

    (Preecedings of the 12th IEEE)

    (International symposium on intelligent control)

  • AASS Örebro University- PhD Seminar 45

    Paper Abstract

    � Application of RL Tech’s to Feedback Control of Non Linear Systems using Adaptive Fuzzy Logic System (FLS)

    � Good Model Non Linear Systems Difficult To Control

    � This paper presents Adaptive FLS can handle Non Linearities through RL with No Preliminary Off Line Learning

    � Approach Presented: � FLS is indirectly told about the effects of its Control Action on the System

    Performance.

    � On Line Learning is Based on Binary Reinforcement Signal from a Critic without knowing the Non linearity appearing in the system

    `Direct Reinforcement Learning´

  • AASS Örebro University- PhD Seminar 46

    Major Modules� Fuzzy Logic System (FLS)

    � Reinforcement Learning (Actor Critic Method)

    � FLS + RL-Training Algorithm Adaptive FLS

    � System to be Controlled:� Non Linear Plant Model

    � Having: � Nonlinearity

    � Disturbances

  • AASS Örebro University- PhD Seminar 47

    FLS, RL & Adaptive FLS� Fuzzy Control Model Free Approach

    � Well known for Non Linear Control Technique

    � Adaptive FLS FLS + RL-Training Algorithm

    � FLS : Collection of IF-Then Rules� Training Algo: Adjusts the Parameters of FLS M.F’s,

    according to input / output data

    � Reinforcement Learning:Large Rewards Reoccurrence (High)Small Rewards Reoccurrence (Low)

    � Extending this idea of Action Selections to depend on State Information: � Emerges the aspects of Feedback Control, Pattern Recognition &

    Associative learning

    Actions

  • AASS Örebro University- PhD Seminar 48

    Continued......

    � Why RL is Very Important ?

    � Ability to Generate Correct Control Action in situations where Difficult to Define a Priori the Desired Output for Each Input

    � Critic in RL Scheme: Scalar Evaluation Signal

    Much less information than the Desired o/p's required in Supervised learning

    � Supervised Learning &

    Direct Inverse Control A Priori Desired Output is known

  • AASS Örebro University- PhD Seminar 49

    Adaptive Fuzzy Logic System

    Basic Configuration of Adaptive Fuzzy Logic System

  • AASS Örebro University- PhD Seminar 50

    Continued......

    � Parameters to be Adjusted in Adaptive FLS:

    � Control Representative Value of Each Rule `L´ ( )

    � Parameters of Membership Function’s ( m , )

    � Fuzzy Basis Function (FBF) : Strength of the Rule

    θ

    σ

  • AASS Örebro University- PhD Seminar 51

    Continued......

    � Centroid & Width are real valued parameters of Gaussian M.F for i-th input variable & L-th Rule

    � Adjustable parameters (Adaptive FBF) &

    � Constraint: (Input Physical Domain) &

    � FLS output will be;

    i

    l

    i Um ∈

    l

    l

    iml

    l

    im

    0≠l

  • AASS Örebro University- PhD Seminar 52

    Continued......� Non Linear Function to be approximated can be represented as;

    Provided by Adaptation Learning Algorithm

    � Adaptation Learning Algorithm Presented later

  • AASS Örebro University- PhD Seminar 53

    DRAL Controller Design

    Proposed DRAL Controller, illustrating overall Adaptive Learning Scheme

  • AASS Örebro University- PhD Seminar 54

    DRAL Controller Architecture� DRAL Controller is a combination of:

    � Action Generating FLS (using Reinforcement Signal as a Adaptive-Learning signal) &

    � A Fixed Gain Controller in the Performance measurement loop which uses an error based on given Reference Trajectory

    � As Adaptive Learning Process Proceeds:

    � Fixed Gain Controller has less & less influence on Control Action

    � FLS gets more & more influence on Control Action

    � DRAL Controller is designed to Learn how to Generate the Best Control Action at each time instant in the absence of complete information about the Plant & the Disturbances

  • AASS Örebro University- PhD Seminar 55

    Continued......

    � Critic: Binary Sign Function (Output +1 or -1)

    Based on Performance Measure

    � This Evaluative Signal (Binary Reinforcement Signal) used for Generation of Correct Control Actions by FLS

    � Critic works as Supervisor to the FLS & Control the Learning of the FLS by checking the Actual Performance against the Performance Measures

  • AASS Örebro University- PhD Seminar 56

    System Dynamics� Plant can be represented as:

  • AASS Örebro University- PhD Seminar 57

    Continued......�

    � Full Tracking Error r(t) is not allowed to used for tuning the overall FLS, only a Reduced Reinforced Signal `R´ is allowed

    R = sign(r)

    � Proposed Control Law: (Input to the Plant)

  • AASS Örebro University- PhD Seminar 58

    Adaptive Learning Algorithm� The Derived Adaptive Learning Algorithm for tuning the

    � Design Matrices determining the Learning Rate & Ki > 0 Design Parameter governing the Speed of Convergence

  • AASS Örebro University- PhD Seminar 59

    Simulation Results� The Dynamic Equation for n-Link Manipulator is as follows:

    � The Dynamic Equation for 2-link Robot Manipulator is

    � Assuming M(q) is Known, Non Linearities are Unknown

  • AASS Örebro University- PhD Seminar 60

    Continued......� Simulation parameters are:

    � Lengths: l1=l2= 1m, masses: m1=m2= 1kg� Desired Trajectory:

    � qdi(t)= sin(t) (rad)� qd2(t)= cos(t) (rad)

    � Linear Gain Kv= diag[20 20]

    � States 4 input Dimensions � Three Gaussian M.F selected / Input dimension� 81 Fuzzy IF-THEN Rules can be generated Rules= (M.F)i

    � Trajectory tracking performance for 3 cases were studied� With only Fixed Gain in Performance Measure� Action Generating FLS using DRAL Algorithm� DRAL Algorithm in presence of Mass changes

    � Programming : Turbo C

  • AASS Örebro University- PhD Seminar 61

    Continued......� Performance with only Fixed Gain in Performance Measure

    Performance with Kv Only

  • AASS Örebro University- PhD Seminar 62

    Continued......

    � Performance with Action Generating FLS using DRAL Algorithm

    � Trajectory Following Ability : Fairly Good � Kv= Diag[20 20]� DRAL Controller

    � Cancel the Non linearity in the Robot System

    � Robot Dynamics are Unknown to DRAL Controller

  • AASS Örebro University- PhD Seminar 63

    Continued......� Performance with DRAL Algorithm in presence of Mass changes

    � Mass m2 Changes (To Verify the Adaptation & Robustness )� At time t1=5s Mass changes from 1.0Kg to 2.5Kg (picked up)� At time t2=12s Mass changes from 2.5Kg to 1.0Kg (released)� Same Parameters� Fairly Satisfactory Performance� System Dynamics unknown to DRAL Controller

  • AASS Örebro University- PhD Seminar 64

    Key Features

    � No need of Off-Line Training

    � Binary Reinforcement Signal is Directly used for adjusting FLS Parameters

    � Controller Reusability : �Same Controller Works even if Dynamics

    changes

  • AASS Örebro University- PhD Seminar 65

    � Reference Books:� Sutton & Barto, Reinforcement learning: An introduction� Reinforcement Learning : A Survey (Journal of Artificial Intelligence

    research 4 (1996) 237-285)

    � Lecture Notes:� Machine Learning Course Lectures Slides (By Denny & Achim)� Ivan’s Lecture notes of Soft Computing & Control

    � Research Papers:� Young H. Kim & Frank L. Lewis, Direct Reinforcement Adaptive

    Learning Fuzzy Logic Control � Ihsan Omur, Mohamed Ali zohdy, Reinforcement Learning Control

    of Non linear multilink system

    Reference Books & Research Papers

  • AASS Örebro University- PhD Seminar 66

    THANKS