Download - Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis, NAI

Talk available online at http://www.UTA.edu/UTARI/acs

Reinforcement Learning and Adaptive Dynamic Programming (ADP) for

Discrete-Time Systems

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :NSF AFOSR EuropeONR – Marc SteinbergUS TARDEC

Supported by :China NNSFChina Project 111

Invited by Tianyou Chai

Kung Tz 500 BCConfucius

ArcheryChariot driving

MusicRites and Rituals

PoetryMathematics

孔子Man’s relations to

FamilyFriendsSocietyNationEmperorAncestors

Tian xia da tongHarmony under heaven

He who exerts his mind to the utmost knows nature’s pattern.

The way of learning is none other than finding the lost mind.

Meng Tz500 BC

Man’s task is to understand patterns in nature andsociety.

Mencius

SHORT COURSE OVERVIEW

Day 1- Markov Decision Processes and Discrete-time Approximate Dynamic Programming (ADP)

Day 2- ADP and Online Differential Games for Continuous-time Systems

Day 3- Data-driven Real-time Process Optimization, Supervisory Control, and Structured Controllers

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

F.L. Lewis and D. Vrabie,“Reinforcement learning and adaptivedynamic programming for feedback control,”IEEE Circuits & Systems Magazine, InvitedFeature Article, pp. 32-50, Third Quarter2009.

IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K. Vamvoudakis,“Reinforcement learning and feedbackControl,” Dec. 2012

Importance of Feedback Control

Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis

The resources available to most species for their survival are meager and limited

Nature uses Optimal control

Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological Systems

Optimality in Control Systems DesignR. Kalman 1960

Rocket Orbit Injection

http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

FmmmF

rwvv

mF

rrvw

wr

cos

sin2

2

ObjectivesGet to orbit in minimum timeUse minimum fuel

Dynamics

Adaptive Control is generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing

Reinforcement Learning turns out to be the key to this!

F. Lewis, Draguna Vrabie, and Vassilis Syrmos, “Optimal Control,” 3rd edition, 2012. Chapter 11

F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming forfeedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, ThirdQuarter 2009.

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

OutlineData‐Based Online Optimal Control for Discrete‐Time SystemsPolicy IterationValue Iteration (HDP‐ Heuristic Dynamic Programming)Q learning

Discrete-time Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Hamiltonian gives the Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Optimal control comes from stationarity condition 1min ( )k

T Tk k k k ku

x Qx u Ru V x

2 2( ) ( ) 0Tk k k kRu Bu P Ax Bu

1( )T Tk ku R B PB B PAx Optimal control

DT Riccati equation10 ( )T T T TA PA P Q A PB R B PB B PA

Bellman Equation– Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Bellman eq= Lyapunov equationk ku Kx

0 (( ) ( ) )T T Tk kx A BK P A BK P Q K RK x

Assume ANY stabilizing fixed feedback policy

Then is the cost of using the SVFB ( ) Tk k kV x x Px k ku Kx

xu SystemControlL On-line real-time

Control Loop

Off-line Design LoopUsing ARE

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)

System modeling is expensive, time consuming, and inaccurate

( , )J Q RUser prescribed optimization criterion

10 ( )T T T TA PA P Q A PB R B PB B PA

1( )T TL R B PB B PA

1k k kx Ax Bu

1. System dynamics

2. Value/cost function

3. Bellman equation

4. HJ solution equation (Riccati eq.)

5. Policy iteration – gives the structure we need

We want to find optimal control solutions Data-Based MethodOnline in real-time Using adaptive control techniquesWithout knowing the full dynamics

For nonlinear systems and general performance indices

ki

iiki

kh uxrxV ),()(

Discrete-Time Systems Optimal Adaptive Control

cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x u

1 ( ) ( )k k k kx f x g x u system

Example ( , ) T Tk k k k k kr x u x Qx u Ru

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Bellman equation

)( kk xhu = the prescribed control input functionControl policy

Example k ku Kx Linear state variable feedback

1( ) ( )T Th k k k k k h kV x x Qx u Ru V x

Difference eq equivalent

ki

iiki

kh uxrxV ),()(

)())(,()( 1 khkkkh xVxhxrxV

))())(,((min)( 1*

khkkhk xVxhxrxV

Hamiltonian

))(),((min)( 1**

kkkuk xVuxrxVk

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Discrete-Time Optimal Adaptive Control

cost

Bellman eq.

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH

Optimal cost

Bellman’s Opt. PrincipleGives DT HJB equation

Optimal Control

)( kk xhu = the prescribed control policy

Focus on these two eqs

1 ( ) ( )k k k kx f x g x u system

Offline solutionDifficult to solveContains the dynamics

1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x V

Discrete-Time Optimal Control

Bellman eq

)( kk xhu = the prescribed control policy

Solutions by Comp. Intelligence Community

( ) ( , ( ))i kh k i i

i kV x r x h x

Theorem: Let solve the Bellman equation. Then ( )h kV x

Gives value for any prescribed control policy

Policy Evaluation for any given current policy

Policy must be stabilizing

Dynamics Does Not Appear

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Optimal Control

Bellman’s result

1'( ) arg min( ( , ) ( ))k

k k k h kuh x r x u V x

What about? -

Theorem. Bertsekas. Let be the value of any given policy h(xk).

Then

( )h kV x

' ( ) ( )h k h kV x V x

Policy Improvement

for a given policy h(.) ?

One step improvement property of Rollout Algorithms

DT Policy Iteration

)())(,()( 111 kjkjkkj xVxhxrxV

1 1 1 1( ) arg min( ( , ) ( ))k

j k k k j kuh x r x u V x

Howard (1960) proved convergence for MDP

)())(,()( 1 khkkkh xVxhxrxV

Cost for any given control policy h(xk) satisfies the recursion

Recursive solution - Actor/Critic Structure

Pick stabilizing initial control

Policy Evaluation

Policy Improvement

f(.) and g(.) do not appear

Bellman eq.

Recursive formConsistency equation

e.g. Control policy = SVFB

( )k kh x Lx

1 1 1( ) ( , ( )) ( )k j k k j k j ke V x r x h x V x

Temporal difference

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

( ) ( ) ( )T Tk i i i i

i kV x x Qx u x Ru x

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

Hewer proved convergence in 1971

DT Lyapunov eq.

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px

For any stabilizing policy, the cost is

DT Policy iterations

1 1

11 1 1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

Equivalent to an Underlying Problem- DT LQR:

1 ( ) ,k k k kx Ax Bu A BL x

LQR value is quadratic

k ku Lx


1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

DT Policy Iteration – Linear Systems Quadratic Cost- LQR


How to implement online?

1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru


DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px


1k k kx Ax Bu

LQR cost is quadratic

1 111 12 11 121 2 1 2 1

1 12 212 22 12 22 1

1 2 1 21

1 2 1 211 12 22 11 12 22 1 1

2 2 2 21

( ) ( )2 2( ) ( )

k kk k k k

k k

k k

k k k k

k k

p p p px xx x x x

p p p px x

x xp p p x x p p p x x

x x

Quadratic basis set

( ) ( ) ( )Tk i i i i

i kV x x Qx u x Ru x

for some matrix P

Then update control using1( ) ( )T T

j k j k j j kh x L x R B P B B P Ax Need to know A AND B

for control update

1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x

Implementation- DT Policy IterationNonlinear Case

Value Function Approximation (VFA)

)()( xWxV T

basis functionsweights

LQR case- V(x) is quadratic

( ) ( )T TV x x Px W x

Quadratic basis functions

Nonlinear system case- use Neural Network

][ 1211 ppW T

( )x

)())(,()( 111 kjkjkkj xVxhxrxV Value function update for given control – Bellman equation

Assume measurements of xk and xk+1 are available to compute uk+1

Then

))(,()()( 11 kjkkkTj xhxrxxW

Solve for weights in real-time using RLSor, batch LS- many trajectories with different initial conditions over a compact set

Then update control using

Need to know g(xk) for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Policy Iteration

)()( kTjkj xWxV VFA

Indirect Adaptive control with identification of the optimal value

1 11 11 1 1 1

1

( )1 1( ) ( ) ( ) ( )2 2

j kT T T Tj k k k k j

k

dV xu x R g x R g x x W

dx

11 12

12 22

p pp p

11 12 22TW p p p

Solving the IRL Bellman Equation

Need data from 3 time intervals to get 3 equations to solve for 3 unknowns

Now solve by Batch least-squares

Solve for value function parameters


1 1 2 1 1( ) ( ) ( , ( ))Tj k k k j kW x x r x h x

1 2 3 2 2( ) ( ) ( , ( ))Tj k k k j kW x x r x h x

k k+1

apply

observe cost

update Wj+1

Do RLS until convergence to Wj+1

update control

Reinforcement Learning (IRL) – A DATA-BASED APPROACH

Data set at time [k,k+1)

1, ( , ( )),k k j k kx r x h x x

k+2

observe xk+1

apply

observe cost

k+3

observe xk+2

apply

observe cost

Or use batch least-squares

Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of xk , ukInstead of the plant dynamical model.

f(xk) is not needed anywhere


observe xk

( , ( ))k j kr x h x 1 1( , ( ))k j kr x h x 2 2( , ( ))k j kr x h x

update Wj+1 update Wj+1

11 1 1 1

1( ) ( ) ( )2

T T Tj k k k ju x R g x x W

Continuous control feedback gain with discrete gain updates

k

Lj

j0 1 2 3 4 5

Update intervals j may not be the sameThey can be selected on‐line in real time

Gain update (Policy)

Control

LQR Case

( )k k j ku x L x

TSample period

j=1 j=2

Persistence of Excitation

Regression vector must be PE


System

Action network

Policy Evaluation(Critic network)

( )j kh x

cost

The Adaptive Critic Architecture

Control policy


Adaptive Critics

))(),((minarg)( 111 kjkkukj xVuxrxhk

Value update

Control policy update

Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Optimal Adaptive Control

Use RLS until convergence

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV T

Simulation Example • Linear system – Aircraft longitudinal dynamics

• The HJB, i.e. ARE, Solution

1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010

A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353

-0.0453 -0.0175-1.0042 -0.1131

B= 0.0075 0.0134 0.8647 0 0 0.8647

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Unstable, Two-input system


• Simulation• The Cost function approximation – quadratic basis set

• The Policy approximation – linear basis set

ˆ ( )Ti ui ku W x

1 2 3 4 5( )T x x x x x x

11 12 13 14 15

21 22 23 24 25

u u u u uTu

u u u u u

w w w w wW

w w w w w

1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W x

1 2

2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T

V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w

• SimulationThe convergence of the value

[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]

TVW

11 12 13 14 15 1 2 3 4 5

21 22 23 24 25 2 6 7 8 9

31 32 33 34 35 3 7 10 11 12

41 42 43 44 45 4 8 11 13

51 52 53 54 55

0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0

V V V V V

V V V V V

V V V V V

V V V V

P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P

14

5 9 12 14 15

.50.5 0.5 0.5 0.5

V

V V V V V

ww w w w w

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

Actual ARE soln:


Solves ARE online using real-time data measurements Without knowing A matrix

1, ( , ( )),k k j k kx r x h x x

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• SimulationThe convergence of the control policy

4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW

11 12 13 14 15 11 12 13 14 15

21 22 23 24 25 21 22 23 24 25

u u u u u

u u u u u

L L L L L w w w w wL L L L L w w w w w

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Note- In this example, drift dynamics matrix A is NOT Needed.Riccati equation solved online without knowing A matrix

Actual optimal ctrl.

Greedy Value Fn. Update- Approximate Dynamic Programming Value Iteration= Heuristic Dynamic Programming (HDP)

Paul Werbos


Policy Iteration

1 1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

For LQRUnderlying RE

Hewer 1971




Value Iteration

1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

P A BL P A BL Q L RL

L R B P B B P A

For LQRUnderlying RE Lancaster & Rodman

proved convergence

Two occurrences of cost allows def. of greedy update

Initial stabilizing control is needed

Lyapunov eq.

Simple recursion

Initial stabilizing control is NOT needed

Motivation for Value Iteration

1 1( ) ( )T Tj j j j j jA BL P A BL P Q L RL

1 1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 ( ) ( ) Tj j i j j jP A BL P A BL Q L RL

PI Policy Evaluation Step

VI Policy Evaluation Step

Theorem

Let gain L be fixed and (A-BL) stable.Let in MR

Then

0 0P

LE= Lyapunov equation

MR= Matrix recursion

1i jP P

i.e. repeated application of the VI policy evaluation stepis the same as one application of the PI policy evaluation step

IF THE CONTROL POLICY IS NOT UPDATED

Idea of GPI

Needs stabilizing gain

Does not need stabilizing gain

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x Value function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)

for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

Implementation- DT HDP – Value Iteration

)()( kTjkj xWxV VFA

regression matrix Old weights

)())(,()( 111 kjkjkkj xVxhxrxV Policy Iteration was

Example 11.3-4: Policy Iteration and Value Iteration for the DT LQR

1. Policy Iteration: Hewer’s Algorithm

1 1112( ) ( ) .j T T j

k k k k k kV x x Qx u Ru V x

1 11 1 ,T j T T T j

k k k k k k k kx P x x Qx u Ru x P x

1 10 ( ) ( ) ( ) .j T j j j j T jA BK P A BK P Q K R K

1 1 11 1( ) arg min( ) ,j j T T T j

k k k k k k k kx K x x Qx u Ru x P x

1 1 1 1( ) .j T j T jK B P B R B P A

Value Update

Policy Improvement

2. Value Iteration: Lyapunov recursions1

1 1 ,T j T T T jk k k k k k k kx P x x Qx u Ru x P x

1 ( ) ( ) ( ) .j j T j j j T jP A BK P A BK Q K R K

3. Iterative Policy Evaluation1 ( ) ( ) .j T j TP A BK P A BK Q K R K

this recursion converges to the solution of the Lyapunov equation

Lyapunov equation

Lyapunov recursion

All algorithms have a model-free scalar form based on measuring states, and a model-based matrix form based on cancelling states

1 ' ''

( ) ( , ) ( ')u uj j xx xx j

u xV x x u P R V x

Motor control 200 Hz

Oscillation is a fundamental property of neural tissue Brain has multiple adaptive clocks with different timescales

theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.

gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

• consolidation of memory• spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed

Spinal cord

D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.

Work with Dan Levine, UTA Dept. of Psychology

D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.

Doya, Kimura, Kawato 2001

Limbic system

Cerebral cortexMotor areas

ThalamusBasal ganglia

Cerebellum

Brainstem

Spinal cord

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and movement

Summary of Motor Control in the Human Nervous System

reflex

Supervisedlearning

ReinforcementLearning- dopamine

(eye movement)inf. olive

Hippocampus

Unsupervisedlearning

Limbic System


theta rhythms 4-10 Hz

picture by E. StinguD. Vrabie

Memoryfunctions

Long term

Short term

Hierarchy of multiple parallel loops

gamma rhythms 30-100 Hz

Adaptive Actor-Critic structure

Theta waves 4-8 Hz

Reinforcement learning


Paul Werbos

Q Learning

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Adaptive (Approximate) Dynamic Programming

Value Iteration

Q Learning

)(),(),( 1 khkkkkh xVuxruxQ policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Recursion for Q

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

- Action Dependent ADP

)())(,()( 1 khkkkh xVxhxrxV Value function recursion for given policy h(xk)

Optimal Adaptive Control for completely unknown DT systems

k k+1

xk xk+1uk

h(x)

)(),(),( 1 khkkkkh xVuxruxQ

Specify a control policy ,....1,);( kkjxhu jj

policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Bellman equation for Q

))(),(),( 1**

kkkkk xVuxruxQ

))(,(),(),( 1*

1**

kkkkkk xhxQuxruxQ

Optimal Q function

)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV

Optimal control solution

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

))(,((minarg)(* kkhhk xhxQxh

Q Function Definition

Q Function HDP – Action Dependent HDP

Bradtke & Barto (1994) proved convergence for LQR

Q function for any given control policy h(xk) satisfies the Bellman equation

Recursive solution

Pick stabilizing initial control policy

Find Q function

Update control

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

)),((minarg)( 11 kkjukj uxQxhk

Now f(xk,uk) not needed

Q Learning does not need to know f(xk) or g(xk)

)(),(),( 1 khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx

For LQR PxxxWxV TT )()(

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Q is quadratic in x and u

Control update is found by ][2])([20 kuukuxkT

kT

kuHxHuPBBRPAxB

uQ

sokjkuxuuk

TTk xLxHHPAxBPBBRu 1

11)(

Control found only from Q functionA and B not needed

V is quadratic in x

Q function update for control is given by

Assume measurements of uk, xk and xk+1 are available to compute uk+1

),(),( uxWuxQ T

Then

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

Solve for weights using RLS or backprop.

Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Q Function Policy Iteration

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

kjk xLu

Now u is an input to the NN- Werbos- Action dependent NN

)(x

For LQR

For LQR case

QFA – Q Fn. Approximation

Bradtke and Barto

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

Q Policy Iteration

)),((minarg)( 11 kkjukj uxQxhk

Control policy update

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

kjkuxuuk xLxHHu 11

Model-free policy iteration

Bradtke, Ydstie, Barto

Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning

Action-Dependent Heuristic Dynamic Programming (ADHDP)

Paul WerbosModel-free HDP

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

Greedy Q Update

1111 target),(),(),( jkjkTjkjkkk

Tj xLxWxLxruxW

Update weights by RLS or backprop.

Stable initial control needed

Stable initial control NOT needed

Direct OPTIMAL ADAPTIVE CONTROL

Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systems

Proofs?Robustness?Comparison with adaptive control methods?

ADHDP Application for Power system

• System Description

• The Discrete-time Model is obtained by applying ZOH to the CT

( ) [ ( ) ( ) ( ) ( )]

1/ / 0 00 1/ 1/ 0

1/ 0 1/ 1/0 0 0

0 0 1/ 0

1 / 0 0 0

Tg g

p p p

T T

G G G

E

TG

Tp p

x t f t P t X t F t

T K TT T

ART T T

K

B T

E K T

1/ [0.033,0.1]

/ [4,12]

1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]

p

p p

T

G

G

T

K T

TTRT


• The system state∆f _incremental frequency deviation (Hz) ∆Pg _incremental change in generator output (p.u. MW)∆Xg _incremental change in governor position (p.u. MW) ∆F _incremental change in integral control.∆Pd _is the load disturbance (p.u. MW); and

• The system parameters are:TG _the governor time constant

- TT _turbine time constant- TP _plant model time constant- Kp _planet model gain- R _speed regulation due to governor action - KE_ integral control gain.


• ADHDP policy tuning

0 1000 2000 3000-1

0

1

2

3

time (k)

The

conv

erge

nce

of P

P11

P12

P13

P22

P23

P33

P34

P44 0 1000 2000 3000

-3

-2

-1

0

1

Time (k)The

conv

erge

nce

of th

e co

ntro

l po

L11

L12

L13

L14

Converges to

Optimal GARE solution and optimal game control

• Comparison

The ADHDP controller design The design from [1]

• The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1]

[1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993

0 5 10 15 20-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time in sec

stat

es x

1, x2, x

3,x4

X: 0.5Y: -0.2024

Frequency deviationIncrmental change of the governer outIncrmental change of the governer posIncrmental change of the in itegral cont

0 5 10 15 20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

X: 0.5794Y: -0.2507

Time sec

stat

es x

1, x2, x

3,x4

Frequency deviationIncrmental change of the generator ouIncrmental change of the governer posIncrmental change of the in itegral cont


Download - Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Top Related