Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
and
F.L. Lewis, NAI
Talk available online at http://www.UTA.edu/UTARI/acs
Reinforcement Learning and Adaptive Dynamic Programming (ADP) for
Discrete-Time Systems
Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries
Northeastern University, Shenyang, China
Supported by :NSF AFOSR EuropeONR – Marc SteinbergUS TARDEC
Supported by :China NNSFChina Project 111
Invited by Tianyou Chai
Kung Tz 500 BCConfucius
ArcheryChariot driving
MusicRites and Rituals
PoetryMathematics
孔子Man’s relations to
FamilyFriendsSocietyNationEmperorAncestors
Tian xia da tongHarmony under heaven
He who exerts his mind to the utmost knows nature’s pattern.
The way of learning is none other than finding the lost mind.
Meng Tz500 BC
Man’s task is to understand patterns in nature andsociety.
Mencius
SHORT COURSE OVERVIEW
Day 1- Markov Decision Processes and Discrete-time Approximate Dynamic Programming (ADP)
Day 2- ADP and Online Differential Games for Continuous-time Systems
Day 3- Data-driven Real-time Process Optimization, Supervisory Control, and Structured Controllers
D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.
BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.
New Chapters on:Reinforcement LearningDifferential Games
F.L. Lewis and D. Vrabie,“Reinforcement learning and adaptivedynamic programming for feedback control,”IEEE Circuits & Systems Magazine, InvitedFeature Article, pp. 32-50, Third Quarter2009.
IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K. Vamvoudakis,“Reinforcement learning and feedbackControl,” Dec. 2012
Importance of Feedback Control
Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis
The resources available to most species for their survival are meager and limited
Nature uses Optimal control
Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.
Cellular Metabolism
Permeability control of the cell membrane
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Biological Systems
Optimality in Control Systems DesignR. Kalman 1960
Rocket Orbit Injection
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf
FmmmF
rwvv
mF
rrvw
wr
cos
sin2
2
ObjectivesGet to orbit in minimum timeUse minimum fuel
Dynamics
Adaptive Control is generally not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing
Reinforcement Learning turns out to be the key to this!
F. Lewis, Draguna Vrabie, and Vassilis Syrmos, “Optimal Control,” 3rd edition, 2012. Chapter 11
F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming forfeedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, ThirdQuarter 2009.
Different methods of learning
SystemAdaptiveLearning system
ControlInputs
outputs
environmentTuneactor
Reinforcementsignal
Actor
Critic
Desiredperformance
Reinforcement learningIvan Pavlov 1890s
Actor-Critic Learning
We want OPTIMAL performance- ADP- Approximate Dynamic Programming
OutlineData‐Based Online Optimal Control for Discrete‐Time SystemsPolicy IterationValue Iteration (HDP‐ Heuristic Dynamic Programming)Q learning
Discrete-time Linear Systems Quadratic cost (LQR)
1k k kx Ax Bu system
cost
Fact. The cost is quadratic
( ) T Tk i i i i
i kV x x Qx u Ru
( ) Tk k kV x x Px for some symmetric matrix P
1( ) ( )T Tk k k k k kV x x Qx u Ru V x Hamiltonian gives the Bellman equation
1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px
( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu
Optimal control comes from stationarity condition 1min ( )k
T Tk k k k ku
x Qx u Ru V x
2 2( ) ( ) 0Tk k k kRu Bu P Ax Bu
1( )T Tk ku R B PB B PAx Optimal control
DT Riccati equation10 ( )T T T TA PA P Q A PB R B PB B PA
Bellman Equation– Linear Systems Quadratic cost (LQR)
1k k kx Ax Bu system
cost
The cost is quadratic
( ) T Tk i i i i
i kV x x Qx u Ru
( ) Tk k kV x x Px for some symmetric matrix P
1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation
1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px
( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu
Bellman eq= Lyapunov equationk ku Kx
0 (( ) ( ) )T T Tk kx A BK P A BK P Q K RK x
Assume ANY stabilizing fixed feedback policy
Then is the cost of using the SVFB ( ) Tk k kV x x Px k ku Kx
xu SystemControlL On-line real-time
Control Loop
Off-line Design LoopUsing ARE
Optimal Control- The Linear Quadratic Regulator (LQR)
An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)
System modeling is expensive, time consuming, and inaccurate
( , )J Q RUser prescribed optimization criterion
10 ( )T T T TA PA P Q A PB R B PB B PA
1( )T TL R B PB B PA
1k k kx Ax Bu
1. System dynamics
2. Value/cost function
3. Bellman equation
4. HJ solution equation (Riccati eq.)
5. Policy iteration – gives the structure we need
We want to find optimal control solutions Data-Based MethodOnline in real-time Using adaptive control techniquesWithout knowing the full dynamics
For nonlinear systems and general performance indices
ki
iiki
kh uxrxV ),()(
Discrete-Time Systems Optimal Adaptive Control
cost
( 1)
1( ) ( , ) ( , )i k
h k k k i ii k
V x r x u r x u
1 ( ) ( )k k k kx f x g x u system
Example ( , ) T Tk k k k k kr x u x Qx u Ru
1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Bellman equation
)( kk xhu = the prescribed control input functionControl policy
Example k ku Kx Linear state variable feedback
1( ) ( )T Th k k k k k h kV x x Qx u Ru V x
Difference eq equivalent
ki
iiki
kh uxrxV ),()(
)())(,()( 1 khkkkh xVxhxrxV
))())(,((min)( 1*
khkkhk xVxhxrxV
Hamiltonian
))(),((min)( 1**
kkkuk xVuxrxVk
))(),((minarg)(* 1*
kkkuk xVuxrxhk
Discrete-Time Optimal Adaptive Control
cost
Bellman eq.
)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH
Optimal cost
Bellman’s Opt. PrincipleGives DT HJB equation
Optimal Control
)( kk xhu = the prescribed control policy
Focus on these two eqs
1 ( ) ( )k k k kx f x g x u system
Offline solutionDifficult to solveContains the dynamics
1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x V
Discrete-Time Optimal Control
Bellman eq
)( kk xhu = the prescribed control policy
Solutions by Comp. Intelligence Community
( ) ( , ( ))i kh k i i
i kV x r x h x
Theorem: Let solve the Bellman equation. Then ( )h kV x
Gives value for any prescribed control policy
Policy Evaluation for any given current policy
Policy must be stabilizing
Dynamics Does Not Appear
))(),((minarg)(* 1*
kkkuk xVuxrxhk
Optimal Control
Bellman’s result
1'( ) arg min( ( , ) ( ))k
k k k h kuh x r x u V x
What about? -
Theorem. Bertsekas. Let be the value of any given policy h(xk).
Then
( )h kV x
' ( ) ( )h k h kV x V x
Policy Improvement
for a given policy h(.) ?
One step improvement property of Rollout Algorithms
DT Policy Iteration
)())(,()( 111 kjkjkkj xVxhxrxV
1 1 1 1( ) arg min( ( , ) ( ))k
j k k k j kuh x r x u V x
Howard (1960) proved convergence for MDP
)())(,()( 1 khkkkh xVxhxrxV
Cost for any given control policy h(xk) satisfies the recursion
Recursive solution - Actor/Critic Structure
Pick stabilizing initial control
Policy Evaluation
Policy Improvement
f(.) and g(.) do not appear
Bellman eq.
Recursive formConsistency equation
e.g. Control policy = SVFB
( )k kh x Lx
1 1 1( ) ( , ( )) ( )k j k k j k j ke V x r x h x V x
Temporal difference
1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x
( ) ( ) ( )T Tk i i i i
i kV x x Qx u x Ru x
1 111 1
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
Hewer proved convergence in 1971
DT Lyapunov eq.
DT Policy Iteration – Linear Systems Quadratic Cost- LQR
Solves Lyapunov eq. without knowing A and B
( ) TV x x Px
For any stabilizing policy, the cost is
DT Policy iterations
1 1
11 1 1
( ) ( )
( )
T Tj j j j j j
T Tj j j
A BL P A BL P Q L RL
L R B P B B P A
Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics
Equivalent to an Underlying Problem- DT LQR:
1 ( ) ,k k k kx Ax Bu A BL x
LQR value is quadratic
k ku Lx
1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x
1 111 1
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
DT Policy Iteration – Linear Systems Quadratic Cost- LQR
DT Policy iterations
How to implement online?
1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru
1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x
DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR
Solves Lyapunov eq. without knowing A and B
( ) TV x x Px
DT Policy iterations
1k k kx Ax Bu
LQR cost is quadratic
1 111 12 11 121 2 1 2 1
1 12 212 22 12 22 1
1 2 1 21
1 2 1 211 12 22 11 12 22 1 1
2 2 2 21
( ) ( )2 2( ) ( )
k kk k k k
k k
k k
k k k k
k k
p p p px xx x x x
p p p px x
x xp p p x x p p p x x
x x
Quadratic basis set
( ) ( ) ( )Tk i i i i
i kV x x Qx u x Ru x
for some matrix P
Then update control using1( ) ( )T T
j k j k j j kh x L x R B P B B P Ax Need to know A AND B
for control update
1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x
Implementation- DT Policy IterationNonlinear Case
Value Function Approximation (VFA)
)()( xWxV T
basis functionsweights
LQR case- V(x) is quadratic
( ) ( )T TV x x Px W x
Quadratic basis functions
Nonlinear system case- use Neural Network
][ 1211 ppW T
( )x
)())(,()( 111 kjkjkkj xVxhxrxV Value function update for given control – Bellman equation
Assume measurements of xk and xk+1 are available to compute uk+1
Then
))(,()()( 11 kjkkkTj xhxrxxW
Solve for weights in real-time using RLSor, batch LS- many trajectories with different initial conditions over a compact set
Then update control using
Need to know g(xk) for control update
Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update
regression matrix
Implementation- DT Policy Iteration
)()( kTjkj xWxV VFA
Indirect Adaptive control with identification of the optimal value
1 11 11 1 1 1
1
( )1 1( ) ( ) ( ) ( )2 2
j kT T T Tj k k k k j
k
dV xu x R g x R g x x W
dx
11 12
12 22
p pp p
11 12 22TW p p p
Solving the IRL Bellman Equation
Need data from 3 time intervals to get 3 equations to solve for 3 unknowns
Now solve by Batch least-squares
Solve for value function parameters
))(,()()( 11 kjkkkTj xhxrxxW
1 1 2 1 1( ) ( ) ( , ( ))Tj k k k j kW x x r x h x
1 2 3 2 2( ) ( ) ( , ( ))Tj k k k j kW x x r x h x
k k+1
apply
observe cost
update Wj+1
Do RLS until convergence to Wj+1
update control
Reinforcement Learning (IRL) – A DATA-BASED APPROACH
Data set at time [k,k+1)
1, ( , ( )),k k j k kx r x h x x
k+2
observe xk+1
apply
observe cost
k+3
observe xk+2
apply
observe cost
Or use batch least-squares
Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics
This is a data-based approach that uses measurements of xk , ukInstead of the plant dynamical model.
f(xk) is not needed anywhere
))(,()()( 11 kjkkkTj xhxrxxW
observe xk
( , ( ))k j kr x h x 1 1( , ( ))k j kr x h x 2 2( , ( ))k j kr x h x
update Wj+1 update Wj+1
11 1 1 1
1( ) ( ) ( )2
T T Tj k k k ju x R g x x W
Continuous control feedback gain with discrete gain updates
k
Lj
j0 1 2 3 4 5
Update intervals j may not be the sameThey can be selected on‐line in real time
Gain update (Policy)
Control
LQR Case
( )k k j ku x L x
TSample period
j=1 j=2
Persistence of Excitation
Regression vector must be PE
))(,()()( 11 kjkkkTj xhxrxxW
System
Action network
Policy Evaluation(Critic network)
( )j kh x
cost
The Adaptive Critic Architecture
Control policy
)())(,()( 111 kjkjkkj xVxhxrxV
Adaptive Critics
))(),((minarg)( 111 kjkkukj xVuxrxhk
Value update
Control policy update
Leads to ONLINE FORWARD-IN-TIME implementation of optimal control
Optimal Adaptive Control
Use RLS until convergence
Adaptive Control
Plantcontrol output
Identify the Controller-Direct Adaptive
Identify the system model-Indirect Adaptive
Identify the performance value-Optimal Adaptive
)()( xWxV T
Simulation Example • Linear system – Aircraft longitudinal dynamics
• The HJB, i.e. ARE, Solution
1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010
A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353
-0.0453 -0.0175-1.0042 -0.1131
B= 0.0075 0.0134 0.8647 0 0 0.8647
55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782
P
-0.7265 -0.1215 0.0464 0.0782 1.0240
-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798
L
Unstable, Two-input system
10 ( )T T T TA PA P Q A PB R B PB B PA
• Simulation• The Cost function approximation – quadratic basis set
• The Policy approximation – linear basis set
ˆ ( )Ti ui ku W x
1 2 3 4 5( )T x x x x x x
11 12 13 14 15
21 22 23 24 25
u u u u uTu
u u u u u
w w w w wW
w w w w w
1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W x
1 2
2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T
V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w
• SimulationThe convergence of the value
[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]
TVW
11 12 13 14 15 1 2 3 4 5
21 22 23 24 25 2 6 7 8 9
31 32 33 34 35 3 7 10 11 12
41 42 43 44 45 4 8 11 13
51 52 53 54 55
0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0
V V V V V
V V V V V
V V V V V
V V V V
P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P
14
5 9 12 14 15
.50.5 0.5 0.5 0.5
V
V V V V V
ww w w w w
55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782
P
-0.7265 -0.1215 0.0464 0.0782 1.0240
Actual ARE soln:
10 ( )T T T TA PA P Q A PB R B PB B PA
Solves ARE online using real-time data measurements Without knowing A matrix
1, ( , ( )),k k j k kx r x h x x
Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof
• SimulationThe convergence of the control policy
4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW
11 12 13 14 15 11 12 13 14 15
21 22 23 24 25 21 22 23 24 25
u u u u u
u u u u u
L L L L L w w w w wL L L L L w w w w w
-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798
L
Note- In this example, drift dynamics matrix A is NOT Needed.Riccati equation solved online without knowing A matrix
Actual optimal ctrl.
Greedy Value Fn. Update- Approximate Dynamic Programming Value Iteration= Heuristic Dynamic Programming (HDP)
Paul Werbos
)())(,()( 111 kjkjkkj xVxhxrxV
Policy Iteration
1 1
1
( ) ( )
( )
T Tj j j j j j
T Tj j j
A BL P A BL P Q L RL
L R B P B B P A
For LQRUnderlying RE
Hewer 1971
))(),((minarg)( 111 kjkkukj xVuxrxhk
)())(,()( 11 kjkjkkj xVxhxrxV
))(),((minarg)( 111 kjkkukj xVuxrxhk
Value Iteration
1
1
( ) ( )
( )
T Tj j j j j j
T Tj j j
P A BL P A BL Q L RL
L R B P B B P A
For LQRUnderlying RE Lancaster & Rodman
proved convergence
Two occurrences of cost allows def. of greedy update
Initial stabilizing control is needed
Lyapunov eq.
Simple recursion
Initial stabilizing control is NOT needed
Motivation for Value Iteration
1 1( ) ( )T Tj j j j j jA BL P A BL P Q L RL
1 1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x
1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x
1 ( ) ( ) Tj j i j j jP A BL P A BL Q L RL
PI Policy Evaluation Step
VI Policy Evaluation Step
Theorem
Let gain L be fixed and (A-BL) stable.Let in MR
Then
0 0P
LE= Lyapunov equation
MR= Matrix recursion
1i jP P
i.e. repeated application of the VI policy evaluation stepis the same as one application of the PI policy evaluation step
IF THE CONTROL POLICY IS NOT UPDATED
Idea of GPI
Needs stabilizing gain
Does not need stabilizing gain
1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x Value function update for given control
Assume measurements of xk and xk+1 are available to compute uk+1
Then
1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x
Solve for weights using RLSor, many trajectories with different initial conditions over a compact set
Then update control using
1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)
for control update
Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update
Implementation- DT HDP – Value Iteration
)()( kTjkj xWxV VFA
regression matrix Old weights
)())(,()( 111 kjkjkkj xVxhxrxV Policy Iteration was
Example 11.3-4: Policy Iteration and Value Iteration for the DT LQR
1. Policy Iteration: Hewer’s Algorithm
1 1112( ) ( ) .j T T j
k k k k k kV x x Qx u Ru V x
1 11 1 ,T j T T T j
k k k k k k k kx P x x Qx u Ru x P x
1 10 ( ) ( ) ( ) .j T j j j j T jA BK P A BK P Q K R K
1 1 11 1( ) arg min( ) ,j j T T T j
k k k k k k k kx K x x Qx u Ru x P x
1 1 1 1( ) .j T j T jK B P B R B P A
Value Update
Policy Improvement
2. Value Iteration: Lyapunov recursions1
1 1 ,T j T T T jk k k k k k k kx P x x Qx u Ru x P x
1 ( ) ( ) ( ) .j j T j j j T jP A BK P A BK Q K R K
3. Iterative Policy Evaluation1 ( ) ( ) .j T j TP A BK P A BK Q K R K
this recursion converges to the solution of the Lyapunov equation
Lyapunov equation
Lyapunov recursion
All algorithms have a model-free scalar form based on measuring states, and a model-based matrix form based on cancelling states
1 ' ''
( ) ( , ) ( ')u uj j xx xx j
u xV x x u P R V x
Motor control 200 Hz
Oscillation is a fundamental property of neural tissue Brain has multiple adaptive clocks with different timescales
theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.
gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.
• consolidation of memory• spatial mapping of the environment – place cells
The high frequency processing is due to the large amounts of sensorial data to be processed
Spinal cord
D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.
Work with Dan Levine, UTA Dept. of Psychology
D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.
Doya, Kimura, Kawato 2001
Limbic system
Cerebral cortexMotor areas
ThalamusBasal ganglia
Cerebellum
Brainstem
Spinal cord
Interoceptivereceptors
Exteroceptivereceptors
Muscle contraction and movement
Summary of Motor Control in the Human Nervous System
reflex
Supervisedlearning
ReinforcementLearning- dopamine
(eye movement)inf. olive
Hippocampus
Unsupervisedlearning
Limbic System
Motor control 200 Hz
theta rhythms 4-10 Hz
picture by E. StinguD. Vrabie
Memoryfunctions
Long term
Short term
Hierarchy of multiple parallel loops
gamma rhythms 30-100 Hz
Adaptive Actor-Critic structure
Theta waves 4-8 Hz
Reinforcement learning
Motor control 200 Hz
Paul Werbos
Q Learning
Four ADP Methods proposed by Paul Werbos
Heuristic dynamic programming
Dual heuristic programming
AD Heuristic dynamic programming
AD Dual heuristic programming
(Watkins Q Learning)
Critic NN to approximate:
Value
Gradient xV
)( kxV Q function ),( kk uxQ
GradientsuQ
xQ
,
Action NN to approximate the Control
Bertsekas- Neurodynamic Programming
Barto & Bradtke- Q-learning proof (Imposed a settling time)
Adaptive (Approximate) Dynamic Programming
Value Iteration
Q Learning
)(),(),( 1 khkkkkh xVuxruxQ policy h(.) used after time k
uk arbitrary
)())(,( khkkh xVxhxQ
Define Q function
Note
))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Recursion for Q
)),((min)( **kkuk uxQxV
k
Simple expression of Bellman’s principle
)),((minarg)(* *kkuk uxQxh
k
- Action Dependent ADP
)())(,()( 1 khkkkh xVxhxrxV Value function recursion for given policy h(xk)
Optimal Adaptive Control for completely unknown DT systems
k k+1
xk xk+1uk
h(x)
)(),(),( 1 khkkkkh xVuxruxQ
Specify a control policy ,....1,);( kkjxhu jj
policy h(.) used after time k
uk arbitrary
)())(,( khkkh xVxhxQ
Define Q function
Note
))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Bellman equation for Q
))(),(),( 1**
kkkkk xVuxruxQ
))(,(),(),( 1*
1**
kkkkkk xhxQuxruxQ
Optimal Q function
)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV
Optimal control solution
)),((min)( **kkuk uxQxV
k
Simple expression of Bellman’s principle
)),((minarg)(* *kkuk uxQxh
k
))(,((minarg)(* kkhhk xhxQxh
Q Function Definition
Q Function HDP – Action Dependent HDP
Bradtke & Barto (1994) proved convergence for LQR
Q function for any given control policy h(xk) satisfies the Bellman equation
Recursive solution
Pick stabilizing initial control policy
Find Q function
Update control
))(,(),(),( 11 kkhkkkkh xhxQuxruxQ
))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ
)),((minarg)( 11 kkjukj uxQxhk
Now f(xk,uk) not needed
Q Learning does not need to know f(xk) or g(xk)
)(),(),( 1 khkkkkh xVuxruxQ
)()( kkT
kkkTkk
Tk BuAxPBuAxRuuQxx
For LQR PxxxWxV TT )()(
k
k
uuux
xuxxT
k
k
k
kT
k
k
k
kTT
TTT
k
k
ux
HHHH
ux
ux
Hux
ux
PBBRPABPBAPAAQ
ux
Q is quadratic in x and u
Control update is found by ][2])([20 kuukuxkT
kT
kuHxHuPBBRPAxB
uQ
sokjkuxuuk
TTk xLxHHPAxBPBBRu 1
11)(
Control found only from Q functionA and B not needed
V is quadratic in x
Q function update for control is given by
Assume measurements of uk, xk and xk+1 are available to compute uk+1
),(),( uxWuxQ T
Then
),(),(),( 111 kjkkjkkkTj xLxrxLxuxW
Solve for weights using RLS or backprop.
Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update
regression matrix
Implementation- DT Q Function Policy Iteration
),(),(),( 1111 kjkjkkkkj xLxQuxruxQ
kjk xLu
Now u is an input to the NN- Werbos- Action dependent NN
)(x
For LQR
For LQR case
QFA – Q Fn. Approximation
Bradtke and Barto
),(),(),( 1111 kjkjkkkkj xLxQuxruxQ
Q Policy Iteration
)),((minarg)( 11 kkjukj uxQxhk
Control policy update
),(),(),( 111 kjkkjkkkTj xLxrxLxuxW
kjkuxuuk xLxHHu 11
Model-free policy iteration
Bradtke, Ydstie, Barto
Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning
Action-Dependent Heuristic Dynamic Programming (ADHDP)
Paul WerbosModel-free HDP
))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ
Greedy Q Update
1111 target),(),(),( jkjkTjkjkkk
Tj xLxWxLxruxW
Update weights by RLS or backprop.
Stable initial control needed
Stable initial control NOT needed
Direct OPTIMAL ADAPTIVE CONTROL
Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics
Model-free ADP
Works for Nonlinear Systems
Proofs?Robustness?Comparison with adaptive control methods?
ADHDP Application for Power system
• System Description
• The Discrete-time Model is obtained by applying ZOH to the CT
( ) [ ( ) ( ) ( ) ( )]
1/ / 0 00 1/ 1/ 0
1/ 0 1/ 1/0 0 0
0 0 1/ 0
1 / 0 0 0
Tg g
p p p
T T
G G G
E
TG
Tp p
x t f t P t X t F t
T K TT T
ART T T
K
B T
E K T
1/ [0.033,0.1]
/ [4,12]
1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]
p
p p
T
G
G
T
K T
TTRT
ADHDP Application for Power system
• The system state∆f _incremental frequency deviation (Hz) ∆Pg _incremental change in generator output (p.u. MW)∆Xg _incremental change in governor position (p.u. MW) ∆F _incremental change in integral control.∆Pd _is the load disturbance (p.u. MW); and
• The system parameters are:TG _the governor time constant
- TT _turbine time constant- TP _plant model time constant- Kp _planet model gain- R _speed regulation due to governor action - KE_ integral control gain.
ADHDP Application for Power system
• ADHDP policy tuning
0 1000 2000 3000-1
0
1
2
3
time (k)
The
conv
erge
nce
of P
P11
P12
P13
P22
P23
P33
P34
P44 0 1000 2000 3000
-3
-2
-1
0
1
Time (k)The
conv
erge
nce
of th
e co
ntro
l po
L11
L12
L13
L14
Converges to
Optimal GARE solution and optimal game control
• Comparison
The ADHDP controller design The design from [1]
• The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1]
[1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993
0 5 10 15 20-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Time in sec
stat
es x
1, x2, x
3,x4
X: 0.5Y: -0.2024
Frequency deviationIncrmental change of the governer outIncrmental change of the governer posIncrmental change of the in itegral cont
0 5 10 15 20-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
X: 0.5794Y: -0.2507
Time sec
stat
es x
1, x2, x
3,x4
Frequency deviationIncrmental change of the generator ouIncrmental change of the governer posIncrmental change of the in itegral cont
ADHDP Application for Power system