automation & robotics research institute (arri)...

56
Automation & Robotics Research Institute (ARRI) The University of Texas at Arlington F.L. Lewis & Draguna Vrabie Moncrief-O’Donnell Endowed Chair Head, Controls & Sensors Group Talk available online at http://ARRI.uta.edu/acs Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Upload: others

Post on 15-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Automation & Robotics Research Institute (ARRI) The University of Texas at Arlington

F.L. Lewis & Draguna VrabieMoncrief-O’Donnell Endowed Chair

Head, Controls & Sensors Group

Talk available online at http://ARRI.uta.edu/acs

Adaptive Dynamic Programming (ADP)For Discrete-Time Systems

Supported by :NSF - PAUL WERBOS

Page 2: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Bill Wolovich

"Linear Multivariable Systems" New York: Springer-Verlag, 1974."Robotics: Basic Analysis and Design" , 1987.“Automatic Control Systems: Basic Analysis and Design,” Wolovich, 1994.

Falb and Wolovich, “Decoupling in the design and synthesis of multivariable control systems, IEEE Trans. Automatic Control,” 1967.Wolovich and Falb, “On the structure of multivariable systems,” SIAM J. Control, 1969.Wolovich, “The use of state feedback for exact model matching,” SIAM J. Control, 1972.Falb and Wolovich, “The role of the interactor in decoupling, JACC, 1977.Invariants and canonical forms under dynamic compensation, W. Wolovich and P. Falb,SIAM, J. on Control, 14, 1976.

Interactor Matrix & Structure

The solution of the input-output cover problemsWOLOVICH [1972], MORSE [1976], HAMMER and HEYMANN [1981], WONHAM [1974

Pole Placement via Static Output Feedback is NP-HardMorse, A.S., Wolovich, W.A., Anderson, B.D.O.. "GENERIC POLE ASSIGNMENT - PRELIMINARY- RESULTS." IEEE Transactions on Automatic Control 28 503 - 506, 1983.

Page 3: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

∑∞

=

−=ki

iiki

kh uxrxV ),()( γ

Discrete-Time Optimal Control

cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x uγ γ∞

− +

= +

= + ∑

1 ( ) ( )k k k kx f x g x u+ = +system

Example ( , ) T Tk k k k k kr x u x Qx u Ru= +

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x Vγ += + =Value function recursion

)( kk xhu = = the prescribed control input functionControl policy

Example k ku Kx= − Linear state variable feedback

Page 4: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

∑∞

=

−=ki

iiki

kh uxrxV ),()( γ

)())(,()( 1++= khkkkh xVxhxrxV γ

))())(,((min)( 1*

++= khkkhk xVxhxrxV γ

Hamiltonian

))(),((min)( 1**

++= kkkuk xVuxrxVk

γ

))(),((minarg)(* 1*

++= kkkuk xVuxrxhk

γ

Discrete-Time Optimal Control

cost

Value function recursion

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH −+=∇ +γ

Optimal cost

Bellman’s Principle

Optimal Control

System dynamics does not appear

)( kk xhu = = the prescribed control policy

Backwards in time solution

Page 5: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1 ( ) ( )k k k kx f x g x u+ = +

( )

1( ) min ( )

min ( ) ( )k

k

T Tk k k k k ku

T Tk k k k k k ku

V x x Qx u Ru V x

x Qx u Ru V f x g x u

∗ ∗+

⎡ ⎤= + +⎣ ⎦

⎡ ⎤= + + +⎣ ⎦

1 1

1

( )1( ) ( )2

T kk k

k

dV xu x R g x

dx

∗∗ − +

+

= −

( ) T Tk i i i i

i kV x x Qx u Ru

=

= +∑

System

DT HJB equationDifficult to solveContains the dynamics

The Solution: Hamilton-Jacobi-Bellman Equation

1

1

( )2 ( ) 0T kk k

k

dV xRu g xdx

∗+

+

+ =

Minimize wrt uk

Page 6: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1( )T TL R B PB B PA−= +

DT Optimal Control – Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu+ = +system

cost

HJB = DT Riccati equation

Optimal Control

Optimal Cost*( ) T

k k kV x x Px=

10 ( )T T T TA PA P Q A PB R B PB B PA−= − + − +

k ku Lx= −

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

=

= +∑

( ) Tk k kV x x Px= for some symmetric matrix P

Off-line solutionDynamics must be known

Page 7: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

∑∞

=

−=ki

iiki

kh uxrxV ),()( γ

)())(,()( 1++= khkkkh xVxhxrxV γ

))())(,((min)( 1*

++= khkkhk xVxhxrxV γ

Hamiltonian

))(),((min)( 1**

++= kkkuk xVuxrxVk

γ

))(),((minarg)(* 1*

++= kkkuk xVuxrxhk

γ

Discrete-Time Optimal Adaptive Control

cost

Value function recursion

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH −+=∇ +γ

Optimal cost

Bellman’s Principle

Optimal Control

)( kk xhu = = the prescribed control policy

Focus on these two eqs

Page 8: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x Vγ += + =

Discrete-Time Optimal Control

Value function recursion

)( kk xhu = = the prescribed control policy

Solutions by Comp. Intelligence Community

( ) ( , ( ))i kh k i i

i kV x r x h xγ

∞−

=

= ∑

Theorem: Let solve the Lyapunov equation. Then ( )h kV x

The Lyapunov Equation

Gives value for any prescribed control policy

Policy Evaluation for any given current policy

Policy must be stabilizing

Page 9: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

))(),((minarg)(* 1*

++= kkkuk xVuxrxhk

γOptimal Control

Bellman’s result

1'( ) arg min( ( , ) ( ))k

k k k h kuh x r x u V xγ += +

What about? -

Theorem. Bertsekas. Let be the value of any given policy h(xk ).

Then

( )h kV x

' ( ) ( )h k h kV x V x≤

Policy Improvement

for a given policy h(.) ?

One step improvement property of Rollout Algorithms

Page 10: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

DT Policy Iteration

)())(,()( 111 +++ += kjkjkkj xVxhxrxV γ

))(),((minarg)( 111 +++ += kjkkukj xVuxrxhk

γ

Howard (1960) proved convergence for MDP

)())(,()( 1++= khkkkh xVxhxrxV γ

Cost for any given control policy h(xk ) satisfies the recursion

Recursive solution

Pick stabilizing initial control

Policy Evaluation

Policy Improvement

f(.) and g(.) do not appear

Lyapunov eq.

Recursive formConsistency equation

e.g. Control policy = SVFB

( )k kh x Lx= −

Page 11: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

System

Action network

Policy Evaluation(Critic network)

( )j kh x

cost

The Adaptive Critic Architecture

Control policy

)())(,()( 111 +++ += kjkjkkj xVxhxrxV γ

Adaptive Critics

))(),((minarg)( 111 +++ += kjkkukj xVuxrxhk

γ

Value update

Control policy update

Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Page 12: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Page 13: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV∂∂

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

∂∂

∂∂ ,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Adaptive (Approximate) Dynamic Programming

Page 14: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x+ + += + +

( ) ( ) ( )T Tk i i i i

i kV x x Qx u x Ru x

=

= +∑

1 111

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx+ +−

++

= −

1 1

11 1 1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A+ +

−+ + +

− − − = − −

= +Hewer proved convergence in 1971

DT Lyapunov eq.

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

ADP Solves Riccati equation WITHOUT knowing System Dynamics

( ) TV x x Px=

For any stabilizing policy, the cost is

DT Policy iterations

Equivalent to an Underlying Problem- DT LQR:

1k k kx Ax Bu+ = +

LQR value is quadratic

Page 15: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru+ + + +− = +

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x+ + += + +

DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px=

DT Policy iterations

1k k kx Ax Bu+ = +

LQR cost is quadratic

[ ] [ ]

[ ]

1 111 12 11 121 2 1 2 1

1 12 212 22 12 22 1

1 2 1 21

1 2 1 211 12 22 11 12 22 1 1

2 2 2 21

1 1

( ) ( )2 2( ) ( )

( ) ( )

k kk k k k

k k

k k

k k k k

k k

Tj k k

p p p px xx x x x

p p p px x

x xp p p x x p p p x x

x x

W x xϕ ϕ

++ +

+

+

+ +

+

+ +

⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤−⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

⎣ ⎦ ⎣ ⎦⎣ ⎦ ⎣ ⎦⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥= −⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

= −

Quadratic basis set

( ) ( ) ( )Tk i i i i

i kV x x Qx u x Ru x

=

= +∑

for some matrix P

Page 16: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Implementation- DT Policy Iteration

Value Function Approximation (VFA)

)()( xWxV Tϕ=

basis functionsweights

LQR case- V(x) is quadratic

( ) ( )T TV x x Px W xϕ= =

Quadratic basis functions=)(xϕ

Nonlinear system case- use Neural Network

][ 1211 LppW T =

Page 17: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

)())(,()( 111 +++ += kjkjkkj xVxhxrxV γValue function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

[ ] ))(,()()( 11 kjkkkTj xhxrxxW =− ++ γϕϕ

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax−= = + Need to know f(xk ) AND g(xk )

for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Robustness??

Implementation- DT Policy Iteration

Model-Based Policy Iteration

)()( kTjkj xWxV ϕ=VFA

Indirect Adaptive control with identification of the optimal value

Page 18: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

)())(,()( 111 +++ += kjkjkkj xVxhxrxV γ

111

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx+−

++

= −

1. Select control policy

2. Find associated cost

3. Improve control

Needs 10 lines of MATLAB code

Direct optimal adaptive control

Solves Lyapunov eq. without knowing dynamics

k k+1

observe xk

observe xk+1

apply uk

observe cost rk

update V

do until convergence to Vj+1 update control to uj+1

Page 19: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

Page 20: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Greedy Value Fn. Update- Approximate Dynamic Programming ADP Method 1 - Heuristic Dynamic Programming (HDP)

Paul Werbos

)())(,()( 111 +++ += kjkjkkj xVxhxrxV γ

Policy Iteration

1 1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A+ +

− − − = − −

= − +For LQRUnderlying RE Hewer 1971

Initial stabilizing control is NOT needed

Initial stabilizing control is needed

))(),((minarg)( 111 +++ += kjkkukj xVuxrxhk

γLyapunov eq.

Simple recursion

)())(,()( 11 ++ += kjkjkkj xVxhxrxV γ

))(),((minarg)( 111 +++ += kjkkukj xVuxrxhk

γ

ADP Greedy Cost Update

1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

P A BL P A BL Q L RL

L R B P B B P A+

= − − + +

= − +

For LQRUnderlying RE Lancaster & Rodman

proved convergence

Two occurrences of cost allows def. of greedy update

Page 21: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V xγ+ += +Value function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

[ ] [ ]1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W xϕ γ ϕ+ += +

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax−= = − + Need to know f(xk ) AND g(xk )

for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

Implementation- DT HDP

)()( kTjkj xWxV ϕ=VFA

regression matrix Old weights

Page 22: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

DT HDP vs. Receding Horizon Optimal Control

11

0

( )0

T T T Ti i i i iP A PA Q A PB R B PB B PA

P

−+ = + − +

=

11 1 1 1( )T T T T

k k k k k

N

P A P A Q A P B R B P B B P AP

−+ + + += + − +

=

Forward-in-time HDP

Backward-in-time optimization – RHC

Control Lyapunov Function overbounding P∞

Page 23: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

( )1

0( )k N

T T Tk i i i i k N k N

i kV x x Qx u Ru x P x

+ −

+ +=

= + +∑

1k k kx Ax Bu+ = +

Adaptive Terminal Cost RHC Hongwei ZhangDr. Jie Huang

11 0( ) ,T T T T

i i i i iP A PA Q A PB R B PB B PA P−+ = + − +

11 1 1 1 1( )RH T T RH

k N N k N ku R B P B B P A x L x−+ − − + += − + = −

Standard RHC

Requires P0 to be a CLF that overbounds the optimal inf. horizon cost, or large N

P0 is the same for each stage

HWZ Theorem- Let under the usual suspect observability and controllability assumptionsATC RHC guarantees ultimate uniform exponential stability

for ANY P0 > 0.Moreover, our solution converges to the optimal inf. horizon cost.

1N ≥

( )1

( )k N

T T Tk i i i i k N kN k N

i kV x x Qx u Ru x P x

+ −

+ +=

= + +∑

Our ATC RHC

11 ( ) ,T T T T

i i i i i kNP A PA Q A PB R B PB B PA P−+ = + − +

Final cost from previous stage

Page 24: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Q Learning

)(),(),( 1++= khkkkkh xVuxruxQ γpolicy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ =

Define Q function

Note

))(,(),(),( 11 +++= kkhkkkkh xhxQuxruxQ γRecursion for Q

)),((min)( **kkuk uxQxV

k

=

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

=

- Action Dependent ADP

)())(,()( 1++= khkkkh xVxhxrxV γValue function recursion for given policy h(xk )

Optimal Adaptive Control (for unknown DT systems)

Page 25: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

),( uxfx =&

∫∫∞∞

+==t

T

t

dtRuuxQdtuxrtxV ))((),())((

( , , ) ( , ) ( , ) ( , ) ( , )T TV V VH x u V r x u x r x u f x u r x u

x x x∂ ∂ ∂⎛ ⎞ ⎛ ⎞= + = + = +⎜ ⎟ ⎜ ⎟∂ ∂ ∂⎝ ⎠ ⎝ ⎠

& &

⎟⎟⎟

⎜⎜⎜

⎟⎟⎠

⎞⎜⎜⎝

∂∂

+=⎟⎟⎟

⎜⎜⎜

⎟⎟⎠

⎞⎜⎜⎝

∂∂

+= ),(),(min),(min0*

)(

*

)(uxf

xVuxrx

xVuxr

T

tu

T

tu&

xVxgRtxh T

∂∂

−= −*

12

1* )())((

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0 −⎟⎟⎠

⎞⎜⎜⎝

⎛−+⎟⎟

⎞⎜⎜⎝

⎛= 0)0( =V

System

Cost

Hamiltonian

Optimal cost

Optimal control

HJB equation

Continuous-Time Optimal Control

Bellman

,

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH −+=∇ +γc.f. DT Hamiltonian

Draguna Vrabie

Off-line solutionDynamics must be known

Page 26: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Bill Wolovich

Interactor Matrix & Structure Theorem

The solution of the input-output cover problems

Pole Placement via Static Output Feedback

Thank you for your inspiration and motivation in 1970

Page 27: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

)(),(),( 1++= khkkkkh xVuxruxQ γ

Specify a control policy ,....1,);( +== kkjxhu jj

policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ =

Define Q function

Note

))(,(),(),( 11 +++= kkhkkkkh xhxQuxruxQ γRecursion for Q

))(),(),( 1**

++= kkkkk xVuxruxQ γ

))(,(),(),( 1*

1**

+++= kkkkkk xhxQuxruxQ γ

Optimal Q function

)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV ==

Optimal control solution

)),((min)( **kkuk uxQxV

k

=

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

=

))(,((minarg)(* kkhhk xhxQxh =

Q Function Definition

Page 28: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Q Function ADP – Action Dependent ADP

Bradtke & Barto (1994) proved convergence for LQR

Q function for any given control policy h(xk ) satisfies the recursion

Recursive solution

Pick stabilizing initial control policy

Find Q function

Update control

))(,(),(),( 11 +++= kkhkkkkh xhxQuxruxQ γ

))(,(),(),( 111 +++ += kjkjkkkkj xhxQuxruxQ γ

)),((minarg)( 11 kkjukj uxQxhk

++ =

Now f(xk ,uk ) not needed

Page 29: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Q Learning does not need to know f(xk ) or g(xk )

)(),(),( 1++= khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx ++++=

For LQR PxxxWxV TT == )()( ϕ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡≡⎥

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

++

⎥⎦

⎤⎢⎣

⎡=

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Q is quadratic in x and u

Control update is found by ][2])([20 kuukuxkT

kT

kuHxHuPBBRPAxB

uQ

+=++=∂∂

=

sokjkuxuuk

TTk xLxHHPAxBPBBRu 1

11)( +−− =−=+−=

Control found only from Q functionA and B not needed

V is quadratic in x

Page 30: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Q function update for control is given by

Assume measurements of uk , xk and xk+1 are available to compute uk+1

),(),( uxWuxQ Tϕ=

Then

[ ] ),(),(),( 111 kjkkjkkkTj xLxrxLxuxW =− +++ γϕϕ

Solve for weights using RLS or backprop.

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Q Function Policy Iteration

),(),(),( 1111 ++++ += kjkjkkkkj xLxQuxruxQ γ

kjk xLu =

Now u is an input to the NN- Werbos- Action dependent NN

=)(xϕ

For LQR

For LQR case

QFA – Q Fn. Approximation

Page 31: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Q Learning does not need to know f(xk ) or g(xk )

)(),(),( 1++= khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx ++++=

For LQR PxxxWxV TT == )()( ϕ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡≡⎥

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

++

⎥⎦

⎤⎢⎣

⎡=

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Q is quadratic in x and u

Control update is found by ][2])([20 kuukuxkT

kT

kuHxHuPBBRPAxB

uQ

+=++=∂∂

=

sokjkuxuuk

TTk xLxHHPAxBPBBRu 1

11)( +−− =−=+−=

Control found only from Q functionA and B not needed

V is quadratic in x

Page 32: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

),(),(),( 1111 ++++ += kjkjkkkkj xLxQuxruxQ γ

Q Policy Iteration

)),((minarg)( 11 kkjukj uxQxhk

++ =

Control policy update

[ ] ),(),(),( 111 kjkkjkkkTj xLxrxLxuxW =− +++ γϕϕ

kjkuxuuk xLxHHu 11

+− =−=

Model-free policy iteration

Bradtke, Ydstie, Barto

Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning

Action-Dependent Heuristic Dynamic Programming (ADHDP)

Paul WerbosModel-free ADP

))(,(),(),( 111 +++ += kjkjkkkkj xhxQuxruxQ γ

Greedy Q Update

1111 target),(),(),( ++++ ≡+= jkjkTjkjkkk

Tj xLxWxLxruxW γϕϕ

Update weights by RLS or backprop.

Stable initial control needed

Page 33: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Direct OPTIMAL ADAPTIVE CONTROL

Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systems

Proofs?Robustness?Comparison with adaptive control methods?

Page 34: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Discrete-Time Zero-Sum Games

• Consider the following continuous-state and action spaces discrete-time dynamical system

with quadratic cost

• The zero-sum game problem can be formulated as follows:

• The goal is to find the optimal strategies (State-feedback)*( )w x Kx=

,1

kk

kkkk

xyEwBuAxx

=++=+

nRx∈pRy∈

1mk Ru ∈

2mk Rw ∈

[ ]∑∞

=−+= ki i

Tii

Tii

Tiwuk wwuuQxxxV 2maxmin)( γ

*( )u x Lx=

2( ) T T Tk i i i i i ii k

V x x Qx u u w wγ∞

=⎡ ⎤= + −⎣ ⎦∑

Page 35: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

DT Game Heuristic Dynamic Programming:

Forward-in-time Formulation• An Approximate Dynamic Programming Scheme (ADP) where one has the

following incremental optimization

which is equivalently written as

{ })(maxmin)( 12

1 ++ +−+= kikTkk

Tkk

Tkwuki xVwwuuQxxxV

kk

γ

)()()()()()( 12

1 ++ +−+= kikikTikik

Tik

Tkki xVxwxwxuxuQxxxV γ

Asma Al-Tamimi

Page 36: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Game Algebraic Riccati Equation

• Using Bellman optimality principle “Dynamic Programming”

• The Game Algebraic Riccati equation GARE

• The condition for saddle point are

1

2[ ]

T T TT T T

T T T

I B PB B PE B PAP A PA Q A PB A PE

E PA E PE I E PAγ

−⎡ ⎤ ⎡ ⎤+

= + − ⎢ ⎥ ⎢ ⎥−⎣ ⎦ ⎣ ⎦

2 0TI E PEγ −− >0TI B PB+ >

1

1 1

( ) minmax( ^ 2 ( ))

minmax( ( , , ) ).k k

k k

T T Tk k k k k k k ku w

T Tk k k k k k ku w

V x x Qx u u w w V x

x Px r x u w x Px

γ∗ ∗+

+ +

= + − +

= +

Page 37: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Game Algebraic Riccati Equation

The optimal policies for control and disturbance are

2 1 1 2 1( ( ) ) ( ( ) ).T T T T T T T TL I B PB B PE E PE I E PB B PE E PE I E PA B PAγ γ− − −= + − − × − −

2 1 1 1( ( ) ) ( ( ) ).T T T T T T TK E PE I E PB I B PB B PE E PB I B PB BPA E PAγ − − −= − − + × + −

Page 38: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

( ) Tk k kV x x Px∗ =

1( , , ) ( , , ) ( )k k k k k k kTT T T T T T

k k k k k k

Q x u w r x u w V x

x u w H x u w

∗ ∗+= +

⎡ ⎤ ⎡ ⎤= ⎣ ⎦ ⎣ ⎦

TTk

Tk

Tki

Tk

Tk

Tkk

Tkk

Tkk

Tk

TTk

Tk

Tki

Tk

Tk

Tk wuxHwuxwwuuRxxwuxHwux ][][][][ 111111

21 +++++++ +−+= γ

⎥⎥⎥

⎢⎢⎢

wwwuwx

uwuuux

xwxuxx

HHHHHHHHH

( ) , ( )i k i k i k i ku x L x w x K x= =

1 1 1

1 1 1

( ) ( ),

( ) ( ).

i i i i i i i ii uu uw ww wu uw ww wx ux

i i i i i i i ii ww wu uu uw wu uu ux wx

L H H H H H H H H

K H H H H H H H H

− − −

− − −

= − −

= − −

))(ˆ),(ˆ,()(ˆ)(ˆ)(ˆ)(ˆ))(ˆ),(ˆ,(

111

21

+++

+ +−+=

kikiki

kiT

kikiT

kikTkkikiki

xwxuxQxwxwxuxuRxxxwxuxQ γ

Linear Quadratic case- V and Q are quadratic

Q function update

Control Action and Disturbance updates

A, B, E NOT needed☺

Asma Al-Tamimi

Q learning for H-infinity Control

Page 39: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

)(),(),( 1++= khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx ++++=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡≡⎥

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

++

⎥⎦

⎤⎢⎣

⎡=

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Compare to Q function for H2 Optimal Control Case

H-infinity Game Q function

Page 40: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

ˆ ( , ) T Ti i iQ z h z H z h z= =

ˆ ( )i iu x L x= ˆ ( )i iw x K x=

Quadratic Basis set is used to allow on-line solution

TT T Tz x u w⎡ ⎤= ⎣ ⎦2 2 21 1 2 2 3 1( , , , , , , , )q q q qz z z z z z z z z z−= K K

))(ˆ),(ˆ,()(ˆ)(ˆ)(ˆ)(ˆ))(ˆ),(ˆ,(

111

21

+++

+ +−+=

kikiki

kiT

kikiT

kikTkkikiki

xwxuxQxwxwxuxuRxxxwxuxQ γ

)()(ˆ)(ˆ)(ˆ)(ˆ)( 12

1 ++ +−+= kTiki

Tkiki

Tkik

Tkk

Ti xzhxwxwxuxuRxxxzh γ

kkikei nxLxu 1)(ˆ += kkikei nxKxw 2)(ˆ +=

Probing Noise injected to get Persistence of Excitation

Proof- Still converges to exact result

Q function update

Solve for ‘NN weights’ - the elements of kernel matrix HUse batch LS or online RLS

where and

Quadratic Kronecker basis

Asma Al-Tamimi

Control and Disturbance Updates

Page 41: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Asma Al-Tamimi

Page 42: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

ADHDP Application for Power system

• System Description

• The Discrete-time Model is obtained by applying ZOH to the CT

[ ]

( ) [ ( ) ( ) ( ) ( )]

1/ / 0 00 1/ 1/ 0

1/ 0 1/ 1/0 0 0

0 0 1/ 0

1 / 0 0 0

Tg g

p p p

T T

G G G

E

TG

Tp p

x t f t P t X t F t

T K TT T

ART T T

K

B T

E K T

= Δ Δ Δ Δ

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥− − −⎢ ⎥⎣ ⎦=

⎡ ⎤= −⎣ ⎦

1/ [0.033,0.1]

/ [4,12]

1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]

p

p p

T

G

G

T

K T

TTRT

∈∈

Page 43: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

ADHDP Application for Power system

• The system stateΔf _incremental frequency deviation (Hz) ΔPg _incremental change in generator output (p.u. MW)ΔXg _incremental change in governor position (p.u. MW) ΔF _incremental change in integral control.ΔPd _is the load disturbance (p.u. MW); and

• The system parameters are:TG _the governor time constant

- TT _turbine time constant- TP _plant model time constant- Kp _ planet model gain- R _speed regulation due to governor action - KE_ integral control gain.

Page 44: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

ADHDP Application for Power system

• ADHDP policy tuning

0 1000 2000 3000-1

0

1

2

3

time (k)

The

con

verg

ence

of P

P11

P12

P13

P22

P23

P33

P34

P44 0 1000 2000 3000

-3

-2

-1

0

1

Time (k)The

con

verg

ence

of

the

cont

rol p

o

L11

L12

L13

L14

Page 45: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

ADHDP Application for Power system

• Comparison

The ADHDP controller design The design from [1]• The maximum frequency deviation when using the ADHDP controller is improved by

19.3% from the controller designed in [1]

• [1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993

0 5 10 15 20-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time in sec

stat

es x

1, x 2,

x 3,x4

X: 0.5Y: -0.2024

Frequency deviation

Incrmental change of the governer out

Incrmental change of the governer pos

Incrmental change of the in itegral cont

0 5 10 15 20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

X: 0.5794Y: -0.2507

Time sec

stat

es x

1, x 2,

x 3,x4

Frequency deviation

Incrmental change of the generator ou

Incrmental change of the governer pos

Incrmental change of the in itegral cont

Page 46: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• Problem Formulation

• requires solving the DT HJB

1 ( ) ( )k k k kx f x g x u+ = +

( )

1( ) min ( )

min ( ) ( )k

k

T Tk k k k k ku

T Tk k k k k k ku

V x x Qx u Ru V x

x Qx u Ru V f x g x u

∗ ∗+

⎡ ⎤= + +⎣ ⎦

⎡ ⎤= + + +⎣ ⎦

1 1

1

( )1( ) ( )2

T kk k

k

dV xu x R g x

dx

∗∗ − +

+

= −

( ) mink

k i i i iu i kV x x Qx u Ru

∞∗

=

= +∑

Page 47: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

1 ( ) ( ) ( )k k k kx f x g x u x+ = +

( ) T Tk i i i ii k

V x x Qx u Ru∞

== +∑

1

1

( )

( )

T T T Tk k k k k i i i ii k

T Tk k k k k

V x x Qx u Ru x Qx u Ru

x Qx u Ru V x

= +

+

= + + +

= + +

1( ) arg min( ( ))T Ti k k k i ku

u x x Qx u Ru V x += + +

1 1min( ( ))

( ) ( ) ( ( ) ( ) ( ))

T Ti k k i ku

T Tk k i k i k i k k i k

V x Qx u Ru V x

x Qx u x Ru x V f x g x u x

+ += + +

= + + +

Discrete-time NonlinearAdaptive Dynamic Programming:

HDP

Value function recursion

Asma Al-Tamimi

System dynamics

Page 48: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Asma Al-Tamimi

Flavor of proofs

Proof of convergence of DT nonlinear HDP

Page 49: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

ˆ ( , ) ( )Ti k Vi Vi kV x W W xφ= ˆ ( , ) ( )T

i k ui ui ku x W W xσ=

1

1

ˆˆ ˆ( ( ), ) ( ) ( ) ( )ˆ ˆ( ) ( ) ( )

T T Tk Vi k k i k i k i k

T T Tk k i k i k Vi k

d x W x Qx u x Ru x V x

x Qx u x Ru x W x

φ

φ+

+

= + +

= + +

1( ) arg min( ( ))T Ti k k k i ku

u x x Qx u Ru V x += + +

1 1min( ( ))

( ) ( ) ( ( ) ( ) ( ))

T Ti k k i ku

T Tk k i k i k i k k i k

V x Qx u Ru V x

x Qx u x Ru x V f x g x u x

+ += + +

= + + +

Standard Neural Network VFA for On-Line Implementation

Define target cost function

NN for Value - Critic NN for control action

HDP

Backpropagation- P. Werbos

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1( 1) ( )

( )

ˆˆ ˆ( ( )T Tk k i j i j i k

ui j ui jui j

x Qx u Ru V xW W

Wα +

+

∂ + += −

1 1( )

1

( )ˆ( )(2 ( ) )j j T Tkui ui k i j k Vi

k

xW W x Ru g x Wx

φασ+ +

+

∂= − +

ˆ ˆ( , ) ( , )arg min

ˆ ˆ( ( ) ( ) ( , ))

T Tk k k k

uii k k k

x Qx u x Ru xW

V f x g x u xα

α α

αΩ

⎛ ⎞+ += ⎜ ⎟⎜ ⎟+⎝ ⎠

1

21 1arg min{ | ( ) ( ( ), ) | }

Vi

T TVi Vi k k Vi kW

W W x d x W dxφ φ+

+ +Ω

= −∫

Explicit equation for cost – use LS for Critic NN update1

1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dxφ φ φ φ

+Ω Ω

⎛ ⎞= ⎜ ⎟⎝ ⎠∫ ∫

(can use 2-layer NN)

Page 50: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Batch LS

LS solution for Critic NN update

Issues with Nonlinear ADP

Integral over a region of state-spaceApproximate using a set of points

time

x1

x2

1

1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dxφ φ φ φ

+Ω Ω

⎛ ⎞= ⎜ ⎟⎝ ⎠∫ ∫

time

x1

x2

Take sample points along a single trajectory

Recursive Least-Squares RLS

Set of points over a region vs. points along a trajectory

Conjecture- For Nonlinear systemsThey are the same under a persistence of excitation condition

- Exploration

For Linear systems- these are the same

Selection of NN Training Set

Page 51: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1( 1) ( )

( )

ˆˆ ˆ( ( )T Tk k i j i j i k

ui j ui jui j

x Qx u Ru V xW W

Wα +

+

∂ + += −

1 1( )

1

( )ˆ( )(2 ( ) )j j T Tkui ui k i j k Vi

k

xW W x Ru g x Wx

φασ+ +

+

∂= − +

ˆ ˆ( , ) ( , )arg min

ˆ ˆ( ( ) ( ) ( , ))

T Tk k k k

uii k k k

x Qx u x Ru xW

V f x g x u xα

α α

αΩ

⎛ ⎞+ += ⎜ ⎟⎜ ⎟+⎝ ⎠

ˆ ( , ) ( )Ti k ui ui ku x W W xσ=

NN for control action

Note that state internal dynamics f(xk ) is NOT needed in nonlinear case since:

1. NN Approximation for action is used

2. xk+1 is measured

Interesting Fact for HDP for Nonlinear systems

kjT

jT

kjkj AxPBBPBIxLxh 1)()( −+−==Linear Casemust know system A and B matrices

Page 52: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• Simulation Example 1• The linear system – Aircraft longitudinal dynamics

• The HJB, i.e. ARE, Solution

1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010

A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

-0.0453 -0.0175-1.0042 -0.1131

B= 0.0075 0.0134 0.8647 0 0 0.8647

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P =

-0.7265 -0.1215 0.0464 0.0782 1.0240

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L ⎡ ⎤= ⎢ ⎥⎣ ⎦

Unstable, Two-input system

Page 53: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• Simulation• The Cost function approximation

• The Policy approximation

ˆ ( )Ti ui ku W xσ=

[ ]1 2 3 4 5( )T x x x x x xσ =

11 12 13 14 15

21 22 23 24 25

u u u u uTu

u u u u u

w w w w wW

w w w w w⎡ ⎤

= ⎢ ⎥⎣ ⎦

1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W xφ+ + +=

1 2

2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x xφ ⎡ ⎤= ⎣ ⎦

[ ]1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T

V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w=

Page 54: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• SimulationThe convergence of the cost

[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]

TVW =

11 12 13 14 15 1 2 3 4 5

21 22 23 24 25 2 6 7 8 9

31 32 33 34 35 3 7 10 11 12

41 42 43 44 45 4 8 11 13

51 52 53 54 55

0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0

V V V V V

V V V V V

V V V V V

V V V V

P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥ =⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

14

5 9 12 14 15

.50.5 0.5 0.5 0.5

V

V V V V V

ww w w w w

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P =

-0.7265 -0.1215 0.0464 0.0782 1.0240

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Page 55: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• SimulationThe convergence of the control policy

4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW ⎡ ⎤

= ⎢ ⎥⎣ ⎦

11 12 13 14 15 11 12 13 14 15

21 22 23 24 25 21 22 23 24 25

u u u u u

u u u u u

L L L L L w w w w wL L L L L w w w w w⎡ ⎤ ⎡ ⎤

= −⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L ⎡ ⎤= ⎢ ⎥⎣ ⎦

Note- In this example, internal dynamics matrix A is NOT Needed.

Page 56: Automation & Robotics Research Institute (ARRI) …pantsakl/Archive/WolovichSymposium/...Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Supported by : NSF - PAUL WERBOS