single network adaptive critics networks - development...

33
Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, Edited by F.L. Lewis and D. Liu. ISBN 0-471-XXXXX-X Copyright © 2000 Wiley[Imprint], Inc. Chapter 5 Single Network Adaptive Critics Networks - Development, Analysis and Applications Jie Ding, Ali Heydari, and S.N. Balakrishnan Department of mechanical and Aerospace Engineering Missouri University of Science and Technology 5.1 Abstract Solving infinite time optimal control problems in an approximate dynamic programming framework with two network structure has become popular in recent years. In this chapter, an alternative to the two network structure is provided. We develop single

Upload: others

Post on 11-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Reinforcement Learning and Approximate Dynamic Programming for Feedback Control,

Edited by F.L. Lewis and D. Liu.

ISBN 0-471-XXXXX-X Copyright © 2000 Wiley[Imprint], Inc.

Chapter 5

Single Network Adaptive Critics

Networks - Development, Analysis and

Applications

Jie Ding, Ali Heydari, and S.N. Balakrishnan

Department of mechanical and Aerospace Engineering

Missouri University of Science and Technology

5.1 Abstract

Solving infinite time optimal control problems in an approximate dynamic programming

framework with two network structure has become popular in recent years. In this

chapter, an alternative to the two network structure is provided. We develop single

Page 2: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 2

network adaptive critics (SNAC) which eliminate the need to have a separate action

network to output control. Two versions of SNAC are presented. The first version, called

SNAC outputs the costates and the second version called J-SNAC outputs the cost

function values. A special structure called Finite-SNAC to efficiently solve finite-time

problems is also presented. Illustrative infinite time and finite time problems are

considered; numerical results clearly demonstrate the potential of the single network

structures to solve optimal control problems.

5.2 Introduction

There are very few open loop control based systems in practice. Feedback or closed loop

control is desired as a hedge against noise that systems encounter in their operations and

modeling uncertainties. Optimal control based formulations have been shown to yield

desirable stability margins in linear system. For linear or nonlinear systems, dynamic

programming formulations offer the most comprehensive solutions for obtaining

feedback control [1], [2]. For nonlinear systems, solving the associated Hamilton-Jacobi-

Bellman (HJB) equation is well-nigh impossible due to the associated number of

computations and storage requirements. Werbos introduced the concept of Adaptive

Critics (AC) [3], a dual neural network structure to solve an ‘approximate dynamic

programming’ (ADP) formulation which uses the discrete version of the HJB equation.

Many researchers have embraced and researched the enormous potential of AC

based formulations over the last two decades [4]-[10]. Most of the model based adaptive

critic controllers [11]-[13] and others [4] & [5] were targeted for systems described by

Page 3: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 3

ordinary differential equations. Ref. [14] extended the realm of applicability of the AC

design to chemical systems characterized by distributed parameter systems. Authors of

[15] formulated a global adaptive critic controller for a business jet. Authors of [16]

applied an adaptive critic based controller in an atomic force microscope based force

controller to push nano-particles on the substrates. Ref. [17] showed in simulations that

the ADP can prevent cars from skidding when driving over unexpected patches of ice.

There are many variants of AC designs [12] of which the most popular ones are

the Heuristic Dynamic Programming (HDP) and the Dual Heuristic Programming (DHP)

architectures. In the HDP formulation, one neural network, called the ‘critic’ network

maps the input states to output the optimal cost and another network called the ‘action’

network outputs the optimal control with states of the system as its inputs [12]. In the

DHP formulation, while the action network remains the same as with the HDP, the critic

network outputs the optimal costate vector with the current states as inputs [11], [15].

Note that the AC designs are formulated in a discrete framework. Ref. [18] considered

the use of continuous time adaptive critics. More recently, some authors have pursued

continuous-time controls [19]-[22]. In [19] and [20] the suggested algorithms for the

online optimal control of continuous systems are based on sequential updates of the actor

and critic networks.

Many nice theoretical developments have taken place in the last few years,

primarily due to the groups led by Lewis and Sarangapani which have made the

reinforcement learning based AC controllers acceptable for mainstream evaluation.

However, there has not been much work in alleviating the computational load with the

AC paradigms from a practical standpoint. Balakrishnan’s group has been working from

Page 4: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 4

this perspective and formulated the Single Network Adaptive Critic designs. Authors of

[23]-[25] have proposed a Single Network Adaptive Critic (SNAC) related to DHP

designs. SNAC eliminates the usage of one neural network (namely the action network)

that is a part of a typical dual network AC setup. As a consequence, the SNAC

architecture offers three advantages: a simpler architecture for implementation, less

computational load for training and recall process, and elimination of the approximation

error associated with the eliminated network. In [25], it is shown through comparison

studies that the SNAC costs about half the computation time as that of a typical dual

network AC structure for the same problem.

SNAC is applicable to a wide class of nonlinear systems where the optimal

control equation can be explicitly expressed in terms of the state and costate vectors.

Most of the problems in aerospace, automobile, robotics and other engineering

disciplines can be characterized by nonlinear control-affine equations that yield such a

relation. SNAC based controllers have yielded excellent tracking results in micro-

electromechanical systems and chemical reactor applications, [25] and [14]. Authors of

[25] have proved that for linear systems the SNAC converges to discrete Riccati equation

solution.

Motivated by the simplicity of the training and implementation phase resulted

from SNAC, Ref. [26] developed an HDP based SNAC scheme. In their cost function

based single network adaptive critic, or J-SNAC, the critic network outputs the cost

instead of the costate vector. There are some advantages with the J-SNAC formulation:

first, a smaller neural network in needed for J-SNAC because the output is only a scalar,

while in SNAC, the output is the costate vector which has as many elements as the order

Page 5: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 5

of the system. The smaller network size results in less training and recall effort in J-

SNAC. Second, in J-SNAC the output is the optimal cost which has some physical

meaning for the designer, as opposed to the costate vector in SNAC.

Note that these developments introduced above have mainly addressed only the

infinite horizon problems. On the other hand, finite-horizon optimal control is an

important branch of optimal control theory. Solutions to the resulting time varying HJB

equation are usually very difficult to obtain. In the finite-horizon case, the optimal cost-

to-go is not only a function of the current states, but also a function of how much time is

left (time-to-go) to accomplish the goal. There is hardly any work in the neural network

literature to solve this class of problems [28], [29].

To deal with finite horizon problems, a single neural network based solution,

called Finite-SNAC, was developed in [27] which embeds solutions to the time-varying

HJB equation. Finite-SNAC is a DHP based NN controller for finite-horizon optimal

control of discrete-time input-affine nonlinear systems and is developed based on the idea

of using a single set of weights for the network. The Finite-SNAC has the current time as

well as the states as inputs to the network. This feature results in much less required

storage memory compared to the other published methods [28], [29]. Note that once

trained, Finite-SNAC provides optimal feedback solutions to any different final time as

long as it is less than the final time for which the network is synthesized.

Rest of this chapter is organized as follows: the approximate dynamic

programming formulation is discussed in Section 5.3 followed by presentation of the

SNAC architecture for infinite-horizon problems in Section 5.4; J-SNAC architecture and

Page 6: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 6

simulation results are presented in Section 5.5; and Finite-SNAC architecture for finite-

time optimal control problems and its simulation results are presented in Section 5.6.

5.3 Approximate Dynamic Programming

The dynamic programming formulation offers a comprehensive optimal control solution

in a state feedback form, however, it is handicapped by computational and storage

requirements. The approximate dynamic programming (ADP) formulation implemented

with an Adaptive Critic (AC) neural network structure has evolved as a powerful

alternative technique that obviates the need for excessive computations and storage

requirements in solving optimal control problems.

In this section, the principles of ADP are described. An interested reader can find

more details about the derivations in [11] and [3]. Note that a prime requirement to apply

the ADP is to formulate the problem in discrete-time. The control designer has the

freedom to use any appropriate discretization scheme. For example, one can use the Euler

approximation for the state equation and trapezoidal approximation for the cost function

[30].

In a discrete-time formulation, the objective is to find an admissible control , which causes the system given by

岫 岻 (5.1)

to follow an admissible trajectory from an initial point to a final desired point

while minimizing a desired cost function given by

Page 7: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 7

∑ 岫 岻 (5.2)

where and denote the state vector and the control at time step ,

respectively is the order of the system and is the system’s number of inputs. The functions and are assumed to be differentiable with

respect to both and . Moreover, is assumed to be convex (e.g. a quadratic

function in and ). One can notice that when , this cost function leads to a

regulator (infinite time) problem.

The steps in obtaining optimal control are now described. First, the cost function

(5.2) is rewritten to start from time step as

∑ 岫 岻 (5.3)

The cost, , can be split into

(5.4)

where and ∑ represent the ‘utility function’ at time step and the

cost-to-go from time step to , respectively.

The costate vector at time step is defined as

(5.5)

The necessary condition for optimality is given by

(5.6)

Page 8: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 8

Equation (5.6) can be further expanded as

岾 峇 (5.7)

The optimal control equation can, therefore, be written as

岾 峇 (5.8)

The costate equation is derived in the following way

岾 峇 岾 峇 (5.9)

In order to synthesize an optimal controller, (5.1), (5.8) and (5.9) have to be

solved simultaneously, along with appropriate boundary conditions. For regulator

problems, the boundary conditions usually take the form: is fixed and as . If the state equation and cost function are such that one can obtain an explicit

relationship for the control variable in terms of the state and the cost variables from

equation (5.8), the ADP formulation can be solved through SNAC. Note that control

affine nonlinear systems (of the form 岫 岻 岫 岻 ) with a quadratic cost

function (of the form ∑ 岫 岻 ) fall under this class.

5.4 SNAC

In this section, an improvement and modification to the AC architecture, called the

“Single Network Adaptive Critic (SNAC)” related to the DHP designs is presented. In the

Page 9: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 9

SNAC design, the critic network captures the functional relationship between the state

and the optimal costate . Denoting the neural network functional mapping with 岫 岻 one has

岫 岻 (5.10)

For the input-affine system and the quadratic cost function described below, once is calculated, one can generate the optimal control through equation (5.13).

岫 岻 岫 岻 (5.11)

∑ 岫 岻 (5.12)

岫 岻 (5.13)

Note that, for this case, equation (5.9) reads

岾 岫 岫 岻 岫 岻 岻 峇 (5.14)

5.4.1 State Generation for Neural Network Training

State generation is an important part of the training process for the SNAC outlined in the

next sub-section. For this purpose, define { }, where denotes the domain

in which the system operates. This domain is so chosen that its elements cover a large

number of points of the state space in which the state trajectories are expected to lie. For

a systematic training scheme, a ‘telescopic method’ is arrived at as follows:

Page 10: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 10

For , define the set { ‖ ‖ }, where is a positive

constant. At the beginning, a small is fixed and the network is trained with the states

generated in S1. After the network converges (the convergence condition will be

discussed in the next subsection), is chosen such that . Then the network is

trained again for states within S2 and so on. The network training is continued until , where SI covers the domain of interest .

5.4.2 Neural Network Training

The steps for training the SNAC network (see figure 5.1) are as follows:

1) Generate . For each element of , follow the steps below:

a. Input to the critic network to obtain

b. Calculate from the optimal control equation with and c. Get from the state equation using and

d. Input to the critic network to get e. Using and , calculate from costate equation (5.14)

2) Train the critic network for all in ; the target being

3) Check the convergence of the network. If it’s converged, revert to step 1 with until . Otherwise, repeat steps 1-2.

Page 11: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 11

Figure 5.1. SNAC Training Scheme

5.4.3 Convergence Condition

In order to check the convergence of the critic network, a set of new states, and target

outputs are generated as described in previous subsection. Let the target outputs be

and the outputs from the trained networks (using the same inputs from the set ) be . A tolerance value, tolc is used as the convergence criterion for the critic network.

By defining the relative error 岫‖ ‖ ‖ ‖岻 and { }, | |, the training process is stopped when ‖ ‖ . 5.5 J-SNAC

In this section, the “cost function based single network adaptive critic, called J-SNAC is

presented. In J-SNAC the critic network outputs cost instead of the costate as in SNAC.

This approach is applicable to the class of nonlinear systems of 岫 岻 with a constant matrix. As mentioned in the introduction section, the J-SNAC

technique retains all the powerful features of the AC methodology while eliminating the

Page 12: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 12

action network completely. In the J-SNAC design, the critic network captures the

functional relationship between the state and the optimal cost .

Denoting the functional mapping of J-SNAC with 岫 岻, one has

岫 岻 (5.15)

One can calculate through 岫 岻 , and rewrite the costate equation (5.14),

for the quadratic cost function (5.12), in the following form

岾 岫 岻 峇 岫 岻 (5.16)

Optimal control can now calculated as

岾 岫 岻 峇 岾 岫 岻 峇 (5.17)

5.5.1 Neural Network Training

Using a similar state generation and convergence check as discussed in the SNAC

training procedure, the steps for training the J-SNAC network are as follows (Fig. 5.2):

1) Generate . For each element of , follow the steps:

a. Input to the critic network to obtain

b. Calculate and then using equation (5.16)

c. Calculate from equation (5.17)

d. Use and to get from the state equation

e. Input to the critic network to get

Page 13: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 13

f. Use , and , to calculate using equation (5.4)

2) Train the critic network for all in with the output being . 3) Check the convergence of the critic network. If yes, revert to step 1 with

until . Otherwise, repeat steps 1-2.

Figure 5.2. J-SNAC Training Scheme.

5.5.2 Numerical Analysis

For illustrative purposes, a satellite attitude control problem is selected.

5.5.2.1 Modeling the Attitude Control Problem

Consider a rigid spacecraft controlled by reaction wheels. It is assumed that the control

torques are applied through a set of reaction wheels along three orthogonal axes. The

spacecraft rotational motion is described by [31]

岌 (5.18)

where I is the matrix of moment of inertia, is the angular velocity of the body frame

with respect to the inertial frame, p is the total spacecraft angular momentum expressed

Page 14: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 14

in the spacecraft body frame and is the torque applied to the spacecraft by the reaction

wheels.

Using the Euler angles and , the kinematics equation describing the attitude

of the spacecraft may be written as

[ ] 岫 岻 (5.19)

where

岫 岻 [ 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻] (5.20)

The total spacecraft angular momentum p is written as

岫 岻 (5.21)

where [ ] is the (constant) inertial angular momentum and

岫 岻 [ 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 ] (5.22)

Choosing [ ] as the states and [ ] as the controls, the

dynamics of the system, i.e. (5.18) and (5.19) can be rewritten as

[ 岌 岌 岌 岌 岌 岌 ]

[ 岫 岻 ]

[ ] [ ] [ ] (5.23)

Page 15: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 15

The control objective is to drive all the states to zero as . A quadratic cost function, is selected as

∫ [ ] (5.24)

where is a positive semi-definite and is a positive definite matrix for

penalizing the states and controls, respectively.

Denoting the time step by , the state equation is discretized as

[ 岫 岻 ] (5.25)

where 岫 岻 and are given in the state equation (5.23). The quadratic cost function

(5.24) is also discretized as

∑ 岾 峇 (5.26)

The optimality condition leads to the control equation

岫 岻 (岾 峇 ) 岾 峇 (5.27)

5.5.2.2 Simulation

In numerical simulations, 岫 岻 and 岫 岻 are selected. Note that regulating the Euler angles will

automatically regulate the angular rates as well, hence, penalizing the first three element

of the state vector through the selected , will guarantee the regulation of the whole state

vector. The inertia matrix is selected as an identity matrix.

Page 16: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 16

A single layer neural network of the form 岫 岻 is selected where

denotes the neural network weights and 岫 岻 denotes the basis function. The network

weights are initialized to zero and the basis functions 岫 岻 are selected as

岫 岻 [ ] (5.28)

where , denotes the th element of state vector . The initial condition is

selected as [ ] .

Histories of the Euler angles and rotation rates with time are shown in Fig. 5.3. It

can be seen that all the states go to zero within 5 seconds. Moreover, as seen in Fig. 5.4

which shows the history of the applied controls, the goal is achieved through applying

bounded controls about the three axes of the spacecraft.

Figure 5.3. Histories of angles and body rates.

Figure 5.4. Histories of applied controls.

Page 17: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 17

5.6 Finite-SNAC

Finite-SNAC solves finite horizon optimal control of nonlinear input-affine systems

based on a DHP formulation. The discrete-time nonlinear input-affine system and a

quadratic cost function to be minimized are given below.

岫 岻 岫 岻 (5.29)

∑ 岫 岻 (5.30)

where and denote the state vector and the control at time step ,

respectively is the order of the system and is the number of inputs. 岫 岻 and 岫 岻 are the system dynamics and and are

weighting matrices for the final states, states, and control vectors, respectively. and

are positive semi-definite or positive definite matrices and is a positive definite matrix.

Finally, is the total (fixed) number of time steps and superscript denotes

transposition.

Denoting the neural network mapping by 岫 岻 a single neural network called

Finite-SNAC is developed to output the desired costate vector based on the state vector

and the time-to-go as inputs.

岫 岻, (5.31)

where is the system costate vector at time step and denotes the

network weight matrix.

The neural network in this section is selected in the form of linear in the weights.

Page 18: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 18

岫 岻 岫 岻 (5.32)

where 岫 岻 is composed of linearly independent scalar basis functions and where is the number of neurons.

Following the ADP framework discussed in section 2, the costate equation will be

岾 岫 岫 岻 岫 岻 岻 峇 , (5.33)

Note that since

岫 岻 (5.34)

the final condition on the costate vector is

(5.35)

The network training target, denoted by , can be calculated using following two

equations.

岾 岫 岫 岻 岫 岻 岻 峇 , (5.36)

(5.37)

In the training process, in the right hand side of (5.36) will be substituted by 岫 岫 岻 岻 as described earlier in the SNAC training process.

Once the network is trained, it can be used for optimal feed-back control in the

sense that in the online implementation, the states and the time-to-go will be fed into the

Page 19: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 19

network to generate the optimal costate vector and the optimal control will be calculated

through (5.38).

岫 岻 , (5.38)

5.6.1 Neural Network Training

The Finite-SNAC training should be done in such a way that along with learning

the target given in (5.36) for every state and time , the final condition (5.37) is also

satisfied. The authors suggest augmenting the training input-target pairs in such a way

which the final condition is forced to be met in each learning iteration. To do so, one can

define following augmented parameters.

[ ] (5.39)

[ 岫 岻 岫 岻] (5.40)

Now, the network output and the target to be learned are

(5.41)

[ ] (5.42)

The training error is given by

(5.43)

Now, in each iteration along with selecting a random state , a random time , , will also be selected. Feeding and into (5.31) results in a

Page 20: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 20

costate vector which will be used for calculating through (5.38). Having and

one can propagate to using (5.29). Having fed into (5.31) once more, will be resulted which is the desired unknown needed for target calculation

through (5.36). Then, to calculate through (5.37), the randomly selected state will be

considered as and propagated to using the similar process discussed above for

propagating to , and fed to (5.37). Finally will be formed using (5.42). This

process is depicted graphically in Fig. 5.5. In this diagram, the left column follows (5.36)

and the right column follows (5.37) for the target calculation.

Having the input-target pair {[岫 岻 岫 岻] [ ]} calculated, the

network can be trained using any training method. The selected training law in this study

is least squares.

Figure 5.5. Finite-SNAC Training Scheme.

Costate Equation

Optimal Control

Equation

N 岫 岻

SNAC

SNAC

Optimal Control Equation

State Equation

SNAC

Optimal Control Equation

State Equation

,

Page 21: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 21

5.6.2 Convergence Theorems

The proposed algorithm for the Finite-SNAC training is based on the DHP scheme, in

which starting with an initial value for the costate vector one iterates to converge to the

optimal costate. Denoting the iteration index by a superscript and the time index by a

subscript, the learning algorithm for finite horizon optimal control starts with an initial

value assignment to for all ’s, e.g. , and repeating below three

calculations for different ’s from zero to infinity. 岫 岻 (5.44)

( ) (5.45)

(5.46)

The last equation is actually the final condition of the optimal control problem. Note that,

( ) 岾 岫 岻 岫 岻 峇 (5.47)

岫 岻 ( 岫 岻 岫 岻 ) (5.48)

The problem is to prove that this iterative procedure results in the optimal values

for the costate and control . The convergence proofs presented in [27] is based on the

convergence of the HDP, in which the parameter subject to evolution is the cost function whose behavior is much simpler to discuss as compared to that of the costate vector . In HDP, the cost function needs to be initialized, e.g. 岫 岻 , and is

iteratively updated through the following steps.

Page 22: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 22

岫 岻 岾 峇 岫 岻 (5.49)

岾 岫 岻峇 岫 岻 (5.50)

For the finite horizon case, the final condition given below is satisfied at every

iteration.

岫 岻 (5.51)

Note that 岫 岻 and

( 岫 岻 岫 岻 ) (5.52)

The convergence of the above mentioned reinforcement learning schemes are

given below and their proofs, given in [27], are skipped because of the page constraint.

Note that the proofs of the finite-time cases are inspired by the proof for the HDP

convergence for infinite-horizon case in [6].

Theorem 1: HDP Convergence

The sequence of iterated through (5.49) to (5.51), in case of 岫 岻

converges to the finite-horizon optimal solution.

Theorem 2: DHP Convergence

The sequences and defined by equations (5.44) to (5.46), in case of , converges to the finite-horizon optimal solution for the given nonlinear

control-affine system.

Page 23: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 23

5.6.3 Numerical Analysis

An orbital maneuver problem is provided to motivate the readers on the Finite-SNAC

usage. A spacecraft is supposed to move from a certain orbit to another orbit in a fixed

final time. This problem is known as the Rendezvous problem in the orbital mechanics

literature.

5.6.3.1 Modeling the Orbital Maneuver Problem

Denoting the displacement vector of the center of mass of a rigid spacecraft from the

center of the orbital frame in the destination orbit by [ ] where , and

are the components of the vector in the orbital frame, the non-dimentionalized

equation of motion of a spacecraft in the gravity field is given as [32]

岑 岌 岫 岻岫 岻 岑 岌 岫 岻 (5.53)

岑 where the dots denote the time-derivatives; , , and respectively denote the three

components of the non-dimensionalized total force exerted on the spacecraft in , , and axes and √岫 岻 .

Selecting the state vector as [ 岌 岌 岌 ] and the control vector as [ ] , the state equation of the orbital maneuver problem reads:

Page 24: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 24

[ ]

[

岫 岻岫 岻 岫 岻 岫 岻 ]

[ ] (5.54)

where the components of the state vector are denoted by , , and the control

vector are denoted by , . Note that this state equation is in a nonlinear input-

affine form, suitable for the Finite-SNAC application.

Now, the problem is to apply optimal control to force the states , , to

go to zero in a pre-determined and fixed final time. Convergence of the states to zero is

equivalent to performing the orbital maneuver and locating at the destination point in the

dictated time.

5.6.3.2 Simulation

The nondimentionalized state equation can cover any circular orbit as the rendezvous

position. The mean anomaly of both of the maneuvering and the destination spacecraft

are zero at the start of the maneuver. The specifications of the two orbits are given in

Table 5.1.

Table 5.1. The spec of the destination and source orbit

The Characteristic Source Orbit Destination Orbit Orbit Semi-major axes 10,000 km 10,000 km Right Ascension of the Ascending Node 15 deg. 0 deg. Inclination 85 deg. 90 deg. Eccentricity 0 0

For the given characteristics, the initial condition will be

Page 25: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 25

[ ] (5.55)

The fixed final time is selected as 3 time units. The time unit is and

the reference length is , [32]. In this condition, each control unit will be . Note that, this implementation is based on having continuous

thrust for actuation. The weight matrices are selected through trial and error as 岫 岻, 岫 岻, and 岫 岻. The values of the elements of are selected much higher than those of

in order to force the error of the states at the final time to be small.

The basis functions selected for the neural network are combinations of the

following polynomials: for and for and Note that is the th input of the

neural network and is the time-to-go, normalized through dividing it by the total time,

and its contribution in the basis functions is selected through some trial and error such

that the network error is as small as possible.

The total number of basis functions is equal to 96. The network is trained for 200

epochs with the number of states selected at each iteration equal to 300 for creating a

mesh over the region of interest as explained in [27]. For this simulation, the states are

selected in such a way that each element belongs to the interval of 岫 岻. Performing the weight update, the weights converged as shown in Fig. 5.6.

Histories of the position elements of the state vector and the control are given in

Fig. 5.7. In order to evaluate the performance of the controller, a numerical solution to

this problem was calculated through an iterative process for the given initial condition

and the results are depicted using dash plots in Fig. 5.7, while solid plots denote those of

Page 26: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 26

the Finite-SNAC. As seen through the plots, the proposed controller has been able to

force the states to converge to the origin in the fixed final time of 3 units and the Finite-

SNAC trajectories are close to those of the optimal open loop numerical solution. The

cost-to-go of the Finite-SNAC turned out to be only 3% more than the open loop optimal

solution. While the open loop solution is only applicable to the preselected initial

condition and the fixed time-to-go, the Finite-SNAC solutions are more versatile as will

be shown in the discussions below.

To demonstrate the versatility of the HJB based finite-horizon controller, the same

network without retraining is used for another maneuver with the same initial condition

but with a shorter time-to-go, i.e. 2 time units. The trajectories of the position elements of

the state vector and the control histories for the shorter horizon are superimposed on the

result of the previous simulation of the Finite-SNAC controller and shown in Fig. 5.8. In

the two figures, the solid plots denote the results of the maneuver with time-to-go of 3,

and the dash plots denote those of the maneuver with time-to-go of 2 time units.

As could be seen in Fig. 5.8, the controller has applied a different control history

on the spacecraft to accomplish the same maneuver in a shorter time. This shows that the

network has learned the time-dependency of the optimal control in finite-horizon

problems.

The developed controller is able to perform well under different initial conditions

as long as the resulting state trajectory belongs to the domain on which the network is

trained. To evaluate the performance of the controller with different initial conditions,

another source orbit is selected with a of semi-major axis of 11000 km, a right ascension

Page 27: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 27

of -15 degrees and an inclination of 95 degrees which leads to a new set of initial

conditions given below.

[ ] (5.56)

The simulation is performed using the same trained network for the new initial

condition and the resulting position trajectory, called maneuver 2, is plotted with the

results from the previous initial conditions, called maneuver 1 and shown in a 3D plot in

Fig. 5.9. As seen, the new maneuver is accomplished in the fixed set time, as well,

confirming the applicability of the controller for different initial conditions.

Figure 5.6. Histories of some of the elements of the weights matrix during the training

iterations.

Figure 5.7. Histories of the position elements of the state vector and the applied control

for the time-to-go of 3 units.

Page 28: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 28

Figure 5.8. Histories of the position elements of the state vector and the applied control;

Finite-SNAC with different final times.

Figure 5.9. Three-dimensional trajectories of the maneuvers with different source orbits,

i.e., different initial conditions.

Conclusions

Efficient neural network structures to solve approximate dynamic programming based

control problems were presented in this chapter. Since they eliminate a network from the

Page 29: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 29

typical dual-network adaptive critic formulations, computing time is significantly

reduced. These structures called single network adaptive critics(SNAC, J-SNAC) are

applicable to a wide variety of engineering problems. Numerical results from limited

applications and the underlying simpler structures seem to portend that they have a great

potential for implementation.

Acknowledgements

Support for this study from NASA under Grant No.: ARMD NRA NNH07ZEA001N-

IRAC1 and the National Science Foundation are gratefully acknowledged. The views of

the authors do not necessarily represent the views of the NSF or NASA.

References

1. Lewis, F., Applied Optimal Control and Estimation, Prentice-Hall, 1992.

2. Bryson, A. E. and Ho, Y. C., Applied Optimal Control, Taylor and Francis, 1975.

3. Werbos, P. J., “Approximate Dynamic Programming for Real-time Control and Neural

Modeling,” Handbook of Intell. Ctrl., Multiscience Press, 1992.

4. White, D. A. and Sofge, D. A., "Applied Learning-Optimal Control for

Manufacturing," in Handbook of Intelligent Control. New York: Van Nostrand Reinhold,

ch. 9, 1992.

Page 30: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 30

5. Si, J., Barto, P. A. G., and Wunsch, W. B. D., Handbook of Learning and Approximate

Dynamic Programming, IEEE Press Series on Computational Intelligence. New York:

Wiley-IEEE Press, Aug. 2004.

6. Al-Tamimi,A., Lewis,F.L., and Abu-Khalaf, M., “Discrete-time Nonlinear HJB

Solution Using Approximate Dynamic Programming: Convergence proof,” IEEE Trans.

Syst. Man. Cybern. B, vol. 38(4), pp. 943-949, 2008.

7. Balakrishnan, S. N., Ding, J., and Lewis, F. L., “Issues on Stability of ADP Feedback

Controllers for Dynamical Systems,” IEEE Trans. Syst., Man., Cybern. B, vol. 38(4), pp.

913–917, 2008.

8. Dierks T., Thumati B., and Jagannathan S., “Optimal Control of Unknown Affine

Nonlinear Discrete-time Systems Using Offline-trained Neural Networks with Proof of

Convergence,” Neural Networks, vol. 22, pp. 851-860, 2009.

9. Li, B. and Si, J., “Robust Dynamic Programming for Discounted Infinite-horizon

Markov Decision Processes with Uncertain Stationary Transition Matrices,” Proc. IEEE

Int. Symp. Appr. Dynamic Programming and Reinforcement Learning, Honolulu, HI, pp.

96-102, 2007.

10. Werbos, P.J., “Using ADP to Understand and Replicate Brain Intelligence: the Next

Level Design,” Proc. IEEE Symp. Appr. Dynamic Programming and Reinforcement

Learning, Honolulu, HI, pp. 209-216, 2007.

Page 31: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 31

11. Balakrishnan, S. N. and Biega, V., “Adaptive-Critic Based Neural Networks for

Aircraft Optimal Control,” Journal of Guidance, Control and Dynamics, vol. 19, pp. 893-

898, 1996.

12. Prokhorov, D. and Wunsch, D., “Adaptive Critic Designs,” IEEE Trans. on Neural

Networks, vol. 8, pp.997-1007, 1997.

13. Venayagamoorthy, G., Harley, R., and Wunsch, D., “Dual Heuristic Programming

Excitation Neurocontrol for Generators in a Multimachine Power System,” IEEE Trans.

Ind. Appl., vol. 39, pp. 382-384, 2003.

14. Padhi, R. and Balakrishnana, S. N., “Proper Orthogonal Decomposition based

Optimal Neurocontrol Synthesis of a Chemical Reactor Process Using Approximate

Dynamic Programming,” Neural Networks, 16 (5-6): 719-28, 2003.

15. Ferrari, S. and Stengel, R., “Online Adaptive Critic Flight Control," Journal of

Guidance, Ctrl. and Dynamics, vol. 27(5), pp. 777-786, 2004.

16. Yang, Q. and Jagannathan, S., “Adaptive Critic Neural Network Force Controller for

Atomic Force Microscope-based Nanomanipulation,” Proc. IEEE Int. Symp. Intell. Ctrl.,

pp. 464-469, 2006.

17. Lendaris, G., Schultz, L., Shannon, T., “Adaptive Critic Design for Intelligent

Steering and Speed Control of a 2-axle Vehicle,” Proc. International Joint Conf. on

Neural Networks, Como, Italy, 2000.

Page 32: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 32

18. Hanselmann, T., Noakes, L., and Zaknich, A., “Continuous Time adaptive Critics,”

IEEE Trans. on Neural Netw., vol. 18 (3), pp. 631-647, 2007.

19. Vrabie, D., and Lewis, F., “Adaptive Optimal Control Algorithm for Continuous-

Time Nonlinear Systems Based on Policy Iteration,” Proc. IEEE Conf. on Decision and

Control, Cancun, pp. 73-79, 2008.

20. Vrabie, D., Pastravanu, O., Lewis, F., and Abu-Khalaf, M., “Adaptive Optimal

Control for Continuous-Time Linear Systems Based on Policy Iteration,” Automatica, vol

45 (2), pp. 477-484, 2009.

21. Vamvoudakis, K., and Lewis, F., “Online actor-critic algorithm to solve the

continuous-time infinite horizon optimal control problem,” Automatica, vol 46, pp. 878-

888, 2010.

22. Dierks, T., and Jagannathan, S., “Optimal Control of Affine Nonlinear Continuous-

time Systems” Proc. American Control Conf., Marriott Waterfront, Baltimore, pp. 1568-

1573, 2010.

23. Padhi, R., and Balakrishnan, S. N., “Optimal Beaver Population Management Using

Reduced Order Distributed Parameter Model and Single Network Adaptive Critics,”

Proc. Amer. Ctrl. Conf., Boston, MA, pp.1598-1603, 2004.

24. Yadav, V., Padhi, R., Balakrishnan, S. N., “Robust/Optimal Temperature Profile

Control Using Neural Networks,” Proc. IEEE International Conf. on Ctrl. Applications,

Munich, Germany, pp.3169-3174, 2006.

Page 33: Single Network Adaptive Critics Networks - Development ...faculty.smu.edu/aheydari/Research/Book_Chapters/PDF Documents/BookChapter1.pdfWiley STM / Lewis and Liu: RL and ADP for Feedback

Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc

page 33

25. Padhi, R., Unnikrishnan, N., Wang, X., and Balakrishnan, S., “A Single Network

Adaptive Critic (SNAC) Architecture for Optimal Control Synthesis for a Class of

Nonlinear Systems,” Neural Net., vol. 19, pp. 1648-1660, Dec. 2006.

26. Ding J., Balakrishnan S. N., Lewis, F. L., "A Cost Function Based Single Network

Adaptive Critic Architecture for Optimal Control Synthesis for a Class of Nonlinear

Systems," Proc. IJCNN, Barcelona, Spain, 2010.

27. Heydari A. and Balakrishnan S. N., "Finite-Horizon Input-Constrained Nonlinear

Optimal Control Using Single Network Adaptive Critics," Proc. American Control

Conference, 2011.

28. Han, D. and Balakrishnan, S. N., “State-constrained agile missile control with

adaptive-critic-based Neural Networks,” IEEE Trans. on Control Systems Technology,

vol. 10 (4), pp. 481-489, 2002.

29. Cheng, T., Lewis, F. L., and Abu-Khalaf, M., “A neural network solution for fixed-

final time optimal control of nonlinear systems,” Automatica, vol. 43, pp. 482-490, 2007.

30. Gupta, S. K., “Numerical Methods for Engineers,” Wiley Eastern Ltd. and New Age

International Ltd., 1995.

31. Slotine, J.-J.E. and Li,W., Applied Nonlinear Control, Prentice-Hall, 1991.

32. Park C., Guibout V., and Scheeres D., "Solving optimal continuous thrust rendezvous

problems with generating functions," Journal of Guidance, Control, and Dynamics, vol.

29, no. 2, pp.321-331, 2006.