single network adaptive critics networks - development...
TRANSCRIPT
Reinforcement Learning and Approximate Dynamic Programming for Feedback Control,
Edited by F.L. Lewis and D. Liu.
ISBN 0-471-XXXXX-X Copyright © 2000 Wiley[Imprint], Inc.
Chapter 5
Single Network Adaptive Critics
Networks - Development, Analysis and
Applications
Jie Ding, Ali Heydari, and S.N. Balakrishnan
Department of mechanical and Aerospace Engineering
Missouri University of Science and Technology
5.1 Abstract
Solving infinite time optimal control problems in an approximate dynamic programming
framework with two network structure has become popular in recent years. In this
chapter, an alternative to the two network structure is provided. We develop single
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 2
network adaptive critics (SNAC) which eliminate the need to have a separate action
network to output control. Two versions of SNAC are presented. The first version, called
SNAC outputs the costates and the second version called J-SNAC outputs the cost
function values. A special structure called Finite-SNAC to efficiently solve finite-time
problems is also presented. Illustrative infinite time and finite time problems are
considered; numerical results clearly demonstrate the potential of the single network
structures to solve optimal control problems.
5.2 Introduction
There are very few open loop control based systems in practice. Feedback or closed loop
control is desired as a hedge against noise that systems encounter in their operations and
modeling uncertainties. Optimal control based formulations have been shown to yield
desirable stability margins in linear system. For linear or nonlinear systems, dynamic
programming formulations offer the most comprehensive solutions for obtaining
feedback control [1], [2]. For nonlinear systems, solving the associated Hamilton-Jacobi-
Bellman (HJB) equation is well-nigh impossible due to the associated number of
computations and storage requirements. Werbos introduced the concept of Adaptive
Critics (AC) [3], a dual neural network structure to solve an ‘approximate dynamic
programming’ (ADP) formulation which uses the discrete version of the HJB equation.
Many researchers have embraced and researched the enormous potential of AC
based formulations over the last two decades [4]-[10]. Most of the model based adaptive
critic controllers [11]-[13] and others [4] & [5] were targeted for systems described by
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 3
ordinary differential equations. Ref. [14] extended the realm of applicability of the AC
design to chemical systems characterized by distributed parameter systems. Authors of
[15] formulated a global adaptive critic controller for a business jet. Authors of [16]
applied an adaptive critic based controller in an atomic force microscope based force
controller to push nano-particles on the substrates. Ref. [17] showed in simulations that
the ADP can prevent cars from skidding when driving over unexpected patches of ice.
There are many variants of AC designs [12] of which the most popular ones are
the Heuristic Dynamic Programming (HDP) and the Dual Heuristic Programming (DHP)
architectures. In the HDP formulation, one neural network, called the ‘critic’ network
maps the input states to output the optimal cost and another network called the ‘action’
network outputs the optimal control with states of the system as its inputs [12]. In the
DHP formulation, while the action network remains the same as with the HDP, the critic
network outputs the optimal costate vector with the current states as inputs [11], [15].
Note that the AC designs are formulated in a discrete framework. Ref. [18] considered
the use of continuous time adaptive critics. More recently, some authors have pursued
continuous-time controls [19]-[22]. In [19] and [20] the suggested algorithms for the
online optimal control of continuous systems are based on sequential updates of the actor
and critic networks.
Many nice theoretical developments have taken place in the last few years,
primarily due to the groups led by Lewis and Sarangapani which have made the
reinforcement learning based AC controllers acceptable for mainstream evaluation.
However, there has not been much work in alleviating the computational load with the
AC paradigms from a practical standpoint. Balakrishnan’s group has been working from
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 4
this perspective and formulated the Single Network Adaptive Critic designs. Authors of
[23]-[25] have proposed a Single Network Adaptive Critic (SNAC) related to DHP
designs. SNAC eliminates the usage of one neural network (namely the action network)
that is a part of a typical dual network AC setup. As a consequence, the SNAC
architecture offers three advantages: a simpler architecture for implementation, less
computational load for training and recall process, and elimination of the approximation
error associated with the eliminated network. In [25], it is shown through comparison
studies that the SNAC costs about half the computation time as that of a typical dual
network AC structure for the same problem.
SNAC is applicable to a wide class of nonlinear systems where the optimal
control equation can be explicitly expressed in terms of the state and costate vectors.
Most of the problems in aerospace, automobile, robotics and other engineering
disciplines can be characterized by nonlinear control-affine equations that yield such a
relation. SNAC based controllers have yielded excellent tracking results in micro-
electromechanical systems and chemical reactor applications, [25] and [14]. Authors of
[25] have proved that for linear systems the SNAC converges to discrete Riccati equation
solution.
Motivated by the simplicity of the training and implementation phase resulted
from SNAC, Ref. [26] developed an HDP based SNAC scheme. In their cost function
based single network adaptive critic, or J-SNAC, the critic network outputs the cost
instead of the costate vector. There are some advantages with the J-SNAC formulation:
first, a smaller neural network in needed for J-SNAC because the output is only a scalar,
while in SNAC, the output is the costate vector which has as many elements as the order
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 5
of the system. The smaller network size results in less training and recall effort in J-
SNAC. Second, in J-SNAC the output is the optimal cost which has some physical
meaning for the designer, as opposed to the costate vector in SNAC.
Note that these developments introduced above have mainly addressed only the
infinite horizon problems. On the other hand, finite-horizon optimal control is an
important branch of optimal control theory. Solutions to the resulting time varying HJB
equation are usually very difficult to obtain. In the finite-horizon case, the optimal cost-
to-go is not only a function of the current states, but also a function of how much time is
left (time-to-go) to accomplish the goal. There is hardly any work in the neural network
literature to solve this class of problems [28], [29].
To deal with finite horizon problems, a single neural network based solution,
called Finite-SNAC, was developed in [27] which embeds solutions to the time-varying
HJB equation. Finite-SNAC is a DHP based NN controller for finite-horizon optimal
control of discrete-time input-affine nonlinear systems and is developed based on the idea
of using a single set of weights for the network. The Finite-SNAC has the current time as
well as the states as inputs to the network. This feature results in much less required
storage memory compared to the other published methods [28], [29]. Note that once
trained, Finite-SNAC provides optimal feedback solutions to any different final time as
long as it is less than the final time for which the network is synthesized.
Rest of this chapter is organized as follows: the approximate dynamic
programming formulation is discussed in Section 5.3 followed by presentation of the
SNAC architecture for infinite-horizon problems in Section 5.4; J-SNAC architecture and
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 6
simulation results are presented in Section 5.5; and Finite-SNAC architecture for finite-
time optimal control problems and its simulation results are presented in Section 5.6.
5.3 Approximate Dynamic Programming
The dynamic programming formulation offers a comprehensive optimal control solution
in a state feedback form, however, it is handicapped by computational and storage
requirements. The approximate dynamic programming (ADP) formulation implemented
with an Adaptive Critic (AC) neural network structure has evolved as a powerful
alternative technique that obviates the need for excessive computations and storage
requirements in solving optimal control problems.
In this section, the principles of ADP are described. An interested reader can find
more details about the derivations in [11] and [3]. Note that a prime requirement to apply
the ADP is to formulate the problem in discrete-time. The control designer has the
freedom to use any appropriate discretization scheme. For example, one can use the Euler
approximation for the state equation and trapezoidal approximation for the cost function
[30].
In a discrete-time formulation, the objective is to find an admissible control , which causes the system given by
岫 岻 (5.1)
to follow an admissible trajectory from an initial point to a final desired point
while minimizing a desired cost function given by
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 7
∑ 岫 岻 (5.2)
where and denote the state vector and the control at time step ,
respectively is the order of the system and is the system’s number of inputs. The functions and are assumed to be differentiable with
respect to both and . Moreover, is assumed to be convex (e.g. a quadratic
function in and ). One can notice that when , this cost function leads to a
regulator (infinite time) problem.
The steps in obtaining optimal control are now described. First, the cost function
(5.2) is rewritten to start from time step as
∑ 岫 岻 (5.3)
The cost, , can be split into
(5.4)
where and ∑ represent the ‘utility function’ at time step and the
cost-to-go from time step to , respectively.
The costate vector at time step is defined as
(5.5)
The necessary condition for optimality is given by
(5.6)
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 8
Equation (5.6) can be further expanded as
岾 峇 (5.7)
The optimal control equation can, therefore, be written as
岾 峇 (5.8)
The costate equation is derived in the following way
岾 峇 岾 峇 (5.9)
In order to synthesize an optimal controller, (5.1), (5.8) and (5.9) have to be
solved simultaneously, along with appropriate boundary conditions. For regulator
problems, the boundary conditions usually take the form: is fixed and as . If the state equation and cost function are such that one can obtain an explicit
relationship for the control variable in terms of the state and the cost variables from
equation (5.8), the ADP formulation can be solved through SNAC. Note that control
affine nonlinear systems (of the form 岫 岻 岫 岻 ) with a quadratic cost
function (of the form ∑ 岫 岻 ) fall under this class.
5.4 SNAC
In this section, an improvement and modification to the AC architecture, called the
“Single Network Adaptive Critic (SNAC)” related to the DHP designs is presented. In the
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 9
SNAC design, the critic network captures the functional relationship between the state
and the optimal costate . Denoting the neural network functional mapping with 岫 岻 one has
岫 岻 (5.10)
For the input-affine system and the quadratic cost function described below, once is calculated, one can generate the optimal control through equation (5.13).
岫 岻 岫 岻 (5.11)
∑ 岫 岻 (5.12)
岫 岻 (5.13)
Note that, for this case, equation (5.9) reads
岾 岫 岫 岻 岫 岻 岻 峇 (5.14)
5.4.1 State Generation for Neural Network Training
State generation is an important part of the training process for the SNAC outlined in the
next sub-section. For this purpose, define { }, where denotes the domain
in which the system operates. This domain is so chosen that its elements cover a large
number of points of the state space in which the state trajectories are expected to lie. For
a systematic training scheme, a ‘telescopic method’ is arrived at as follows:
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 10
For , define the set { ‖ ‖ }, where is a positive
constant. At the beginning, a small is fixed and the network is trained with the states
generated in S1. After the network converges (the convergence condition will be
discussed in the next subsection), is chosen such that . Then the network is
trained again for states within S2 and so on. The network training is continued until , where SI covers the domain of interest .
5.4.2 Neural Network Training
The steps for training the SNAC network (see figure 5.1) are as follows:
1) Generate . For each element of , follow the steps below:
a. Input to the critic network to obtain
b. Calculate from the optimal control equation with and c. Get from the state equation using and
d. Input to the critic network to get e. Using and , calculate from costate equation (5.14)
2) Train the critic network for all in ; the target being
3) Check the convergence of the network. If it’s converged, revert to step 1 with until . Otherwise, repeat steps 1-2.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 11
Figure 5.1. SNAC Training Scheme
5.4.3 Convergence Condition
In order to check the convergence of the critic network, a set of new states, and target
outputs are generated as described in previous subsection. Let the target outputs be
and the outputs from the trained networks (using the same inputs from the set ) be . A tolerance value, tolc is used as the convergence criterion for the critic network.
By defining the relative error 岫‖ ‖ ‖ ‖岻 and { }, | |, the training process is stopped when ‖ ‖ . 5.5 J-SNAC
In this section, the “cost function based single network adaptive critic, called J-SNAC is
presented. In J-SNAC the critic network outputs cost instead of the costate as in SNAC.
This approach is applicable to the class of nonlinear systems of 岫 岻 with a constant matrix. As mentioned in the introduction section, the J-SNAC
technique retains all the powerful features of the AC methodology while eliminating the
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 12
action network completely. In the J-SNAC design, the critic network captures the
functional relationship between the state and the optimal cost .
Denoting the functional mapping of J-SNAC with 岫 岻, one has
岫 岻 (5.15)
One can calculate through 岫 岻 , and rewrite the costate equation (5.14),
for the quadratic cost function (5.12), in the following form
岾 岫 岻 峇 岫 岻 (5.16)
Optimal control can now calculated as
岾 岫 岻 峇 岾 岫 岻 峇 (5.17)
5.5.1 Neural Network Training
Using a similar state generation and convergence check as discussed in the SNAC
training procedure, the steps for training the J-SNAC network are as follows (Fig. 5.2):
1) Generate . For each element of , follow the steps:
a. Input to the critic network to obtain
b. Calculate and then using equation (5.16)
c. Calculate from equation (5.17)
d. Use and to get from the state equation
e. Input to the critic network to get
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 13
f. Use , and , to calculate using equation (5.4)
2) Train the critic network for all in with the output being . 3) Check the convergence of the critic network. If yes, revert to step 1 with
until . Otherwise, repeat steps 1-2.
Figure 5.2. J-SNAC Training Scheme.
5.5.2 Numerical Analysis
For illustrative purposes, a satellite attitude control problem is selected.
5.5.2.1 Modeling the Attitude Control Problem
Consider a rigid spacecraft controlled by reaction wheels. It is assumed that the control
torques are applied through a set of reaction wheels along three orthogonal axes. The
spacecraft rotational motion is described by [31]
岌 (5.18)
where I is the matrix of moment of inertia, is the angular velocity of the body frame
with respect to the inertial frame, p is the total spacecraft angular momentum expressed
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 14
in the spacecraft body frame and is the torque applied to the spacecraft by the reaction
wheels.
Using the Euler angles and , the kinematics equation describing the attitude
of the spacecraft may be written as
[ ] 岫 岻 (5.19)
where
岫 岻 [ 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻] (5.20)
The total spacecraft angular momentum p is written as
岫 岻 (5.21)
where [ ] is the (constant) inertial angular momentum and
岫 岻 [ 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 岫 岻 ] (5.22)
Choosing [ ] as the states and [ ] as the controls, the
dynamics of the system, i.e. (5.18) and (5.19) can be rewritten as
[ 岌 岌 岌 岌 岌 岌 ]
[ 岫 岻 ]
[ ] [ ] [ ] (5.23)
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 15
The control objective is to drive all the states to zero as . A quadratic cost function, is selected as
∫ [ ] (5.24)
where is a positive semi-definite and is a positive definite matrix for
penalizing the states and controls, respectively.
Denoting the time step by , the state equation is discretized as
[ 岫 岻 ] (5.25)
where 岫 岻 and are given in the state equation (5.23). The quadratic cost function
(5.24) is also discretized as
∑ 岾 峇 (5.26)
The optimality condition leads to the control equation
岫 岻 (岾 峇 ) 岾 峇 (5.27)
5.5.2.2 Simulation
In numerical simulations, 岫 岻 and 岫 岻 are selected. Note that regulating the Euler angles will
automatically regulate the angular rates as well, hence, penalizing the first three element
of the state vector through the selected , will guarantee the regulation of the whole state
vector. The inertia matrix is selected as an identity matrix.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 16
A single layer neural network of the form 岫 岻 is selected where
denotes the neural network weights and 岫 岻 denotes the basis function. The network
weights are initialized to zero and the basis functions 岫 岻 are selected as
岫 岻 [ ] (5.28)
where , denotes the th element of state vector . The initial condition is
selected as [ ] .
Histories of the Euler angles and rotation rates with time are shown in Fig. 5.3. It
can be seen that all the states go to zero within 5 seconds. Moreover, as seen in Fig. 5.4
which shows the history of the applied controls, the goal is achieved through applying
bounded controls about the three axes of the spacecraft.
Figure 5.3. Histories of angles and body rates.
Figure 5.4. Histories of applied controls.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 17
5.6 Finite-SNAC
Finite-SNAC solves finite horizon optimal control of nonlinear input-affine systems
based on a DHP formulation. The discrete-time nonlinear input-affine system and a
quadratic cost function to be minimized are given below.
岫 岻 岫 岻 (5.29)
∑ 岫 岻 (5.30)
where and denote the state vector and the control at time step ,
respectively is the order of the system and is the number of inputs. 岫 岻 and 岫 岻 are the system dynamics and and are
weighting matrices for the final states, states, and control vectors, respectively. and
are positive semi-definite or positive definite matrices and is a positive definite matrix.
Finally, is the total (fixed) number of time steps and superscript denotes
transposition.
Denoting the neural network mapping by 岫 岻 a single neural network called
Finite-SNAC is developed to output the desired costate vector based on the state vector
and the time-to-go as inputs.
岫 岻, (5.31)
where is the system costate vector at time step and denotes the
network weight matrix.
The neural network in this section is selected in the form of linear in the weights.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 18
岫 岻 岫 岻 (5.32)
where 岫 岻 is composed of linearly independent scalar basis functions and where is the number of neurons.
Following the ADP framework discussed in section 2, the costate equation will be
岾 岫 岫 岻 岫 岻 岻 峇 , (5.33)
Note that since
岫 岻 (5.34)
the final condition on the costate vector is
(5.35)
The network training target, denoted by , can be calculated using following two
equations.
岾 岫 岫 岻 岫 岻 岻 峇 , (5.36)
(5.37)
In the training process, in the right hand side of (5.36) will be substituted by 岫 岫 岻 岻 as described earlier in the SNAC training process.
Once the network is trained, it can be used for optimal feed-back control in the
sense that in the online implementation, the states and the time-to-go will be fed into the
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 19
network to generate the optimal costate vector and the optimal control will be calculated
through (5.38).
岫 岻 , (5.38)
5.6.1 Neural Network Training
The Finite-SNAC training should be done in such a way that along with learning
the target given in (5.36) for every state and time , the final condition (5.37) is also
satisfied. The authors suggest augmenting the training input-target pairs in such a way
which the final condition is forced to be met in each learning iteration. To do so, one can
define following augmented parameters.
[ ] (5.39)
[ 岫 岻 岫 岻] (5.40)
Now, the network output and the target to be learned are
(5.41)
[ ] (5.42)
The training error is given by
(5.43)
Now, in each iteration along with selecting a random state , a random time , , will also be selected. Feeding and into (5.31) results in a
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 20
costate vector which will be used for calculating through (5.38). Having and
one can propagate to using (5.29). Having fed into (5.31) once more, will be resulted which is the desired unknown needed for target calculation
through (5.36). Then, to calculate through (5.37), the randomly selected state will be
considered as and propagated to using the similar process discussed above for
propagating to , and fed to (5.37). Finally will be formed using (5.42). This
process is depicted graphically in Fig. 5.5. In this diagram, the left column follows (5.36)
and the right column follows (5.37) for the target calculation.
Having the input-target pair {[岫 岻 岫 岻] [ ]} calculated, the
network can be trained using any training method. The selected training law in this study
is least squares.
Figure 5.5. Finite-SNAC Training Scheme.
Costate Equation
Optimal Control
Equation
N 岫 岻
SNAC
SNAC
Optimal Control Equation
State Equation
SNAC
Optimal Control Equation
State Equation
,
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 21
5.6.2 Convergence Theorems
The proposed algorithm for the Finite-SNAC training is based on the DHP scheme, in
which starting with an initial value for the costate vector one iterates to converge to the
optimal costate. Denoting the iteration index by a superscript and the time index by a
subscript, the learning algorithm for finite horizon optimal control starts with an initial
value assignment to for all ’s, e.g. , and repeating below three
calculations for different ’s from zero to infinity. 岫 岻 (5.44)
( ) (5.45)
(5.46)
The last equation is actually the final condition of the optimal control problem. Note that,
( ) 岾 岫 岻 岫 岻 峇 (5.47)
岫 岻 ( 岫 岻 岫 岻 ) (5.48)
The problem is to prove that this iterative procedure results in the optimal values
for the costate and control . The convergence proofs presented in [27] is based on the
convergence of the HDP, in which the parameter subject to evolution is the cost function whose behavior is much simpler to discuss as compared to that of the costate vector . In HDP, the cost function needs to be initialized, e.g. 岫 岻 , and is
iteratively updated through the following steps.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 22
岫 岻 岾 峇 岫 岻 (5.49)
岾 岫 岻峇 岫 岻 (5.50)
For the finite horizon case, the final condition given below is satisfied at every
iteration.
岫 岻 (5.51)
Note that 岫 岻 and
( 岫 岻 岫 岻 ) (5.52)
The convergence of the above mentioned reinforcement learning schemes are
given below and their proofs, given in [27], are skipped because of the page constraint.
Note that the proofs of the finite-time cases are inspired by the proof for the HDP
convergence for infinite-horizon case in [6].
Theorem 1: HDP Convergence
The sequence of iterated through (5.49) to (5.51), in case of 岫 岻
converges to the finite-horizon optimal solution.
Theorem 2: DHP Convergence
The sequences and defined by equations (5.44) to (5.46), in case of , converges to the finite-horizon optimal solution for the given nonlinear
control-affine system.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 23
5.6.3 Numerical Analysis
An orbital maneuver problem is provided to motivate the readers on the Finite-SNAC
usage. A spacecraft is supposed to move from a certain orbit to another orbit in a fixed
final time. This problem is known as the Rendezvous problem in the orbital mechanics
literature.
5.6.3.1 Modeling the Orbital Maneuver Problem
Denoting the displacement vector of the center of mass of a rigid spacecraft from the
center of the orbital frame in the destination orbit by [ ] where , and
are the components of the vector in the orbital frame, the non-dimentionalized
equation of motion of a spacecraft in the gravity field is given as [32]
岑 岌 岫 岻岫 岻 岑 岌 岫 岻 (5.53)
岑 where the dots denote the time-derivatives; , , and respectively denote the three
components of the non-dimensionalized total force exerted on the spacecraft in , , and axes and √岫 岻 .
Selecting the state vector as [ 岌 岌 岌 ] and the control vector as [ ] , the state equation of the orbital maneuver problem reads:
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 24
[ ]
[
岫 岻岫 岻 岫 岻 岫 岻 ]
[ ] (5.54)
where the components of the state vector are denoted by , , and the control
vector are denoted by , . Note that this state equation is in a nonlinear input-
affine form, suitable for the Finite-SNAC application.
Now, the problem is to apply optimal control to force the states , , to
go to zero in a pre-determined and fixed final time. Convergence of the states to zero is
equivalent to performing the orbital maneuver and locating at the destination point in the
dictated time.
5.6.3.2 Simulation
The nondimentionalized state equation can cover any circular orbit as the rendezvous
position. The mean anomaly of both of the maneuvering and the destination spacecraft
are zero at the start of the maneuver. The specifications of the two orbits are given in
Table 5.1.
Table 5.1. The spec of the destination and source orbit
The Characteristic Source Orbit Destination Orbit Orbit Semi-major axes 10,000 km 10,000 km Right Ascension of the Ascending Node 15 deg. 0 deg. Inclination 85 deg. 90 deg. Eccentricity 0 0
For the given characteristics, the initial condition will be
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 25
[ ] (5.55)
The fixed final time is selected as 3 time units. The time unit is and
the reference length is , [32]. In this condition, each control unit will be . Note that, this implementation is based on having continuous
thrust for actuation. The weight matrices are selected through trial and error as 岫 岻, 岫 岻, and 岫 岻. The values of the elements of are selected much higher than those of
in order to force the error of the states at the final time to be small.
The basis functions selected for the neural network are combinations of the
following polynomials: for and for and Note that is the th input of the
neural network and is the time-to-go, normalized through dividing it by the total time,
and its contribution in the basis functions is selected through some trial and error such
that the network error is as small as possible.
The total number of basis functions is equal to 96. The network is trained for 200
epochs with the number of states selected at each iteration equal to 300 for creating a
mesh over the region of interest as explained in [27]. For this simulation, the states are
selected in such a way that each element belongs to the interval of 岫 岻. Performing the weight update, the weights converged as shown in Fig. 5.6.
Histories of the position elements of the state vector and the control are given in
Fig. 5.7. In order to evaluate the performance of the controller, a numerical solution to
this problem was calculated through an iterative process for the given initial condition
and the results are depicted using dash plots in Fig. 5.7, while solid plots denote those of
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 26
the Finite-SNAC. As seen through the plots, the proposed controller has been able to
force the states to converge to the origin in the fixed final time of 3 units and the Finite-
SNAC trajectories are close to those of the optimal open loop numerical solution. The
cost-to-go of the Finite-SNAC turned out to be only 3% more than the open loop optimal
solution. While the open loop solution is only applicable to the preselected initial
condition and the fixed time-to-go, the Finite-SNAC solutions are more versatile as will
be shown in the discussions below.
To demonstrate the versatility of the HJB based finite-horizon controller, the same
network without retraining is used for another maneuver with the same initial condition
but with a shorter time-to-go, i.e. 2 time units. The trajectories of the position elements of
the state vector and the control histories for the shorter horizon are superimposed on the
result of the previous simulation of the Finite-SNAC controller and shown in Fig. 5.8. In
the two figures, the solid plots denote the results of the maneuver with time-to-go of 3,
and the dash plots denote those of the maneuver with time-to-go of 2 time units.
As could be seen in Fig. 5.8, the controller has applied a different control history
on the spacecraft to accomplish the same maneuver in a shorter time. This shows that the
network has learned the time-dependency of the optimal control in finite-horizon
problems.
The developed controller is able to perform well under different initial conditions
as long as the resulting state trajectory belongs to the domain on which the network is
trained. To evaluate the performance of the controller with different initial conditions,
another source orbit is selected with a of semi-major axis of 11000 km, a right ascension
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 27
of -15 degrees and an inclination of 95 degrees which leads to a new set of initial
conditions given below.
[ ] (5.56)
The simulation is performed using the same trained network for the new initial
condition and the resulting position trajectory, called maneuver 2, is plotted with the
results from the previous initial conditions, called maneuver 1 and shown in a 3D plot in
Fig. 5.9. As seen, the new maneuver is accomplished in the fixed set time, as well,
confirming the applicability of the controller for different initial conditions.
Figure 5.6. Histories of some of the elements of the weights matrix during the training
iterations.
Figure 5.7. Histories of the position elements of the state vector and the applied control
for the time-to-go of 3 units.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 28
Figure 5.8. Histories of the position elements of the state vector and the applied control;
Finite-SNAC with different final times.
Figure 5.9. Three-dimensional trajectories of the maneuvers with different source orbits,
i.e., different initial conditions.
Conclusions
Efficient neural network structures to solve approximate dynamic programming based
control problems were presented in this chapter. Since they eliminate a network from the
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 29
typical dual-network adaptive critic formulations, computing time is significantly
reduced. These structures called single network adaptive critics(SNAC, J-SNAC) are
applicable to a wide variety of engineering problems. Numerical results from limited
applications and the underlying simpler structures seem to portend that they have a great
potential for implementation.
Acknowledgements
Support for this study from NASA under Grant No.: ARMD NRA NNH07ZEA001N-
IRAC1 and the National Science Foundation are gratefully acknowledged. The views of
the authors do not necessarily represent the views of the NSF or NASA.
References
1. Lewis, F., Applied Optimal Control and Estimation, Prentice-Hall, 1992.
2. Bryson, A. E. and Ho, Y. C., Applied Optimal Control, Taylor and Francis, 1975.
3. Werbos, P. J., “Approximate Dynamic Programming for Real-time Control and Neural
Modeling,” Handbook of Intell. Ctrl., Multiscience Press, 1992.
4. White, D. A. and Sofge, D. A., "Applied Learning-Optimal Control for
Manufacturing," in Handbook of Intelligent Control. New York: Van Nostrand Reinhold,
ch. 9, 1992.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 30
5. Si, J., Barto, P. A. G., and Wunsch, W. B. D., Handbook of Learning and Approximate
Dynamic Programming, IEEE Press Series on Computational Intelligence. New York:
Wiley-IEEE Press, Aug. 2004.
6. Al-Tamimi,A., Lewis,F.L., and Abu-Khalaf, M., “Discrete-time Nonlinear HJB
Solution Using Approximate Dynamic Programming: Convergence proof,” IEEE Trans.
Syst. Man. Cybern. B, vol. 38(4), pp. 943-949, 2008.
7. Balakrishnan, S. N., Ding, J., and Lewis, F. L., “Issues on Stability of ADP Feedback
Controllers for Dynamical Systems,” IEEE Trans. Syst., Man., Cybern. B, vol. 38(4), pp.
913–917, 2008.
8. Dierks T., Thumati B., and Jagannathan S., “Optimal Control of Unknown Affine
Nonlinear Discrete-time Systems Using Offline-trained Neural Networks with Proof of
Convergence,” Neural Networks, vol. 22, pp. 851-860, 2009.
9. Li, B. and Si, J., “Robust Dynamic Programming for Discounted Infinite-horizon
Markov Decision Processes with Uncertain Stationary Transition Matrices,” Proc. IEEE
Int. Symp. Appr. Dynamic Programming and Reinforcement Learning, Honolulu, HI, pp.
96-102, 2007.
10. Werbos, P.J., “Using ADP to Understand and Replicate Brain Intelligence: the Next
Level Design,” Proc. IEEE Symp. Appr. Dynamic Programming and Reinforcement
Learning, Honolulu, HI, pp. 209-216, 2007.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 31
11. Balakrishnan, S. N. and Biega, V., “Adaptive-Critic Based Neural Networks for
Aircraft Optimal Control,” Journal of Guidance, Control and Dynamics, vol. 19, pp. 893-
898, 1996.
12. Prokhorov, D. and Wunsch, D., “Adaptive Critic Designs,” IEEE Trans. on Neural
Networks, vol. 8, pp.997-1007, 1997.
13. Venayagamoorthy, G., Harley, R., and Wunsch, D., “Dual Heuristic Programming
Excitation Neurocontrol for Generators in a Multimachine Power System,” IEEE Trans.
Ind. Appl., vol. 39, pp. 382-384, 2003.
14. Padhi, R. and Balakrishnana, S. N., “Proper Orthogonal Decomposition based
Optimal Neurocontrol Synthesis of a Chemical Reactor Process Using Approximate
Dynamic Programming,” Neural Networks, 16 (5-6): 719-28, 2003.
15. Ferrari, S. and Stengel, R., “Online Adaptive Critic Flight Control," Journal of
Guidance, Ctrl. and Dynamics, vol. 27(5), pp. 777-786, 2004.
16. Yang, Q. and Jagannathan, S., “Adaptive Critic Neural Network Force Controller for
Atomic Force Microscope-based Nanomanipulation,” Proc. IEEE Int. Symp. Intell. Ctrl.,
pp. 464-469, 2006.
17. Lendaris, G., Schultz, L., Shannon, T., “Adaptive Critic Design for Intelligent
Steering and Speed Control of a 2-axle Vehicle,” Proc. International Joint Conf. on
Neural Networks, Como, Italy, 2000.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 32
18. Hanselmann, T., Noakes, L., and Zaknich, A., “Continuous Time adaptive Critics,”
IEEE Trans. on Neural Netw., vol. 18 (3), pp. 631-647, 2007.
19. Vrabie, D., and Lewis, F., “Adaptive Optimal Control Algorithm for Continuous-
Time Nonlinear Systems Based on Policy Iteration,” Proc. IEEE Conf. on Decision and
Control, Cancun, pp. 73-79, 2008.
20. Vrabie, D., Pastravanu, O., Lewis, F., and Abu-Khalaf, M., “Adaptive Optimal
Control for Continuous-Time Linear Systems Based on Policy Iteration,” Automatica, vol
45 (2), pp. 477-484, 2009.
21. Vamvoudakis, K., and Lewis, F., “Online actor-critic algorithm to solve the
continuous-time infinite horizon optimal control problem,” Automatica, vol 46, pp. 878-
888, 2010.
22. Dierks, T., and Jagannathan, S., “Optimal Control of Affine Nonlinear Continuous-
time Systems” Proc. American Control Conf., Marriott Waterfront, Baltimore, pp. 1568-
1573, 2010.
23. Padhi, R., and Balakrishnan, S. N., “Optimal Beaver Population Management Using
Reduced Order Distributed Parameter Model and Single Network Adaptive Critics,”
Proc. Amer. Ctrl. Conf., Boston, MA, pp.1598-1603, 2004.
24. Yadav, V., Padhi, R., Balakrishnan, S. N., “Robust/Optimal Temperature Profile
Control Using Neural Networks,” Proc. IEEE International Conf. on Ctrl. Applications,
Munich, Germany, pp.3169-3174, 2006.
Wiley STM / Lewis and Liu: RL and ADP for Feedback Control, Chapter 5 / Ding, Heydari, and Balakrishnan / filename: ch5.doc
page 33
25. Padhi, R., Unnikrishnan, N., Wang, X., and Balakrishnan, S., “A Single Network
Adaptive Critic (SNAC) Architecture for Optimal Control Synthesis for a Class of
Nonlinear Systems,” Neural Net., vol. 19, pp. 1648-1660, Dec. 2006.
26. Ding J., Balakrishnan S. N., Lewis, F. L., "A Cost Function Based Single Network
Adaptive Critic Architecture for Optimal Control Synthesis for a Class of Nonlinear
Systems," Proc. IJCNN, Barcelona, Spain, 2010.
27. Heydari A. and Balakrishnan S. N., "Finite-Horizon Input-Constrained Nonlinear
Optimal Control Using Single Network Adaptive Critics," Proc. American Control
Conference, 2011.
28. Han, D. and Balakrishnan, S. N., “State-constrained agile missile control with
adaptive-critic-based Neural Networks,” IEEE Trans. on Control Systems Technology,
vol. 10 (4), pp. 481-489, 2002.
29. Cheng, T., Lewis, F. L., and Abu-Khalaf, M., “A neural network solution for fixed-
final time optimal control of nonlinear systems,” Automatica, vol. 43, pp. 482-490, 2007.
30. Gupta, S. K., “Numerical Methods for Engineers,” Wiley Eastern Ltd. and New Age
International Ltd., 1995.
31. Slotine, J.-J.E. and Li,W., Applied Nonlinear Control, Prentice-Hall, 1991.
32. Park C., Guibout V., and Scheeres D., "Solving optimal continuous thrust rendezvous
problems with generating functions," Journal of Guidance, Control, and Dynamics, vol.
29, no. 2, pp.321-331, 2006.