aas 18-377 physically-constrained inverse optimal control...
TRANSCRIPT
AAS 18-377
PHYSICALLY-CONSTRAINED INVERSE OPTIMAL CONTROL FORSATELLITE MANEUVER DETECTION
Richard Linares∗ and Joseph B. Raquepas†
This paper develops an approach to determine the behavior of Space Ob-jects (SOs) using Inverse Optimal Control (IOC) theory. The proposedmethod determines the control objective function that each SO is usingfor control utilizing IOC theory, thereby determining the behavior of theSOs. The approach discussed in this work can be used to analyze maneu-vering SOs from observational data. The contribution of this work is toformulate the behavior estimation problem using IOC theory. The IOCproblem is solved using the Pontryagin’s minimum principle which re-sults in a convex optimization problem. Three simulation test cases areshown, the first two cases study the behavior estimation problem for a us-ing the relative motion equations, while the third case studies the behaviorestimation problem for continuous thrust orbit raising maneuver. Addi-tionally, a general approach for modeling the SO reward function usingExtreme Learning Machines (ELM) is discussed. The simulation exampledemonstrate that the control objective function can be recovered using theproposed method.
INTRODUCTION
Optimal Control refers to the process of computing state and control trajectories which
are optimal with respect to a cost function subject to constraints, and in the context of
this work we call this the Forward Optimal Control (FOC) problem. The Inverse optimal
control (IOC) refers to the process of determining the cost function implied by observed
state and control trajectories. The study of IOC problems has been an active area for more
than 50 years,1 and has been successfully applied to many fields.2 Kalman1 first studied
the problem of finding all performance indices that a given control law are optimal with
respect to. This theory has seen many application for development of control laws2, 3 and
more recently in reinforcement learning.4–8
IOC is also known as Inverse Reinforcement Learning (IRL)4 but this work makes a
distinction between these two approaches. This work uses the term IRL to refer to comput-
ing IOC solutions without knowledge of underlying dynamics and in a stochastic setting.
There are many emerging application of both IRL and IOC which include transferring ex-
pert demonstrations to robotic systems,5 humanoid robot control, autonomous driving,9 and
human-robot cooperation,10 and stabilization of a rigid spacecraft systems.11 IRL is often
∗Charles Stark Draper Assistant Professor, Department of Aeronautics and Astronautics, Massachusetts Insti-
tute of Technology, Cambridge, MA, 02139. E-mail: [email protected], Senior Member AIAA.†Research Mathematician, Information Directorate, Air Force Research Laboratory, Rome, NY.
1
applied to the problem of learning control policies from demonstrations, this process is re-
ferred to as learning or apprenticeship learning. The demonstration data is usually consider
to be state and control trajectories that are captured from expert demonstration.
Apprenticeship learning can also be solved using regression and classification approaches
which learn to map states to actions using expert demonstration.12 However, these approach
tend to have issues with generalization of policies for unseen conditions.13 Recently, IRL
has been shown to provide the basis for some of the most successful approaches to im-
itation learning.4–8 Instead of learning to model the observed behavior directly, IRL ap-
proach approximate the cost function (or reward function) which the observed trajectories
are (approximately) optimal with respect to. IRL approaches train a model over the entire
trajectories of observed behavior instead of individual state and action pairs, and therefore
is more robust to unseen conditions.
One of the first IRL methods was development for stationary Markov Decision Pro-
cesses (MDS) and the solution for the cost function was found using a linear programming
method.4 This work was later extended to provide an efficient IRL approach by finding
a cost function that is a linear combination of nonlinear features and is lower in cost, by
a margin, for the expert trajectories.5 There have been many extensions of this work, in-
cluding Bayesian approach,14 maximum entropy approach,6 and using linearly solvable
stochastic optimal control problems.15 However, most existing IRL techniques solve a
FOC problem within the procedure in a nested iterative process. These methods compute
predicted trajectories of the corresponding cost function by the solving the FOC in an inner
loop and cost function (feature weights) are then updated in an outer loop. This structure
was initially proposed by Ref. 5.
An alternative to IRL approach are IOC approaches that make use of additional con-
straints on the expert demonstration. In general, both the IRL and IOC problems fall under
a more general framework of imputing objective function for optimization problems from
observed solution. In this context Ref. 16 developed an approach for imputing the objec-
tive function given known constraints and observed trajectories. The general concept for
finding the cost function is to set up a set of optimality equations which should hold for the
observed trajectories and then minimizing the residual of these equations 16. Reference
16 used the Karush-Kuhn-Tucker (KKT) optimality conditions for determine the unknown
cost function. Additionally, Ref. 16 that this approach could be applied to IOC problem
where the known constraints are the system dynamics and the unknown objective func-
tion is the control cost function. These ideas where then extended solving deterministic
discrete-time IOC problem,17 IOC problems for hybrid systems,18 additive cost functions
and linear constraints for aircraft intent determination,19 and for analysis of human loco-
motion.17 Finally, for continuous time systems this concepts can be extended using Pon-
tryagion’s Minimum principle where the Euler Lagrange equations in continuous cases are
employed.20
Unfortunately, issues do arise using IOC or IRL. By using IOC, a set of reward function
can be found for which a given policy is optimal. Unfortunately, this set may contain
degenerate reward function (i.e. reward function is zero). To mitigate this, additional
2
constraints are imposed on the cost function.16 Another problem that arises is large or
infinite state spaces in which a tabular form of the reward function is unattainable. By
assuming the reward function is a linear combination of fixed basis function, the IOC stays
in the class of linear programs that can be solved efficiently.21 Nevertheless, these problems
should be considered while formulating the IOC algorithm.
The estimated reward function provided by IOC can be used to determine the type of
behavior mode the unit is following and to classify the mode based on libraries of be-
havior modes. This concept was demonstrated for aircraft intent estimation using IOC
solutions.19This work will also investigate using feature weight vectors determined by the
IOC approach for maneuver prediction and classification for satellites. This work follows
the standard assumption made for both IOC and IRL approaches that the cost function is a
linear combination of features.12 These weight vectors can be added to the state of SOs as
a way to represent the policy that the SO is currently following and allow for the change of
this policy over time as the behavior changes. Rather, we are given observations of expert
demonstrations for a given task and the goal is to estimate the reward function that the ex-
pert used to derive the demonstration trajectories. It is common to assume that the expert’s
actions are optimal with respect to the reward function the expert is using and this work
makes this assumption. This work will investigate the Pontryagin’s minimum principle20
for solving for the expert’s reward function.
This work utilizes IOC to develop a physically-constrained behavior estimation approach
for Space Objects (SOs). This work uses the dynamical equations of motion for SOs to
determine constraints that can be imposed on the IOC solution. This work discusses the
use of IOC to learn the behavior of Space Objects (SOs) from observed orbital motion. The
behavior of SOs is estimated using IOC to determine the reward function that each SO is
using for determining its control. In general, SOs are control to achieve a particular goal
that is determined by its mission and therefore only a data-driven learning approach can
reveal the true goal. It is also important to determine what type of behavior SOs are using
and if this behavior changes. IRL and IOC approaches use optimal control principles to
learn what reward function or control objective function is being used by an agent given
observations.5 The simplest IRL approach solves for the reward function using feature
vectors by modeling the reward function as a weighted sum of these feature vectors.5 The
weights determined from the IRL calculation are the representation for the reward function
the SO is using.
The organization of this paper is as follows. First, the concept of inverse reinforcement
learning is introduced. Following this the inverse optimal control approach is discussed and
this is followed by an outline of ELMs and relative motion dynamics. Finally, simulation
results are provided for the proposed method.
THEORY: INVERSE REINFORCEMENT LEARNING
This section summarizes existing IOC and IRL methods. Max-margin IRL was devel-
oped by Abbeel and Ng5 which was used directly on the driving example mentioned ear-
lier. This method learns a cost function which is minimized to obtain performance (feature
3
vector) similar to the expert. First, the method is initialized with a random cost function
parameter c(0), and the optimal control problem is solved given by
minimizex(t),u(t)
∫ tf
to
cTφ(t,x(t),u(t))dt
subject to x(t) = f(t,x(t),u(t))
x(t0) = xstart
x(tf) = xfinal
(1)
where x(t) ∈ Rns is the state, u(t) ∈ R
ns is the control input,
φ(t,x(t),u(t)) = [φ1(t,x(t),u(t)), · · · , φnc(t,x(t),u(t))]T
are the basis functions, and c ∈ Rnc is the cost function parameter that is to be learned. By
solving this optimal control problem, the initial trajectory (x(0)(t),u(0)(t)) is found. The
initial feature µ(0) can be found using
µ =
∫ tf
to
φ(t,x(t),u(t))dt (2)
Then the quadratic program (QP) is solved for the ith coefficient, ci, given by
minimizeci,bi
‖ci‖2
subject to(
ci)T
µ∗6
(
ci)T
µj − bi
for j = 0, · · · , i− 1
bi > 0
(3)
where bi is the margin on the ith iteration, µ∗ is the optimal trajectory feature vector, and
µ(j) is the jth trajectory feature vector. If bi < ǫ terminate, otherwise iterate for i = i + 1.
After it terminates, there will be a policy and a corresponding feature vector that follow
closely to the expert’s policy.
Maximum margin planning developed by Reference 22 learns a cost function for which
the expert policy has lower expected cost than every other policy. Starting with the QP
minimizeci,bi
‖c‖2
subject to cTµ∗6 (c)T µ− b
for all (x(t),u(t)) ∈ S
bi > 0
(4)
where S is a certain set of trajectories, the constraints of the QP problem can be satisfied
using
cTµ∗6minimize
x,u
(
cTµ− b)
subject to x(t) = f(t,x(t),u(t))(5)
4
for all possible trajectories. Instead of solving for bi in Reference 5, b is set as b = Lwhere L is the loss function. The loss function is defined by the closeness of trajectory
(x(t),u(t)) to the optimal trajectory (x∗(t),u∗(t)). If the trajectory (x(t),u(t)) is an op-
timal trajectory, then the loss function is zero. As (x(t),u(t)) increasingly deviates from
the optimal trajectory, then the loss function increases to one. Lastly, slack variables ζ are
included to allow constraint violations. Therefore, the problem to be solved becomes
minimizec,ζ
ζ +λ
2‖c‖2
subject to cTµ∗6 minimize
x,u
(
cTµ(x,u)− b)
x = f (x,u)
(6)
where λ > 0 is a penalizes constraint violations and small weight vectors. With tight slack
variables, the objective can be modified as
J(c) = λ‖c‖2 + cTµ∗(i)− minimizex(t),u(t)
(
cTµ− L)
x(t) = f(t,x(t),u(t))(7)
and solved as a convex program using sub-gradient descent.
The bi-level inverse optimal control method proposed by Mombaur21 uses a derivative-
free optimization technique to find a cost function parameter while an optimal control
method solves predicted trajectories. By minimizing the sum squared error between pre-
dicted and observed trajectories given by
minimizec
∫ tf
to
‖ [xc;uc]− [x∗;u∗] ‖2dt (8)
Using derivative-free optimization techniques, the cost function parameter c can be found.
Then new trajectories (xc,uc) are solved by the optimal control problem
minimizex(t),u(t)
∫ tf
to
cTφ(t,x(t),u(t))dt
subject to x(t) = f(t,x(t),u(t))
x(t0) = xstart
x(tf) = xfinal
(9)
and the search for a new cost function parameter c continues until convergence. Though
these methods provide a converging solution, they all vary in computational efficiency and
performance. Inverse optimal control in Reference 20 provides an increase in computa-
tional efficiency and less parameter, feature, and trajectory error than the other three ap-
proaches.
5
THEORY: INVERSE OPTIMAL CONTROL
To formulate the IRL problem using optimal control theory, the optimal or near-optimal
trajectory (x,u) data is considered. The minimum principle provides the necessary con-
ditions for the trajectory (x,u) to be a local minimum and can used to learn the objective
function [22]. For example, a common problem statement for the IRL problem assumes
that the cost function is a linear combination of nonlinear features, where is the feature
vector φ(t,x,u), and the problem is stated as,
minimizex(t),u(t)
∫ tf
to
cTφ(t,x(t),u(t))dt
subject to x(t) = f(t,x(t),u(t))
(10)
The Hamiltonian for this problem is stated as
H(x,u,p) = cTφ(t,x,u) + pT f(t,x,u) (11)
For an unknown parameter vector c, if (x,u) = (x∗,u∗) are assumed to be near-optimal,
there exists a costate trajectory p∗ given by
0 = p∗T +∇xH(x∗,u∗,p∗) (12)
0 = ∇uH(x∗,u∗,p∗) (13)
By plugging in the Hamiltonian, the costate equation can be simplified to
0 = p∗T + cT∇xφ(x∗,u∗,p∗) + pT∇xf(t,x,u) (14)
0 = cT∇uφ(x∗,u∗,p∗) + pT∇uf(t,x,u) (15)
If (x,u) = (x∗,u∗), the necessary conditions for optimality is satisfied. If the trajectory
(x,u) is approximately optimal, the necessary conditions for optimality are approximately
satisfied. This can be described by a residual function which determines how much the
trajectories (x,u) do not satisfy the necessary conditions. Thus, an optimal solution is one
where the residual function is zero. For any approximately optimal trajectories, the goal
becomes minimizing this residual function. By defining
z =[
cT pT]T
, v = p (16)
Note that z ∈ Rnz where nz = ns + nc. Then the residual function is
r(z,v) =
[
∇xφT∣
∣
(x,u)∇xf
T∣
∣
(x,u)
∇uφT∣
∣
(x,u)∇uf
T∣
∣
(x,u)
]
z+
[
Inu×ns
0ns×ns
]
v
= F (t)z(t) +G(t)v(t)
(17)
6
The goal is to determine the minimum of the residual under unknown parameters z and v
given by
minimizez,v
∫ tf
t0
||r(z,v)||2dt
subject to z =
[
Ins×ns
0ns×ns
]
v
z(t0) = z0 (unknown)
(18)
where z(to) is unknown and
||r(z,v)||2 = zTF TFz+ vTGTGv + 2zTF TGv (19)
This cost function is in a quadratic form and the minimization problem with constraints be-
comes an LQR problem, or can be solved using convex optimization very efficiently. When
the initial condition z(t0) is known, the minimization problem with constraints become an
LQR problem given by
minimizez,v
∫ tf
t0
zTQz+ vTRv + 2zTSvdt
subject to z = Az+Bv
z(t0) = z0 (assumed known)
(20)
where
A(t) = 0nz×nz , B(t) =
[
Ins×ns
0ns×ns
]
Q(t) = F T (t)F (t), R(t) = GT (t)G(t)
S(t) = F T (t)G(t)
(21)
Using the standard LQR equations, the control policy and value function can be determined
as
v(t) = K(t)z(t), V (z) = zTP (t)z (22)
where K(t) is
K(t) = −(GT (t)G(t))−1(GT (t)F (t) +BT (t)P (t)) (23)
P (t) is the solution to the Riccati equation which can be solved by minimizing the value
function. The Matrix Riccati equation that define P (t) is given by
dP
dt= −PA− ATP −Q + (PB + S)R−1
(
BTP +NT)
(24)
where the terminal condition is given by P (tf) = 0nz×nz .Then the initial state z0 can be
solved for by solving the following optimization problem,
minimizez0
zT0 P (t0)z0
subject to c > 0
cTc = 1
(25)
7
Then constraints above are needed to avoid the trivial solution c = 0. Then the IOC process
involves determining the matrices in Eq. (21) using the observed trajectories and then solv-
ing the matrix differential Riccati equation in Eq. (24) backward in time for P (t0). Finally,
with P (t0), the Quadratic Programming problem in Eq. (25) is solved to determine z0. It
has been shown in Reference 20 that this inverse optimal control method is more compu-
tationally efficient with better approximation of the unknown parameters as compared to
IRL methods.
Figure 1. Typical architecture of a Single Layer Forward Network (SLFN) which isthe most fundamental ELM.
COST FUNCTION FEATURE SELECTION: EXTREME LEARNING MACHINES
This section discusses Extreme Learning Machines (ELMs), which is a general linear
feature learning method. This method is effective for the SO behavior problem because it
has been shown to work well for high dimensional problems. Machine learning techniques
have been successfully used in learning functional relationships, that only require a limited
amount of data. Most of such techniques (e.g. NNs, Support Vector Machines (SVM)) are
faced with many challenges including, slow learning speed, poor computational scalability
as well as requirement of ad-hoc human intervention. Extreme Learning Machines have
been recently established as an emergent technology that may overcome some of the above
mentioned challenges providing better generalization, faster learning speed and minimum
human intervention.23 ELMs work with “generalized” Single Layer Forward Networks
(SLFN, Figure 1). SLFN are computationally designed to have a single hidden layer (which
can be either Radial Basis Function (RBF) or other activation functions) couple to a linear
output layer. The key point is that the hidden neurons need not to be tuned and their
weights (training parameters) can be sampled from a random distribution. Theoretical
studies24 show that feed-forward networks with minimum output weights tend to achieve
better generalization. EML tend to reach a) the minimum training error and b) the smallest
norm of output weights with consequent improved generalization. Importantly, since the
hidden nodes can be selected and fixed, the output weights can be determined via least-
square methodologies. Consider a SLFN with L hidden nodes (Figure 3). The output
8
function can be represented as follows:
fL(x) =L∑
i=1
βigi(x) =L∑
i=1
βiG(ai, bi,x) (26)
where x ∈ Rd and βi ∈ R
m. For additive nodes with activation function g we have the
following
G(ai, bi,x) = g(aTi x+ bi) (27)
where ai ∈ Rd and bi ∈ R. For RBF nodes with activation function g is given by
G(ai, bi,x) = g(‖aTi x+ bi‖) (28)
where ai ∈ Rd and bi ∈ R
+. Consider a training set comprising N distinct samples, i.e.
[xi, ti] ∈ Rd × R
m. The mathematical model describing SLFNs can be cast as follows:
L∑
i=1
βiG(ai, bi,xj) = oj, for j = 1, · · · , N (29)
Stating that SLFNs can approximate N samples with zero error is equivalent to state that
there exist pairs (ai, bj) and βi such that:
L∑
i=1
βiG(ai, bi,xj) = tj, for j = 1, · · · , N (30)
Compactly in matrix form:
Hβ = T (31)
Where the hidden layer output matrix H is formally written as:
H(x) =
G(a1, b1,x1) · · · G(aL, bL,x1)...
. . ....
G(a1, b1,xN) · · · G(aL, bL,xN)
(32)
where
β =
βT1...
βTL
N×m
and T =
tT1...
tTN
N×m
(33)
Huang et al. 23 theoretically showed that SLFNs with randomly generated additive or RFB
nodes can universally approximate any desired (target) function over a compact subset of
X ∈ Rd. Such results can be generalized to any piecewise continuous activation function
in the hidden node.24
The basic ELM can be constructed as follows. After selecting a sufficiently high number
of hidden nodes (ELM architecture), the parameters (ai,bi) are randomly generated and
9
remain fixed. Training occurs by simply determining β of the system H(x)β = T, i.e.
find β such that
‖H(x)β −T‖2 = minβ‖H(x)β −T‖2 (34)
This work uses a regularized matrix inverse solution to this minimization problem and the
regularized problem is given by
minβ‖H(x)β −T‖2 + α‖β‖2 (35)
where α is a regularization parameter. The ELM approach is used to generate features
for the IOC method used in this work. The parameters (ai,bi) are randomly generated to
determine H and the feature function are selected to be φ(x,u) = H(x,u). This allows
for very expressive features that can handle high dimensional inputs while maintaining the
structure of the cost function as a linear combination of the feature, i.e. φTc.
RELATIVE ORBITAL MOTION EQUATIONS
This work makes use of relative motion dynamics for controlling a GEO satellite relative
to a desired orbit, and this section provides an overview of the relative motion equations.
The spacecraft about which all other spacecraft are orbiting is referred to as the chief. The
remaining spacecraft are referred to as the deputies. The relative orbit position vector, r, is
expressed in components by ρ = [x y z]T . A complete derivation of the relative equations
of motion for eccentric orbits can be found in Ref.25 If the relative orbit coordinates are
small compared to the chief orbit radius, then the equations of motion are given by25
x− xθ2(
1 + 2rcp
)
− 2θ
(
y − yrcrc
)
= ux (36a)
y + 2θ2(
x+ 2xrcp
)
− yθ
(
1−rcp
)
= uy (36b)
z + xθ2rcp
= uz (36c)
where p is semilatus rectum of the chief, rc is the chief orbit radius and θ is true anomaly
rate of the chief. Also, ux, uy and uz are control accelerations. The true anomaly accelera-
tion and chief orbit-radius acceleration are given by
θ = −2rcrcθ (37a)
rc = rcθ2
(
1−rcp
)
(37b)
If the chief satellite orbit is assumed to be circular so that rc = 0 and p = rc, then the
relative equations of motion reduce to the simple form known as the CW equations (with
control added here):
x− 3n2x− 2ny = ux (38a)
y + 2nx = uy (38b)
z + n2z = uz (38c)
10
where n = θ is the mean motion. The state and control vectors is given by x = [x y z x y z]T
and u = [ux uy uz]T , respectively.
-3
3
-2
-6
-1
2
0
10-3
-4
1
1
2
-210
-4 0
3
10-40
-1 2-2 4
-3 6
Deseired
Controlled
Uncontrolled
Radial (L) Along-Track (L)
Out-
of-
Pla
ne
(L)
(a) Satellite Control Example 1
0 2 4 6 8 10 12 14 16
0
10
2010
-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
10-3
-1
0
110
-3
0 2 4 6 8 10 12 14 16
-1
0
110
-3
Pos.
(L)
xyz
Vel
.(L
/T)
vxvyvz
Contr
ol
(L/T
2)
uxuyuz
t (T)
t (T)
t (T)
(b) Satellite Control Example 1
0 2 4 6 8 10 12 14 16
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Pval
ues
t (T)
(c) P Values Example 1
Figure 2. Observed Satellite Trajectories for Relative Motion Example.
SIMULATION RESULTS
This section discusses the initial proof-of-concept results for the proposed satellite be-
havior modeling approach. To highlight the effectiveness of the proposal approach we
considering two examples; the first case uses GEO satellite which is station keeping in a
GEO box and the second case looks at a continuous thrust orbit raising maneuver. Multiple
observed trajectories are used to recover control objective function. The learned control
objective function can then be used to estimate the behavior of SOs. This first simulate
case is outline here while the second case in outline in the section below. The simulation
scenario considered for learning the objective function used a GEO stationary SO which
is performing station keeping maneuvers to maintain a near GEO orbit. The equations of
motion for the SO are described using the linearized HCW model given by Eq. (38). The
true control objective function used for the station keeping example is assumed to be given
11
by the following function,
∫ tf
to
xTQHxT + uTRHu
Tdt (39)
where QH = I6×6 and RH = 5I3×3. Given this true control objective function, LQR
is used to generate simulated trajectories. Then these simulated trajectories are provided
as observations for the IOC approach. For these initial proof-of-concept results perfect
measurements of the SO’s state and control. The simulates trajectories are shown in Figure
2.
Relative Motion Example: Simple Features
The first case considers simple polynomial features given by
φ(x,u) =[
ux2 uy
2 uz2 x2 y2 z2 x2 y2 z2
]
(40)
Therefore, using these features the IOC cost function is given by
∫ tf
to
c1ux2 + c2uy
2 + c3uz2 + c4x
2 + c5y2 + c6z
2 + c7x2 + c8y
2 + c9z2dt (41)
Using these features and the trajectories shown in Figure 2, two cases were studied. The
first case used only one trajectory of data and the second case used five. The true weight
vector for this case is given by
ctrue =1
9[5, 5, 5, 1, 1, 1, 1, 1, 1]T (42)
The estimated weights for using one observed trajectory is given by
cest1 =1
9[5.4295, 4.9057, 4.5063, 1.3077, 0.9886, 0.8792, 1.2456, 1.1407, 0.9135]T (43)
The norm error for this case was ‖cTrue − cest1‖2 = 0.0885. The estimated weights using
five trajectories is given by
cest5 =1
9[4.975, 5.012, 4.981, 1.005, 1.016, 0.992, 0.987, 1.071, 0.993]T (44)
The norm error for this case was ‖cTrue − cest1‖2 = 0.0298. Using one trajectory of data
resulted in large errors in the estimated weight vector. This is due to the fact that with one
trajectory not all aspects of the cost function are highlighted. However, we five trajectories
the weight vector was recover with relativity high degree of accuracy. Therefore, it can be
seen that using more trajectories provide an improved estimate of the weights.
12
(a) Costates (b) Costates Velocities
(c) Weight Parameters (d) Estimated vs. Control Objective Function
Figure 3. ELM features for Relative Motion Example.
Relative Motion Example: ELM Features
In addition to the polynomial features, ELM-based feature functions were used. Figure
3 shows the results for the case that uses ELM features. From this figure it can be seen that
the ELM features are flexible enough to capture the cost function, which is based on poly-
nomial features. Additionally, this case had 400 ELM features which was easily solvable
using the IOC approach. The estimated ELM weight parameters are shown in Figure 3(c),
where the costate and costate velocities are shown in Figures 3(a) and 3(b), respectively.
These results show promise for extending this work to high-dimensional systems.
Nonlinear Dynamics: Optimal Orbit Raising
This example considers determining the control cost function for an SO which is using
continuous thrust to raise its orbit. The trajectory used for this case is shown in Figure 4.
13
-1.5 -1 -0.5 0 0.5 1 1.5
-1
-0.5
0
0.5
1
1.5Deseired
Controlled
Uncontrolled
y(L
)
x (L)
(a) Transfer Orbit
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-2
0
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
10-3
-1
0
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-1
0
1
Pos.
(L) xy
Vel
.(L
/T)
vxvy
Contr
ol
(L/T
2)
uxuy
t (T)
t (T)
t (T)
(b) Orbit Raising Control
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-4
-2
0
2
4
6
8
10
12
14
16
Pval
ues
t (T)
(c) P Values Orbit Raising Example
Figure 4. Continuous Thrust Orbit Optimal Raising Case.
The objective function used for this case is given by
minimizex,u
∫ tf
to
c1ux2 + c2uy
2 + c4x2 + c5y
2 + c5vx2 + c6vy
2dt
subject to x = f(x)
x(t0) = xinitial
x(tf) = xfinal
(45)
where
f(x) =
vxvy
−µx
r3+ ux
−µy
r3+ uy
(46)
The dynamics model consider is a 2 dimensional 2 body problem and where the state of
the SO for this case is given by x = [x, y, vx, vy]T and it is assumed that µ = 1 and
r =√
x2 + y2. The initial and final states are given by xinitial = [r0, 0,√
µ/r0, 0]T and
14
xfinal = [rf , 0,√
µ/rf , 0]T . This objective function is used to solve the forward problem.
The forward optimal control problem for the orbit raising case is solved using the Hermite-
Simpson transcription method on a grid of 100 points. The solution for the forward problem
is shown in Figures 4(b) and 4(a). In this case the forward problem is nonlinear since
nonlinear two body dynamics are used. However, the inverse problem is still convex and
solvable using the LQR-based approach. The gradients for the IOC approach are given by
∇xφ|(x,u) =
0 0 0 00 0 0 02 x 0 0 00 2 y 0 00 0 2 vx 00 0 0 2 vy
(47a)
∇uφ|(x,u) =
2 ux 00 2 uy
0 00 00 00 0
(47b)
∇xf |(x,u) =
0 0 1 00 0 0 1
3x2
r5− 1
r33x y
r50 0
3x y
r53 y2
r5/2− 1
r30 0
(47c)
Figure 4(c) shows the solution of the Riccati Differential Equation for the initial P (t0)and once this obtained, the coefficients can be solved for using Quadratic programming
problem shown in Eq. (25). The true weight vector for this case is given by
cTrue = [1, 1, 1e−8, 1e−8, 1e−8, 1e−8]T (48)
The estimated weights using five trajectories is given by
cest5 = [1, 0.930, 0.875e−8, 0.720e−8, 1.209e−8, 1.058e−8]T (49)
The estimated weight vector using five observed trajectories recovered the control cost
function with good accuracy. The norm error for this case was ‖cTrue − cest5‖2 = 7e−2.
CONCLUSION
This paper considers the problem of determining the behavior of a SO from observa-
tional data using Inverse Optimal Control (IOC). Given the observed trajectories of states
and controls, the IOC approach can be used to estimate the control objective function that
a given SO is using. This work considers three simulated cases of maneuvering SOs. The
first two cases use an SO in GEO which is maneuvering to maintain a given GEO station-
ary box. The third case studies the behavior estimation problem for optimal orbit raising.
15
The control objective function was specified, and simulation data was generated for the
hypothetical SOs. The first two cases differ in the features used to represent the control ob-
jective function. The control objective function was estimated using a linear combination
of features. The first case uses simple second order polynomial features while the second
case demonstrates complex features based on neural networks. The nonlinear basis func-
tions used for this work included polynomial terms and extreme learning machines. Good
performance was shown for the first two cases, and it was observed that the accuracy of
IOC solution improves with the number of observed trajectories used. For the third and
final case, the estimated control objective function approximated the true control objective
function well, and good performance was shown for the proposed approach.
REFERENCES
[1] Kalman, R. E., “When is a linear control system optimal?” Journal of Basic Engineering, Vol. 86,No. 1, 1964, pp. 51–60.
[2] Masak, M., “An inverse problem on decoupling optimal control systems,” IEEE Transactions on Auto-matic Control, Vol. 13, No. 1, 1968, pp. 109–110.
[3] Moylan, P. and Anderson, B., “Nonlinear regulator theory and an inverse optimal control problem,”IEEE Transactions on Automatic Control, Vol. 18, No. 5, October 1973, pp. 460–465.
[4] Ng, A. Y., Russell, S. J., et al., “Algorithms for inverse reinforcement learning.” Icml, 2000, pp. 663–670.
[5] Abbeel, P. and Ng, A. Y., “Apprenticeship learning via inverse reinforcement learning,” Proceedings ofthe twenty-first international conference on Machine learning, ACM, 2004, p. 1.
[6] Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K., “Maximum Entropy Inverse ReinforcementLearning.” AAAI, Vol. 8, Chicago, IL, USA, 2008, pp. 1433–1438.
[7] Finn, C., Levine, S., and Abbeel, P., “Guided cost learning: Deep inverse optimal control via policyoptimization,” International Conference on Machine Learning, 2016, pp. 49–58.
[8] Wulfmeier, M., Ondruska, P., and Posner, I., “Maximum entropy deep inverse reinforcement learning,”arXiv preprint arXiv:1507.04888, 2015.
[9] Kuderer, M., Gulati, S., and Burgard, W., “Learning driving styles for autonomous vehicles fromdemonstration,” Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE,2015, pp. 2641–2646.
[10] Mainprice, J., Hayne, R., and Berenson, D., “Goal Set Inverse Optimal Control and Iterative Replanningfor Predicting Human Reaching Motions in Shared Workspaces.” IEEE Trans. Robotics, Vol. 32, No. 4,2016, pp. 897–908.
[11] Krstic, M. and Tsiotras, P., “Inverse optimal stabilization of a rigid spacecraft,” IEEE Transactions onAutomatic Control, Vol. 44, No. 5, 1999, pp. 1042–1049.
[12] Argall, B. D., Chernova, S., Veloso, M., and Browning, B., “A survey of robot learning from demon-stration,” Robotics and autonomous systems, Vol. 57, No. 5, 2009, pp. 469–483.
[13] Ho, J., Gupta, J., and Ermon, S., “Model-free imitation learning with policy optimization,” InternationalConference on Machine Learning, 2016, pp. 2760–2769.
[14] Ramachandran, D. and Amir, E., “Bayesian inverse reinforcement learning,” Urbana, Vol. 51, No.61801, 2007, pp. 1–4.
[15] Dvijotham, K. and Todorov, E., “Inverse optimal control with linearly-solvable MDPs,” Proceedings ofthe 27th International Conference on Machine Learning (ICML-10), 2010, pp. 335–342.
[16] Keshavarz, A., Wang, Y., and Boyd, S., “Imputing a convex objective function,” Intelligent Control(ISIC), 2011 IEEE International Symposium on, IEEE, 2011, pp. 613–619.
[17] Puydupin-Jamin, A.-S., Johnson, M., and Bretl, T., “A convex approach to inverse optimal control andits application to modeling human locomotion,” Robotics and Automation (ICRA), 2012 IEEE Interna-tional Conference on, IEEE, 2012, pp. 531–536.
[18] Aghasadeghi, N., Long, A., and Bretl, T., “Inverse optimal control for a hybrid dynamical system withimpacts,” Robotics and Automation (ICRA), 2012 IEEE International Conference on, IEEE, 2012, pp.4962–4967.
16
[19] Terekhov, A. V., Pesin, Y. B., Niu, X., Latash, M. L., and Zatsiorsky, V. M., “An analytical approach tothe problem of inverse optimization with additive objective functions: an application to human prehen-sion,” Journal of mathematical biology, Vol. 61, No. 3, 2010, pp. 423–453.
[20] Johnson, M., Aghasadeghi, N., and Bretl, T., “Inverse optimal control for deterministic continuous-timenonlinear systems,” Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, IEEE, 2013,pp. 2906–2913.
[21] Mombaur, K., Truong, A., and Laumond, J.-P., “From human to humanoid locomotion—an inverseoptimal control approach,” Autonomous robots, Vol. 28, No. 3, 2010, pp. 369–383.
[22] Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A., “Maximum margin planning,” Proceedings of the23rd international conference on Machine learning, ACM, 2006, pp. 729–736.
[23] Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K., “Extreme learning machine: theory and applications,”Neurocomputing, Vol. 70, No. 1, 2006, pp. 489–501.
[24] Huang, G.-B., Chen, L., and Siew, C.-K., “Universal approximation using incremental constructivefeedforward networks with random hidden nodes,” Neural Networks, IEEE Transactions on, Vol. 17,No. 4, 2006, pp. 879–892.
[25] Junkins, J. L., Hughes, D. C., Wazni, K. P., Pariyapong, V., and Kehtarnavaz, N., “Vision-Based Navi-gation for Rendezvous, Docking and Proximity Operations,” AAS Paper 99-021, Vol. 52, Feb. 1999.
17