aas 18-377 physically-constrained inverse optimal control...

AAS 18-377

PHYSICALLY-CONSTRAINED INVERSE OPTIMAL CONTROL FORSATELLITE MANEUVER DETECTION

Richard Linares∗ and Joseph B. Raquepas†

This paper develops an approach to determine the behavior of Space Ob-jects (SOs) using Inverse Optimal Control (IOC) theory. The proposedmethod determines the control objective function that each SO is usingfor control utilizing IOC theory, thereby determining the behavior of theSOs. The approach discussed in this work can be used to analyze maneu-vering SOs from observational data. The contribution of this work is toformulate the behavior estimation problem using IOC theory. The IOCproblem is solved using the Pontryagin’s minimum principle which re-sults in a convex optimization problem. Three simulation test cases areshown, the first two cases study the behavior estimation problem for a us-ing the relative motion equations, while the third case studies the behaviorestimation problem for continuous thrust orbit raising maneuver. Addi-tionally, a general approach for modeling the SO reward function usingExtreme Learning Machines (ELM) is discussed. The simulation exampledemonstrate that the control objective function can be recovered using theproposed method.

INTRODUCTION

Optimal Control refers to the process of computing state and control trajectories which

are optimal with respect to a cost function subject to constraints, and in the context of

this work we call this the Forward Optimal Control (FOC) problem. The Inverse optimal

control (IOC) refers to the process of determining the cost function implied by observed

state and control trajectories. The study of IOC problems has been an active area for more

than 50 years,1 and has been successfully applied to many fields.2 Kalman1 first studied

the problem of finding all performance indices that a given control law are optimal with

respect to. This theory has seen many application for development of control laws2, 3 and

more recently in reinforcement learning.4–8

IOC is also known as Inverse Reinforcement Learning (IRL)4 but this work makes a

distinction between these two approaches. This work uses the term IRL to refer to comput-

ing IOC solutions without knowledge of underlying dynamics and in a stochastic setting.

There are many emerging application of both IRL and IOC which include transferring ex-

pert demonstrations to robotic systems,5 humanoid robot control, autonomous driving,9 and

human-robot cooperation,10 and stabilization of a rigid spacecraft systems.11 IRL is often

∗Charles Stark Draper Assistant Professor, Department of Aeronautics and Astronautics, Massachusetts Insti-

tute of Technology, Cambridge, MA, 02139. E-mail: [email protected], Senior Member AIAA.†Research Mathematician, Information Directorate, Air Force Research Laboratory, Rome, NY.

1

applied to the problem of learning control policies from demonstrations, this process is re-

ferred to as learning or apprenticeship learning. The demonstration data is usually consider

to be state and control trajectories that are captured from expert demonstration.

Apprenticeship learning can also be solved using regression and classification approaches

which learn to map states to actions using expert demonstration.12 However, these approach

tend to have issues with generalization of policies for unseen conditions.13 Recently, IRL

has been shown to provide the basis for some of the most successful approaches to im-

itation learning.4–8 Instead of learning to model the observed behavior directly, IRL ap-

proach approximate the cost function (or reward function) which the observed trajectories

are (approximately) optimal with respect to. IRL approaches train a model over the entire

trajectories of observed behavior instead of individual state and action pairs, and therefore

is more robust to unseen conditions.

One of the first IRL methods was development for stationary Markov Decision Pro-

cesses (MDS) and the solution for the cost function was found using a linear programming

method.4 This work was later extended to provide an efficient IRL approach by finding

a cost function that is a linear combination of nonlinear features and is lower in cost, by

a margin, for the expert trajectories.5 There have been many extensions of this work, in-

cluding Bayesian approach,14 maximum entropy approach,6 and using linearly solvable

stochastic optimal control problems.15 However, most existing IRL techniques solve a

FOC problem within the procedure in a nested iterative process. These methods compute

predicted trajectories of the corresponding cost function by the solving the FOC in an inner

loop and cost function (feature weights) are then updated in an outer loop. This structure

was initially proposed by Ref. 5.

An alternative to IRL approach are IOC approaches that make use of additional con-

straints on the expert demonstration. In general, both the IRL and IOC problems fall under

a more general framework of imputing objective function for optimization problems from

observed solution. In this context Ref. 16 developed an approach for imputing the objec-

tive function given known constraints and observed trajectories. The general concept for

finding the cost function is to set up a set of optimality equations which should hold for the

observed trajectories and then minimizing the residual of these equations 16. Reference

16 used the Karush-Kuhn-Tucker (KKT) optimality conditions for determine the unknown

cost function. Additionally, Ref. 16 that this approach could be applied to IOC problem

where the known constraints are the system dynamics and the unknown objective func-

tion is the control cost function. These ideas where then extended solving deterministic

discrete-time IOC problem,17 IOC problems for hybrid systems,18 additive cost functions

and linear constraints for aircraft intent determination,19 and for analysis of human loco-

motion.17 Finally, for continuous time systems this concepts can be extended using Pon-

tryagion’s Minimum principle where the Euler Lagrange equations in continuous cases are

employed.20

Unfortunately, issues do arise using IOC or IRL. By using IOC, a set of reward function

can be found for which a given policy is optimal. Unfortunately, this set may contain

degenerate reward function (i.e. reward function is zero). To mitigate this, additional

2

constraints are imposed on the cost function.16 Another problem that arises is large or

infinite state spaces in which a tabular form of the reward function is unattainable. By

assuming the reward function is a linear combination of fixed basis function, the IOC stays

in the class of linear programs that can be solved efficiently.21 Nevertheless, these problems

should be considered while formulating the IOC algorithm.

The estimated reward function provided by IOC can be used to determine the type of

behavior mode the unit is following and to classify the mode based on libraries of be-

havior modes. This concept was demonstrated for aircraft intent estimation using IOC

solutions.19This work will also investigate using feature weight vectors determined by the

IOC approach for maneuver prediction and classification for satellites. This work follows

the standard assumption made for both IOC and IRL approaches that the cost function is a

linear combination of features.12 These weight vectors can be added to the state of SOs as

a way to represent the policy that the SO is currently following and allow for the change of

this policy over time as the behavior changes. Rather, we are given observations of expert

demonstrations for a given task and the goal is to estimate the reward function that the ex-

pert used to derive the demonstration trajectories. It is common to assume that the expert’s

actions are optimal with respect to the reward function the expert is using and this work

makes this assumption. This work will investigate the Pontryagin’s minimum principle20

for solving for the expert’s reward function.

This work utilizes IOC to develop a physically-constrained behavior estimation approach

for Space Objects (SOs). This work uses the dynamical equations of motion for SOs to

determine constraints that can be imposed on the IOC solution. This work discusses the

use of IOC to learn the behavior of Space Objects (SOs) from observed orbital motion. The

behavior of SOs is estimated using IOC to determine the reward function that each SO is

using for determining its control. In general, SOs are control to achieve a particular goal

that is determined by its mission and therefore only a data-driven learning approach can

reveal the true goal. It is also important to determine what type of behavior SOs are using

and if this behavior changes. IRL and IOC approaches use optimal control principles to

learn what reward function or control objective function is being used by an agent given

observations.5 The simplest IRL approach solves for the reward function using feature

vectors by modeling the reward function as a weighted sum of these feature vectors.5 The

weights determined from the IRL calculation are the representation for the reward function

the SO is using.

The organization of this paper is as follows. First, the concept of inverse reinforcement

learning is introduced. Following this the inverse optimal control approach is discussed and

this is followed by an outline of ELMs and relative motion dynamics. Finally, simulation

results are provided for the proposed method.

THEORY: INVERSE REINFORCEMENT LEARNING

This section summarizes existing IOC and IRL methods. Max-margin IRL was devel-

oped by Abbeel and Ng5 which was used directly on the driving example mentioned ear-

lier. This method learns a cost function which is minimized to obtain performance (feature

3

vector) similar to the expert. First, the method is initialized with a random cost function

parameter c(0), and the optimal control problem is solved given by

minimizex(t),u(t)

∫ tf

to

cTφ(t,x(t),u(t))dt

subject to x(t) = f(t,x(t),u(t))

x(t0) = xstart

x(tf) = xfinal

(1)

where x(t) ∈ Rns is the state, u(t) ∈ R

ns is the control input,

φ(t,x(t),u(t)) = [φ1(t,x(t),u(t)), · · · , φnc(t,x(t),u(t))]T

are the basis functions, and c ∈ Rnc is the cost function parameter that is to be learned. By

solving this optimal control problem, the initial trajectory (x(0)(t),u(0)(t)) is found. The

initial feature µ(0) can be found using

µ =

∫ tf

to

φ(t,x(t),u(t))dt (2)

Then the quadratic program (QP) is solved for the ith coefficient, ci, given by

minimizeci,bi

‖ci‖2

subject to(

ci)T

µ∗6

(

ci)T

µj − bi

for j = 0, · · · , i− 1

bi > 0

(3)

where bi is the margin on the ith iteration, µ∗ is the optimal trajectory feature vector, and

µ(j) is the jth trajectory feature vector. If bi < ǫ terminate, otherwise iterate for i = i + 1.

After it terminates, there will be a policy and a corresponding feature vector that follow

closely to the expert’s policy.

Maximum margin planning developed by Reference 22 learns a cost function for which

the expert policy has lower expected cost than every other policy. Starting with the QP

minimizeci,bi

‖c‖2

subject to cTµ∗6 (c)T µ− b

for all (x(t),u(t)) ∈ S

bi > 0

(4)

where S is a certain set of trajectories, the constraints of the QP problem can be satisfied

using

cTµ∗6minimize

x,u

(

cTµ− b)

subject to x(t) = f(t,x(t),u(t))(5)

4

for all possible trajectories. Instead of solving for bi in Reference 5, b is set as b = Lwhere L is the loss function. The loss function is defined by the closeness of trajectory

(x(t),u(t)) to the optimal trajectory (x∗(t),u∗(t)). If the trajectory (x(t),u(t)) is an op-

timal trajectory, then the loss function is zero. As (x(t),u(t)) increasingly deviates from

the optimal trajectory, then the loss function increases to one. Lastly, slack variables ζ are

included to allow constraint violations. Therefore, the problem to be solved becomes

minimizec,ζ

ζ +λ

2‖c‖2

subject to cTµ∗6 minimize

x,u

(

cTµ(x,u)− b)

x = f (x,u)

(6)

where λ > 0 is a penalizes constraint violations and small weight vectors. With tight slack

variables, the objective can be modified as

J(c) = λ‖c‖2 + cTµ∗(i)− minimizex(t),u(t)

(

cTµ− L)

x(t) = f(t,x(t),u(t))(7)

and solved as a convex program using sub-gradient descent.

The bi-level inverse optimal control method proposed by Mombaur21 uses a derivative-

free optimization technique to find a cost function parameter while an optimal control

method solves predicted trajectories. By minimizing the sum squared error between pre-

dicted and observed trajectories given by

minimizec

∫ tf

to

‖ [xc;uc]− [x∗;u∗] ‖2dt (8)

Using derivative-free optimization techniques, the cost function parameter c can be found.

Then new trajectories (xc,uc) are solved by the optimal control problem

minimizex(t),u(t)

∫ tf

to

cTφ(t,x(t),u(t))dt


x(t0) = xstart

x(tf) = xfinal

(9)

and the search for a new cost function parameter c continues until convergence. Though

these methods provide a converging solution, they all vary in computational efficiency and

performance. Inverse optimal control in Reference 20 provides an increase in computa-

tional efficiency and less parameter, feature, and trajectory error than the other three ap-

proaches.

5

THEORY: INVERSE OPTIMAL CONTROL

To formulate the IRL problem using optimal control theory, the optimal or near-optimal

trajectory (x,u) data is considered. The minimum principle provides the necessary con-

ditions for the trajectory (x,u) to be a local minimum and can used to learn the objective

function [22]. For example, a common problem statement for the IRL problem assumes

that the cost function is a linear combination of nonlinear features, where is the feature

vector φ(t,x,u), and the problem is stated as,

minimizex(t),u(t)

∫ tf

to

cTφ(t,x(t),u(t))dt


(10)

The Hamiltonian for this problem is stated as

H(x,u,p) = cTφ(t,x,u) + pT f(t,x,u) (11)

For an unknown parameter vector c, if (x,u) = (x∗,u∗) are assumed to be near-optimal,

there exists a costate trajectory p∗ given by

0 = p∗T +∇xH(x∗,u∗,p∗) (12)

0 = ∇uH(x∗,u∗,p∗) (13)

By plugging in the Hamiltonian, the costate equation can be simplified to

0 = p∗T + cT∇xφ(x∗,u∗,p∗) + pT∇xf(t,x,u) (14)

0 = cT∇uφ(x∗,u∗,p∗) + pT∇uf(t,x,u) (15)

If (x,u) = (x∗,u∗), the necessary conditions for optimality is satisfied. If the trajectory

(x,u) is approximately optimal, the necessary conditions for optimality are approximately

satisfied. This can be described by a residual function which determines how much the

trajectories (x,u) do not satisfy the necessary conditions. Thus, an optimal solution is one

where the residual function is zero. For any approximately optimal trajectories, the goal

becomes minimizing this residual function. By defining

z =[

cT pT]T

, v = p (16)

Note that z ∈ Rnz where nz = ns + nc. Then the residual function is

r(z,v) =

[

∇xφT∣

∣

(x,u)∇xf

T∣

∣

(x,u)

∇uφT∣

∣

(x,u)∇uf

T∣

∣

(x,u)

]

z+

[

Inu×ns

0ns×ns

]

v

= F (t)z(t) +G(t)v(t)

(17)

6

The goal is to determine the minimum of the residual under unknown parameters z and v

given by

minimizez,v

∫ tf

t0

||r(z,v)||2dt

subject to z =

[

Ins×ns

0ns×ns

]

v

z(t0) = z0 (unknown)

(18)

where z(to) is unknown and

||r(z,v)||2 = zTF TFz+ vTGTGv + 2zTF TGv (19)

This cost function is in a quadratic form and the minimization problem with constraints be-

comes an LQR problem, or can be solved using convex optimization very efficiently. When

the initial condition z(t0) is known, the minimization problem with constraints become an

LQR problem given by

minimizez,v

∫ tf

t0

zTQz+ vTRv + 2zTSvdt

subject to z = Az+Bv

z(t0) = z0 (assumed known)

(20)

where

A(t) = 0nz×nz , B(t) =

[

Ins×ns

0ns×ns

]

Q(t) = F T (t)F (t), R(t) = GT (t)G(t)

S(t) = F T (t)G(t)

(21)

Using the standard LQR equations, the control policy and value function can be determined

as

v(t) = K(t)z(t), V (z) = zTP (t)z (22)

where K(t) is

K(t) = −(GT (t)G(t))−1(GT (t)F (t) +BT (t)P (t)) (23)

P (t) is the solution to the Riccati equation which can be solved by minimizing the value

function. The Matrix Riccati equation that define P (t) is given by

dP

dt= −PA− ATP −Q + (PB + S)R−1

(

BTP +NT)

(24)

where the terminal condition is given by P (tf) = 0nz×nz .Then the initial state z0 can be

solved for by solving the following optimization problem,

minimizez0

zT0 P (t0)z0

subject to c > 0

cTc = 1

(25)

7

Then constraints above are needed to avoid the trivial solution c = 0. Then the IOC process

involves determining the matrices in Eq. (21) using the observed trajectories and then solv-

ing the matrix differential Riccati equation in Eq. (24) backward in time for P (t0). Finally,

with P (t0), the Quadratic Programming problem in Eq. (25) is solved to determine z0. It

has been shown in Reference 20 that this inverse optimal control method is more compu-

tationally efficient with better approximation of the unknown parameters as compared to

IRL methods.

Figure 1. Typical architecture of a Single Layer Forward Network (SLFN) which isthe most fundamental ELM.

COST FUNCTION FEATURE SELECTION: EXTREME LEARNING MACHINES

This section discusses Extreme Learning Machines (ELMs), which is a general linear

feature learning method. This method is effective for the SO behavior problem because it

has been shown to work well for high dimensional problems. Machine learning techniques

have been successfully used in learning functional relationships, that only require a limited

amount of data. Most of such techniques (e.g. NNs, Support Vector Machines (SVM)) are

faced with many challenges including, slow learning speed, poor computational scalability

as well as requirement of ad-hoc human intervention. Extreme Learning Machines have

been recently established as an emergent technology that may overcome some of the above

mentioned challenges providing better generalization, faster learning speed and minimum

human intervention.23 ELMs work with “generalized” Single Layer Forward Networks

(SLFN, Figure 1). SLFN are computationally designed to have a single hidden layer (which

can be either Radial Basis Function (RBF) or other activation functions) couple to a linear

output layer. The key point is that the hidden neurons need not to be tuned and their

weights (training parameters) can be sampled from a random distribution. Theoretical

studies24 show that feed-forward networks with minimum output weights tend to achieve

better generalization. EML tend to reach a) the minimum training error and b) the smallest

norm of output weights with consequent improved generalization. Importantly, since the

hidden nodes can be selected and fixed, the output weights can be determined via least-

square methodologies. Consider a SLFN with L hidden nodes (Figure 3). The output

8

function can be represented as follows:

fL(x) =L∑

i=1

βigi(x) =L∑

i=1

βiG(ai, bi,x) (26)

where x ∈ Rd and βi ∈ R

m. For additive nodes with activation function g we have the

following

G(ai, bi,x) = g(aTi x+ bi) (27)

where ai ∈ Rd and bi ∈ R. For RBF nodes with activation function g is given by

G(ai, bi,x) = g(‖aTi x+ bi‖) (28)

where ai ∈ Rd and bi ∈ R

+. Consider a training set comprising N distinct samples, i.e.

[xi, ti] ∈ Rd × R

m. The mathematical model describing SLFNs can be cast as follows:

L∑

i=1

βiG(ai, bi,xj) = oj, for j = 1, · · · , N (29)

Stating that SLFNs can approximate N samples with zero error is equivalent to state that

there exist pairs (ai, bj) and βi such that:

L∑

i=1

βiG(ai, bi,xj) = tj, for j = 1, · · · , N (30)

Compactly in matrix form:

Hβ = T (31)

Where the hidden layer output matrix H is formally written as:

H(x) =

G(a1, b1,x1) · · · G(aL, bL,x1)...

. . ....

G(a1, b1,xN) · · · G(aL, bL,xN)

(32)

where

β =

βT1...

βTL

N×m

and T =

tT1...

tTN

N×m

(33)

Huang et al. 23 theoretically showed that SLFNs with randomly generated additive or RFB

nodes can universally approximate any desired (target) function over a compact subset of

X ∈ Rd. Such results can be generalized to any piecewise continuous activation function

in the hidden node.24

The basic ELM can be constructed as follows. After selecting a sufficiently high number

of hidden nodes (ELM architecture), the parameters (ai,bi) are randomly generated and

9

remain fixed. Training occurs by simply determining β of the system H(x)β = T, i.e.

find β such that

‖H(x)β −T‖2 = minβ‖H(x)β −T‖2 (34)

This work uses a regularized matrix inverse solution to this minimization problem and the

regularized problem is given by

minβ‖H(x)β −T‖2 + α‖β‖2 (35)

where α is a regularization parameter. The ELM approach is used to generate features

for the IOC method used in this work. The parameters (ai,bi) are randomly generated to

determine H and the feature function are selected to be φ(x,u) = H(x,u). This allows

for very expressive features that can handle high dimensional inputs while maintaining the

structure of the cost function as a linear combination of the feature, i.e. φTc.

RELATIVE ORBITAL MOTION EQUATIONS

This work makes use of relative motion dynamics for controlling a GEO satellite relative

to a desired orbit, and this section provides an overview of the relative motion equations.

The spacecraft about which all other spacecraft are orbiting is referred to as the chief. The

remaining spacecraft are referred to as the deputies. The relative orbit position vector, r, is

expressed in components by ρ = [x y z]T . A complete derivation of the relative equations

of motion for eccentric orbits can be found in Ref.25 If the relative orbit coordinates are

small compared to the chief orbit radius, then the equations of motion are given by25

x− xθ2(

1 + 2rcp

)

− 2θ

(

y − yrcrc

)

= ux (36a)

y + 2θ2(

x+ 2xrcp

)

− yθ

(

1−rcp

)

= uy (36b)

z + xθ2rcp

= uz (36c)

where p is semilatus rectum of the chief, rc is the chief orbit radius and θ is true anomaly

rate of the chief. Also, ux, uy and uz are control accelerations. The true anomaly accelera-

tion and chief orbit-radius acceleration are given by

θ = −2rcrcθ (37a)

rc = rcθ2

(

1−rcp

)

(37b)

If the chief satellite orbit is assumed to be circular so that rc = 0 and p = rc, then the

relative equations of motion reduce to the simple form known as the CW equations (with

control added here):

x− 3n2x− 2ny = ux (38a)

y + 2nx = uy (38b)

z + n2z = uz (38c)

10

where n = θ is the mean motion. The state and control vectors is given by x = [x y z x y z]T

and u = [ux uy uz]T , respectively.

-3

3

-2

-6

-1

2

0

10-3

-4

1

1

2

-210

-4 0

3

10-40

-1 2-2 4

-3 6

Deseired

Controlled

Uncontrolled

Radial (L) Along-Track (L)

Out-

of-

Pla

ne

(L)

(a) Satellite Control Example 1

0 2 4 6 8 10 12 14 16

0

10

2010

-4

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

10-3

-1

0

110

-3

0 2 4 6 8 10 12 14 16

-1

0

110

-3

Pos.

(L)

xyz

Vel

.(L

/T)

vxvyvz

Contr

ol

(L/T

2)

uxuyuz

t (T)

t (T)

t (T)

(b) Satellite Control Example 1

0 2 4 6 8 10 12 14 16

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Pval

ues

t (T)

(c) P Values Example 1

Figure 2. Observed Satellite Trajectories for Relative Motion Example.

SIMULATION RESULTS

This section discusses the initial proof-of-concept results for the proposed satellite be-

havior modeling approach. To highlight the effectiveness of the proposal approach we

considering two examples; the first case uses GEO satellite which is station keeping in a

GEO box and the second case looks at a continuous thrust orbit raising maneuver. Multiple

observed trajectories are used to recover control objective function. The learned control

objective function can then be used to estimate the behavior of SOs. This first simulate

case is outline here while the second case in outline in the section below. The simulation

scenario considered for learning the objective function used a GEO stationary SO which

is performing station keeping maneuvers to maintain a near GEO orbit. The equations of

motion for the SO are described using the linearized HCW model given by Eq. (38). The

true control objective function used for the station keeping example is assumed to be given

11

by the following function,

∫ tf

to

xTQHxT + uTRHu

Tdt (39)

where QH = I6×6 and RH = 5I3×3. Given this true control objective function, LQR

is used to generate simulated trajectories. Then these simulated trajectories are provided

as observations for the IOC approach. For these initial proof-of-concept results perfect

measurements of the SO’s state and control. The simulates trajectories are shown in Figure

2.

Relative Motion Example: Simple Features

The first case considers simple polynomial features given by

φ(x,u) =[

ux2 uy

2 uz2 x2 y2 z2 x2 y2 z2

]

(40)

Therefore, using these features the IOC cost function is given by

∫ tf

to

c1ux2 + c2uy

2 + c3uz2 + c4x

2 + c5y2 + c6z

2 + c7x2 + c8y

2 + c9z2dt (41)

Using these features and the trajectories shown in Figure 2, two cases were studied. The

first case used only one trajectory of data and the second case used five. The true weight

vector for this case is given by

ctrue =1

9[5, 5, 5, 1, 1, 1, 1, 1, 1]T (42)

The estimated weights for using one observed trajectory is given by

cest1 =1

9[5.4295, 4.9057, 4.5063, 1.3077, 0.9886, 0.8792, 1.2456, 1.1407, 0.9135]T (43)

The norm error for this case was ‖cTrue − cest1‖2 = 0.0885. The estimated weights using

five trajectories is given by

cest5 =1

9[4.975, 5.012, 4.981, 1.005, 1.016, 0.992, 0.987, 1.071, 0.993]T (44)

The norm error for this case was ‖cTrue − cest1‖2 = 0.0298. Using one trajectory of data

resulted in large errors in the estimated weight vector. This is due to the fact that with one

trajectory not all aspects of the cost function are highlighted. However, we five trajectories

the weight vector was recover with relativity high degree of accuracy. Therefore, it can be

seen that using more trajectories provide an improved estimate of the weights.

12

(a) Costates (b) Costates Velocities

(c) Weight Parameters (d) Estimated vs. Control Objective Function

Figure 3. ELM features for Relative Motion Example.

Relative Motion Example: ELM Features

In addition to the polynomial features, ELM-based feature functions were used. Figure

3 shows the results for the case that uses ELM features. From this figure it can be seen that

the ELM features are flexible enough to capture the cost function, which is based on poly-

nomial features. Additionally, this case had 400 ELM features which was easily solvable

using the IOC approach. The estimated ELM weight parameters are shown in Figure 3(c),

where the costate and costate velocities are shown in Figures 3(a) and 3(b), respectively.

These results show promise for extending this work to high-dimensional systems.

Nonlinear Dynamics: Optimal Orbit Raising

This example considers determining the control cost function for an SO which is using

continuous thrust to raise its orbit. The trajectory used for this case is shown in Figure 4.

13

-1.5 -1 -0.5 0 0.5 1 1.5

-1

-0.5

0

0.5

1

1.5Deseired

Controlled

Uncontrolled

y(L

)

x (L)

(a) Transfer Orbit

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

-2

0

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4

10-3

-1

0

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

-1

0

1

Pos.

(L) xy

Vel

.(L

/T)

vxvy

Contr

ol

(L/T

2)

uxuy

t (T)

t (T)

t (T)

(b) Orbit Raising Control

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

-4

-2

0

2

4

6

8

10

12

14

16

Pval

ues

t (T)

(c) P Values Orbit Raising Example

Figure 4. Continuous Thrust Orbit Optimal Raising Case.

The objective function used for this case is given by

minimizex,u

∫ tf

to

c1ux2 + c2uy

2 + c4x2 + c5y

2 + c5vx2 + c6vy

2dt

subject to x = f(x)

x(t0) = xinitial

x(tf) = xfinal

(45)

where

f(x) =

vxvy

−µx

r3+ ux

−µy

r3+ uy

(46)

The dynamics model consider is a 2 dimensional 2 body problem and where the state of

the SO for this case is given by x = [x, y, vx, vy]T and it is assumed that µ = 1 and

r =√

x2 + y2. The initial and final states are given by xinitial = [r0, 0,√

µ/r0, 0]T and

14

xfinal = [rf , 0,√

µ/rf , 0]T . This objective function is used to solve the forward problem.

The forward optimal control problem for the orbit raising case is solved using the Hermite-

Simpson transcription method on a grid of 100 points. The solution for the forward problem

is shown in Figures 4(b) and 4(a). In this case the forward problem is nonlinear since

nonlinear two body dynamics are used. However, the inverse problem is still convex and

solvable using the LQR-based approach. The gradients for the IOC approach are given by

∇xφ|(x,u) =

0 0 0 00 0 0 02 x 0 0 00 2 y 0 00 0 2 vx 00 0 0 2 vy

(47a)

∇uφ|(x,u) =

2 ux 00 2 uy

0 00 00 00 0

(47b)

∇xf |(x,u) =

0 0 1 00 0 0 1

3x2

r5− 1

r33x y

r50 0

3x y

r53 y2

r5/2− 1

r30 0

(47c)

Figure 4(c) shows the solution of the Riccati Differential Equation for the initial P (t0)and once this obtained, the coefficients can be solved for using Quadratic programming

problem shown in Eq. (25). The true weight vector for this case is given by

cTrue = [1, 1, 1e−8, 1e−8, 1e−8, 1e−8]T (48)

The estimated weights using five trajectories is given by

cest5 = [1, 0.930, 0.875e−8, 0.720e−8, 1.209e−8, 1.058e−8]T (49)

The estimated weight vector using five observed trajectories recovered the control cost

function with good accuracy. The norm error for this case was ‖cTrue − cest5‖2 = 7e−2.

CONCLUSION

This paper considers the problem of determining the behavior of a SO from observa-

tional data using Inverse Optimal Control (IOC). Given the observed trajectories of states

and controls, the IOC approach can be used to estimate the control objective function that

a given SO is using. This work considers three simulated cases of maneuvering SOs. The

first two cases use an SO in GEO which is maneuvering to maintain a given GEO station-

ary box. The third case studies the behavior estimation problem for optimal orbit raising.

15

The control objective function was specified, and simulation data was generated for the

hypothetical SOs. The first two cases differ in the features used to represent the control ob-

jective function. The control objective function was estimated using a linear combination

of features. The first case uses simple second order polynomial features while the second

case demonstrates complex features based on neural networks. The nonlinear basis func-

tions used for this work included polynomial terms and extreme learning machines. Good

performance was shown for the first two cases, and it was observed that the accuracy of

IOC solution improves with the number of observed trajectories used. For the third and

final case, the estimated control objective function approximated the true control objective

function well, and good performance was shown for the proposed approach.

REFERENCES

[1] Kalman, R. E., “When is a linear control system optimal?” Journal of Basic Engineering, Vol. 86,No. 1, 1964, pp. 51–60.

[2] Masak, M., “An inverse problem on decoupling optimal control systems,” IEEE Transactions on Auto-matic Control, Vol. 13, No. 1, 1968, pp. 109–110.

[3] Moylan, P. and Anderson, B., “Nonlinear regulator theory and an inverse optimal control problem,”IEEE Transactions on Automatic Control, Vol. 18, No. 5, October 1973, pp. 460–465.

[4] Ng, A. Y., Russell, S. J., et al., “Algorithms for inverse reinforcement learning.” Icml, 2000, pp. 663–670.

[5] Abbeel, P. and Ng, A. Y., “Apprenticeship learning via inverse reinforcement learning,” Proceedings ofthe twenty-first international conference on Machine learning, ACM, 2004, p. 1.

[6] Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K., “Maximum Entropy Inverse ReinforcementLearning.” AAAI, Vol. 8, Chicago, IL, USA, 2008, pp. 1433–1438.

[7] Finn, C., Levine, S., and Abbeel, P., “Guided cost learning: Deep inverse optimal control via policyoptimization,” International Conference on Machine Learning, 2016, pp. 49–58.

[8] Wulfmeier, M., Ondruska, P., and Posner, I., “Maximum entropy deep inverse reinforcement learning,”arXiv preprint arXiv:1507.04888, 2015.

[9] Kuderer, M., Gulati, S., and Burgard, W., “Learning driving styles for autonomous vehicles fromdemonstration,” Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE,2015, pp. 2641–2646.

[10] Mainprice, J., Hayne, R., and Berenson, D., “Goal Set Inverse Optimal Control and Iterative Replanningfor Predicting Human Reaching Motions in Shared Workspaces.” IEEE Trans. Robotics, Vol. 32, No. 4,2016, pp. 897–908.

[11] Krstic, M. and Tsiotras, P., “Inverse optimal stabilization of a rigid spacecraft,” IEEE Transactions onAutomatic Control, Vol. 44, No. 5, 1999, pp. 1042–1049.

[12] Argall, B. D., Chernova, S., Veloso, M., and Browning, B., “A survey of robot learning from demon-stration,” Robotics and autonomous systems, Vol. 57, No. 5, 2009, pp. 469–483.

[13] Ho, J., Gupta, J., and Ermon, S., “Model-free imitation learning with policy optimization,” InternationalConference on Machine Learning, 2016, pp. 2760–2769.

[14] Ramachandran, D. and Amir, E., “Bayesian inverse reinforcement learning,” Urbana, Vol. 51, No.61801, 2007, pp. 1–4.

[15] Dvijotham, K. and Todorov, E., “Inverse optimal control with linearly-solvable MDPs,” Proceedings ofthe 27th International Conference on Machine Learning (ICML-10), 2010, pp. 335–342.

[16] Keshavarz, A., Wang, Y., and Boyd, S., “Imputing a convex objective function,” Intelligent Control(ISIC), 2011 IEEE International Symposium on, IEEE, 2011, pp. 613–619.

[17] Puydupin-Jamin, A.-S., Johnson, M., and Bretl, T., “A convex approach to inverse optimal control andits application to modeling human locomotion,” Robotics and Automation (ICRA), 2012 IEEE Interna-tional Conference on, IEEE, 2012, pp. 531–536.

[18] Aghasadeghi, N., Long, A., and Bretl, T., “Inverse optimal control for a hybrid dynamical system withimpacts,” Robotics and Automation (ICRA), 2012 IEEE International Conference on, IEEE, 2012, pp.4962–4967.

16

[19] Terekhov, A. V., Pesin, Y. B., Niu, X., Latash, M. L., and Zatsiorsky, V. M., “An analytical approach tothe problem of inverse optimization with additive objective functions: an application to human prehen-sion,” Journal of mathematical biology, Vol. 61, No. 3, 2010, pp. 423–453.

[20] Johnson, M., Aghasadeghi, N., and Bretl, T., “Inverse optimal control for deterministic continuous-timenonlinear systems,” Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, IEEE, 2013,pp. 2906–2913.

[21] Mombaur, K., Truong, A., and Laumond, J.-P., “From human to humanoid locomotion—an inverseoptimal control approach,” Autonomous robots, Vol. 28, No. 3, 2010, pp. 369–383.

[22] Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A., “Maximum margin planning,” Proceedings of the23rd international conference on Machine learning, ACM, 2006, pp. 729–736.

[23] Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K., “Extreme learning machine: theory and applications,”Neurocomputing, Vol. 70, No. 1, 2006, pp. 489–501.

[24] Huang, G.-B., Chen, L., and Siew, C.-K., “Universal approximation using incremental constructivefeedforward networks with random hidden nodes,” Neural Networks, IEEE Transactions on, Vol. 17,No. 4, 2006, pp. 879–892.

[25] Junkins, J. L., Hughes, D. C., Wazni, K. P., Pariyapong, V., and Kehtarnavaz, N., “Vision-Based Navi-gation for Rendezvous, Docking and Proximity Operations,” AAS Paper 99-021, Vol. 52, Feb. 1999.

17

aas 18-377 physically-constrained inverse optimal control...

Documents