optimal control of a double inverted pendulum on a cart · optimal control of a double inverted...

Optimal Control of a

Double Inverted Pendulum on a Cart

Alexander Bogdanov ∗

Department of Computer Science & Electrical Engineering,OGI School of Science & Engineering, OHSU

Technical Report CSE-04-006December 2004

In this report a number of algorithms for optimal control of a double inverted pendulum on a cart (DIPC)are investigated and compared. Modeling is based on Euler-Lagrange equations derived by specifying aLagrangian, difference between kinetic and potential energy of the DIPC system. This results in a systemof nonlinear differential equations consisting of three 2-nd order equations. This system of equations is thentransformed into a usual form of six 1-st order ordinary differential equations (ODE) for control design pur-poses. Control of a DIPC poses a certain challenge, since unlike a robot, the system is underactuated: onecontrolling force per three degrees of freedom (DOF). In this report, problem of optimal control minimizinga quadratic cost functional is addressed. Several approaches are tested: linear quadratic regulator (LQR),state-dependent Riccati equation (SDRE), optimal neural network (NN) control, and combinations of theNN with the LQR and the SDRE. Simulations reveal superior performance of the SDRE over the LQR andimprovements provided by the NN, which compensates for model inadequacies in the LQR. Limited capa-bilities of the NN to approximate functions over the wide range of arguments prevent it from significantlyimproving the SDRE performance, providing only marginal benefits at larger pendulum deflections.

1 Nomenclature

m massli distance from a pivot joint to the i-th pen-

dulum link center of massLi length of an i-th pendulum linkθ0 wheeled cart positionθ1, θ2 pendulum anglesIi moment of inertia of i-th pendulum link

w.r.t. its center of massg gravity constantu control forceT kinetic energyP potential energyL Lagrangianw neural network (NN) weightsNh number of neurons in the hidden layer∆t digital control sampling time

Subscripts 0, 1, 2 used with the aforementioned param-eters refer to the cart, first (bottom) pendulum and sec-ond (top) pendulum correspondingly.

2 Introduction

As a nonlinear underactuated plant, double invertedpendulum on a cart (DIPC) poses a challenging controlproblem. It seems to have been one of attractive toolsfor testing linear and nonlinear control laws [6, 17, 1].Numerous papers use DIPC as a testbed. Cited onesare merely an example. Nearly all works on pendulumcontrol concentrate on two problems: pendulums swing-up control design and stabilization of the inverted pen-dulums. In this report, optimal nonlinear stabilization

∗Senior Research Associate, [email protected]

problem is addressed: stabilize DIPC minimizing an ac-cumulative cost functional quadratic in states and con-trols. For linear systems, this leads to linear feedbackcontrol, which is found by solving a Riccati equation[5], and thus referred to as linear quadratic regulator(LQR). DIPC, however, is a highly nonlinear system,and its linearization is far from adequate for control de-sign purposes. Nonetheless, the LQR will be consideredas a baseline controller, and other control designs willbe tested and compared against it. To solve the non-linear optimal control problem, we will employ severaldifferent approaches: a nonlinear extension to the LQRcalled SDRE; direct nonlinear optimization using a neu-ral network (NN); and combinations of the direct NNoptimization with the LQR and the SDRE.

For a baseline nonlinear control, we will utilize atechnique that has shown considerable promise and in-volves manipulating the system dynamic equations intoa pseudo-linear state-dependent coefficient (SDC) form,in which system matrices are given explicitly as a func-tion of the current state. Treating the system matricesas constant, the approximate solution of the nonlinearstate-dependent Riccati equation is obtained for the re-formulated pseudo-linear dynamical system in discretetime steps. The solution is then used to calculate a feed-back control law that is optimized around the systemstate estimated at each time step. This technique, re-ferred to as State-Dependent Riccati Equation (SDRE)control, is thus an extension to the LQR as it solves theLQR problem at each time step.

As a next step, a direct optimization approach willbe investigated. For this purpose, a NN will be used inthe feedback loop, and a standard calculus of variationsapproach will be employed to adjust the NN parameters(weights) to optimize the cost functional over a wide

1

X

Y 1

2

m2g

m1g

0

m0

l2

ul1

L1

L2

I1

I2

Figure 1: Double inverted pendulum on a cart

range of initial DIPC states. As a universal functionapproximator, the NN is thus trained to implement anonlinear optimal controller.

As the last step, two combinations of feedback NNcontrol with LQR and SDRE will be designed. Approxi-mation capabilities of a NN are limited by its size and thefact that optimization in the space of its weights is non-convex. Thus, only local minimums are usually found,and the solution is at most suboptimal. The problemis more severe with wider ranges of NN inputs and out-puts. To address this, a combination of the NN witha conventional feedback suboptimal control is designedto simplify the NN input-output mapping and thereforereduce training complexity. If the conventional feedbackcontrol were optimal, the optimal NN output would triv-ially be zero. Simple logic says that a suboptimal con-ventional control will simplify the NN mapping to someextent: instead of generating optimal controls, the NN istrained to produce corrections to the controls generatedby the conventional suboptimal controller. For example,the LQR provides near-optimal control in the vicinityof the equilibrium, since nonlinear and linearized DIPCdynamics are close in the equilibrium. Thus the NN willonly have to correct the LQR output when the lineariza-tion accuracy diminishes. The next sections will discussthe above concepts in details and illustrate them withsimulations.

3 Modeling

The DIPC system is graphically depicted in Fig. 1. Toderive its equations of motion, one of the possible waysis to use Lagrange equations:

d

dt

(

∂L

∂θ

)

−∂L

∂θ= Q (1)

where L = T − P is a Lagrangian, Q is a vector ofgeneralized forces (or moments) acting in the directionof generalized coordinates θ and not accounted for informulation of kinetic energy T and potential energy P .

Kinetic and potential energies of the system are givenby the sum of energies of its individual components (awheeled cart and two pendulums):

T = T0 + T1 + T2

P = P0 + P1 + P2

where

T0 =1

2m0θ

20

T1 =1

2m1

[

(

θ0 + l1θ1 cos θ1

)2+(

l1θ1 sin θ1

)2]

+1

2I1θ

21

=1

2m1θ

20 +

1

2

(

m1l21 + I1

)

θ21 + m1l1θ0θ1 cos θ1

T2 =1

2m2

[

(

θ0 + L1θ1 cos θ1 + l2θ2 cos θ2

)2

+(

L1θ1 sin θ1 + l2θ2 sin θ2

)2]

+1

2I2θ

22

=1

2m2θ

20 +

1

2m2L

21θ

21 +

1

2

(

m2l22 + I2

)

θ22

+ m2L1θ0θ1 cos θ1 + m2l2θ0θ2 cos θ2

+ m2L1l2θ1θ2 cos(θ1 − θ2)

P0 = 0

P1 = m1gl1 cos θ1

P2 = m2g(

L1 cos θ1 + l2 cos θ2

)

Thus the Lagrangian of the system is given by

L =1

2

(

m0 + m1 + m2

)

θ20

+1

2

(

m1l21 + m2L

21 + I1

)

θ21 +

1

2

(

m2l22 + I2

)

θ22

+(

m1l1 + m2L1

)

cos(θ1)θ0θ1

+ m2l2 cos(θ2)θ0θ2 + m2L1l2 cos(θ1 − θ2)θ1θ2

−(

m1l1 + m2L1

)

g cos θ1 − m2l2g cos θ2

Differentiating the Lagrangian by θ and θ yields La-grange equation (1) as

d

dt

(

∂L

∂θ0

)

−∂L

∂θ0= u

d

dt

(

∂L

∂θ1

)

−∂L

∂θ1= 0

d

dt

(

∂L

∂θ2

)

−∂L

∂θ2= 0

Or explicitly,

u =(

∑

mi

)

θ0 +(

m1l1 + m2L1

)

cos(θ1)θ1

+ m2l2 cos(θ2)θ2 −(

m1l1 + m2L1

)

sin(θ1)θ21

− m2l2 sin(θ2)θ22

0 =(

m1l1 + m2L1

)

cos(θ1)θ0

+(

m1l21 + m2L

21 + I1

)

θ1

+ m2L1l2 cos(θ1 − θ2)θ2

+ m2L1l2 sin(θ1 − θ2)θ22

−(

m1l1 + m2L1

)

g sin θ1

0 = m2l2 cos(θ2)θ0 + m2L1l2 cos(θ1 − θ2)θ1

+(

m2l22 + I2

)

θ2 − m2L1l2 sin(θ1 − θ2)θ21

− m2l2g sin θ2

2

Lagrange equations for the DIPC system can be writ-ten in a more compact matrix form:

D(θ)θ + C(θ, θ)θ + G(θ) = Hu (2)

where

D(θ)=

(

d1 d2 cos θ1 d3 cos θ2

d2 cos θ1 d4 d5 cos(θ1−θ2)d3 cos θ2 d5 cos(θ1−θ2) d6

)

(3)

C(θ, θ)=

0 −d2 sin(θ1)θ1 −d3 sin(θ2)θ2

0 0 d5 sin(θ1−θ2)θ2

0 −d5 sin(θ1−θ2)θ1 0

(4)

G(θ) =

(

0−f1 sin θ1

−f2 sin θ2

)

(5)

H = (1 0 0)T

Assuming that centers of mass of the pendulums are inthe geometrical center of the links, which are solid rods,we have: li = Li/2, Ii = miL

2i /12. Then for the ele-

ments of matrices D(θ), C(θ, θ), and G(θ) we get:

d1 = m0 + m1 + m2

d2 = m1l1 + m2L1 =(1

2m1 + m2

)

L1

d3 = m2l2 =1

2m2L2

d4 = m1l21 + m2L

21 + I1 =

(1

3m1 + m2

)

L21

d5 = m2L1l2 =1

2m2L1L2

d6 = m2l22 + I2 =

1

3m2L

22

f1 = (m1l1 + m2L1)g = (1

2m1 + m2)L1g

f2 = m2l2g =1

2m2L2g

Note that matrix D(θ) is symmetric and nonsingular.

4 Control

To design a control law, Lagrange equations of motion(2) are reformulated into a 6-th order system of ordinarydifferential equations. To do this, a state vector x ∈ R6

is introduced:x = (θ θ)T

Then dropping dependencies of the system matrices onthe generalized coordinates and their derivatives, thesystem dynamic equations appear as:

x =

(

0 I0 −D−1C

)

x+

(

0−D−1G

)

+

(

0D−1H

)

u (6)

In this report, optimal nonlinear stabilization control de-sign is addressed: stabilize the DIPC minimizing an ac-cumulative cost functional quadratic in states and con-trols. The general problem of designing an optimal con-trol law involves minimizing a cost function

Jt =

tfinal∑

k=t

Lk(xk,uk), (7)

which represents an accumulated cost of the sequence ofstates xk and controls uk from the current discrete time tto the final time tfinal. For regulation problems tfinal =∞. Optimization is done with respect to the controlsequence subject to constraints of the system dynamics(6). In our case,

Lk(xk,uk) = xTk Qxk + uT

k Ruk (8)

corresponds to the standard Linear Quadratic cost. Forlinear systems, this leads to linear state-feedback control,LQR, designed in the next subsection. For nonlinearsystems the optimal control problem generally requiresa numerical solution, which can be computationally pro-hibitive. An analytical approximation to the nonlin-ear optimal control solution is utilized in subsection onSDRE control, which represents a nonlinear extensionto the LQR and yields superior results. Neural net-work (NN) capabilities for function approximation areemployed to approximate the nonlinear control solutionin subsection on NN control. And combinations of theNN with LQR and SDRE are investigated in the subsec-tion following the NN control.

4.1 Linear Quadratic Regulator

The linear quadratic regulator yields an optimal solu-tion to the control problem (7)–(8) when system dynam-ics are linear. Since DIPC is nonlinear, as described by(6), it can be linearized to derive an approximate linearsolution to the optimal control problem. Linearizationof (6) around x = 0 yields:

x = Ax + Bu (9)

where

A =

(

0 I

−D(0)−1 ∂G(0)

∂θ0

)

(10)

B =

(

0

D(0)−1

H

)

(11)

and the continuous LQR solution is obtained then by:

u = −R−1BT Pcx ≡ −Kcx (12)

where Pc is a steady-state solution of the differentialRiccati equation. To implement computerized digitalcontrol, dynamic equations (9) are approximately dis-cretized as Φ ≈ eA∆t, Γ ≈ B∆t, and digital LQR con-trol is then given by

uk = −R−1ΓT Pxk ≡ −Kxk (13)

where P is the steady state solution of the differenceRiccati equation, obtained by solving the discrete-timealgebraic Riccati equation

ΦT [P − PΓ(R + ΓT PΓ)−1ΓT P]Φ − P + Q = 0 (14)

where Q ∈ R6×6 and R ∈ R are positive definite stateand control cost matrices. Since linearization (9)–(11)accurately represents the DIPC system (6) in the equi-librium, the LQR control (12) or (13) will be a locallynear-optimal stabilizing control.

3

4.2 State-Dependent Riccati Equation Control

An approximate nonlinear analytical solution of theoptimal control problem (7)–(8) subject to (6) is givenby a technique referred to as the state-dependent Ric-cati equation (SDRE) control. The SDRE approach [3]involves manipulating the dynamic equations

x = f(x,u)

into a pseudo-linear state-dependent coefficient (SDC)form in which system matrices are explicit functions ofthe current state:

x = A(x)x + B(x)u (15)

A standard LQR problem (Riccati equation) can thenbe solved at each time step to design the state feedbackcontrol law on-line. For digital implementation, (15) isapproximately discretized at each time step into

xk+1 = Φ(xk)xk + Γ(xk)uk (16)

And the SDRE regulator is then specified similar to thediscrete LQR (compare with (13)) as

uk = −R−1ΓT (xk)P(xk)xk ≡ −K(xk)xk (17)

where P(xk) is the steady state solution of the differenceRiccati equation, obtained by solving the discrete-timealgebraic Riccati equation (14) using state-dependentmatrices Φ(xk) and Γ(xk), which are treated as beingconstant at each time step. Thus the approach is some-times considered as a nonlinear extension to the LQR.

Proposition 1. Dynamic equations of a double invertedpendulum on a cart are presentable in SDC form (15).

Proof. From the derived dynamic equations for theDIPC system (6) it is clear that the required SDC form(15) can be obtained if vector G(θ) is presentable in theSDC form: G(θ) = Gsd(θ)θ. Let us construct Gsd(θ) as

Gsd(θ) =

0 0 00 −f1

sin θ1

θ10

0 0 −f2sin θ2

θ2

Elements of constructed Gsd(θ) are bounded everywhereand G(θ) = Gsd(θ)θ as required. Thus the system dy-namic equations can be presented in the SDC form as

x =

(

0 I−D−1Gsd −D−1C

)

x +

(

0D−1H

)

u (18)

Derived system equations (18) (compare with the lin-earized system (9)–(11)) are discretized at each time stepinto (16), and control is then computed as given by (17).

Remark. In the neighborhood of equilibrium x = 0system equations in the SDC form (18) turn into lin-earized equations (9) used in the LQR design. This canbe checked either by direct computation or by notingthat on one hand, linearization yields

x = Ax + Bu + O(x)2

Double invertedpendulum on a cart

Neural network

1q

Tk kx Qx

kx

kx ku1kx

2kRu

Figure 2: Neural network control diagram

and on the other hand, SDC form of the system dynamicsis

x = A(x)x + B(x)u

Since O(x)2 → 0 when x → 0, then A(x) → A andB(x) → B. Therefore, it is natural to expect that per-formance of the SDRE regulator will be very close to theLQR in the vicinity of the equilibrium, and the differ-ence will show at larger pendulum deflections. This isexactly illustrated in the Simulation Results section.

4.3 Neural Network Learning Control

A neural network (NN) control is often popular in con-trol of nonlinear systems due to universal function ap-proximation capabilities of NNs. Neural networks withonly one hidden layer and an arbitrarily large num-ber of neurons represent nonlinear mappings, which canbe used for approximation of any nonlinear functionf ∈ C(Rn, Rm) over a compact subset of Rn [7, 4, 12]. Inthis section, we utilize the function approximation prop-erties of NNs to approximate a solution of the nonlinearoptimal control problem (7)–(8) subject to system dy-namics (6), and thus design an optimal NN regulator.The problem can be solved by directly implementing afeedback controller (see Figure 2) as:

uk = NN (xk,w) ,

Next, an optimal set of weights w is computed tosolve the optimization problem. To do this, a stan-dard calculus of variations approach is taken: mini-mization of a functional subject to equality constraintsxk+1 = f(xk,uk). Let λk be a vector of Lagrange mul-tipliers in the augmented cost function

H =

tfinal∑

k=t

{

L(xk,uk) + λTk (xk+1−f(xk,uk))

}

(19)

We can now derive the recurrent Euler-Lagrange equa-tions by solving ∂H

∂xk= 0 w.r.t. the Lagrange multipliers

and then find the optimal set of NN weights w? by solv-ing the optimality condition ∂H

∂w= 0 (numerically by

means of gradient descent).

Dropping dependence of L(xk,uk) and f(xk,uk) onxk and uk, and NN(xk,w) on xk and w for brevity, the

4

PendulumJacobian

NN Jacobian

1q

2 kQx

kdx

NNkdx kdu

1k

2 kRu

+

+

k

Figure 3: Adjoint system diagram

Euler-Lagrange equations are derived as

λk =

(

∂f

∂xk

+∂f

∂uk

∂NN

∂xk

)T

λk+1

+

(

∂L

∂xk

+∂L

∂uk

∂NN

∂xk

)T

(20)

with λtfinalinitialized as zero vector. For L(xk,uk)

given by (8), ∂L∂xk

= 2xTk Q, and ∂L

∂uk= 2uT

k R. Theseequations correspond to an adjoint system shown graph-ically in Figure 3, with optimality condition

∂H

∂w=

tfinal∑

k=t

{(

λTk+1

∂f

∂uk

+∂L

∂uk

)

∂NN

∂w

}

= 0. (21)

The overall training procedure for the NN can now besummarized as follows:

1. Simulate the system forward in time for tfinal timesteps (Figure 2). Although, as we mentioned,tfinal = ∞ for regulation problems, in practice itis set to a sufficiently large number (in our casetfinal = 500).

2. Run the adjoint system backward in time to accu-mulate the Lagrange multipliers (20) (see Figure 3).System Jacobians are evaluated analytically or byperturbation.

3. Update the weights using gradient descent1, ∆w =

−γ(

∂H∂w

)T.

4. Repeat until convergence or until an acceptable levelof cost reduction is achieved.

Due to the nature of the training process (propagation ofthe Lagrange multipliers backwards through time), thistraining algorithm is referred to as Back PropagationThrough Time (BPTT) [12, 16].

As seen from Euler-Lagrange equations (20), backpropagation of the Lagrange multipliers includes compu-tation of the system and NN Jacobians. The DIPC Ja-cobians can be computed numerically by perturbation orderived analytically from the dynamic equations (6). Let

us show the analytically derived Jacobians ∂fc(x,u)∂x

and

1In practice we use an adaptive learning rate for each weight inthe network using a procedure similar to delta-bar-delta [8]

∂fc(x,u)∂u

, where fc(x, u) is a brief notation for the right-hand side of the continuous system (6), i.e. x = fc(x, u).

∂fc(x, u)

∂x=

(

0 I−D−1M −2D−1C

)

(22)

∂fc(x, u)

∂u=

(

0D−1H

)

(23)

where matrix M(θ, θ) = (M0 M1 M2), and each of its

columns (note that ∂D−1

∂θi= −D−1 ∂D

∂θiD−1) is given by

Mi =∂C

∂θi

θ +∂G

∂θi

−∂D

∂θi

D−1(Cθ + G − Hu) (24)

Remark 1. Clearly, Jacobians (22)–(24) transform intolinear system matrices (10)–(11) in equilibrium θ = 0,

θ = 0.Remark 2. From (3)–(5) it follows that M0 ≡ 0.Remark 3. Jacobians (22)–(24) are derived from thecontinuous system equations (6). BPTT requires com-putation of the discrete system Jacobians. Thus, to usethe derived matrices in NN training, they should be dis-cretized (e.g. as it was done for the LQR and SDRE).

Computation of the NN Jacobian is easy to performgiven the nonlinear functions of individual neural ele-ments are y = tanh(z). In this case, NN with N0 inputs,single output u and a single hidden layer with Nh ele-ments is described by

u =

Nh∑

i=1

w2itanh

N0∑

j=1

w1i,jxj + wb

1i

+ wb2

Or in a more compact form,

u = w2tanh(

W1x + wb1

)

+ wb2

where tanh(z) = (tanh(z1), . . . , tanh(zNh))

T, w2 ∈ RNh

is a row-vector of weights in the NN output, W1 ∈

RNh×N0 is a matrix of weights of the hidden layer el-ements, wb

1 ∈ RNh is a vector of bias weights in thehidden layer and wb

2 is a bias weight in the NN output.Graphically NN is depicted in Figure 4.Remark 4. In case of a NN with Nout outputs(MIMO control problems), w2 becomes a matrix W2 ∈

RNout×Nh , and wb2 becomes a vector wb

2 ∈ RNout .

Noting that ∂ tanh(z)∂z

= 1− tanh2(z), the NN Jacobianw.r.t. the NN input (state vector x) is given by

∂NN

∂x=(

wT2 � dtanh

(

W1x + wb1

))TW1 (25)

where operator � is an element-by-elementmultiplication of two vectors, i.e. x � y =

(x1y1, . . . , xnyn)T; and vector field dtanh(z) =

(

1 − tanh2(z1), . . . , 1 − tanh2(zNh))T

. Again, in MIMOcase when the NN has Nout outputs, individual rows ofW2 take place of row-vector w2 in (25).

5

W11,1

W11,2

W11,6

W12,1

W12,2

W12,6

W1Nh,1

W1Nh,2

W1Nh,6

Wb1Nh

Wb12

Wb11

1x

2x

6x

tanh

tanh

tanh

+

W21

W22

W2Nh

+

+

+

u

Wb2

Hidden layer

Output

1 1, bW w

2 2, bW w

1

2

Nh

Figure 4: Neural network structure

The last essential part of the NN training algorithm isthe NN Jacobian w.r.t. the network weights:

∂NN

∂W1i

=(

wT2 � dtanh

(

W1x + wb1

))Txi

∂NN

∂w2= tanh

(

W1x + wb1

)

(26)

∂NN

∂wb1

=(

wT2 � dtanh

(

W1x + wb1

))T

∂NN

∂wb2

= 1

where W1iis an i-th column of W1.

Now we have all the quantities to compute weight up-dates and control. Explicitly, from (21) and (26) theweight update recurrent law is given by

W1n+1= W1n

−γ

tfinal∑

k=t

{[

WT2n

(

λTk+1

∂f

∂uk

+∂L

∂uk

)T]

� dtanh(

W1nxk + wb

1n

)

xTk

}

W2n+1= W2n

−γ

tfinal∑

k=t

{

(

λTk+1

∂f

∂uk

+∂L

∂uk

)T

× tanhT(

W1nxk + wb

1n

)

}

(27)

wb1n+1

= wb1n−γ

tfinal∑

k=t

{[

WT2n

(

λTk+1

∂f

∂uk

+∂L

∂uk

)T]

� dtanh(

W1nxk + wb

1n

)

}

wb2n+1

= wb2n−γ

tfinal∑

k=t

(

λTk+1

∂f

∂uk

+∂L

∂uk

)T

In conclusion to this section, it should be mentionedthat the number of elements in the NN hidden layer af-fects optimality of the control design. One one hand, themore elements, the better the theoretically achievable


Neural network

1q

Tk kx Qx

kx

kxku

1kx

2kRu

+

K

Figure 5: Neural network + LQR control diagram

function approximation capabilities of the NN. On theother hand, too many elements make gradient descentin space of NN weights more prone to getting stuck inlocal minima (recall, this is a non-convex problem), thusprohibiting achievement of good approximation levels.Therefore a reasonable compromise is usually necessary.

4.4 Neural Network Control + LQR/SDRE

In the previous section a NN was optimized directly tostabilize the DIPC at minimum cost (7)–(8). To achievethis, the NN was trained to generate optimal controlsover the wide range of the pendulums motion. As illus-trated in Simulation Results section, NN approximationcapabilities, limited by the number of neurons and nu-merical challenges of non-convex minimization, providedachievement of stabilization only within the range com-parable to the LQR. One might ask whether it is possibleto make the NN training more efficient and faster. Sim-ple logic says that if the DIPC were stable and closerto optimal, the NN would not have much to learn sinceits optimal output would be closer to zero. In the trivialcase, if the DIPC were optimally stabilized by an internalcontroller, the optimal NN output would be zero. Nowlet us recall that LQR provides optimal control in thevicinity of the equilibrium. Thus if we include LQR intothe control scheme as shown in Figure 5, it is reason-able to expect better overall performance: the NN willbe trained to generate only “corrections” to the LQRcontrols to provide optimality in the wide range of pen-dulums motion.

Although in Figure 5 an LQR is shown, the SDREcontroller can be used as well in place of the LQR.As demonstrated in the Simulation Results section, theSDRE control provides superior performance in termsof minimized cost and range of stability over the LQR.Thus, it is natural to expect a better performance of thecontrol design shown in Figure 5 when the SDRE is usedin place of the LQR. Since both the LQR and the SDREhave a lot in common (in fact, the SDRE is often called anonlinear extension to the LQR), both cases will be dis-cussed in this section, and when the differences betweenthe two will call for clarification, it will be provided.

The problem solved in this section is nearly the sameas in the previous section, therefore, almost all the for-mulas stay the same or have just a few changes.

Since control of the cart is a sum of the NN outputand the LQR (SDRE),

uk = NN (w,xk) + Kxk

6

PendulumJacobian

NN Jacobian

1q

2 kQx

kdx

NNkdx kdu

1k

2 kRu

+

+

k

K

+

Figure 6: Adjoint NN + LQR system diagram

where in case of SDRE K ≡ K(xk), minimization of theaugmented cost function (19) by taking derivatives of Hw.r.t. xk and w yields slightly modified Euler-Lagrangeequations:

λk =

[

∂f

∂xk

+∂f

∂uk

(

∂NN

∂xk

+ K

)]T

λk+1

+

[

∂L

∂xk

+∂L

∂uk

(

∂NN

∂xk

+ K

)]T

(28)

Note that when SDRE is used instead of the LQR,∂(K(xk)xk)

∂xkshould be used in place of K in the above

equation. These equations correspond to an adjoint sys-tem shown graphically in Figure 6 (compare to Figure 3),with optimality condition (21)

The training procedure for the NN is the same as inthe previous section, and all the system Jacobians andthe NN Jacobian are computed in the same way. Theonly new item in this section is the SDRE Jacobian∂(K(xk)xk)

∂xkwhich appears in the Euler-Lagrange equa-

tions (28). This Jacobian is computed either numeri-cally, which may be computationally expensive, or ap-proximately as

∂ (K(xk)xk)

∂xk

≈ K (xk)

Experiments in the Simulation Results section illus-trate the superior performance of such combination con-trol over the pure NN or LQR control. For the SDRE,a noticeable cost reduction is achieved only near criti-cal pendulum deflections (close to the boundaries of theSDRE recovery region).

Remark. The idea of using NNs to adjust outputsof a conventional controller to account for differencesbetween the actual system and its model used in theconventional control design was also employed in a num-ber of works [15, 14, 2, 11]. Calise, Rysdyk and John-son [2, 11] used a NN controller to supplement an ap-proximate feedback linearization control of a fixed wingaircraft and a helicopter. The NN provided additionalcontrols to match the response of the vehicle to the ref-erence model, compensating for the approximations inthe autopilot design. Wan and Bogdanov [15, 14] de-signed a model predictive neural control (MPNC) for anautonomous helicopter, where a NN worked in pair withthe SDRE autopilot and provided minimization of thequadratic cost function over a receding horizon. The

SDRE control was designed for a simplified nonlinearmodel, and the NN made it possible to compensate forwind disturbances and higher-order terms unaccountedfor in the SDRE design.

4.5 Receding Horizon Neural Network Control

Is it possible to further improve control scheme de-signed in the previous section? Recall, that the NN wastrained numerous times over the wide range of the pen-dulum angles to minimize the cost functional (19). Thecomplexity of the approach is a function of the final timetfinal, which determines the length of a training epoch.Solution suggested in this section is to apply a recedinghorizon framework and train the NN in a limited rangeof the pendulum angles, along a trajectory starting inthe current point and having relatively short duration(horizon) N . This is accomplished by rewriting the costfunction (19) as

H =

t+N−1∑

k=t

{

L(xk,uk)+λTk (xk+1−f(xk,uk))

}

+V (xt+N )

where the last term V (xt+N ) denotes the cost-to-go fromtime t + N to time tfinal. Since range of the NN inputsin this case is limited and the horizon length N is short,a shorter time is required to train the NN. However, onlylocal minimization of function (19) is achieved: startingwith a significantly different initial condition the NN willnot provide cost minimization since it was not trained forit. This issue is addressed by periodically retraining theNN: after a period of time called an update interval, theNN is retrained taking the current state vector as initial.Update interval is usually significantly shorter than thehorizon, and in classical model-predictive control (MPC)is often only one time step. This technique was appliedin helicopter control problem [15, 14] and is referred toas Model Predictive Neural Control (MPNC).

In practice, the true value of V (xt+N ) is unknown,and must be approximated. Most common is to simplyset V (xt+N ) = 0; however, this may lead to reducedstability and poor performance for short horizon lengths[9]. Alternatively, we may include a control Lyapunovfunction (CLF), which guarantees stability if the CLF isan upper bound on the cost-to-go, and results in a regionof attraction for the MPC of at least that of the CLF [9].The cost-to-go V (xt+N ) can be approximated using thesolution of the SDRE at time t + N ,

V (xt+N ) ≈ xTt+NP(xt+N )xt+N

This CLF provides the exact cost-to-go for regulationassuming a linear system at the horizon time. A sim-ilar formulation was used for nonlinear regulation byCloutier et al [13].

All the equations from the previous section apply hereas well. The Euler-Lagrange equations (20) are initial-ized in this case as

λt+N =

(

∂V (xt+N )

∂xt+N

)T

≈ P(xt+N )xt+N

where dependence of the SDRE solution P on state vec-tor xt+N was neglected.

7

Stability of the MPNC is closely related to that of thetraditional MPC. Ideally, in the case of unconstrainedoptimization, stability is guaranteed provided V (xt+N )is a CLF and is an (incremental) upper bound on thecost-to-go [9]. In this case, the minimum region of at-traction of the receding horizon optimal control is de-termined by the CLF used and horizon length. Theguaranteed region of operation contains that of the CLFcontroller and may be made as large as desired by in-creasing the optimization horizon (restricted to the in-finite horizon domain) [10]. In our case, the minimumregion of attraction of the receding horizon MPNC isdetermined by the SDRE solution used as the CLF toapproximate the terminal cost. In addition, we also re-strict the controls to be of the form given by (4.4) andthe optimization is performed with respect to the NNweights w. In theory, the universal mapping capabilityof NNs implies that the stability guarantees are equiva-lent to that of the traditional MPC framework. However,in practice stability is affected by the chosen size of theNN (which affects the actual mapping capabilities), aswell as the horizon length N and update interval length(how often NN is re-optimized). When the horizon isshort, performance is more affected by the chosen CLF.On the other hand, when the horizon is long, perfor-mance is limited by the NN properties. An additionalfactor affecting stability is the specific algorithm usedfor numeric optimization. Gradient descent, which weuse to minimize the cost function (4.5), is guaranteed toconverge to only a local minimum (the cost function isnot guaranteed convex with respect to the NN weights),and thus depends on the initial conditions. In addition,convergence is assured only if the learning rate is keptsufficiently small. To summarize these points, stabil-ity of the MPNC is guaranteed under certain restrictedideal conditions. In practice, the designer must selectappropriate settings and perform sufficient experimentsto assure stability and performance.

4.6 Neural Network Adaptive Control

What if pendulum parameters are unknown? In thiscase the system Jacobians (22)–(23) can not be com-puted directly. A solution then is to use an additionalNN, called “emulator”, to emulate pendulum dynamics(see Figure (7)–(8)). This NN-emulator is trained torecreate pendulum outputs, given control input and cur-rent state vector. The regular error back-propagation(BP) algorithm is used if the NN-emulator is connectedas in Figure (7), and the BPTT is employed in case de-picted in Figure (8). The NN-regulator training proce-dure is the same as in subsection on the NN learningcontrol, but instead of the system Jacobians (22)–(23),the NN-emulator Jacobians computed as (25) are used(see Figure (9)).

5 Simulation results

To evaluate the control performance, two sets of simula-tions were conducted. In the first set, the LQR, SDRE,NN and NN+LQR/SDRE control schemes were tested tostabilize the DIPC when both pendulums were initiallydeflected from their vertical position. System parame-


Neural network

1q

Tk kx Qx

kx

kx ku1kx

2kRu

Neural networkemulator +

1ˆ kx -

+kx

ku

Figure 7: Neural network adaptive control diagram


Neural network

1q

Tk kx Qx

kx

kx ku1kx

2kRu

Neural networkemulator +

1ˆ kx -

+ˆ kx

ku

1q

Figure 8: Neural network adaptive control diagram

NN emulatorJacobian

NN Jacobian

1q

2 kQx

kdx

NNkdx kdu

1k

2 kRu

+

+

k

Pendulum

Figure 9: Adjoint NN adaptive system diagram

ters used in the simulations are presented in Table 1.Slight deflection (10 degrees in the opposite directions

or 20 degrees in the same direction) results in a verysimilar performance of all control approaches as it waspredicted (see Fig. 10–11).

Larger deflections (15 degrees in the opposite direc-tions or 30 degrees in the same direction) reveal differ-ences between the LQR and the SDRE: the latter oneprovides lesser cost (7)–(8) and keeps pendulum veloci-ties lower (see Fig. 12–13 and Table 2). The NN-basedcontrollers behave similar to the SDRE.

Note that the LQR could not recover the pendulumsstarting from 19 deg. initial deflection in the oppositedirections or 36 deg. deflection in the same direction.Critical cases for the LQR (pendulums starting at 18deg. deflections in the opposite directions from the up-ward position and 35 deg. in the same direction) arepresented in Fig. 14–15. The NN control provides con-sistently better cost than the LQR at these larger deflec-

8

Table 1: Simulation parameters

Parameter Valuem0 1.5 kgm1 0.5 kgm2 0.75 kgL1 0.5 mL2 0.75 m∆t 0.02 sQ diag(5 50 50 20 700 700)R 1Nh 40tfinal 500

Table 2: Control performance (cost) comparison

Control Deflection of pendulums, deg.Opposite direction

10 15 18 28LQR 115.2 328.4 705.0 n/rSDRE 112.9 277.3 437.5 3655.8NN 114.1 323.4 n/r n/rNN+LQR 113.3 275.7 448.3 n/rNN+SDRE 112.5 276.3 436.6 2753.7

Same direction20 30 35 67

LQR 36.7 108.3 325.3 n/rSDRE 36.0 84.0 118.0 5250.5NN 36.9 85.7 144.8 n/rNN+LQR 37.3 86.2 136.4 n/rNN+SDRE 36.0 84.0 118.0 n/r

Table 3: Regions of pendulums recoveryControl Recovery region, deg.

Max. opposite dir. Max. same dir.LQR 18 35SDRE 28 67NN 15 38NN+LQR 21 40NN+SDRE 28 62

tions, but it can’t stabilize the system beyond 15 (38)deg. pendulum deflections in the opposite (same) di-rections due to limited approximation capabilities of theNN.

In the second set of experiments, regions of pendulumrecovery (i.e. maximum initial pendulum deflectionsfrom which a control law can bring the DIPC back tothe equilibrium) were evaluated. The results are shownin Table 3. Cases of critical initial pendulums positionsfor SDRE (28 deg. deflections the in opposite directionsor 67 deg. in the same direction) are presented for thereference in Fig. 16–17. Note that no limits were imposedon the control force magnitude in all simulations. Thepurpose was to compare LQR and SDRE without intro-ducing model changes which are not directly accountedfor in control design. It should be mentioned however,that limited control authority and ways of taking it intoaccount in the SDRE design were investigated earlier [3]

but are left currently beyond our scope.

6 Conclusions

This report demonstrated potential advantage of theSDRE technique over the LQR design in nonlinear opti-mal control problems with an underactuated plant. Re-gion of pendulum recovery for SDRE appeared to be 55to 91 percent larger than in case of the LQR control.

Direct optimization via using neural networks yieldsresults superior to the LQR, but the recovery region isabout the same as in the LQR case. This happens due tolimited approximation capabilities of the NN and non-convex numerical optimization challenges.

Combination of the NN control with the LQR (or withthe SDRE) provides larger recovery regions and betteroverall performance. In this case the NN learns to gener-ate corrections to the LQR (SDRE) control to compen-sate for suboptimality of the LQR (SDRE).

To enhance this report, it would be valuable to inves-tigate taking limited control authority into account inall control designs.

References

[1] R. W. Brockett and H. Li. A light weight rotarydouble pendulum: maximizing the domain of at-traction. In Proceedings of the 42nd IEEE Confer-ence on Decision and Control, Maui, Hawaii, De-cember 2003.

[2] A. Calise and R. Rysdyk. Nonlinear adaptive flightcontrol using neural networks. IEEE Control Sys-tems Magazine, 18(6), December 1998.

[3] J. R. Cloutier, C. N. D’Souza, and C. P. Mracek.Nonlinear regulation and nonlinear H-infinity con-trol via the state-dependent Riccati equation tech-nique: Part1, theory. In Proceedings of the Interna-tional Conference on Nonlinear Problems in Avia-tion and Aerospace, Daytona Beach, FL, May 1996.

[4] G. Cybenko. Approximation by superpositions ofa sigmoidal function. Mathematics of Control, Sig-nals, and Systems, 2(4), 1989.

[5] G. F. Franklin, J. D. Powell, and A. Emami-Naeini.Feedback control of dynamic systems. Addison-Wesley, 2 edition, 1991.

[6] K. Furuta, T. Okutani, and H. Sone. Computercontrol of a double inverted pendulum. Computerand Electrical Engineering, 5:67–84, 1978.

[7] K. Hornik, M. Stinchcombe, and H. White. Multi-layer feedforward neural networks are universal ap-proximators. Neural Networks, 2:359–366, 1989.

[8] R. A. Jacobs. Increasing rates of convergencethrough learning rate adaptation. Neural Networks,1(4):295–307, 1988.

9

[9] A. Jadbabaie, J. Yu, and J. Hauser. Stabilizing re-ceding horizon control of nonlinear systems: a con-trol Lyapunov function approach. In Proceedings ofAmerican Control Conference, 1999.

[10] A. Jadbabaie, J. Yu, and J. Hauser. Unconstrainedreceding horizon control of nonlinear systems. InProceedings of IEEE Conference on Decision andControl, 1999.

[11] E. Johnson, A. Calise, R. Rysdyk, and H. El-Shirbiny. Feedback linearization with neural net-work augmentation applied to x-33 attitude control.In Proceedings of the AIAA Guidance, Navigation,and Control Conference, August 2000.

[12] W. T. Miller, R. S. Sutton, and P. J. Werbos. Neuralnetworks for control. MIT Press, Cambridge, MA,1990.

[13] M. Sznaizer, J. Cloutier, R. Hull, D. Jacques, andC. Mracek. Receding horizon control Lyapunovfunction approach to suboptimal regulation of non-linear systems. Journal of Guidance, Control andDynamics, 23(3):399–405, May-June 2000.

[14] E. Wan, A. Bogdanov, R. Kieburtz, A. Baptista,M. Carlsson, Y. Zhang, and M. Zulauf. Modelpredictive neural control for aggressive helicoptermaneuvers. In T. Samad and G. Balas, editors,Software Enabled Control: Information Technolo-gies for Dynamical Systems, chapter 10, pages 175–200. IEEE Press, John Wiley & Sons, 2003.

[15] E. A. Wan and A. A. Bogdanov. Model predictiveneural control with applications to a 6 DOF heli-copter model. In Proceedings of IEEE AmericanControl Conference, Arlington, VA, June 2001.

[16] P. Werbos. Backpropagation through time: what itdoes and how to do it. Proceedings of IEEE, spe-cial issue on neural networks, 2:1550–1560, October1990.

[17] W. Zhong and H. Rock. Energy and passivity basedcontrol of the double inverted pendulum on a cart.In Proceedings of the IEEE international confer-ence on control applications, Mexico City, Mexico,September 2001.

10

0 1 2 3 4 5 6 7−25

−20

−15

−10

−5

0

5

10

time, s

deg

Bottom pendulum angles θ1

0 1 2 3 4 5 6 7−300

−200

−100

0

100

time, s

deg/

s

Bottom pendulum velocities dθ1/dt

SDRELQRNNNN+LQRNN+SDRE


0 1 2 3 4 5 6 7−15

−10

−5

0

5

10

time, s

deg

Top pendulum angles θ2

0 1 2 3 4 5 6 7−30

−20

−10

0

10

20

30

time, s

deg/

s

Top pendulum velocities dθ2/dt



0 1 2 3 4 5 6 7−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7

−2

−1

0

1

2

time, s

m/s

Cart velocity dθ0/dt



0 0.5 1 1.5 2 2.5 3−40

−20

0

20

40

60

80

100

120

140

time, s

N

Control force


Figure 10: 10 deg. opposite direction

0 1 2 3 4 5 6 7−10

−5

0

5

10

15

20

25

time, s

deg


0 1 2 3 4 5 6 7−60

−40

−20

0

20

40

time, s

deg/

s




0 1 2 3 4 5 6 7−10

−5

0

5

10

15

20

time, s

deg


0 1 2 3 4 5 6 7−40

−30

−20

−10

0

10

20

time, s

deg/

s



LQRSDRENNNN+LQRNN+SDRE

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7−1

−0.5

0

0.5

1

1.5

2

2.5

time, s

m/s




0 0.5 1 1.5 2 2.5 3−15

−10

−5

0

5

10

15

20

25

time, s

N

Control force


Figure 11: 20 deg. same direction

11

0 1 2 3 4 5 6 7

−30

−20

−10

0

10

20

time, s

deg


0 1 2 3 4 5 6 7

−400

−300

−200

−100

0

100

200

time, s

deg/

s




0 1 2 3 4 5 6 7

−20

−10

0

10

time, s

deg


0 1 2 3 4 5 6 7−40

−20

0

20

40

60

80

time, s

deg/

s




0 1 2 3 4 5 6 7−4

−3

−2

−1

0

1

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7−4

−2

0

2

4

time, s

m/s




0 0.5 1 1.5 2 2.5 3−100

−50

0

50

100

150

200

250

time, s

N

Control force



0 1 2 3 4 5 6 7

−10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7

−150

−100

−50

0

50

time, s

deg/

s




0 1 2 3 4 5 6 7

−10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7

−60

−40

−20

0

time, s

deg/

s




0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7−2

−1

0

1

2

3

4

time, s

m/s




0 0.5 1 1.5 2 2.5 3−20

−10

0

10

20

30

40

50

time, s

N

Control force



12

0 1 2 3 4 5 6 7−40

−30

−20

−10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7−600

−400

−200

0

200

400

time, s

deg/

s


SDRELQRNN+LQRNN+SDRE


0 1 2 3 4 5 6 7

−20

−10

0

10

time, s

deg


0 1 2 3 4 5 6 7

−50

0

50

100

time, s

deg/

s




0 1 2 3 4 5 6 7−4

−3

−2

−1

0

1

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7−6

−4

−2

0

2

4

6

time, s

m/s




0 0.5 1 1.5 2 2.5 3−150

−100

−50

0

50

100

150

200

250

300

350

time, s

N

Control force



0 1 2 3 4 5 6 7

−20

0

20

40

time, s

deg


0 1 2 3 4 5 6 7−300

−200

−100

0

100

time, s

deg/

s




0 1 2 3 4 5 6 7

−10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7

−100

−50

0

time, s

deg/

s



LQRSDRENNNN+LQRNN+SDRE

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7−2

0

2

4

6

time, s

m/s




0 0.5 1 1.5 2 2.5 3−60

−40

−20

0

20

40

60

80

100

time, s

N

Control force



13

0 1 2 3 4 5 6 7

−40

−20

0

20

time, s

deg


0 1 2 3 4 5 6 7−1000

−500

0

500

time, s

deg/

s


SDRENN+SDRE

SDRENN+SDRE

0 1 2 3 4 5 6 7−40

−30

−20

−10

0

10

20

time, s

deg


0 1 2 3 4 5 6 7−400

−300

−200

−100

0

100

200

time, s

deg/

s


SDRENN+SDRE

SDRENN+SDRE

0 1 2 3 4 5 6 7−3

−2

−1

0

1

time, s

m

Cart position θ0

0 1 2 3 4 5 6 7

−10

−5

0

5

10

time, s

m/s


SDRENN+SDRE

SDRENN+SDRE

0 0.5 1 1.5 2 2.5 3−600

−400

−200

0

200

400

600

800

1000

time, s

N

Control force

SDRENN+SDRE

Figure 16: 28 deg. opposite direction, SDRE vs.NN+SDRE

0 1 2 3 4 5 6 7−40

−20

0

20

40

60

80

time, s

deg

Pendulum angles

0 1 2 3 4 5 6 7−1500

−1000

−500

0

500

1000

1500

time, s

deg/

s

Pendulum velocities

θ1

θ2

dθ1/dt

dθ2/dt

0 1 2 3 4 5 6 7−5

0

5

10

15

time, s

m

Cart position, θ0

0 1 2 3 4 5 6 7−5

0

5

10

15

20

25

time, s

m/s

Cart velocity, dθ0/dt

0 0.5 1 1.5 2 2.5 3

−1000

−800

−600

−400

−200

0

200

400

600

time, s

N

Control force

Figure 17: 67 deg. same direction, SDRE

14

optimal control of a double inverted pendulum on a cart · optimal control of a double inverted...

Documents