an approximate dynamic programming method for multi-input multi-output nonlinear system

OPTIMAL CONTROL APPLICATIONS AND METHODSOptim. Control Appl. Meth. (2011)Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/oca.1031

An approximate dynamic programming method for multi-inputmulti-output nonlinear system

Zhijian Huang1,2,3,*,, Jie Ma1 and He Huang11Ocean Engineering State Key Laboratory, Shanghai Jiao Tong University, Shanghai 200240, China

2Merchant Marine College, Shanghai Maritime University, Shanghai 200135, China3IMS Automotive Electronic Systems Co. Ltd., Shanghai 200335, China

SUMMARY

Approximate dynamic programming is a useful tool in solving multi-stage decision optimal control prob-lems. In this work, we first promote the action-dependent heuristic dynamic programming method tomulti-input multi-output control system by extending its action network to multi-output form. The detailedderivation is also given. We then apply this method to the fluctuation control of a spark engine idle speed. Anengine idling model is set up to verify the control effect of this method. Results here show that this methodrequires several iterations to suppress unbalanced combustion by manipulating spark ignition timing. Thismethod provides an alternative for a simpler multi-input multi-output approximate dynamic programmingscheme. Moreover, it has a faster iteration convergence effect. The derivation of this method also has a rig-orous mathematical basis. Although illustrated for engines, this control system framework should also beapplicable to general multi-input multi-output nonlinear system. Copyright 2011 John Wiley & Sons, Ltd.

Received 16 January 2011; Revised 31 August 2011; Accepted 3 November 2011

KEY WORDS: approximate dynamic programming; adaptive critic designs; fluctuation control of an engineidling; neural network; multi-input multi-output system

1. INTRODUCTION

The current controller has gone through conventional feedback controller and adaptive controllerto the development of intelligent controller. A conventional feedback controller, such as a PID con-troller, can only be applied to a linear or nonlinear single-input single-output system. It is still widelyused for its simplicity. An adaptive controller, such as an adaptive inverse controller, has the perfor-mance to overcome uncertainties of model parameters and disturbance that conventional feedbackcontroller does not have. However, an intelligent controller, such as an approximate dynamic pro-gramming (ADP) controller, finds a way out for a complex nonlinear system control problem. Whenan object model is difficult to obtain, either because of complexity or the numerous uncertaintiesinherent in the system, conventional techniques are less useful. The intelligent ADP controller mayoffer a useful alternative in these cases [1].

An intelligent control system learns from experience. As such, intelligent control systems are alsoadaptive. However, adaptive systems are not necessarily intelligent. An intelligent system improvesits control strategy on the basis of past experience or performance. In other words, an adaptive sys-tem regards the current state as novel, whereas an intelligent system correlates experience gained atprevious plant operating regions with the current state and then modifies its behavior accordinglyfor a more long-term effect [1].

*Correspondence to: Zhijian Huang, Ocean Engineering State Key Laboratory, Shanghai Jiao Tong University, Shanghai200240, China.

E-mail: [email protected]

Copyright 2011 John Wiley & Sons, Ltd.

Z. HUANG, J. MA AND H. HUANG

In addition, the ADP controller also utilizes trial-and-error mechanism of human and animallearning. This learning system explores the environment actively and then adjusts the controlleraccording to the exploration results. It can implement unsupervised online training. The action-critic learning of the ADP controller is an important way in reinforcement learning and provides anattempt to find the optimal action and utility function simultaneously [2].

Bellman first proposed the dynamic programming method [3]. Bellman and his colleaguesresearched the feedback decision concept called dynamic programming and applied this conceptto economy, engineering, operations research, and mathematical problems [4]. However, the com-putation and storage requirements of dynamic programming exponentially increase in volume onadding extra dimensions to the state space because of the well-known curse of dimensionality. Thus,limited by the computational techniques, dynamic programming could not be utilized to solve manypractical problems at that time (about 1960s). Therefore, most of Bellmans work is ahead of theera [4].

In order to solve the curse of dimensionality, in 1977, Werbos proposed an approach for ADP thatis later called adaptive critic designs (ACDs) [5]. This approach uses an artificial neural network asfunction approximates of cost to go in dynamic programming. To implement the ACD algorithm,Werbos later introduced a means to get around this numerical complexity by using ACD formulas[6]. A particularly impressive success that greatly motivated subsequent research is the developmentof a backgammon playing program by Tesauro [7]. He clearly presented the concept of a critic neu-ral network to approximate the optimal cost function in a control problem. Bertsekas and Tsitsiklisgave an overview of the neuro-dynamic programming in their book [8]. In 1997, Prokhorov andWunsch developed more algorithms according to ACDs [9]. In recent years, ADP has gained muchattention from researchers. Si et al. discussed their relation to artificial intelligence, approximationtheory, control theory, operations research and statistics, etc. [10]. Powell showed how ADP cansolve the curse of dimensionality for complex deterministic or stochastic optimization problemsand pointed out future directions of ADP [11, 12].

Now, ADP is categorized into three major families such as heuristic dynamic programming(HDP), dual HDP (DHP), and globalized DHP. In HDP, the critic network estimates the cost functionusing Bellman equation. In DHP, the critic network estimates the derivative of the cost function withrespect to the states of the system. A globalized DHP combines both HDP and DHP for optimiza-tion of the cost function. Each family has its action-dependent form if the action network is directlyconnected to the critic network [9]. For example, ADHDP and ADDHP denote action-dependentHDP and action-dependent DHP, respectively. ADHDP is one of the most widely used methodsin ADP in that it does not need the model of the controlled object.

As for industrial applications, ADP has focused on flight control [13,14], engine control [1517],power systems [2,18,19], auto-landing problem [20], missile systems [21,22], artificial intelligence[23, 24], etc. Enns and Si demonstrated a model-free nonlinear multi-input multi-output (MIMO)helicopter flight controller [14]. Liu et al. promoted an ADP approach [25] and presented an enginetorque and exhaust airfuel ratio controller using an adaptive critic algorithm [15]. Shih extendedstability proofs of nonlinear discrete-time systems in nonstrict feedback form and applied it toexhaust gas recirculation control of a spark engine upon the basis of reinforcement learning dualcontrol [16]. Lu et al. applied a direct HDP to a large power system damping oscillations controlproblem [19]. Murray et al. utilized the ADP method for the control of an X-43 unmanned aircraftauto-landing problem [20]. Werbos applied the ADP approach to understand and replicate brainintelligence, which may be the next level design [23].

A closer examination of the current literature suggests that the ADP method has been success-fully applied to many control areas [1315, 18, 20, 21, 23]. Enns and Si probably, for the first time,systematically applied the ADP method to a complex continuous-state MIMO nonlinear systemwith uncertainty. They adopted a cascaded neural network scheme [14]. Lee and coworkers con-trolled a MIMO methyl methacrylate polymerization reactor with ADP. They used the k-nearestneighbor averager algorithm to improve the control performance [26]. Padhi and coworkers solveda multi-critic-output control problem, such as a real-life micro-electro mechanical system. Padhigot around numerical problems in training with sub-network structure [27]. Some other scholarsalso adopted the ADP method in a nonlinear MIMO control system. For example, see the work

Copyright 2011 John Wiley & Sons, Ltd. Optim. Control Appl. Meth. (2011)DOI: 10.1002/oca

AN ADP METHOD FOR MIMO NONLINEAR SYSTEM

of Liu [15], Murray [20], and Lin [28]. However, most of them gave control effect directly with-out derivation procedure and computation formulas. In truth, most actual control systems are in anonlinear MIMO form. We can directly utilize neither Matlab toolbox nor its functions to approxi-mate the critic function and the action output of the ADP method. Therefore, this work promotes theADHDP method to nonlinear MIMO system by extending its action network to a multi-output form.The derivation procedure and computational formulas are also explored in detail. We then apply thismethod to the fluctuation control of an engine idling to demonstrate its effect. The magnitudes ofthe first and second harmonic of discrete Fourier transform (DFT) results over data samples in anengine cycle represent the nonuniformity revolving level. These magnitudes measure manipulatedspark ignition timing to suppress unbalanced combustion among cylinders. The fluctuation controlof an engine idling is a complex time-varying MIMO nonlinear system. Thus, we study the correct-ness of these formulas and the feasibility of the ADHDP method for the nonlinear MIMO controlproblem with this project.

This paper is organized as follows. In Section 2, we demonstrate the principle of the ADHDPmethod. In Section 3, we design the ADHDP controller for a general MIMO nonlinear system. InSection 4, we set up the spark engine idling model and show the control effect with simulationresults. In Section 5, we conclude this paper with a few remarks.

2. PRINCIPLE OF APPROXIMATE DYNAMIC PROGRAMMING

2.1. Dynamic programming and cost-to-go function-based controlDynamic programming is based on Bellmans principle of optimality: an optimal (control) strategyhas the property that no matter what previous decisions have been made, the remaining deci-sions must constitute an optimal strategy with regard to the state resulting from those previousdecisions [3].

Suppose that a discrete-time nonlinear (time-varying) dynamic system is given as [25]x.t C 1/ D Ft x.t/, u.t/, t , (1)

where x 2 Rn represents the state vector of the system, u 2 Rm denotes the control action, t is thesystem time step (or stage), Ft is the state transition function of nonlinear system. Suppose that thefollowing equation associates with the performance index (or cost function) of this system:

J x.t/, t D1XiDt

itrx.i/, u.i/, i , (2)

where r is called the utility function (i.e., a single-stage cost) and is the discount factor with0 < 6 1. The objective is to choose the control sequence u.i/, i D t , t C 1, : : :, so that the Jfunction (i.e., the cost) in (2) is minimized.

Suppose that one has computed the optimal cost J x.t C 1/, t C 1 from time t C 1 on, for allpossible states x.t C 1/ and that one has also found the optimal control sequences from time t C 1on. The optimal cost results when the optimal control sequence u.t C1/, u.t C2/, : : :, are appliedto the system with initial state x.t C1/. Note that the optimal control sequence depends on x.t C1/.If one applies an arbitrary control u.t/ at time t and then uses the known optimal control sequencefrom t C 1 on, the resulting cost will be rx.t/, u.t/, t C J x.t C 1/, t C 1, where x.t/ is thestate at time t and x.t C 1/ is determined by (1), so the minimum cost from time t on [25] is

J x.t/, t D minu.t/, u.tC1/, :::

1XiDt

itrx.i/, u.i/, i !

D minu.t/

rx.t/, u.t/, t C J x.t C 1/, t C 1 . (3)

The optimal control u.t/ at time t is the u.t/ that achieves this minimum, that is,

u.t/ D arg minu.t/

.rx.t/, u.t/, t C J x.t C 1/, t C 1/. (4)



Equation (4) is the principle of optimality for discrete-time systems. Its importance lies in the factthat any strategy of action that minimizes J in the short term will also minimize the sum of r overall future times [25].

2.2. Approximate dynamic programming

In order to solve dynamic programming problems more effectively, the ADP method appeared. Con-ventional ADP contains three basic modules: critic, model, and action. Each of the three modulescan be implemented using a neural network. By combining the critic network and the model net-work to form a new critic network, we get a form of ADHDP where the critic network implicitlyincludes a model network [25] (Figure 1).

Define the future accumulated cost at time t [29] as

R.t/ D r.t C 1/ C r.t C 2/ C : : : (5)In the new structure, the critic network outputs an estimate of the function J.tC1/ in (2), i.e. R.t/

in (5). This is carried out by minimizing the following error measure over time. The critic networkwill also be trained by minimizing the following error measure over time [25],

kEck DX

t

1

2e2c .t/ D

Xt

1

2.Q.t/ Q.t 1/ r.t//2, (6)

where Q.t/ D Qx.t/, u.t/, t , wc and wc is the parameter of the new critic network. WhenEc.t/ D 0 for all t , (6) implies that

Q.t/ D r.t C 1/ C Q.t C 1/D r.t C 1/ C r.t C 2/ C Q.t C 2/D : : :

D1X

iDtC1it1r.i/.

(7)

Clearly, comparing (7) with (2), we have now Q.t/ D J x.tC1/, tC1. Therefore, when minimiz-ing the error function in (6), we have a neural network trained so that its output becomes an estimateof the cost defined in (2) for i D t C1, that is, the value of the cost function in the immediate future.

The new critic network maps a state and action pair to the cost-to-go value. The optimal Qfunction satisfies [30]

Q.x.t/, u.t// D rx.t/, u.t/, t C minu.tC1/

Qx.t C 1/, u.t C 1/. (8)

The action network is trained after the critic network, and its training objective is given asEa.t/ D Q.t/ D 0. (9)

Figure 1. Three modules in a typical approximate dynamic programming and a new critic network foraction-dependent heuristic dynamic programming.



Once the optimal Q function is known, the optimal control policy u.t/ can be easily obtainedby [30]

u.t/ D arg minu.t/

Q.x.t/, u.t//. (10)

This strategy indirectly enables the action network to produce optimal control actions (10) andzero utility, that is, r.t/ D 0, so that the critic networks output is as close to 0 as possible [1]. Conse-quently, the sum of r.t/ and J.tC1/ in (4) at this stage is also simultaneously minimized. This is thetheory of the ADHDP method that achieves the optimal (4) and solves the curse of dimensionality.

3. DESIGN OF MULTI-INPUT MULTI-OUTPUT ACTION-DEPENDENT HEURISTICDYNAMIC PROGRAMMING CONTROLLER

Figure 2 shows the schematic diagram of the principle of the MIMO ADHDP method. In it, thecritic network is used to approximate the cost function R.t/ in (5). The action network is used tooutput the optimal control vector u.t/. These optimal outputs control the nonlinear MIMO object,and the controlled object produces the state vector x.t/ D x1.t/, x2.t/, : : : , xn.t/ at this time. Thisstate vector is fed to the critic and action networks simultaneously. The action network is extendedto multi-output u.t/ D u1.t/, u2.t/, : : : , um.t/ and is connected to the critic network together withthe state vector. wc and wa represent the weights of critic and action networks, respectively. Thedashed lines are the paths for the critic and action network weights turning.

Thus, this paper promotes Sis ADHDP controller [29] to a multi-output action network form andgets the following critic and action network design.

3.1. The critic network

Symbols are seen in Figures 2 and 3. The critic network has more than one action input, which isdifferent from Sis [29].

We define the prediction error for the critic network as follows:

ec.t/ D Q.t/ Q.t 1/ r.t/, (11)

Ec.t/ D 12e2c .t/. (12)

The weight update method for the critic network is the gradient descent rule that minimizes theEc.t/ given by

wc.t C 1/ D wc.t/ C wc.t/, (13)

Figure 2. The schematic diagram demonstrating the principle of multi-input multi-output action-dependentheuristic dynamic programming (ADHDP). The solid lines represent signal flow, whereas the dashed lines

are the paths for weight turning.



Figure 3. The schematic diagram for the implementation of multi-input action-dependent heuristic dynamicprogramming critic network using a feed-forward network with one hidden layer.

wc.t/ D lc.t/@Ec.t/

@wc.t/

D lc.t/

@Ec.t/

@Q.t/ @Q.t/@wc.t/

, (14)

where lc.t/ > 0 is the learning rate of the critic network at time t , which decreases with time to asmall value.

3.2. The action network

Symbols are seen in Figures 2 and 4. The action network has more than one action output, which isdifferent from Sis [29]. Let

ea.t/ D Q.t/. (15)

Figure 4. The schematic diagram for the implementation of multi-output action-dependent heuristicdynamic programming action network using a feed-forward network with one hidden layer.



The weights in the action network are updated to minimize the following performance errormeasure:

Ea.t/ D 12e2a.t/. (16)

The weight update method for the action network is then similar to the one for the critic network.According to the gradient descent rule,

wa.t C 1/ D wa.t/ C wa.t/, (17)

wa.t/ D la.t/@Ea.t/

@wa.t/

D la.t/

@Ea.t/

@Q.t/ @Q.t/@u.t/

@u.t/@wa.t/

, (18)

where la.t/ > 0 is the learning rate of the action network at time t , which decreases with time to asmall value.

3.3. Online learning algorithms

The w.t/ weight update algorithm of the critic and action networks is the key of MIMO ADHDPdesign. In order to satisfy the chain rule, we only adopt the derivative of scalar with respect to scalarfor a rigorous mathematical basis.

This work lets s be the input vector of the critic network (19). This is different from Sis [29]because the input vector contains more than one action variable u.t/ from action network outputs.

As for the critic network, from input to output, there should be

s D x1.t/, x2.t/, : : : , xn.t/, u1.t/, u2.t/, : : : , um.t/ , (19)

qi .t/ Dhw.1/ci1.t/, w

.1/ci2

.t/, : : : , w.1/cin.t/, w.1/ci ,nC1

.t/, w.1/ci ,nC2.t/, : : : , w.1/ci ,nCm

.t/i

sT

DnCmXjD1

w.1/ci j .t/sj .t/,(20)

pi .t/ D 1 exp.qi .t//

1 C exp.qi .t// , (21)

Q.t/ Dhw.2/c1 .t/, w

.2/c2

.t/, : : : , w.2/cNh1.t/i

p1.t/, p2.t/, : : : , pNh1.t/T

DNh1XiD1

w.2/ci .t/pi .t/,(22)

where i D 1, 2, : : : , Nh1 is the hidden node number in the critic network, j D 1, 2, : : : , n C m isthe input variable number of the critic network input layer, qi is the i th hidden node input of thecritic network, and pi is the i th hidden node output of the critic network.

(1) w.2/ci .t/ (hidden to output layer)According to (14), the hidden to output layer update should be

w.2/ci .t/ D lc.t/"

@Ec.t/@w

.2/ci .t/

#D lc.t/

"@Ec.t/

@Q.t/ @Q.t/@w

.2/ci .t/

#D lc.t/ec.t/pi .t/. (23)

(2) w.1/ci .t/ (input to hidden layer)



According to (14), the input to hidden layer update should be

w.1/ci j .t/ D lc.t/"

@Ec.t/@w

.1/ci j .t/

#D lc.t/

"@Ec.t/

@Q.t/ @Q.t/@pi .t/

@pi .t/@qi .t/

@qi .t/w

.1/ci j .t/

#

D lc.t/ec.t/w

.2/ci

.t/

1

2

1 p2i .t/

sj .t/

D

8 50.66

Pma i D .1 C 0.907 C 0.09982/G.P /

Pma o D 0.0005968Nm 0.1336P C 0.0005341NmP C 0.000001757NmP 2

PP D kp. Pma i Pma o/, kp D 42.40Torque dynamics:

! D Nm

2

60

Tm i D 39.22 C 0.9061Gi Pma o 0.01122i C 0.000675i! C 0.635i C 0.0216! 0.000102!2

Si D sinNm

t

60 i 1

2Nm

C 1

2sin2Nm

t

60 i 1

2Nm

C 1

6sin3Nm

t

60 i 1

2Nm

step.x/ D

1, if x > 00, otherwise (36)

Tw i D 0.1.step.Si / C 1/SiCopyright 2011 John Wiley & Sons, Ltd. Optim. Control Appl. Meth. (2011)

DOI: 10.1002/oca


Figure 5. Examples of the engine idling model fluctuation simulation effect with different spark ignitiontiming and torque quasi-dc variations: (a) G1 D G2 D G3 D G4 D 1.0, 1 D 2 D 3 D 4 D 26; (b)G1 D G2 D G3 D G4 D 1.0, 1 D 3 D 26, 2 D 23, 4 D 24; (c) G1 D 0.98, G2 D 0.96, G3 D 0.97,G4 D 1.00, 1 D 2 D 3 D 4 D 26; (d) G1 D 0.98, G2 D 0.96, G3 D 0.97, G4 D 1.00, 1 D 26,

2 D 29,3 D 23, 4 D 28.

TI D4X

iD1Tm i .Tw i C dTw i /, i D 1, 2, 3, 4.

TL D .Nm=263.17/2 C TdEngine speed:

PN C 108N D ka c.TI TL/, ka c D 2520

PNm C 0.3Nm D km.TI TL/, km D 7,where 1 4 is the spark ignition timing for cylinders 14 and G1 G4 is used to simulateshort-term fluctuations of engine speed for cylinders 14 caused by component difference, aging,disturbance, etc. The running effect of this simulated engine model is shown in Figure 5.

To evaluate the nonuniformity revolving level effectively, we apply the DFT to the data samplescollected in one engine cycle. If the four periodic waves are completely uniform (Figure 5(a)), themagnitude of the first harmonic and that of the second harmonic of DFT results over data samplesin an engine cycle will be zero [31].

4.2. Discrete-time dynamic system simulation and results

Discrete environments can guarantee that any operation that updates the cost function (accord-ing to the Bellman equation) can only reduce the error between the current cost and the optimalcost [14]. Thus, the engine idling model should be discretized before simulation for robust iterationconvergence procedure.

We first estimate the torque difference among cylinders by measuring the crankshaft angularspeed of one engine cycle. The nonuniformity revolving level over the engine speed is then fed backinto the control system. It manipulates spark ignition timing to suppress unbalanced combustionsamong the cylinders.

A given vector NN D .N0, N1, N2, : : : , Nn1/ is the sampled rotation speed values in an enginecycle. Let Yfull be the magnitude of the fundamental frequency in the DFT corresponding to one full



period of engine cycle. Let Yhalf be the magnitude of the second harmonic frequency correspondingto half the period. Therefore, the utility function is defined as

r.t/ D8 0.3, (37)

where !kn D e2ik=n, j.j is the magnitude of a complex number, and c1 and c2 are the coefficientsfor the nonuniformity r.t/.

The critic network is chosen as a 6121 structure with six input neurons, 12 hidden layerneurons, and one output neuron. The six inputs to the critic network are the magnitude of the funda-mental frequency in the DFT corresponding to one full period of engine cycle, the magnitude of thesecond harmonic frequency corresponding to half the period, and the four ignition timing of enginecylinders 14. The hidden layer of the critic network adopts a sigmoidal function:

y D 1 ex

1 C ex . (38)

The action network is chosen as a 284 structure with two input neurons, eight hidden layerneurons, and four output neurons. The two inputs to the action network are the magnitude of thefundamental frequency in the DFT corresponding to one full period of engine cycle and the mag-nitude of the second harmonic frequency corresponding to half the period. Both the hidden layerand the output layer use the sigmoidal function of (38). The outputs of the action network are thefour ignition timing of engine cylinders 14. Thus, the fluctuation control of the engine idling is aMIMO nonlinear system.

All network weights are initialized randomly in the range of .1, C1/. The discount factor is D 0.95. For the critic network, the learning rate is lc D 0.3, the desired training error is Tc D 0.05,and the maximum training cycle is set to 50 times. For the action network, the learning rate isla D 0.3, the desired training error is Ta D 0.005, and the maximum training cycle is set to 500times.

The simulation results show that the ADHDP controller requires eight iterations to suppressnonuniformity revolving among the cylinders, and the control process can make a convergence uni-formly. There is also no overshoot (Figure 6). Cylinders 1 to 4 intelligently manipulate each sparkignition timing to suppress unbalanced combustions (Figure 7). The critic and action network errors

Figure 6. The fluctuation control procedure of the engine idling: (a) zeroth iteration; (b) first iteration; (c)third iteration; (d) eighth iteration.



Figure 7. The adjustment procedure of cylinders 14 spark ignition timing.

also converge to an extremely small value quickly (Figure 8(a)). If the errors of the critic and actionnetworks are given an initial value of 0.2 in the iteration 0 or before training, we can observe theprocess of convergence to a minimum more clearly (Figure 8(b)).

In addition, this control method has faster iteration convergence effect. It is probably at least twotimes faster than any published results [3133].

5. DISCUSSION AND CONCLUSION

This research presents a neural-network-based ADHDP method for a nonlinear MIMO system,which can be applied to complex control problems such as fluctuation control of engine speedat idle. Based on the extension of standard ADP method, an MIMO ADHDP method has beenobtained. This paper employs the fluctuation control of an engine idling to show the feasibility ofthe presented method.

We should note that this study was carried out with relatively ideal speed signals. However, noone will utilize the most original signals in practical industry. All industry signals are more or lesspre-treated. Nevertheless, the simulation results still show that this method is amazing. The con-troller can detect the speed fluctuation. It then adjusts the spark ignition timing to eliminate thenonuniformity revolving for such a nonlinear MIMO system.

The ADP method has not been adopted in the short-time fluctuation control of an engine idling.Thus, applying the ADHDP method to the engine idling control is also a novel scheme. This con-troller can compensate short-term fluctuation of the engine idling in one engine cycle caused bycomponent difference, aging, disturbance, etc. and make itself an adaptive controller.

This controller also does not need to know which cylinder the fluctuation comes from becauseit can intelligently adjust the four spark ignition timing to simultaneously suppress nonuniformityof each cylinder. This performance demonstrates its strong intelligence in identification and controlability for nonlinear MIMO system.

Enns and Sis scheme for a MIMO nonlinear system is a typical method. They adopted a cas-caded neural network scheme. However, our scheme differs from theirs in that we extend the actionnetwork to a multi-output form based on its original structure. Thus, our scheme is relatively simplein ADP structure and computational formulas.

Although utilizing neural networks, this method does not require training data. In fact, theADHDP method is highly different from a neural network. The role of a neural network is onlyused to approximate the cost function in a high-dimensional state space. The ADHDP controlleradopts the principle of dynamic programming and reinforcement learning in it. With the increase of



(a)

(b) Figure 8. The error convergence process: (a) the error convergence process of the critic and action networkswithout initial value and (b) the error convergence process of the critic and action networks with initial value

of 0.2.

training iteration, the controller achieves its objective according to the minimizing utility function.In addition, the presented MIMO ADHDP method has a faster iteration convergence effect so far.

The fluctuation control of an engine idling is a typical complex MIMO nonlinear system. Thesecomputational formulas presented here are also applicable to other control objects. The only dif-ference lies in the node numbers and the training parameters of the neural network as well as theutility function. Thus, although illustrated for engine control, the ADHDP control system frameworkshould also be applicable to general MIMO nonlinear systems.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous reviewers for their helpful comments and high-quality sug-gestions. This work was supported by the NSFC Projects under grant no. 50979058, the Ocean Engineering



State Key Laboratory of Shanghai Jiao Tong University under grant GKZD-010011, the Innovation Programof Shanghai Municipal Education Commission under grant no.11ZZ143, and the Shanghai IMS AutomotiveElectronic Systems Co. Ltd.

REFERENCES

1. Govindhasamy JJ, McLoone SF, Irwin GW, French JJ, Doyle RP. Reinforcement learning for online control andoptimisation. IEE Control Engineering Book Series 2005; 70(9):293326.

2. Ernst D, Glavic M, Wehenkel L. Power systems stability control: reinforcement learning framework. IEEETransactions on Power Systems 2004; 19(1):427435.

3. Bellman R. Dynamic Programming. Princeton University Press: Princeton, NJ, 1957.4. Larson RE, Casti JL. Principles of Dynamic Programming: Basic Analytical and Computational Methods. Marcel

Dekker Inc: NY, 1978.5. Werbos PJ. Advanced forecasting methods for global crisis warning and models of intelligence. General System

Yearbook 1977; 22:2538.6. Werbos PJ. Approximate dynamic programming for real-time control and neural modeling. In Handbook of Intel-

ligent Control: Neural, Fuzzy and Adaptive Approaches, White DA, Sofge DA (eds). Van Nostrand Reinhold: NY,1992; 493525.

7. Tesauro G. Practical issues in temporal difference learning. Machine Learning 1992; 8:257277.8. Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. Athena Scientific: Belmont, MA, 1996.9. Prokhorov DV, Wunsch DC. Adaptive critic designs. IEEE Transactions on Neural Networks 1997; 8(5):9971007.

10. Si J, Barto AG, Powell WB, Wunsch D. Handbook of Learning and Approximate Dynamic Programming. John Wiley& Sons Inc: Hoboken, NJ, 2004.

11. Powell WB. Approximate Dynamic Programming Solving the Curses of Dimensionality. John Wiley & Sons Inc:Hoboken, NJ, 2007.

12. Wang FY, Zhang H, Liu D. Adaptive dynamic programming: an introduction. IEEE Computational IntelligenceMagazine 2009; 4(2):3947. DOI: 10.1109/MCI.2009.932261.

13. Ferrari S, Stengel RF. Online adaptive critic flight control. Journal of Guidance, Control, and Dynamics 2004;27(5):777786.

14. Enns R, Si J. Helicopter trimming and tracking control using direct neural dynamic programming. IEEE Transactionson Neural Networks 2003; 14(4):929939. DOI: 10.1109/TNN.2003.813839.

15. Liu D, Javaherian H, Kovalenko O, Huang T. Adaptive critic learning techniques for engine torque and airfuelratio control. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 2008; 38(4):988993.DOI: 10.1109/TSMCB.2008.922019.

16. Shih P, Kaul B, Jagannathan S, Drallmeier J. Near optimal output-feedback control of nonlinear discrete-time sys-tems in nonstrict feedback form with application to engines. Proceedings of International Joint Conference on NeuralNetworks, Orlando Florida USA, 2007; 396401.

17. Kulkarni NV, Krishnakumar K. Intelligent engine control using an adaptive critic. IEEE Transactions on ControlSystems Technology 2003; 11(2):164173. DOI: 10.1109/TCST.2003.809254.

18. Mohagheghi S, Venayagamoorthy GK, Harley RG. Adaptive critic design based neuro-fuzzy controller for a staticcompensator in a multimachine power system. IEEE Transactions on Power Systems 2006; 21(4):17441754.DOI: 10.1109/TPWRS.2006.882467.

19. Lu C, Si J, Xie X. Direct heuristic dynamic programming for damping oscillations in a large power sys-tem. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 2008; 38(4):10081013.DOI: 10.1109/TSMCB.2008.923157.

20. Murray JJ, Cox CJ, Lendaris GG, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man,and Cybernetics-Part C: Applications and Reviews 2002; 32(2):140153.

21. Lin CK. Adaptive critic autopilot design of bank-to-turn missiles using fuzzy basis function net-works. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 2005; 35(2):197207.DOI: 10.1109/TSMCB.2004.842246.

22. Bertsekas DP, Homer ML, Logan DA, Patek SD, Sandell NR. Missile defense and interceptor allocation by neuro-dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 2000;30(1):4251.

23. Werbos PJ. Using ADP to understand and replicate brain intelligence: the next level design. Proceedings of the IEEEInternational Symposium on Approximate Dynamic Programming and Reinforcement Learning, Honolulu, Hawaii,USA, 2007; 209216.

24. Iftekharuddin KM, Li Y, Siddiqui F. A biologically inspired dynamic model for object recognition. Neurodynamicsof Cognition and Consciousness, Understanding Complex Systems 2007; 2007:211232.

25. Liu D, Xiong X, Zhang Y. Action-dependent adaptive critic designs. Proceedings of the IEEEINNS InternationalJoint Conference on Neural Networks, Washington, DC, 2001; 990995.

26. Lee JH, Lee JM. Approximate dynamic programming based approach to process control and scheduling. Computerand Chemical Engineering 2006; 30:16031618.

27. Padhi R, Unnikrishnan N, Wang X, Balakrishnan SN. A single network adaptive critic (SNAC) architecture foroptimal control synthesis for a class of nonlinear systems. Neural Networks 2006; 19(10):16481660.



28. Lin X, Lei S, Song C, Song S, Liu D. ADHDP for the pH value control in the clarifying process of sugar cane juice.Lecture Notes in Computer Science 2008; 5263:796805. DOI: 10.1007/978-3-540-87732-5_88.

29. Si J, Wang YT. On-line learning control by association and reinforcement. IEEE Transactions on Neural Networks2001; 12(2):264276.

30. Lee JM, Lee JH. Value function-based approach to the scheduling of multiple controllers. Journal of Process Control2008; 18:533542.

31. Kim DE, Park J. Application of adaptive control to the fluctuation of engine speed at idle. Information Sciences 2007;177(16):33413355. DOI: 10.1016/j.ins.2006.12.02.

32. Shim D, Park J, Khargonekar PP, Ribbens WB. Reducing automotive engine speed fluctuation at idle. IEEETransactions on Control Systems Technology 1996; 4(4):404410.

33. Kim DE, Park J. Neural network control for reducing engine speed fluctuation at idle. IEEE Transactions on ControlSystems Technology 1999; 4:629634.


an approximate dynamic programming method for multi-input multi-output nonlinear system

Documents