fast automatic two-stage nonlinear model identification based on the extreme learning machine

8
Fast automatic two-stage nonlinear model identification based on the extreme learning machine Jing Deng, Kang Li , George W. Irwin Intelligent Systems and Control Group, School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT9 5AH, UK article info Available online 12 May 2011 Keywords: Extreme learning machine Two-stage stepwise selection Leave-one-out cross validation RBF networks abstract It is convenient and effective to solve nonlinear problems with a model that has a linear-in-the- parameters (LITP) structure. However, the nonlinear parameters (e.g. the width of Gaussian function) of each model term needs to be pre-determined either from expert experience or through exhaustive search. An alternative approach is to optimize them by a gradient-based technique (e.g. Newton’s method). Unfortunately, all of these methods still need a lot of computations. Recently, the extreme learning machine (ELM) has shown its advantages in terms of fast learning from data, but the sparsity of the constructed model cannot be guaranteed. This paper proposes a novel algorithm for automatic construction of a nonlinear system model based on the extreme learning machine. This is achieved by effectively integrating the ELM and leave-one-out (LOO) cross validation with our two-stage stepwise construction procedure [1]. The main objective is to improve the compactness and generalization capability of the model constructed by the ELM method. Numerical analysis shows that the proposed algorithm only involves about half of the computation of orthogonal least squares (OLS) based method. Simulation examples are included to confirm the efficacy and superiority of the proposed technique. & 2011 Elsevier B.V. All rights reserved. 1. Introduction In nonlinear system modelling and identification, two main issues need to be resolved. The first is to find a proper nonlinear model structure and the second is to optimize that model structure and the parameters. To address the first issue, various model types have been proposed, such as the nonlinear autore- gressive with exogenous input (NARX), nonlinear output error (NOE), artificial neural networks, and fuzzy systems [2]. Gener- ally, most of these models have, or can be transformed into, a linear-in-the-parameters (LITP) structure which is easier to optimize. The main problem in a LITP model is the determination of the nonlinear parameters in each term, e.g. the width of the Gaussian function and the power of a polynomial term. These parameters are normally found by exhaustive search or optimized by a gradient-based technique which significantly increases the com- putational complexity. Unlike conventional schemes, a more general model construction concept, namely the extreme learning machine (ELM), has been proposed for LITP models [3]. In an ELM, all the nonlinear parameters are randomly chosen independent of the training data. Training is thus transformed into a standard least-squares problem, leading to a significant improvement in the learning speed. It has been proven that the LITP model with such randomly generated nonlinear parameters can approximate any continuous target function [4]. Further, the ELM has been extended to a much wider class of model terms, including fuzzy rules as well as additive nodes [5]. Though the ELM is efficient in terms of nonlinear model construction, the resultant model is not compact. Some of the terms can be unimportant to the model’s interpretability due to the stochastic process. Thus, a sparse model with a satisfactory accuracy needs to be further optimized from these candidate terms based on the parsimonious principle. The most popular approach for this is the subset construction method, including forward stepwise selection, backward stepwise selection and the so-called best subset selection [6]. Amongst these techniques, forward stepwise selection is perhaps the most popular approach in the literature for model construction where a very large term pool has to be considered [7]. The best-known algorithm in forward selection is orthogonal least squares (OLS) [8,9] which is derived from an orthogonal (or QR) decomposition of the regression matrix. An alternative is the recently proposed fast recursive algorithm (FRA) [10] which has been shown to be more efficient and stable than OLS. The main issue in forward stepwise construction is to termi- nate the selection procedure to prevent over-fitting. Commonly used methods include use of an information criterion, such as the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.11.035 Corresponding author. Tel.: þ44 2890974663. E-mail address: [email protected] (K. Li). Neurocomputing 74 (2011) 2422–2429

Upload: jing-deng

Post on 10-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Neurocomputing 74 (2011) 2422–2429

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

Fast automatic two-stage nonlinear model identification based on theextreme learning machine

Jing Deng, Kang Li �, George W. Irwin

Intelligent Systems and Control Group, School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT9 5AH, UK

a r t i c l e i n f o

Available online 12 May 2011

Keywords:

Extreme learning machine

Two-stage stepwise selection

Leave-one-out cross validation

RBF networks

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2010.11.035

esponding author. Tel.: þ44 2890974663.

ail address: [email protected] (K. Li).

a b s t r a c t

It is convenient and effective to solve nonlinear problems with a model that has a linear-in-the-

parameters (LITP) structure. However, the nonlinear parameters (e.g. the width of Gaussian function) of

each model term needs to be pre-determined either from expert experience or through exhaustive

search. An alternative approach is to optimize them by a gradient-based technique (e.g. Newton’s

method). Unfortunately, all of these methods still need a lot of computations. Recently, the extreme

learning machine (ELM) has shown its advantages in terms of fast learning from data, but the sparsity

of the constructed model cannot be guaranteed. This paper proposes a novel algorithm for automatic

construction of a nonlinear system model based on the extreme learning machine. This is achieved by

effectively integrating the ELM and leave-one-out (LOO) cross validation with our two-stage stepwise

construction procedure [1]. The main objective is to improve the compactness and generalization

capability of the model constructed by the ELM method. Numerical analysis shows that the proposed

algorithm only involves about half of the computation of orthogonal least squares (OLS) based method.

Simulation examples are included to confirm the efficacy and superiority of the proposed technique.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

In nonlinear system modelling and identification, two mainissues need to be resolved. The first is to find a proper nonlinearmodel structure and the second is to optimize that modelstructure and the parameters. To address the first issue, variousmodel types have been proposed, such as the nonlinear autore-gressive with exogenous input (NARX), nonlinear output error(NOE), artificial neural networks, and fuzzy systems [2]. Gener-ally, most of these models have, or can be transformed into, alinear-in-the-parameters (LITP) structure which is easier tooptimize.

The main problem in a LITP model is the determination of thenonlinear parameters in each term, e.g. the width of the Gaussianfunction and the power of a polynomial term. These parametersare normally found by exhaustive search or optimized by agradient-based technique which significantly increases the com-putational complexity. Unlike conventional schemes, a moregeneral model construction concept, namely the extreme learningmachine (ELM), has been proposed for LITP models [3]. In an ELM,all the nonlinear parameters are randomly chosen independent ofthe training data. Training is thus transformed into a standard

ll rights reserved.

least-squares problem, leading to a significant improvement inthe learning speed. It has been proven that the LITP model withsuch randomly generated nonlinear parameters can approximateany continuous target function [4]. Further, the ELM has beenextended to a much wider class of model terms, including fuzzyrules as well as additive nodes [5].

Though the ELM is efficient in terms of nonlinear modelconstruction, the resultant model is not compact. Some of theterms can be unimportant to the model’s interpretability due tothe stochastic process. Thus, a sparse model with a satisfactoryaccuracy needs to be further optimized from these candidateterms based on the parsimonious principle. The most popularapproach for this is the subset construction method, includingforward stepwise selection, backward stepwise selection and theso-called best subset selection [6]. Amongst these techniques,forward stepwise selection is perhaps the most popular approachin the literature for model construction where a very large termpool has to be considered [7]. The best-known algorithm inforward selection is orthogonal least squares (OLS) [8,9] whichis derived from an orthogonal (or QR) decomposition of theregression matrix. An alternative is the recently proposed fastrecursive algorithm (FRA) [10] which has been shown to be moreefficient and stable than OLS.

The main issue in forward stepwise construction is to termi-nate the selection procedure to prevent over-fitting. Commonlyused methods include use of an information criterion, such as the

J. Deng et al. / Neurocomputing 74 (2011) 2422–2429 2423

final prediction error (FPE) or Akaike’s information criterion (AIC)[2] which provide a trade-off between the training error andmodel complexity. However, the tuning parameters that penalizethe model complexity in these criteria are sensitive to both thespecific application and the data sample size. It is still easy to missan optimal model. To overcome this problem, the regularizedforward model construction variant has also been described[11,12]. The main idea here is that each model weight is assigneda prior hyperparameter, and the most probable values of thesehyperparameters are estimated iteratively from the data. As aresult, those terms that are mainly determined by the noise havelarge hyperparameter value, and their corresponding modelweights are forced near to zero. Sparsity is then achieved byremoving irrelevant terms from the trained model.

To assess the generalization ability, the resultant model isusually applied to a set of test data. This involves splitting all theavailable measured data into different sets. However, the amountof data available is often limited, so it is desirable to use all thedata available to train the model without sacrificing any general-ization performance. A typical way of doing this is to employcross validation. The limited data set is split into s parts in whichs�1 parts are used for model training, and the single remainingpart is used for model testing. This procedure is continued untilall s different possible combinations has been implemented. Theextreme case of cross validation is known as the leave-one-out(LOO) method [2,13], where only one sample is used for testingand the rest left for training. The overall model error or LOO erroris then the average test error of each data sample. By adoptingthis method in a forward construction context, model terms areselected based on their reduction in the LOO error, and theselection procedure automatically terminates at the point wherethis starts to increase. Though this approach can achieveimproved model generalization, its computational complexity isextremely high. Generally, it is therefore feasible only for verysmall data set, because the computational complexity is propor-tional to the total number of data samples. However, if the modelhas a linear-in-the-parameters structure, it has been shown thatLOO error can be calculated without splitting the data setexplicitly [13].

Though a forward construction scheme can efficiently build asparse model from a large candidate term pool, the final model isnot optimal [14], since the calculation of the contribution of anynew term depends on previously selected ones [1]. To reduce thisconstraint, genetic search has been suggested to refine the modelstructure [15], but at the expense of a very high computationalcomplexity. In [1], a two-stage stepwise construction method wasrecently proposed which combined both forward and backwardconstruction procedure. This retains the computational efficiencyof the FRA while improving the model compactness.

In the above two-stage construction algorithm, an initialmodel is first constructed by the FRA where the contribution ofa particular term of interest is measured by its reduction in anappropriate cost function. The significance of each selected termis then reviewed at a second stage of model refinement, andinsignificant terms are replaced. Specifically, if the contribution ofa previously selected term is less than any from the candidatepool, it will be replaced with this. Thus, the cost function can befurther reduced without increasing the model size. This checkingcycle is iterated until no insignificant model term exists in thetrained model, resulting in an optimized model structure withimproved performance. The extension of two-stage algorithm toELM was first proposed in [16], and then further improved in [17].However, the computation complexity in [17] is still very highwhich may discourage its application to large data sets.

In this paper, the Leave-One-Out cross validation technique iseffectively integrated into our recently proposed two-stage

stepwise construction method to enhance the sparsity of modelsconstructed from ELM. In this new algorithm, the contribution ofeach candidate model term is measured by its reduction in theLOO error, and the construction procedure is automaticallystopped when this starts to increase. The second stage of modelrefinement can further improve the trained model by replacingany insignificant terms with ones from the candidate pool. Due tothe introduction of a residual matrix which can be updatedrecursively, the calculation of LOO error is significantly reduced.Results from two simulation examples are included to confirm theefficacy of the proposed method.

The paper is organized as follows. Section 2 gives the pre-liminaries. The automatic two-stage algorithm is then introducedin Section 3. Section 4 describes the results from two simulationexamples to illustrate the effectiveness of the proposed techni-que, and Section 5 contains a concluding summary.

2. Preliminaries

2.1. General nonlinear system model

Consider a general nonlinear system approximated by a modelthat has a linear-in-the-parameters structure

yðtÞ ¼Xn

k ¼ 1

ykfkðxðtÞÞþeðtÞ ð1Þ

where y(t) is the system output at sample time t, xðtÞ is the inputvector, fkðxðtÞÞ denotes the nonlinear function (e.g. polynomial orGaussian), yk is a model parameter, and eðtÞ is the modelprediction error. In extreme learning machine, the nonlinearparameters in fkðxðtÞÞ, k¼ 1, . . . ,n are randomly assigned, andthe model parameters h are estimated through least squares. Thegeneralization performance mainly depends on the number ofnonlinear functions n. Too small of n may result in an under-fittedmodel, while too large of n may lead to an over-fitted one.

Suppose a set of N data samples fxðtÞ,yðtÞgNt ¼ 1 are used formodel training, then (1) can be written in the matrix form

y¼Uhþe ð2Þ

where U¼ ½/1, . . . ,/n�ARN�n is the regression matrix with col-

umn vectors /i ¼ ½fiðxð1ÞÞ, . . . ,/iðxðNÞÞ�T , i¼ 1, . . . ,n, y¼ ½yð1Þ, . . . ,

yðNÞ�T ARN ,h¼ ½y1, . . . ,yn�T ARn and e¼ ½eð1Þ, . . . ,eðNÞ�T ARN .

The model selection procedure aims to build a parsimoniousrepresentation. Suppose that the initial model has M candidateterms, and nðn5MÞ regressors p1, . . . ,pn have been selected fromthe candidate pool with those remaining in the pool denoted as/ið�Þ, i¼ nþ1, . . . ,M. The resulting model is represented by

y¼ Pnhnþe ð3Þ

where Pn ¼ ½p1, . . . ,pn�, hn ¼ ½y1, . . . ,yn�T , e¼ ½eð1Þ, . . . ,eðNÞ�T , and

the least-squares estimation of h can be given by

hn ¼ ðPTnPnÞ

�1PTny ð4Þ

The above introduction reveals the three issues in the ELM thathave to be dealt with in applications. These are: (i) A largernumber of nonlinear functions are usually required; (ii) thesingularity problem in P becomes serious as the model sizeincreased; and (iii) large model generates high computationaloverhead in deriving the linear parameters. In fact, these threeproblems are also closely coupled. If the performance of thenonlinear model can be improved, the number of required non-linear functions and the corresponding number of linear para-meters can be significantly reduced, leading to the overallreduction of the computational complexity. Once PTP becomesnonsingular, a number of efficient algorithms can be used for fast

J. Deng et al. / Neurocomputing 74 (2011) 2422–24292424

computation of the linear parameters. The basic idea in this paperis to introduce an integrated framework to select the nonlinearterms from the randomly generated candidate pool in ELM tofurther improve the model performance and to efficiently calcu-late the linear parameters as well.

2.2. Computation of LOO error

Leave-one-out is a commonly used cross validation method toimprove the model generalization capability [13]. It is achievedby using only one sample for model testing in each run, with theremaining (N�1) samples reserved for training. Thus, fort¼ 1, . . . ,N, the jth model is estimated from the data set withthe tth point being removed. The prediction error can then becalculated by

eð�tÞj ¼ yðtÞ�y

ð�tÞj ðtÞ ð5Þ

where yð�tÞj ðtÞ is the jth model output estimated by using the

remaining (N�1) data samples. The LOO error is obtained byaveraging all these prediction errors as

Jj ¼1

N

XN

t ¼ 1

ðeð�tÞj Þ

2ð6Þ

The model term with the minimum LOO error will be selectedat each step. It is obvious that this procedure is computationallyexpensive since the amount of calculation involved in using eachdata sample individually, as described above, is N times that inusing the complete data set once. However, the following deriva-tion shows that the LOO error can be obtained without explicitlyhaving to split the training data set sequentially as above [13].

Refer to (3), and define a matrix M:

M¼ PT P ð7Þ

Then, the least-squares parameter estimate of h is given by

h ¼ ½PT P��1PT y¼M�1PT y ð8Þ

The model residual at sample time t becomes

eðtÞ ¼ yðtÞ�pðtÞh ¼ yðtÞ�pðtÞM�1PT y ð9Þ

where pðtÞ denotes the tth row of the regression matrix P. If thetth data sample was deleted from the estimation data set, theparameter h is then calculated as

hð�tÞ¼ fMð�tÞ

g�1fPð�tÞgT yð�tÞ ð10Þ

From the definition of M, it can be shown that

Mð�tÞ¼M�pðtÞT pðtÞ ð11Þ

and

fPð�tÞgT yð�tÞ ¼ PT y�pðtÞT yðtÞ ð12Þ

By using the well-known matrix inversion lemma ½AþBCD��1 ¼

A�1�A�1B½DA�1BþC�1

��1DA�1 [18], the inverse of Mð�tÞ in (11)can be computed as

fMð�tÞg�1 ¼M�1

þM�1pðtÞT pðtÞM�1

1�pðtÞMT pðtÞTð13Þ

The model error at data point t is now given as [13]

eð�tÞðtÞ ¼ yðtÞ�pðtÞhð�tÞ

¼ yðtÞ�pðtÞfMð�tÞg�1fpð�tÞgT yð�tÞ

¼yðtÞ�pðtÞM�1PT y

1�pðtÞM�1pT ðtÞ

¼eðtÞ

1�pðtÞM�1pT ðtÞð14Þ

The calculation of the LOO error is simplified by the use of Eq. (14)which does not involve splitting the training data sequentially.

3. Automatic two-stage model construction

In forward stepwise construction scheme, model terms areselected one-by-one with the LOO error being maximally reducedeach time. Suppose k such terms have been selected, and theresulting regression matrix is denoted as

Pk ¼ ½p1, . . . ,pk�, k¼ 1, . . . ,n ð15Þ

From (14), the corresponding model residual at sample time t

becomes

eð�tÞk ðtÞ ¼

yðtÞ�pkðtÞM�1k PT

k y

1�pkðtÞM�1k pT

k ðtÞ¼

ekðtÞ

1�pkðtÞM�1k pT

k ðtÞð16Þ

and the LOO error is given by

Jk ¼1

N

XN

t ¼ 1

ðeð�tÞk ðtÞÞ

2ð17Þ

If one more regressor term pkþ1 is selected, the regression matrixchanges to Pkþ1 ¼ ½Pk,pkþ1�. The selected term should maximallyreduces the LOO error compared to all the remaining availablecandidates. However, this choice involves a constrained minimi-zation of Jkþ1, which can be solved in a second stage of modelrefinement.

3.1. Forward model selection—first stage

In order to simplify the computation of the LOO error, aresidual matrix series is defined first. Thus

Rk9I�PkM�1

k PTk , 0okrn

I, k¼ 0

(ð18Þ

According to [10,1], the matrix terms Rk,k¼ 0, . . . ,n have thefollowing properties:

Rkþ1 ¼Rk�Rkpkþ1pT

kþ1RTk

pTkþ1Rkpkþ1

, k¼ 0,1, . . . ,n�1 ð19Þ

RTk ¼Rk; ðRkÞ

2¼Rk, k¼ 0,1, . . . ,n ð20Þ

RiRj ¼RjRi ¼Ri, iZ j; i,j¼ 0,1, . . . ,n ð21Þ

Rk/¼0, rankð½Pk,/�Þ ¼ k

/ðkÞa0, rankð½Pk,/�Þ ¼ kþ1

(ð22Þ

R1,...,p,...,q,...,k ¼ R1,...,q,...,p,...,k, p,qrk ð23Þ

The last property means that any change in the selection order ofthe regressor terms p1, . . . ,pk does not change the residualmatrices Rk. This property will help to reduce the computationaleffort in the second stage.

Now, the residual vector e can be given by

ek ¼ y�PkM�1k PT

k y¼Rky ð24Þ

where ek ¼ ½ekð1Þ, . . . ,ekðtÞ, . . . ,ekðNÞ�T . The denominator in (16) is

also the tth diagonal element of the residual matrix Rk. As a result,the LOO error in (17) can be rewritten as

Jk ¼1

N

XN

t ¼ 1

e2k ðtÞ

d2k ðtÞ

ð25Þ

where dk ¼ diagðRkÞ.According to (19), the above LOO error can be further simpli-

fied by defining an auxiliary matrix AARk�M , and a vector

J. Deng et al. / Neurocomputing 74 (2011) 2422–2429 2425

bARM�1, with elements given by

ai,j9ðpði�1Þ

i ÞT pj, 1r jrk

ðpði�1Þi Þ

T/j, ko jrM

8<: ð26Þ

bi9ðpði�1Þ

i ÞT y, 1r irk

ð/ðkÞi ÞT y, ko irM

8<: ð27Þ

Referring to the updating of residual matrix in (19), it showsthat ak,j, bk, dk and ek can be computed recursively as follows:

ak,j ¼ pTk/j�

Xk�1

l ¼ 1

al,kal,j=al,l, k¼ 1, . . . ,n, j¼ 1, . . . ,M: ð28Þ

bk ¼ pTk y�

Xk�1

l ¼ 1

ðal,kblÞ=al,l k¼ 1, . . . ,n: ð29Þ

pðk�1Þk ¼ pðk�2Þ

k �ak�1,k

ak�1,k�1pðk�2Þ

k�1 ð30Þ

dk ¼ dk�1�ðpðk�1Þk Þ

2=ak,k ð31Þ

ek ¼ ek�1�bk

ak,kpðk�1Þ

k ð32Þ

The LOO error in (25) can now be calculated recursively using (31)and (32). In this forward construction stage, the significance of eachmodel term is measured based on its reduction in LOO error. Thus,suppose at the kth step, one more term from the candidate pool is tobe selected. The new LOO error of the model, which includes thepreviously selected k terms and the new candidate one, will becomputed from (25). The one that gives the minimum LOO error Jkþ1

will be added to the model. Meanwhile, all the regressors in U havebeen stored in their intermediate forms for the next selection. That is,if the kth term is added to the model, then all previously selected

terms will be saved as pð0Þ1 , . . . ,pðk�1Þk , and all the remaining regressors

in the candidate pool will be saved as /ðkÞi ,i¼ kþ1, . . . ,M. The

diagonal elements aj,j in A, and bj for kþ1r jrM are also pre-

calculated for use in the next selection, and are given by

aðkþ1Þj,j ¼ aðkÞj,j �a2

k,j=ak,k ð33Þ

bðkþ1Þj ¼ bðkÞj �ak,jbk=ak,k ð34Þ

This procedure is continued until the LOO error starts to increase,meaning the forward selection stage will be automatically terminatedwhen

Jno Jnþ1 ð35Þ

resulting in a compact model with n terms.

3.2. Backward model refinement—second stage

This stage involves the elimination of insignificant terms dueto the constraint introduced in forward construction. Noting thatthe last selected term in forward construction is always moresignificant than those remaining in the candidate pool, the back-ward model refinement can be divided into three main proce-dures: firstly, a selected term pk,k¼ 1, . . . ,n�1 is shifted to the nthposition as it was the last selected one; secondly, the LOO error ofeach candidate term is recalculated, and compared with theshifted term. If the LOO error of a selected term is larger thanthat of a term from the candidate pool, it will be replaced, leadingto the required improvement in model generalization capability.This review is repeated until no insignificant term remains in theselected model. Finally, all the selected terms are used to form a

new candidate pool, and the forward stepwise construction isimplemented again, probably further reducing the model size.

3.2.1. Re-ordering of regressor terms

Suppose a selected model term pk is to be moved to the nthposition in the regression matrix Pn. This can be achieved byrepeatedly interchanging two adjacent regressors so that

p�q ¼ pqþ1, p�qþ1 ¼ pq, q¼ k, . . . ,n�1 ð36Þ

By noting the property in (23), it is clear that only Rq in the residualmatrix series is changed at each step. This is updated using

R�q ¼ Rq�1�Rq�1p�qðp

�qÞ

T RTq�1

ðp�qÞT Rq�1p�q

ð37Þ

Meanwhile, the following terms also need to be updated:

In matrix A, only the upper triangular elements ai,j,ir j areused for model term selection. The qth and the ðqþ1Þthcolumns with elements from row 1 to q�1 need to be modified(i¼ 1, . . . ,q�1)

a�i,q ¼ ðpði�1Þi Þ

T pqþ1 ¼ ai,qþ1

a�i,qþ1 ¼ ðpði�1Þi Þ

T pq ¼ ai,q

8<: ð38Þ

The elements of qth row aq,j from column q to column M

(j¼ q, . . . ,M) are also changed using

a�q,j ¼

aqþ1,qþ1þa2q,qþ1=aq,q, j¼ q

aq,qþ1, j¼ qþ1

aqþ1,jþaq,qþ1aq,j=aq,q, jZqþ2

8>><>>: ð39Þ

and the elements of the ðqþ1Þth row aqþ1,j, (j¼ qþ1, . . . ,M) arechanged by

a�qþ1,j ¼aq,q�a2

q,qþ1=a�q,q j¼ qþ1

aq,j�aq,qþ1a�q,j=a�q,q Nonlinear optimization jZqþ2

8<:

ð40Þ

For the vectors b, only the qth and the ðqþ1Þth gc elements arechanged. Thus

b�q ¼ bqþ1þaq,qþ1bq=aq,q ð41Þ

b�qþ1 ¼ bq�aq,qþ1b�q=a�q,q ð42Þ

Finally, pðq�1Þq and pðqÞqþ1 are updated using

ðp�qÞðq�1Þ¼ pðqÞqþ1þ

aq,qþ1

aq,qpðq�1Þ

q ð43Þ

ðp�qþ1ÞðqÞ¼ pðq�1Þ

q �aq,qþ1

a�q,q

ðp�Þðq�1Þq ð44Þ

This procedure is continued until the kth regressor term is shiftedto the nth position, the new regression matrix and residualmatrices series becoming

P�n ¼ ½p�1, . . . ,p�k�1,p�k, . . . ,p�n�1,p�n�

¼ ½p1, . . . ,pk�1,pkþ1, . . . ,pn,pk� ð45Þ

fR�kg ¼ ½R1, . . . ,Rk�1,R�k, . . . ,R�n� ð46Þ

3.2.2. LOO error comparison

When the regressor term pk has been moved to the nthposition in the full regression matrix Pn, its contribution to thereduction of leave-one-out cross validation error needs to be

LOO

Err

or

Number of hidden nodes nn’

First StageSecond Stage

Fig. 1. LOO error at different stages (The forward selection stage is stopped at n,

and the second stage is stopped at n0 , n0rn).

J. Deng et al. / Neurocomputing 74 (2011) 2422–24292426

reviewed. The LOO error of those previously selected n termsremains unchanged, while the new LOO errors of the regressorterms in the candidate pool must be recalculated. The corre-sponding terms for nþ1r jrM to be changed are as follows:

a�j,j ¼ aj,jþða�n,jÞ

2=a�n,n ð47Þ

b�j ¼ bjþb�na�n,i=a�n,n ð48Þ

ðuðn�1Þj Þ

�¼/ðnÞj þ

a�n,j

a�n,n

ðpðn�1Þn Þ

�ð49Þ

d�j ¼ dnþ1

a�n,n

½ðpðn�1Þn Þ

��2�

1

a�j,j½ð/ðn�1Þ

j Þ��2 ð50Þ

e�j ¼ enþb�n

an,nðpðn�1Þ

n ��

b�ja�j,jð/ðn�1Þ

j Þ�

ð51Þ

Suppose a regressor /m from the candidate term pool has asmaller LOO error than the model term of interest, that is J�moJ�n.In this case, /m will replace p�n in the selected regression matrixPn, and p�n will be put back into the candidate term pool. Mean-while, the following related terms are updated:

The vectors en and dn are changed to

e�n ¼ em, d�n ¼ dm ð52Þ

In the matrix A

a�i,n ¼ ai,m, a�i,m ¼ ai,n ði¼ 1, . . . ,n�1Þ ð53Þ

a�n,j ¼

am,m j¼ n

an,m j¼m

/Tm/j�

Xn�1

l ¼ 1

al,mal,j=al,l others

8>>>><>>>>:

ð54Þ

ða�j,jÞðnþ1Þ

¼an,n�ða�n,mÞ

2=a�n,n j¼m

a�j,j�ða�n,jÞ

2=a�n,n jam

8<: ð55Þ

In the vector b,

b�n ¼ bm ð56Þ

ðb�j Þðnþ1Þ

¼bn�a�n,mb�n=a�n,n j¼m

bj�a�n,jb�n=a�n,n jam

(ð57Þ

Finally, pðn�1Þn and /ðnÞj for no jrM are updated according to

ðpðn�1Þn Þ

�¼/ðn�1Þ

m ð58Þ

ð/ðnÞj Þ�¼/ðn�1Þ

j �a�n,j

a�n,n

ðpðn�1Þn Þ

�ð59Þ

3.2.3. Model refinement

The above two procedures in Sections 3.2.1 and 3.2.2 arerepeated until there is no insignificant centres remaining in thefull regression matrix Pn. As described in the forward selectionstage, the reduction in LOO error involves a constrained mini-mization, since the selection of additional model terms willdepend on those previously chosen. This is also true in determin-ing the stopping point at this step. The total number of RBFcentres from the first stage in the network model is not optimal.Fig. 1 illustrates n centres being selected in the first stage, with afurther reduction of the LOO error given during the second stage

at the value n. However, this number n is not the optimal modelsize at the second stage since n0 is in fact the optimal number ofcentres. The third procedure involves re-ordering the selectedcentres by their contributions to the network. This is done byputting all n selected hidden nodes into a smaller term pool, andapplying forward selection again. This process will either beautomatically terminated at the point n0, or when all n termshave been re-selected.

3.3. Algorithm

The proposed algorithm for selecting a sub-model can now besummarized as follows:

Step 1

Initialization: Construct the candidate regression matrixbased on ELM, and let the model size k¼ 0. Then assignthe initial value for the following terms ðj¼ 1, . . . ,MÞ:� J0 ¼

1N

PNt ¼ 1 yðtÞ2;

� d0 ¼ ½1, . . . ,1�T ARN�1; e0 ¼ y;� /ð0Þj ¼/j; að1Þj,j ¼/T

j /j; bð1Þj ¼/Tj y;

Step 2

Forward selection:(a) At the kth step ð1rkrMÞ, use (31), (32), and (35) to

calculate dj, ej and their corresponding LOO error Jk

for each candidate term.(b) Find the candidate regressor that gives the minimal

LOO error, and add it to the regression matrix P. Then

compute ak,j and pre-calculate /ðkÞj , aðkþ1Þj,j , and bðkþ1Þ

j

for j¼ kþ1, . . . ,M.(c) If the LOO error Jk�14 Jk, set k¼ kþ1, and go back to

step (a). Otherwise, go to step 3.

Step 3 Backward model refinement:

(a) Change the position of pk with pkþ1 ðk¼ n�1, . . . ,1Þ,and update the related terms using (38)–(44).

(b) Continue the above step until the regressor pk hasbeen moved to the nth position.

(c) Update aj,j,bj, /ðn�1Þj , dj, ej for each candidate regressors

using (47)–(51), and compute their new LOO error.(d) If the LOO error of the candidate term J�m is less than

J�n, then replace p�n with /m, and put p�n back into thecandidate term pool. Update the related termsaccording to (52)–(59).

(e) If k41, set k¼ k�1, and go to step 3(a).

Table 1Comparis

second st

represent

Algorith

OLS

OLSþLO

FRA

FRAþLO

Two-sta

New

J. Deng et al. / Neurocomputing 74 (2011) 2422–2429 2427

(f) If one or more regressor terms were changed in the lastreview, then set k¼ n�1, and repeat steps 3(a)–(e) toreview all the terms again. Otherwise, the procedure isterminated.

on o

age;

s the

m

O

O

ge

Step 4

Final forward selection: Put all n selected regressor termstogether to form a small candidate term pool, and applythe forward selection procedure again. This selectionprocess automatically terminates at n0, n0rn.

3.4. Computational complexity

The computation in two-stage selection with leave-one-out crossvalidation is mainly dominated by the first forward selection stage.It will be shown that the proposed method here is still moreefficient than orthogonal least squares (OLS). Specifically, supposethere are initially M candidate regressors in the selection pool, andonly n terms are to be included in the final model. As N data samplesare used for training, the computational complexity involved in OLSusing standard Gram–Schmidt orthogonalisation, Forward recursivealgorithm (FRA), the two-stage selection and their combination withcross validation are all now reviewed.

The computational complexity is measured by the total num-ber of basic arithmetic operations involving addition/subtractionand multiplication/division. For OLS with leave-one-out crossvalidation, the total computation is given by

CðOLSÞ �NMð2n2þ11nÞ�nðn�1Þð11NþM�1Þ=2

�nðn�1Þð2n�1Þð4N�1Þ=6þ2N�2M ð60Þ

For the first stage of the proposed algorithm, the computation isthe same as FRA with LOO, and is given by

CðFRAÞ �NMð13nþ4Þ�Nnð13n�9Þ

�nð7nþ3Þþ7Mn�2M�2N ð61Þ

The computation for the second refinement stage includes theshifting of each selected term to the last position, the new LOOerror comparison and the change of a centre of interest with acandidate pool term. In the extreme case, all the terms of interestare insignificant compared to a candidate regressor. The totalcomputation involved in one checking loops is then calculatedusing

Cð2ndÞ � 4Nðm�nÞþmð3n2�nþ6Þ

þ2Nðn2�1Þ�n3þ5n2�14n ð62Þ

Generally, most of the shifted terms are more significant than thecandidate ones, thus the actual computation for the second stageis much less than (62), and the total number of checking looprepeated is normally less than 5. In practice n5N and n5M, sothe computational effort mainly comes from the first term inabove equations. Table 1 compares these algorithms. It shows

f computational complexity (five checking loops are used at the

N is the number of samples; M is the size of initial term pool and n

final model size).

Computation

2NMðn2þ2nÞ

NMð2n2þ11nÞ

2NMn

13NMn

NMð4nþ15Þ=2

NMð13nþ20Þ

that the new one described here needs about half of thecomputation of OLS with LOO cross validation.

4. Simulation example

This section presents the simulation results for nonlinearsystem modelling using RBF neural networks which have thelinear-in-the-parameters structure. The following three methodswere applied and assessed: (1) OLS method. Here, the width ofthe activation function is chosen ‘a priori’, and the data samplesare used as the centres of the candidate RBF nodes. Then the OLSis used to select the most significant terms, and AIC (Akaike’sinformation criterion) is adopted to control the model complexity.(2) Two-stage with LOO. The experimental condition is set thesame as the above, and two-stage with LOO is then applied toautomatically build the model. (3) The proposed method. Here,the width and centres of the RBF nodes are randomly selectedinstead of setting ‘a priori’, and the proposed method is then usedfor automatic model construction.

Example 1: Consider a nonlinear system to be approximated byan RBF neural model [19] defined by

yðxÞ ¼ 0:1xþsinðxÞ

xþsinð0:5xÞ, �10rxr10 ð63Þ

A total of 400 data samples were generated with x uniformlydistributed in the range ½�10,10�. Gaussian white noise with azero mean and variance 0:12 was added to the first 200 datasamples. The RBF model employed the Gaussian kernel functionfðx,ciÞ ¼ expð�Jx�ciJ

2=s2Þ as the basis function. In OLS, thewidth of the Gaussian basis function was chosen as s¼ 4. Forthe method proposed here, the support function width waschosen randomly in the interval [0, 2] and the hidden nodes wererandomly selected from the data samples. With the first 200 noisesamples used for training and the remaining 200 noise-free datareserved for testing, the conventional OLS algorithm had selected11 centres at the point where the AIC value started to increase.The two-stage with leave-one-out cross validation constructed amore compact network with eight RBF centres under bothconventional candidate pool and ELM. The root mean-squarederror (RMSE) [19] over the training data and testing data sets aregiven in Table 2. This RMSE is defined as

RMSE¼

ffiffiffiffiffiffiffiffiffiSSE

N

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðy�yÞT ðy�yÞ

N

sð64Þ

where SSE is the sum-squared error, y is the RBF neural networkprediction, and N is the number of samples. As shown in Table 2,it is clear that:

TabCom

A

O

T

N

OLS produced an over-fitted network with a smaller trainingerror and larger testing error.

� Two-stage with LOO gave a better RBF network with a slightly

increased training error and a smaller testing error.

� By using extreme learning machine, the new method con-

structed a more compact RBF model than OLS.

Though the proposed method did not produce a better modelthan the other two methods in terms of testing error, the main

le 2parison of model performance in experiment 1 (RMSE).

lgorithm Network size Training error Testing error

LS 11 0.1006 0.0326

wo-stageþLOO 8 0.1011 0.0201

ew 8 0.1004 0.0337

Table 3Evolution of term selection in the model refining procedure in experiment 1.

Loop Selected terms Training

error (RMSE)

LOO error

0 [24,163,100,55,150,146,175,120] 0.1056 0.0127

1 [24,163,100,55,150,146,120,5] 0.1060 0.0123

2 [24,163,100,55,150,120,5,149] 0.1061 0.0123

3 [24,163,100,55,120,5,149,121] 0.1031 0.0116

4 [24,120,5,149,121,55,100,96] 0.1026 0.0115

5 [24,96,100,55,121,149,5,110] 0.1018 0.0113

6 [24,96,100,55,121,110,5,146] 0.1005 0.0110

7 [24,96,100,55,110,5,146,16] 0.1004 0.0110

Table 4Comparison of model performance in experiment 2 (RMSE).

Algorithm Network size Training error Testing error

OLS 26 0.0913 0.0735

Two-stageþLOO 13 0.1020 0.0692

New 13 0.0999 0.0572

Table 5Evolution of term selection in the model refining procedure in experiment 2.

Loop Number of

changed terms

Training

error (RMSE)

LOO

error

0 0 0.1053 0.0129

1 2 0.1060 0.0127

^11 8 0.0999 0.0113

J. Deng et al. / Neurocomputing 74 (2011) 2422–24292428

advantage is that the network can be rapidly constructed withoutneed to test different RBF widths. Table 3 shows the refinementprocedure at the second stage while using the extreme learningmachine, where seven check loops were performed to minimizethe LOO error. Here a total of four RBF nodes were replaced at thesecond network refinement stage, and the RMSE value wasreduced without increasing the model size.

Example 2: The proposed algorithm was then applied toapproximate a nonlinear dynamic system given by [13]

zðtÞ ¼zðt�1Þzðt�2Þzðt�3Þuðt�2Þ½zðt�3Þ�1�þuðt�1Þ

1þz2ðt�2Þþz2ðt�3Þð65Þ

where u(t) is the random system input which was uniformlydistributed in the range [�1,1]. Again, a total of 400 data sampleswere generated with the first 200 noise data samples used fortraining and the remaining 200 noise-free data samples for testing.The noise series had a Gaussian distribution with a zero mean andvariance of 0:12. The activation function had the same form asdescribed in experiment 1 with s¼ 1:096 for conventional networkconstruction scheme. The input vector was xðtÞ ¼ ½yðt�1Þ,yðt�2Þ,yðt�3Þ,uðt�1Þ,uðt�2Þ�T . In this case, the OLS method selected 26RBF nodes while the proposed algorithm chose 13 nodes. In Table 4,the OLS still produced an over-fitted network with a smaller trainingerror and a larger testing error is shown. By contrast, the proposedmethod gave a compact network model with better generalizationperformance. Table 5 illustrates the reduction of LOO error in eachrefining loop of the two-stage algorithm under the ELM scheme.

5. Concluding summary

In this paper, a previously proposed two-stage stepwiseidentification algorithm [1] is combined with the leave-one-outcross validation to improve the compactness of models con-structed by the extreme learning machine. At the first stage, theselection procedure can be automatically terminated based on theLOO error. At the second stage, the contribution of each selectedmodel term is reviewed, and insignificant ones are replaced. Thisprocedure repeats until no insignificant RBF nodes exist in thefinal model. The forward selection step is then implemented againto re-order the selected terms, producing a potential furtherreduction in the model size. Results from two simulation exam-ples show that the proposed improved extreme learning machinescheme can efficiently produce a nonlinear model with satisfac-tory performance. The potential to build a more compact modelby integrating other techniques, such as Bayesian regularization,is regarded as the future work.

Acknowledgment

Jing Deng wishes to thank Queens University Belfast for theaward of an ORS scholarship to support his doctoral studies. Thiswork was partially supported by the UK EPSRC under grantsEP/F021070/1 and EP/G042594/1.

References

[1] K. Li, J.X. Peng, E.W. Bai, A two-stage algorithm for identification of nonlineardynamic systems, Automatica 42 (7) (2006) 1189–1197.

[2] O. Nelles, Nonlinear System Identification, Springer, 2001.[3] G. Huang, Q. Zhu, C. Siew, Extreme learning machine: theory and applica-

tions, Neurocomputing 70 (2006) 489–501.[4] G. Huang, L. Chen, C. Siew, Universal approximation using incremental

constructive feedforward networks with random hidden nodes, IEEE Trans-actions on Neural Networks 17 (4) (2006) 879–892.

[5] G. Huang, L. Chen, Convex incremental extreme learning machine, Neuro-computing 70 (16–18) (2007) 3056–3062.

[6] A. Miller, Subset Selection in Regression, CRC Press, 2002.[7] X. Hong, R.J. Mitchell, S. Chen, C.J. Harris, K. Li, G.W. Irwin, Model selection

approaches for non-linear system identification: a review, InternationalJournal of Systems Science 39 (10) (2008) 925–946.

[8] S. Chen, S.A. Billings, W. Luo, Orthogonal least squares methods and theirapplication to non-linear system identification, International Journal ofControl 50 (5) (1989) 1873–1896.

[9] S. Chen, C.F.N. Cowan, P.M. Grant, Orthogonal least squares learningalgorithm for radial basis function networks, IEEE Transactions on NeuralNetworks 2 (2) (1991) 302–309.

[10] K. Li, J.X. Peng, G.W. Irwin, A fast nonlinear model identification method, IEEETransactions on Automatic Control 50 (8) (2005) 1211–1216.

[11] S. Chen, E.S. Chng, K. Alkadhimi, Regularized orthogonal least squaresalgorithm for constructing radial basis function networks, InternationalJournal of Control 64 (5) (1996) 829–837.

[12] D.J. Du, K. Li, M.R. Fei, G.W. Irwin, A new locally regularised recursive methodfor the selection of RBF neural network centres, in: R. Walter (Ed.), Proceed-ings of the 15th IFAC Symposium on System Identification, SYSID 2009,Saint-Malo, France, 2009.

[13] X. Hong, P.M. Sharkey, K. Warwick, Automatic nonlinear predictive modelconstruction algorithm using forward regression and the press statistic, IEEProceedings: Control Theory and Applications 150 (3) (2003) 245–254.

[14] A. Sherstinsky, R.W. Picard, On the efficiency of the orthogonal least squarestraining method for radial basis function networks, IEEE Transactions onNeural Networks 7 (1) (1996) 195–200.

[15] K.Z. Mao, S.A. Billings, Algorithms for minimal model structure detectionin nonlinear dynamic system identification, International Journal of Control68 (2) (2007) 311–330.

[16] K. Li, G. Huang, S. Ge, A Fast Construction Algorithm for Feedforward NeuralNetworks, Springer Verlag, 2010.

[17] Y. Lan, Y. Soh, G. Huang, Two-stage extreme learning machine for regression,Neurocomputing 73 (16-18) (2010) 3028–3038.

[18] L. Ljung, System Identification: Theory for the User, Prentice Hall, Cliffs, NJ, 1987.[19] J.X. Peng, K. Li, D.S. Huang, A hybrid forward algorithm for RBF neural

network construction, IEEE Transactions on Neural Networks 17 (6) (2006)1439–1451.

puting 74 (2011) 2422–2429 2429

Jing Deng received his B.Sc. degrees from National

J. Deng et al. / Neurocom

University of Defence Technology, China, in 2001 and2005, and his M.Sc. degrees from the Shanghai Uni-versity, China, in 2005 and 2007. From March 2008 hestarted his Ph.D. research at Intelligent Systems andControl (ISAC) Group at Queen’s University Belfast, UK.His main research interests include system modelling,pattern recognition and fault detection.

Kang Li is a Reader in Intelligent Systems and Controlat Queen’s University Belfast. His research interestsinclude advanced algorithms for training and con-struction of neural networks, fuzzy systems and sup-port vector machines, as well as advancedevolutionary algorithms, with applications to non-linear system modelling and control, microarray dataanalysis, systems biology, environmental modellingand monitoring, and polymer extrusion. He has pro-duced over 150 research papers and co-edited sevenconference proceedings in his field. He is a CharteredEngineer, a member of the IEEE and the InstMC and the

current Secretary of the IEEE UK and Republic ofIreland Section.

George W. Irwin leads the Intelligent Systems andControl Research group and is Director of the Univer-sity Virtual Engineering Centre at Queen’s UniversityBelfast. He has been elected as a Fellow of the RoyalAcademy of Engineering, a Member of the Royal IrishAcademy and is a Chartered Engineer, an IEEE Fellow, aFellow of the IEE and a Fellow of the Institute ofMeasurement and Control. His research covers identi-fication, monitoring, and control, including neural net-works, fuzzy neural systems and multivariate statisticsand has published over 350 research papers and 12edited books. He is currently working on wireless

networked control systems, fault diagnosis of internal

combustion engines and novel techniques for fast temperature measurement andwas a Technical Director of Anex6 Ltd., a spin-out company from his groupspecialising in process monitoring. He has been awarded a number of prizesincluding four IEE Premiums, a Best Paper award from the Czech Academy ofSciences and the 2002 Honeywell International Medal from the Institute ofMeasurement and Control. International recognitions include Honorary Professorat Harbin Institute of Technology and Shandong University, and Visiting Professorat Shanghai University. He is a former Editor-in-Chief of the IFAC Journal ControlEngineering Practice and past chair of the UK Automatic Control Council. Hecurrently chairs the IFAC Publications Committee and serves on the editorialboards of several journals.