fast automatic two-stage nonlinear model identification based on the extreme learning machine
TRANSCRIPT
Neurocomputing 74 (2011) 2422–2429
Contents lists available at ScienceDirect
Neurocomputing
0925-23
doi:10.1
� Corr
E-m
journal homepage: www.elsevier.com/locate/neucom
Fast automatic two-stage nonlinear model identification based on theextreme learning machine
Jing Deng, Kang Li �, George W. Irwin
Intelligent Systems and Control Group, School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT9 5AH, UK
a r t i c l e i n f o
Available online 12 May 2011
Keywords:
Extreme learning machine
Two-stage stepwise selection
Leave-one-out cross validation
RBF networks
12/$ - see front matter & 2011 Elsevier B.V. A
016/j.neucom.2010.11.035
esponding author. Tel.: þ44 2890974663.
ail address: [email protected] (K. Li).
a b s t r a c t
It is convenient and effective to solve nonlinear problems with a model that has a linear-in-the-
parameters (LITP) structure. However, the nonlinear parameters (e.g. the width of Gaussian function) of
each model term needs to be pre-determined either from expert experience or through exhaustive
search. An alternative approach is to optimize them by a gradient-based technique (e.g. Newton’s
method). Unfortunately, all of these methods still need a lot of computations. Recently, the extreme
learning machine (ELM) has shown its advantages in terms of fast learning from data, but the sparsity
of the constructed model cannot be guaranteed. This paper proposes a novel algorithm for automatic
construction of a nonlinear system model based on the extreme learning machine. This is achieved by
effectively integrating the ELM and leave-one-out (LOO) cross validation with our two-stage stepwise
construction procedure [1]. The main objective is to improve the compactness and generalization
capability of the model constructed by the ELM method. Numerical analysis shows that the proposed
algorithm only involves about half of the computation of orthogonal least squares (OLS) based method.
Simulation examples are included to confirm the efficacy and superiority of the proposed technique.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction
In nonlinear system modelling and identification, two mainissues need to be resolved. The first is to find a proper nonlinearmodel structure and the second is to optimize that modelstructure and the parameters. To address the first issue, variousmodel types have been proposed, such as the nonlinear autore-gressive with exogenous input (NARX), nonlinear output error(NOE), artificial neural networks, and fuzzy systems [2]. Gener-ally, most of these models have, or can be transformed into, alinear-in-the-parameters (LITP) structure which is easier tooptimize.
The main problem in a LITP model is the determination of thenonlinear parameters in each term, e.g. the width of the Gaussianfunction and the power of a polynomial term. These parametersare normally found by exhaustive search or optimized by agradient-based technique which significantly increases the com-putational complexity. Unlike conventional schemes, a moregeneral model construction concept, namely the extreme learningmachine (ELM), has been proposed for LITP models [3]. In an ELM,all the nonlinear parameters are randomly chosen independent ofthe training data. Training is thus transformed into a standard
ll rights reserved.
least-squares problem, leading to a significant improvement inthe learning speed. It has been proven that the LITP model withsuch randomly generated nonlinear parameters can approximateany continuous target function [4]. Further, the ELM has beenextended to a much wider class of model terms, including fuzzyrules as well as additive nodes [5].
Though the ELM is efficient in terms of nonlinear modelconstruction, the resultant model is not compact. Some of theterms can be unimportant to the model’s interpretability due tothe stochastic process. Thus, a sparse model with a satisfactoryaccuracy needs to be further optimized from these candidateterms based on the parsimonious principle. The most popularapproach for this is the subset construction method, includingforward stepwise selection, backward stepwise selection and theso-called best subset selection [6]. Amongst these techniques,forward stepwise selection is perhaps the most popular approachin the literature for model construction where a very large termpool has to be considered [7]. The best-known algorithm inforward selection is orthogonal least squares (OLS) [8,9] whichis derived from an orthogonal (or QR) decomposition of theregression matrix. An alternative is the recently proposed fastrecursive algorithm (FRA) [10] which has been shown to be moreefficient and stable than OLS.
The main issue in forward stepwise construction is to termi-nate the selection procedure to prevent over-fitting. Commonlyused methods include use of an information criterion, such as the
J. Deng et al. / Neurocomputing 74 (2011) 2422–2429 2423
final prediction error (FPE) or Akaike’s information criterion (AIC)[2] which provide a trade-off between the training error andmodel complexity. However, the tuning parameters that penalizethe model complexity in these criteria are sensitive to both thespecific application and the data sample size. It is still easy to missan optimal model. To overcome this problem, the regularizedforward model construction variant has also been described[11,12]. The main idea here is that each model weight is assigneda prior hyperparameter, and the most probable values of thesehyperparameters are estimated iteratively from the data. As aresult, those terms that are mainly determined by the noise havelarge hyperparameter value, and their corresponding modelweights are forced near to zero. Sparsity is then achieved byremoving irrelevant terms from the trained model.
To assess the generalization ability, the resultant model isusually applied to a set of test data. This involves splitting all theavailable measured data into different sets. However, the amountof data available is often limited, so it is desirable to use all thedata available to train the model without sacrificing any general-ization performance. A typical way of doing this is to employcross validation. The limited data set is split into s parts in whichs�1 parts are used for model training, and the single remainingpart is used for model testing. This procedure is continued untilall s different possible combinations has been implemented. Theextreme case of cross validation is known as the leave-one-out(LOO) method [2,13], where only one sample is used for testingand the rest left for training. The overall model error or LOO erroris then the average test error of each data sample. By adoptingthis method in a forward construction context, model terms areselected based on their reduction in the LOO error, and theselection procedure automatically terminates at the point wherethis starts to increase. Though this approach can achieveimproved model generalization, its computational complexity isextremely high. Generally, it is therefore feasible only for verysmall data set, because the computational complexity is propor-tional to the total number of data samples. However, if the modelhas a linear-in-the-parameters structure, it has been shown thatLOO error can be calculated without splitting the data setexplicitly [13].
Though a forward construction scheme can efficiently build asparse model from a large candidate term pool, the final model isnot optimal [14], since the calculation of the contribution of anynew term depends on previously selected ones [1]. To reduce thisconstraint, genetic search has been suggested to refine the modelstructure [15], but at the expense of a very high computationalcomplexity. In [1], a two-stage stepwise construction method wasrecently proposed which combined both forward and backwardconstruction procedure. This retains the computational efficiencyof the FRA while improving the model compactness.
In the above two-stage construction algorithm, an initialmodel is first constructed by the FRA where the contribution ofa particular term of interest is measured by its reduction in anappropriate cost function. The significance of each selected termis then reviewed at a second stage of model refinement, andinsignificant terms are replaced. Specifically, if the contribution ofa previously selected term is less than any from the candidatepool, it will be replaced with this. Thus, the cost function can befurther reduced without increasing the model size. This checkingcycle is iterated until no insignificant model term exists in thetrained model, resulting in an optimized model structure withimproved performance. The extension of two-stage algorithm toELM was first proposed in [16], and then further improved in [17].However, the computation complexity in [17] is still very highwhich may discourage its application to large data sets.
In this paper, the Leave-One-Out cross validation technique iseffectively integrated into our recently proposed two-stage
stepwise construction method to enhance the sparsity of modelsconstructed from ELM. In this new algorithm, the contribution ofeach candidate model term is measured by its reduction in theLOO error, and the construction procedure is automaticallystopped when this starts to increase. The second stage of modelrefinement can further improve the trained model by replacingany insignificant terms with ones from the candidate pool. Due tothe introduction of a residual matrix which can be updatedrecursively, the calculation of LOO error is significantly reduced.Results from two simulation examples are included to confirm theefficacy of the proposed method.
The paper is organized as follows. Section 2 gives the pre-liminaries. The automatic two-stage algorithm is then introducedin Section 3. Section 4 describes the results from two simulationexamples to illustrate the effectiveness of the proposed techni-que, and Section 5 contains a concluding summary.
2. Preliminaries
2.1. General nonlinear system model
Consider a general nonlinear system approximated by a modelthat has a linear-in-the-parameters structure
yðtÞ ¼Xn
k ¼ 1
ykfkðxðtÞÞþeðtÞ ð1Þ
where y(t) is the system output at sample time t, xðtÞ is the inputvector, fkðxðtÞÞ denotes the nonlinear function (e.g. polynomial orGaussian), yk is a model parameter, and eðtÞ is the modelprediction error. In extreme learning machine, the nonlinearparameters in fkðxðtÞÞ, k¼ 1, . . . ,n are randomly assigned, andthe model parameters h are estimated through least squares. Thegeneralization performance mainly depends on the number ofnonlinear functions n. Too small of n may result in an under-fittedmodel, while too large of n may lead to an over-fitted one.
Suppose a set of N data samples fxðtÞ,yðtÞgNt ¼ 1 are used formodel training, then (1) can be written in the matrix form
y¼Uhþe ð2Þ
where U¼ ½/1, . . . ,/n�ARN�n is the regression matrix with col-
umn vectors /i ¼ ½fiðxð1ÞÞ, . . . ,/iðxðNÞÞ�T , i¼ 1, . . . ,n, y¼ ½yð1Þ, . . . ,
yðNÞ�T ARN ,h¼ ½y1, . . . ,yn�T ARn and e¼ ½eð1Þ, . . . ,eðNÞ�T ARN .
The model selection procedure aims to build a parsimoniousrepresentation. Suppose that the initial model has M candidateterms, and nðn5MÞ regressors p1, . . . ,pn have been selected fromthe candidate pool with those remaining in the pool denoted as/ið�Þ, i¼ nþ1, . . . ,M. The resulting model is represented by
y¼ Pnhnþe ð3Þ
where Pn ¼ ½p1, . . . ,pn�, hn ¼ ½y1, . . . ,yn�T , e¼ ½eð1Þ, . . . ,eðNÞ�T , and
the least-squares estimation of h can be given by
hn ¼ ðPTnPnÞ
�1PTny ð4Þ
The above introduction reveals the three issues in the ELM thathave to be dealt with in applications. These are: (i) A largernumber of nonlinear functions are usually required; (ii) thesingularity problem in P becomes serious as the model sizeincreased; and (iii) large model generates high computationaloverhead in deriving the linear parameters. In fact, these threeproblems are also closely coupled. If the performance of thenonlinear model can be improved, the number of required non-linear functions and the corresponding number of linear para-meters can be significantly reduced, leading to the overallreduction of the computational complexity. Once PTP becomesnonsingular, a number of efficient algorithms can be used for fast
J. Deng et al. / Neurocomputing 74 (2011) 2422–24292424
computation of the linear parameters. The basic idea in this paperis to introduce an integrated framework to select the nonlinearterms from the randomly generated candidate pool in ELM tofurther improve the model performance and to efficiently calcu-late the linear parameters as well.
2.2. Computation of LOO error
Leave-one-out is a commonly used cross validation method toimprove the model generalization capability [13]. It is achievedby using only one sample for model testing in each run, with theremaining (N�1) samples reserved for training. Thus, fort¼ 1, . . . ,N, the jth model is estimated from the data set withthe tth point being removed. The prediction error can then becalculated by
eð�tÞj ¼ yðtÞ�y
ð�tÞj ðtÞ ð5Þ
where yð�tÞj ðtÞ is the jth model output estimated by using the
remaining (N�1) data samples. The LOO error is obtained byaveraging all these prediction errors as
Jj ¼1
N
XN
t ¼ 1
ðeð�tÞj Þ
2ð6Þ
The model term with the minimum LOO error will be selectedat each step. It is obvious that this procedure is computationallyexpensive since the amount of calculation involved in using eachdata sample individually, as described above, is N times that inusing the complete data set once. However, the following deriva-tion shows that the LOO error can be obtained without explicitlyhaving to split the training data set sequentially as above [13].
Refer to (3), and define a matrix M:
M¼ PT P ð7Þ
Then, the least-squares parameter estimate of h is given by
h ¼ ½PT P��1PT y¼M�1PT y ð8Þ
The model residual at sample time t becomes
eðtÞ ¼ yðtÞ�pðtÞh ¼ yðtÞ�pðtÞM�1PT y ð9Þ
where pðtÞ denotes the tth row of the regression matrix P. If thetth data sample was deleted from the estimation data set, theparameter h is then calculated as
hð�tÞ¼ fMð�tÞ
g�1fPð�tÞgT yð�tÞ ð10Þ
From the definition of M, it can be shown that
Mð�tÞ¼M�pðtÞT pðtÞ ð11Þ
and
fPð�tÞgT yð�tÞ ¼ PT y�pðtÞT yðtÞ ð12Þ
By using the well-known matrix inversion lemma ½AþBCD��1 ¼
A�1�A�1B½DA�1BþC�1
��1DA�1 [18], the inverse of Mð�tÞ in (11)can be computed as
fMð�tÞg�1 ¼M�1
þM�1pðtÞT pðtÞM�1
1�pðtÞMT pðtÞTð13Þ
The model error at data point t is now given as [13]
eð�tÞðtÞ ¼ yðtÞ�pðtÞhð�tÞ
¼ yðtÞ�pðtÞfMð�tÞg�1fpð�tÞgT yð�tÞ
¼yðtÞ�pðtÞM�1PT y
1�pðtÞM�1pT ðtÞ
¼eðtÞ
1�pðtÞM�1pT ðtÞð14Þ
The calculation of the LOO error is simplified by the use of Eq. (14)which does not involve splitting the training data sequentially.
3. Automatic two-stage model construction
In forward stepwise construction scheme, model terms areselected one-by-one with the LOO error being maximally reducedeach time. Suppose k such terms have been selected, and theresulting regression matrix is denoted as
Pk ¼ ½p1, . . . ,pk�, k¼ 1, . . . ,n ð15Þ
From (14), the corresponding model residual at sample time t
becomes
eð�tÞk ðtÞ ¼
yðtÞ�pkðtÞM�1k PT
k y
1�pkðtÞM�1k pT
k ðtÞ¼
ekðtÞ
1�pkðtÞM�1k pT
k ðtÞð16Þ
and the LOO error is given by
Jk ¼1
N
XN
t ¼ 1
ðeð�tÞk ðtÞÞ
2ð17Þ
If one more regressor term pkþ1 is selected, the regression matrixchanges to Pkþ1 ¼ ½Pk,pkþ1�. The selected term should maximallyreduces the LOO error compared to all the remaining availablecandidates. However, this choice involves a constrained minimi-zation of Jkþ1, which can be solved in a second stage of modelrefinement.
3.1. Forward model selection—first stage
In order to simplify the computation of the LOO error, aresidual matrix series is defined first. Thus
Rk9I�PkM�1
k PTk , 0okrn
I, k¼ 0
(ð18Þ
According to [10,1], the matrix terms Rk,k¼ 0, . . . ,n have thefollowing properties:
Rkþ1 ¼Rk�Rkpkþ1pT
kþ1RTk
pTkþ1Rkpkþ1
, k¼ 0,1, . . . ,n�1 ð19Þ
RTk ¼Rk; ðRkÞ
2¼Rk, k¼ 0,1, . . . ,n ð20Þ
RiRj ¼RjRi ¼Ri, iZ j; i,j¼ 0,1, . . . ,n ð21Þ
Rk/¼0, rankð½Pk,/�Þ ¼ k
/ðkÞa0, rankð½Pk,/�Þ ¼ kþ1
(ð22Þ
R1,...,p,...,q,...,k ¼ R1,...,q,...,p,...,k, p,qrk ð23Þ
The last property means that any change in the selection order ofthe regressor terms p1, . . . ,pk does not change the residualmatrices Rk. This property will help to reduce the computationaleffort in the second stage.
Now, the residual vector e can be given by
ek ¼ y�PkM�1k PT
k y¼Rky ð24Þ
where ek ¼ ½ekð1Þ, . . . ,ekðtÞ, . . . ,ekðNÞ�T . The denominator in (16) is
also the tth diagonal element of the residual matrix Rk. As a result,the LOO error in (17) can be rewritten as
Jk ¼1
N
XN
t ¼ 1
e2k ðtÞ
d2k ðtÞ
ð25Þ
where dk ¼ diagðRkÞ.According to (19), the above LOO error can be further simpli-
fied by defining an auxiliary matrix AARk�M , and a vector
J. Deng et al. / Neurocomputing 74 (2011) 2422–2429 2425
bARM�1, with elements given by
ai,j9ðpði�1Þ
i ÞT pj, 1r jrk
ðpði�1Þi Þ
T/j, ko jrM
8<: ð26Þ
bi9ðpði�1Þ
i ÞT y, 1r irk
ð/ðkÞi ÞT y, ko irM
8<: ð27Þ
Referring to the updating of residual matrix in (19), it showsthat ak,j, bk, dk and ek can be computed recursively as follows:
ak,j ¼ pTk/j�
Xk�1
l ¼ 1
al,kal,j=al,l, k¼ 1, . . . ,n, j¼ 1, . . . ,M: ð28Þ
bk ¼ pTk y�
Xk�1
l ¼ 1
ðal,kblÞ=al,l k¼ 1, . . . ,n: ð29Þ
pðk�1Þk ¼ pðk�2Þ
k �ak�1,k
ak�1,k�1pðk�2Þ
k�1 ð30Þ
dk ¼ dk�1�ðpðk�1Þk Þ
2=ak,k ð31Þ
ek ¼ ek�1�bk
ak,kpðk�1Þ
k ð32Þ
The LOO error in (25) can now be calculated recursively using (31)and (32). In this forward construction stage, the significance of eachmodel term is measured based on its reduction in LOO error. Thus,suppose at the kth step, one more term from the candidate pool is tobe selected. The new LOO error of the model, which includes thepreviously selected k terms and the new candidate one, will becomputed from (25). The one that gives the minimum LOO error Jkþ1
will be added to the model. Meanwhile, all the regressors in U havebeen stored in their intermediate forms for the next selection. That is,if the kth term is added to the model, then all previously selected
terms will be saved as pð0Þ1 , . . . ,pðk�1Þk , and all the remaining regressors
in the candidate pool will be saved as /ðkÞi ,i¼ kþ1, . . . ,M. The
diagonal elements aj,j in A, and bj for kþ1r jrM are also pre-
calculated for use in the next selection, and are given by
aðkþ1Þj,j ¼ aðkÞj,j �a2
k,j=ak,k ð33Þ
bðkþ1Þj ¼ bðkÞj �ak,jbk=ak,k ð34Þ
This procedure is continued until the LOO error starts to increase,meaning the forward selection stage will be automatically terminatedwhen
Jno Jnþ1 ð35Þ
resulting in a compact model with n terms.
3.2. Backward model refinement—second stage
This stage involves the elimination of insignificant terms dueto the constraint introduced in forward construction. Noting thatthe last selected term in forward construction is always moresignificant than those remaining in the candidate pool, the back-ward model refinement can be divided into three main proce-dures: firstly, a selected term pk,k¼ 1, . . . ,n�1 is shifted to the nthposition as it was the last selected one; secondly, the LOO error ofeach candidate term is recalculated, and compared with theshifted term. If the LOO error of a selected term is larger thanthat of a term from the candidate pool, it will be replaced, leadingto the required improvement in model generalization capability.This review is repeated until no insignificant term remains in theselected model. Finally, all the selected terms are used to form a
new candidate pool, and the forward stepwise construction isimplemented again, probably further reducing the model size.
3.2.1. Re-ordering of regressor terms
Suppose a selected model term pk is to be moved to the nthposition in the regression matrix Pn. This can be achieved byrepeatedly interchanging two adjacent regressors so that
p�q ¼ pqþ1, p�qþ1 ¼ pq, q¼ k, . . . ,n�1 ð36Þ
By noting the property in (23), it is clear that only Rq in the residualmatrix series is changed at each step. This is updated using
R�q ¼ Rq�1�Rq�1p�qðp
�qÞ
T RTq�1
ðp�qÞT Rq�1p�q
ð37Þ
Meanwhile, the following terms also need to be updated:
�
In matrix A, only the upper triangular elements ai,j,ir j areused for model term selection. The qth and the ðqþ1Þthcolumns with elements from row 1 to q�1 need to be modified(i¼ 1, . . . ,q�1)a�i,q ¼ ðpði�1Þi Þ
T pqþ1 ¼ ai,qþ1
a�i,qþ1 ¼ ðpði�1Þi Þ
T pq ¼ ai,q
8<: ð38Þ
The elements of qth row aq,j from column q to column M
(j¼ q, . . . ,M) are also changed using
a�q,j ¼
aqþ1,qþ1þa2q,qþ1=aq,q, j¼ q
aq,qþ1, j¼ qþ1
aqþ1,jþaq,qþ1aq,j=aq,q, jZqþ2
8>><>>: ð39Þ
and the elements of the ðqþ1Þth row aqþ1,j, (j¼ qþ1, . . . ,M) arechanged by
a�qþ1,j ¼aq,q�a2
q,qþ1=a�q,q j¼ qþ1
aq,j�aq,qþ1a�q,j=a�q,q Nonlinear optimization jZqþ2
8<:
ð40Þ
�
For the vectors b, only the qth and the ðqþ1Þth gc elements arechanged. Thusb�q ¼ bqþ1þaq,qþ1bq=aq,q ð41Þ
b�qþ1 ¼ bq�aq,qþ1b�q=a�q,q ð42Þ
�
Finally, pðq�1Þq and pðqÞqþ1 are updated usingðp�qÞðq�1Þ¼ pðqÞqþ1þ
aq,qþ1
aq,qpðq�1Þ
q ð43Þ
ðp�qþ1ÞðqÞ¼ pðq�1Þ
q �aq,qþ1
a�q,q
ðp�Þðq�1Þq ð44Þ
This procedure is continued until the kth regressor term is shiftedto the nth position, the new regression matrix and residualmatrices series becoming
P�n ¼ ½p�1, . . . ,p�k�1,p�k, . . . ,p�n�1,p�n�
¼ ½p1, . . . ,pk�1,pkþ1, . . . ,pn,pk� ð45Þ
fR�kg ¼ ½R1, . . . ,Rk�1,R�k, . . . ,R�n� ð46Þ
3.2.2. LOO error comparison
When the regressor term pk has been moved to the nthposition in the full regression matrix Pn, its contribution to thereduction of leave-one-out cross validation error needs to be
LOO
Err
or
Number of hidden nodes nn’
First StageSecond Stage
Fig. 1. LOO error at different stages (The forward selection stage is stopped at n,
and the second stage is stopped at n0 , n0rn).
J. Deng et al. / Neurocomputing 74 (2011) 2422–24292426
reviewed. The LOO error of those previously selected n termsremains unchanged, while the new LOO errors of the regressorterms in the candidate pool must be recalculated. The corre-sponding terms for nþ1r jrM to be changed are as follows:
a�j,j ¼ aj,jþða�n,jÞ
2=a�n,n ð47Þ
b�j ¼ bjþb�na�n,i=a�n,n ð48Þ
ðuðn�1Þj Þ
�¼/ðnÞj þ
a�n,j
a�n,n
ðpðn�1Þn Þ
�ð49Þ
d�j ¼ dnþ1
a�n,n
½ðpðn�1Þn Þ
��2�
1
a�j,j½ð/ðn�1Þ
j Þ��2 ð50Þ
e�j ¼ enþb�n
an,nðpðn�1Þ
n ��
b�ja�j,jð/ðn�1Þ
j Þ�
ð51Þ
Suppose a regressor /m from the candidate term pool has asmaller LOO error than the model term of interest, that is J�moJ�n.In this case, /m will replace p�n in the selected regression matrixPn, and p�n will be put back into the candidate term pool. Mean-while, the following related terms are updated:
�
The vectors en and dn are changed toe�n ¼ em, d�n ¼ dm ð52Þ
�
In the matrix Aa�i,n ¼ ai,m, a�i,m ¼ ai,n ði¼ 1, . . . ,n�1Þ ð53Þ
a�n,j ¼
am,m j¼ n
an,m j¼m
/Tm/j�
Xn�1
l ¼ 1
al,mal,j=al,l others
8>>>><>>>>:
ð54Þ
ða�j,jÞðnþ1Þ
¼an,n�ða�n,mÞ
2=a�n,n j¼m
a�j,j�ða�n,jÞ
2=a�n,n jam
8<: ð55Þ
�
In the vector b,b�n ¼ bm ð56Þ
ðb�j Þðnþ1Þ
¼bn�a�n,mb�n=a�n,n j¼m
bj�a�n,jb�n=a�n,n jam
(ð57Þ
�
Finally, pðn�1Þn and /ðnÞj for no jrM are updated according toðpðn�1Þn Þ
�¼/ðn�1Þ
m ð58Þ
ð/ðnÞj Þ�¼/ðn�1Þ
j �a�n,j
a�n,n
ðpðn�1Þn Þ
�ð59Þ
3.2.3. Model refinement
The above two procedures in Sections 3.2.1 and 3.2.2 arerepeated until there is no insignificant centres remaining in thefull regression matrix Pn. As described in the forward selectionstage, the reduction in LOO error involves a constrained mini-mization, since the selection of additional model terms willdepend on those previously chosen. This is also true in determin-ing the stopping point at this step. The total number of RBFcentres from the first stage in the network model is not optimal.Fig. 1 illustrates n centres being selected in the first stage, with afurther reduction of the LOO error given during the second stage
at the value n. However, this number n is not the optimal modelsize at the second stage since n0 is in fact the optimal number ofcentres. The third procedure involves re-ordering the selectedcentres by their contributions to the network. This is done byputting all n selected hidden nodes into a smaller term pool, andapplying forward selection again. This process will either beautomatically terminated at the point n0, or when all n termshave been re-selected.
3.3. Algorithm
The proposed algorithm for selecting a sub-model can now besummarized as follows:
Step 1
Initialization: Construct the candidate regression matrixbased on ELM, and let the model size k¼ 0. Then assignthe initial value for the following terms ðj¼ 1, . . . ,MÞ:� J0 ¼1N
PNt ¼ 1 yðtÞ2;
� d0 ¼ ½1, . . . ,1�T ARN�1; e0 ¼ y;� /ð0Þj ¼/j; að1Þj,j ¼/T
j /j; bð1Þj ¼/Tj y;
Step 2
Forward selection:(a) At the kth step ð1rkrMÞ, use (31), (32), and (35) tocalculate dj, ej and their corresponding LOO error Jk
for each candidate term.(b) Find the candidate regressor that gives the minimal
LOO error, and add it to the regression matrix P. Then
compute ak,j and pre-calculate /ðkÞj , aðkþ1Þj,j , and bðkþ1Þ
j
for j¼ kþ1, . . . ,M.(c) If the LOO error Jk�14 Jk, set k¼ kþ1, and go back to
step (a). Otherwise, go to step 3.
Step 3 Backward model refinement:(a) Change the position of pk with pkþ1 ðk¼ n�1, . . . ,1Þ,and update the related terms using (38)–(44).
(b) Continue the above step until the regressor pk hasbeen moved to the nth position.
(c) Update aj,j,bj, /ðn�1Þj , dj, ej for each candidate regressors
using (47)–(51), and compute their new LOO error.(d) If the LOO error of the candidate term J�m is less than
J�n, then replace p�n with /m, and put p�n back into thecandidate term pool. Update the related termsaccording to (52)–(59).
(e) If k41, set k¼ k�1, and go to step 3(a).
Table 1Comparis
second st
represent
Algorith
OLS
OLSþLO
FRA
FRAþLO
Two-sta
New
J. Deng et al. / Neurocomputing 74 (2011) 2422–2429 2427
(f) If one or more regressor terms were changed in the lastreview, then set k¼ n�1, and repeat steps 3(a)–(e) toreview all the terms again. Otherwise, the procedure isterminated.
on o
age;
s the
m
O
O
ge
Step 4
Final forward selection: Put all n selected regressor termstogether to form a small candidate term pool, and applythe forward selection procedure again. This selectionprocess automatically terminates at n0, n0rn.3.4. Computational complexity
The computation in two-stage selection with leave-one-out crossvalidation is mainly dominated by the first forward selection stage.It will be shown that the proposed method here is still moreefficient than orthogonal least squares (OLS). Specifically, supposethere are initially M candidate regressors in the selection pool, andonly n terms are to be included in the final model. As N data samplesare used for training, the computational complexity involved in OLSusing standard Gram–Schmidt orthogonalisation, Forward recursivealgorithm (FRA), the two-stage selection and their combination withcross validation are all now reviewed.
The computational complexity is measured by the total num-ber of basic arithmetic operations involving addition/subtractionand multiplication/division. For OLS with leave-one-out crossvalidation, the total computation is given by
CðOLSÞ �NMð2n2þ11nÞ�nðn�1Þð11NþM�1Þ=2
�nðn�1Þð2n�1Þð4N�1Þ=6þ2N�2M ð60Þ
For the first stage of the proposed algorithm, the computation isthe same as FRA with LOO, and is given by
CðFRAÞ �NMð13nþ4Þ�Nnð13n�9Þ
�nð7nþ3Þþ7Mn�2M�2N ð61Þ
The computation for the second refinement stage includes theshifting of each selected term to the last position, the new LOOerror comparison and the change of a centre of interest with acandidate pool term. In the extreme case, all the terms of interestare insignificant compared to a candidate regressor. The totalcomputation involved in one checking loops is then calculatedusing
Cð2ndÞ � 4Nðm�nÞþmð3n2�nþ6Þ
þ2Nðn2�1Þ�n3þ5n2�14n ð62Þ
Generally, most of the shifted terms are more significant than thecandidate ones, thus the actual computation for the second stageis much less than (62), and the total number of checking looprepeated is normally less than 5. In practice n5N and n5M, sothe computational effort mainly comes from the first term inabove equations. Table 1 compares these algorithms. It shows
f computational complexity (five checking loops are used at the
N is the number of samples; M is the size of initial term pool and n
final model size).
Computation
2NMðn2þ2nÞ
NMð2n2þ11nÞ
2NMn
13NMn
NMð4nþ15Þ=2
NMð13nþ20Þ
that the new one described here needs about half of thecomputation of OLS with LOO cross validation.
4. Simulation example
This section presents the simulation results for nonlinearsystem modelling using RBF neural networks which have thelinear-in-the-parameters structure. The following three methodswere applied and assessed: (1) OLS method. Here, the width ofthe activation function is chosen ‘a priori’, and the data samplesare used as the centres of the candidate RBF nodes. Then the OLSis used to select the most significant terms, and AIC (Akaike’sinformation criterion) is adopted to control the model complexity.(2) Two-stage with LOO. The experimental condition is set thesame as the above, and two-stage with LOO is then applied toautomatically build the model. (3) The proposed method. Here,the width and centres of the RBF nodes are randomly selectedinstead of setting ‘a priori’, and the proposed method is then usedfor automatic model construction.
Example 1: Consider a nonlinear system to be approximated byan RBF neural model [19] defined by
yðxÞ ¼ 0:1xþsinðxÞ
xþsinð0:5xÞ, �10rxr10 ð63Þ
A total of 400 data samples were generated with x uniformlydistributed in the range ½�10,10�. Gaussian white noise with azero mean and variance 0:12 was added to the first 200 datasamples. The RBF model employed the Gaussian kernel functionfðx,ciÞ ¼ expð�Jx�ciJ
2=s2Þ as the basis function. In OLS, thewidth of the Gaussian basis function was chosen as s¼ 4. Forthe method proposed here, the support function width waschosen randomly in the interval [0, 2] and the hidden nodes wererandomly selected from the data samples. With the first 200 noisesamples used for training and the remaining 200 noise-free datareserved for testing, the conventional OLS algorithm had selected11 centres at the point where the AIC value started to increase.The two-stage with leave-one-out cross validation constructed amore compact network with eight RBF centres under bothconventional candidate pool and ELM. The root mean-squarederror (RMSE) [19] over the training data and testing data sets aregiven in Table 2. This RMSE is defined as
RMSE¼
ffiffiffiffiffiffiffiffiffiSSE
N
r¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðy�yÞT ðy�yÞ
N
sð64Þ
where SSE is the sum-squared error, y is the RBF neural networkprediction, and N is the number of samples. As shown in Table 2,it is clear that:
�
TabCom
A
O
T
N
OLS produced an over-fitted network with a smaller trainingerror and larger testing error.
� Two-stage with LOO gave a better RBF network with a slightlyincreased training error and a smaller testing error.
� By using extreme learning machine, the new method con-structed a more compact RBF model than OLS.
Though the proposed method did not produce a better modelthan the other two methods in terms of testing error, the main
le 2parison of model performance in experiment 1 (RMSE).
lgorithm Network size Training error Testing error
LS 11 0.1006 0.0326
wo-stageþLOO 8 0.1011 0.0201
ew 8 0.1004 0.0337
Table 3Evolution of term selection in the model refining procedure in experiment 1.
Loop Selected terms Training
error (RMSE)
LOO error
0 [24,163,100,55,150,146,175,120] 0.1056 0.0127
1 [24,163,100,55,150,146,120,5] 0.1060 0.0123
2 [24,163,100,55,150,120,5,149] 0.1061 0.0123
3 [24,163,100,55,120,5,149,121] 0.1031 0.0116
4 [24,120,5,149,121,55,100,96] 0.1026 0.0115
5 [24,96,100,55,121,149,5,110] 0.1018 0.0113
6 [24,96,100,55,121,110,5,146] 0.1005 0.0110
7 [24,96,100,55,110,5,146,16] 0.1004 0.0110
Table 4Comparison of model performance in experiment 2 (RMSE).
Algorithm Network size Training error Testing error
OLS 26 0.0913 0.0735
Two-stageþLOO 13 0.1020 0.0692
New 13 0.0999 0.0572
Table 5Evolution of term selection in the model refining procedure in experiment 2.
Loop Number of
changed terms
Training
error (RMSE)
LOO
error
0 0 0.1053 0.0129
1 2 0.1060 0.0127
^11 8 0.0999 0.0113
J. Deng et al. / Neurocomputing 74 (2011) 2422–24292428
advantage is that the network can be rapidly constructed withoutneed to test different RBF widths. Table 3 shows the refinementprocedure at the second stage while using the extreme learningmachine, where seven check loops were performed to minimizethe LOO error. Here a total of four RBF nodes were replaced at thesecond network refinement stage, and the RMSE value wasreduced without increasing the model size.
Example 2: The proposed algorithm was then applied toapproximate a nonlinear dynamic system given by [13]
zðtÞ ¼zðt�1Þzðt�2Þzðt�3Þuðt�2Þ½zðt�3Þ�1�þuðt�1Þ
1þz2ðt�2Þþz2ðt�3Þð65Þ
where u(t) is the random system input which was uniformlydistributed in the range [�1,1]. Again, a total of 400 data sampleswere generated with the first 200 noise data samples used fortraining and the remaining 200 noise-free data samples for testing.The noise series had a Gaussian distribution with a zero mean andvariance of 0:12. The activation function had the same form asdescribed in experiment 1 with s¼ 1:096 for conventional networkconstruction scheme. The input vector was xðtÞ ¼ ½yðt�1Þ,yðt�2Þ,yðt�3Þ,uðt�1Þ,uðt�2Þ�T . In this case, the OLS method selected 26RBF nodes while the proposed algorithm chose 13 nodes. In Table 4,the OLS still produced an over-fitted network with a smaller trainingerror and a larger testing error is shown. By contrast, the proposedmethod gave a compact network model with better generalizationperformance. Table 5 illustrates the reduction of LOO error in eachrefining loop of the two-stage algorithm under the ELM scheme.
5. Concluding summary
In this paper, a previously proposed two-stage stepwiseidentification algorithm [1] is combined with the leave-one-outcross validation to improve the compactness of models con-structed by the extreme learning machine. At the first stage, theselection procedure can be automatically terminated based on theLOO error. At the second stage, the contribution of each selectedmodel term is reviewed, and insignificant ones are replaced. Thisprocedure repeats until no insignificant RBF nodes exist in thefinal model. The forward selection step is then implemented againto re-order the selected terms, producing a potential furtherreduction in the model size. Results from two simulation exam-ples show that the proposed improved extreme learning machinescheme can efficiently produce a nonlinear model with satisfac-tory performance. The potential to build a more compact modelby integrating other techniques, such as Bayesian regularization,is regarded as the future work.
Acknowledgment
Jing Deng wishes to thank Queens University Belfast for theaward of an ORS scholarship to support his doctoral studies. Thiswork was partially supported by the UK EPSRC under grantsEP/F021070/1 and EP/G042594/1.
References
[1] K. Li, J.X. Peng, E.W. Bai, A two-stage algorithm for identification of nonlineardynamic systems, Automatica 42 (7) (2006) 1189–1197.
[2] O. Nelles, Nonlinear System Identification, Springer, 2001.[3] G. Huang, Q. Zhu, C. Siew, Extreme learning machine: theory and applica-
tions, Neurocomputing 70 (2006) 489–501.[4] G. Huang, L. Chen, C. Siew, Universal approximation using incremental
constructive feedforward networks with random hidden nodes, IEEE Trans-actions on Neural Networks 17 (4) (2006) 879–892.
[5] G. Huang, L. Chen, Convex incremental extreme learning machine, Neuro-computing 70 (16–18) (2007) 3056–3062.
[6] A. Miller, Subset Selection in Regression, CRC Press, 2002.[7] X. Hong, R.J. Mitchell, S. Chen, C.J. Harris, K. Li, G.W. Irwin, Model selection
approaches for non-linear system identification: a review, InternationalJournal of Systems Science 39 (10) (2008) 925–946.
[8] S. Chen, S.A. Billings, W. Luo, Orthogonal least squares methods and theirapplication to non-linear system identification, International Journal ofControl 50 (5) (1989) 1873–1896.
[9] S. Chen, C.F.N. Cowan, P.M. Grant, Orthogonal least squares learningalgorithm for radial basis function networks, IEEE Transactions on NeuralNetworks 2 (2) (1991) 302–309.
[10] K. Li, J.X. Peng, G.W. Irwin, A fast nonlinear model identification method, IEEETransactions on Automatic Control 50 (8) (2005) 1211–1216.
[11] S. Chen, E.S. Chng, K. Alkadhimi, Regularized orthogonal least squaresalgorithm for constructing radial basis function networks, InternationalJournal of Control 64 (5) (1996) 829–837.
[12] D.J. Du, K. Li, M.R. Fei, G.W. Irwin, A new locally regularised recursive methodfor the selection of RBF neural network centres, in: R. Walter (Ed.), Proceed-ings of the 15th IFAC Symposium on System Identification, SYSID 2009,Saint-Malo, France, 2009.
[13] X. Hong, P.M. Sharkey, K. Warwick, Automatic nonlinear predictive modelconstruction algorithm using forward regression and the press statistic, IEEProceedings: Control Theory and Applications 150 (3) (2003) 245–254.
[14] A. Sherstinsky, R.W. Picard, On the efficiency of the orthogonal least squarestraining method for radial basis function networks, IEEE Transactions onNeural Networks 7 (1) (1996) 195–200.
[15] K.Z. Mao, S.A. Billings, Algorithms for minimal model structure detectionin nonlinear dynamic system identification, International Journal of Control68 (2) (2007) 311–330.
[16] K. Li, G. Huang, S. Ge, A Fast Construction Algorithm for Feedforward NeuralNetworks, Springer Verlag, 2010.
[17] Y. Lan, Y. Soh, G. Huang, Two-stage extreme learning machine for regression,Neurocomputing 73 (16-18) (2010) 3028–3038.
[18] L. Ljung, System Identification: Theory for the User, Prentice Hall, Cliffs, NJ, 1987.[19] J.X. Peng, K. Li, D.S. Huang, A hybrid forward algorithm for RBF neural
network construction, IEEE Transactions on Neural Networks 17 (6) (2006)1439–1451.
puting 74 (2011) 2422–2429 2429
Jing Deng received his B.Sc. degrees from National
J. Deng et al. / Neurocom
University of Defence Technology, China, in 2001 and2005, and his M.Sc. degrees from the Shanghai Uni-versity, China, in 2005 and 2007. From March 2008 hestarted his Ph.D. research at Intelligent Systems andControl (ISAC) Group at Queen’s University Belfast, UK.His main research interests include system modelling,pattern recognition and fault detection.
Kang Li is a Reader in Intelligent Systems and Controlat Queen’s University Belfast. His research interestsinclude advanced algorithms for training and con-struction of neural networks, fuzzy systems and sup-port vector machines, as well as advancedevolutionary algorithms, with applications to non-linear system modelling and control, microarray dataanalysis, systems biology, environmental modellingand monitoring, and polymer extrusion. He has pro-duced over 150 research papers and co-edited sevenconference proceedings in his field. He is a CharteredEngineer, a member of the IEEE and the InstMC and the
current Secretary of the IEEE UK and Republic ofIreland Section.George W. Irwin leads the Intelligent Systems andControl Research group and is Director of the Univer-sity Virtual Engineering Centre at Queen’s UniversityBelfast. He has been elected as a Fellow of the RoyalAcademy of Engineering, a Member of the Royal IrishAcademy and is a Chartered Engineer, an IEEE Fellow, aFellow of the IEE and a Fellow of the Institute ofMeasurement and Control. His research covers identi-fication, monitoring, and control, including neural net-works, fuzzy neural systems and multivariate statisticsand has published over 350 research papers and 12edited books. He is currently working on wireless
networked control systems, fault diagnosis of internalcombustion engines and novel techniques for fast temperature measurement andwas a Technical Director of Anex6 Ltd., a spin-out company from his groupspecialising in process monitoring. He has been awarded a number of prizesincluding four IEE Premiums, a Best Paper award from the Czech Academy ofSciences and the 2002 Honeywell International Medal from the Institute ofMeasurement and Control. International recognitions include Honorary Professorat Harbin Institute of Technology and Shandong University, and Visiting Professorat Shanghai University. He is a former Editor-in-Chief of the IFAC Journal ControlEngineering Practice and past chair of the UK Automatic Control Council. Hecurrently chairs the IFAC Publications Committee and serves on the editorialboards of several journals.