sysid06-semilpenary

Upload: wasyhun-asefa

Post on 14-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 sysid06-semilpenary

    1/19

    IDENTIFICATION WITH FINITELY MANYDATA POINTS: THE LSCR APPROACH

    Marco C. Campi ,1 Erik Weyer ,2

    Dip. di Elettronica per lAutomazione

    Universita di Brescia

    Via Branze 38, 25123 Brescia, Italy

    [email protected] Dept. of Electrical and Electronic Engineering

    The University of Melbourne

    Parkville, VIC 3010, Australia

    [email protected]

    Abstract: This paper gives an overview of LSCR (Leave-out Sign-dominant Cor-relation Regions), a general technique for system identification. Under normalconditions, observations contain information corrupted by disturbances and mea-surement noise so that only an approximate description of the underlying systemcan at best be obtained from a finite data set. This is similar to describing an objectseen through a frosted glass. Differently from standard identification methods that

    deliver single models, LSCR generates a model set. As information increases, themodel set shrinks around the true system and, for any finite sample size, the setis guaranteed to contain the true system with a precise probability chosen by theuser. LSCR only assumes a minimum amount of prior information on the noise.

    Keywords: System identification, non-asymptotic results, model set, correlation.

    1. INTRODUCTION: FROMNOMINAL-MODEL TO MODEL-SET

    IDENTIFICATION

    System identification is the science of deriving amodel from data.

    In practical applications, the number of datapoints an identification procedure can rely on isalways finite and, at times, scarce. Nevertheless,most results in identification are of asymptoticnature, that is they tell us what happens whenthe number of data points tends to infinity. While

    1 Supported by MIUR (Ministero dellIstruzione,

    dellUniversita e della Ricerca) under the projectNew methods for Identification and Adaptive Control for

    Industrial Systems2 and by the Australian Research Council.

    these results are useful in indicating fundamentalqualitative links b etween the identification set-

    up and the expected properties of the identi-fied model, when it comes to quantitative assess-ments, they necessarily have to be used on anapproximate basis. Moreover, contributions haveappeared showing that the asymptotic theory canproduce misleading quantitative results in somesystem identification endeavors, Garatti et al.(2004,2006).

    In this paper, an attempt to bridge this gap ismade: We here introduce, review and expand theLSCR (Leave-out Sign-dominant Correlation Re-gions) approach. LSCR is a system identification

    methodology that provides rigorous results for anyfinite data set, no matter how small.

    46

    14th IFAC Symposium on System Identification, Newcastle, Australia, 2006

  • 7/30/2019 sysid06-semilpenary

    2/19

    1.1 The need for something more than a nominal

    model

    Suppose we want to identify a system S by select-ing a model in a model class M. Even when Sbelongs to the model class, we cannot expect thatthe model M identified from a finite data sample

    coincides with S due to a number of accidentsaffecting our data: measurement noise, presenceof disturbances acting on the system, etc.

    Example 1. Consider the system

    yt = 0ut + wt, (1)

    where ut is input, yt output and wt is an unmea-sured disturbance. Let ut = 1 for all t and identify by means of least-squares:

    =

    Nt=1

    u2

    t1

    Nt=1

    utyt

    =

    1

    N

    Nt=1

    yt,

    where N is the number of data points. The estima-tion error 0 = 1

    N

    Nt=1 yt

    0 = 1N

    Nt=1(

    0+

    wt) 0 =1N

    Nt=1 wt does converge to zero

    when N under natural assumptions on wt.However, for any finite N, 1

    N

    Nt=1 wt cannot be

    expected to be zero and a system-model mismatchis present. 2

    If a probabilistic description of the uncertain

    elements is adopted, under general circumstancesthe only probabilistic claim we are in a positionto make is that

    P r{M = S} = 0,

    clearly a useless statement if our intention is thatof crediting the model with reliability. Thus, whenan identification procedure only returns a nominalmodel, we certainly cannot trust it to coincidewith the true system and - when this model is usedfor any purpose - it is done in the hope that thesystem-model mismatch does not affect the final

    result too badly. While this way of proceeding ispractical, it is not grounded on a solid theoreticalbasis.

    A scientific use of an identified nominal modelrequires instead that this model be complementedwith additional information, information able tocertify the model accuracy.

    1.2 Model-set identification

    A way to obtain certified results is to move fromnominal-model to model-set identification. An ex-ample explains the idea.

    Example 2. (continuation of Example 1). In sys-tem (1), suppose that wt is white and Gaussian

    with zero mean and unitary variance. Then, 0 = 1

    N

    Nt=1 wt is a Gaussian random variable

    with variance equal to 1/N and we can easily

    attach a probability to the event {| 0| }for any given finite N by looking up a Gaussian

    table. The practical use of this result is that - afterestimating from data - an interval [ , + ]can be computed having guaranteed probabilityto contain the true parameter value. 2

    In more general terms, the situation can be de-picted as in Figure 1: Given any finite N, theparameter estimate is affected by random fluctu-ation, so that it has probability zero to exactly hitthe system parameter value. Considering a regionaround , however, elevates the probability that 0

    belongs to the region to a nonzero - and therefore

    meaningful - value.

    Fig. 1. Random fluctuation in parameter estimate.

    This observation points to a simple, but importantfact:

    exploiting the information conveyed by a fi-nite number of data points can in standardsituations at best generate a model set towhich the system belong. Any further at-tempt to provide sharper estimation results(e.g. one single model) goes beyond the avail-able information level and generates resultsthat cannot be guaranteed.

    As a consequence, considering identification pro-cedures returning parameter sets - as opposed tosingle parameter values - is a sensible approach inthe attempt to provide guaranteed results.

    Having made this conceptual point, it is alsoimportant to observe that this in no way deny the

    importance of nominal models: Nominal modelsdo play an important practical role in many areassuch as simulation, prediction and control. Thus,

    47

  • 7/30/2019 sysid06-semilpenary

    3/19

    constructing nominal models is often a significantstep towards finding a solution of a problem. Yet,in order to judge the quality of the solution oneshould also account for uncertainty, as given by amodel uncertainty set.

    1.3 The role of a-priori information

    Any system identification process is based on twosources of information:

    (i) a-priori information, that is we a-priori knowthat S belongs to a system set S;

    (ii) a-posteriori information, that is data col-lected from the system.

    Without any a-priori information we can hardlygenerate any guaranteed result. In Example 2, weassumed quite a bit of a-priori information: Thesystem structure was known; and the fact that thenoise was Gaussian with variance 1 was exploitedin the construction of the confidence region.

    Having said that a-priori information cannot betotally renounced, it is also important to pointout that a good theory should demand as littleprior as possible. Indeed:

    stringent prior conditions reduce the theorysapplicability;

    in real applications, stringent prior condi-tions may be difficult to verify even when

    they are actually satisfied.

    Going back to Example 2, let us ask the followingquestion: Can we reduce prior knowledge andstill provide guaranteed results? Some answer isprovided in the following example continuation.

    Example 3. (continuation of Example 2). Supposenow that the variance of wt is 2 and that itis unknown and no upper bound on its value isavailable. We still want to quantify the probabilityof the event {| 0| } and we also want thatthis probability to be guaranteed for any situation

    allowed by our a-priori knowledge, that is for anyvalue of 2.

    Since 0 is Gaussian with variance 2/N, for

    any given > 0 (even very large), sup2 P r{| 0| } = 0, so that the only statement valid forall possible 2 is

    P r{| 0| } 0,

    which is evidently a void statement.

    A natural question is: Is the situation hopeless

    and do we have to give up on finding guaranteedresults or can we attempt some other approach?To answer, let us try to put the problem under a

    different light: A well known fact from statisticsis that

    01

    N(N1)

    Nt=1(yt )

    2

    has a Student t-distribution (Richmond (1964))

    with N 1 degrees of freedom, independently ofthe value of 2. Thus, given a , using tables ofthe Student t-distribution one can determine a such that

    P r

    |

    0|1

    N(N1)

    Nt=1(yt )

    2

    = ,

    where the important fact is that the result holdsno matter what 2 is. Hence:

    P r| 0| (y) = (2)with

    (y) =

    1N(N 1)

    Nt=1

    (yt )2.

    The latter is the desired accuracy evaluation re-sult: One selects a and computes the correspond-ing accuracy variable (y) using data. The tableof the Student t-distribution is then used to findthe probability of the confidence interval. 2

    Two remarks on Example 3 are in order:

    (1) If 2 is unknown, we have seen that noaccuracy result - save the void one - canbe established for a fixed deterministic .Yet, by allowing to be data dependent (i.e.by substituting with (y)) a meaningfulconclusion like (2) can be derived. This facttells us that we need to let the data speakand the uncertainty set size has to dependon observed data.

    (2) The random variable 0

    1N(N1)N

    t=1(yt)2

    is

    a function of data, and the distribution ofthe data themselves depends on the noisevariance 2. Despite this, the distribution of

    01

    N(N1)

    N

    t=1(yt)2

    is independent of 2.

    In the statistical literature, this is calleda pivotal variable because its distributiondoes not depend on the variable elementsof the problem. The existence of a pivotalvariable is crucial in this example to establisha result that is guaranteed for any 2.

    Unfortunately, finding pivotal variables is gener-ally hard even for very simple examples:

    48

  • 7/30/2019 sysid06-semilpenary

    4/19

    Example 4. Consider now the autoregressive sys-tem

    yt + 0yt1 = wt,

    where again wt is zero-mean Gaussian with un-known variance 2. Finding a pivotal variable forthis situation is already a very hard problem. 2

    Thus, the approach outlined in Example 3 doesnot appear to be easy to generalize. Nevertheless,the concept of pivotal distribution - or, moregenerally, of pivotal result - has a general appealand we shall come back to it at a later stage inthis paper.

    1.4 Content of the present paper

    In this paper, we provide an overview of LSCRas a methodology to address the above describedproblems of determining guaranteed model sets insystem identification. LSCR delivers data-basedconfidence sets that contain the true parametervalue with guaranteed probability for any finitedata sample and it requires minimal knowledgeon the noise affecting the system.

    The main idea behind the LSCR method is tocompute empirical correlation functions and toleave out those regions in parameter space wherethe correlation functions take on positive or nega-tive values too many times. This principle, whichis the reason for he name of the method, is basedon the fact that for the true parameter valuethe correlation functions are sums of zero meanrandom variables and, therefore, it is unlikely thatnearly all of them will be positive or nearly all ofthem will be negative.

    Part I of the paper deals with the case where thesystem S belongs to the model class M.

    In many cases, however, the structure of S is onlypartially known, and - even when S is knownto belong to a certain class S - we at times

    deliberately look for a model in a restricted modelclass M because S is too large to work with. Inthese cases, asking for a probability that S Mlooses any meaning and we should instead askwhether M contains a suitable approximation orprojection of S. Part II of this paper containsresults for systems with unmodeled dynamics.

    Further generalizations to a nonlinear set-up aregiven in Part III.

    The presentation is mainly based on examplesto help readability, while general results are only

    sketched.It goes without saying that the perspective of thispaper of reviewing LSCR is a matter of choice and

    other techniques exist to deal with finite sampleidentification. Without any claim or attempt ofcompleteness, some of these techniques are brieflymentioned in the next section.

    2. A SHORT LITERATURE REVIEW OF

    FINITE SAMPLE SYSTEM IDENTIFICATION

    The pioneering work of Vapnik and Chervonenkis(1968,1971) provided the foundations of what be-came the field of statistical learning theory, e.g.Vapnik (1998), Cherkassky and Mulier (1998),Vidyasagar (2002). By using exponential inequal-ities such as the Hoeffding inequality (Hoeffding(1963)) and combinatorial arguments, they de-rived rigorous uniform probabilistic bounds onthe difference between expected values and em-pirical means computed using a finite number of

    data points. An underlying assumption was thatthe data points were independent and identicallydistributed, an assumption not often satisfied ina system identification setting. Since the 1990sresults have appeared which extend the uniformconvergence results to dependent data sequencessuch as M-dependent sequences, and -, - and-mixing sequences, e.g. Yu (1994), Bosq (1998),Karandikar and Vidyasagar (2002). These ideaswere applied to problems in identification andprediction developing non-asymptotic bounds onthe estimation and prediction accuracies. See e.g.Mohda and Masry (1996,1998), Campi and Ku-mar (1998), Goldenshluger (1998), Weyer et al.(1999), Weyer (2000), Meir (2000), Venkatesh andDahleh (2001), Campi and Weyer (2002), Weyerand Campi (2002), Vidyasagar and Karandikar(2002,2006), Bartlett (2003).

    The finite sample results with roots in learningtheory gave bounds on the difference betweenexpected and empirical means, and the boundswere depending on the number of data points,but not on the actually seen data. For this reasonthey could be quite conservative for the particularsystem at hand. A way around conservatism isto make active use of the data and construct thebounds using data coming from the system underinvestigation. Data based methods for evaluationof model quality using bootstrap and subsampling(e.g. Efron and Tibshirani (1993), Shao and Tu(1995), Politis (1998), Politis et al. (1999)) havebeen explored in Tjarnstrom and Ljung (2002),Bittanti and Lovera (2000) and Dunstan and Bit-mead (2003). However, few truly rigorous finitesample results for model quality assessment areavailable for these techniques. Using subsamplingtechniques, Campi et al. (2004) obtained some

    guaranteed results for generalised FIR systems.See also Hjalmarsson and Ninness (2004) and Nin-ness and Hjalmarsson (2004) for non-asymptotic

    49

  • 7/30/2019 sysid06-semilpenary

    5/19

    variance expressions for frequency function esti-mates.

    Data based finite sample results have also been de-rived in the context of model validation by Smithand Dullerud (1996), and the set membership ap-proach to system identification, e.g. Milanese andVicino (1991), Vicino and Zappa (1996), Giarre et

    al. (1997a,b), Garulli et al. (2000, 2002), Milaneseand Taragna (2005).

    Along a different line of thought, finite sampleproperties of worst case identification in deter-ministic frameworks were studied in Dahleh et al.(1993,1995), Poolla and Tikku (1994) and Harri-son et al. (1996). Spall (1995) considered uncer-tainty bounds for M-estimators with small samplesizes. Non-parametric identification methods withfinite number of data points have been studiedby Welsh and Goodwin (2002) (see also Heath(2001)), while Ding and Chen (2005) have stud-

    ied the recursive least squares method in a finitesample context.

    The LSCR method presented in this paper is mak-ing use of subsampling techniques, and in partic-ular it extends the results of Hartigan (1969,1970)to a dynamical setting. Loosely speaking, onecould view LSCR as a stochastic set membershipapproach to system identification, where the set-ting we consider is the standard stochastic settingfor system identification from e.g. Ljung (1999)or Soderstrom and Stoica (1988), but where theoutcomes are sets of models as in set membership

    identification.

    PART I: Known systemstructure

    3. LSCR: A PRELIMINARY EXAMPLE

    We start with a preliminary example that readilyprovides some insight in the LSCR technique.

    More general results and comments are given inthe next section.

    Consider again the system of Example 4:

    yt + 0yt1 = wt. (3)

    Assume we know that wt is an independent pro-cess and that it has a symmetric distributionaround zero. Apart from this, no knowledge onthe noise is assumed: It can have any (unknown)distribution: Gaussian; uniform; flat with small-area spikes at high-value locations describing the

    chance of outliers; etc. Its variance can be any(unknown) number, from very small to very large.We do not even make any stationarity assumption

    on wt and allow that its distribution varies withtime. The assumption that wt is independent canbe interpreted by saying that we know the systemstructure: It is an autoregressive system of order1.

    9 data points were generated according to (3) andthey are shown in Figure 2. The values of 0 and

    wt are for the moment not revealed to the reader(they will be disclosed later). Our goal is to forma confidence region for 0 from the available dataset.

    Fig. 2. Data for the preliminary example.

    We next adopt a users point of view and describethe procedure in order to solve this problem withLSCR. Later on comments regarding the obtainedresults will be provided.

    Rewrite the system as a model with generic pa-rameter :

    yt + yt1 = wt.

    The predictor and prediction error associated withthe model are

    yt() = yt1, t() = yt yt() = yt + yt1.

    Next we compute the prediction errors t() fort = 1, . . . , 8, and calculate

    ft1() = t1()t(), t = 2, . . . , 8.

    Note that, ft1() are functions of that canindeed be computed from the available data set.Then, we take the average of some of these func-tions in many different ways. Precisely, we form 8averages of the form:

    gi() =1

    4

    kIi

    fk(), i = 1, . . . , 8, (4)

    where the sets Ii are subsets of {1, . . . , 7} con-taining the elements highlighted by a bullet inthe table below. For instance: I1 = {1, 2, 4, 5},

    I2 = {1, 3, 4, 6}, etc.. The last set, I8, is an excep-tional set: It is empty and we let g8() = 0. Thefunctions gi(), i = 1, . . . , 7, can be interpreted

    50

  • 7/30/2019 sysid06-semilpenary

    6/19

    as empirical 1-step correlations of the predictionerror.

    I1I2I3

    I4I5I6I7I8

    1 2 3 4 5 6 7

    The functions gi(), i = 1, . . . , 7, obtained for thedata in Figure 2 are displayed in Figure 3.

    empirical correlations

    Fig. 3. The gi() functions.

    Now, a simple reasoning leads us to conclude thatthese gi() functions have a tendency to intersectthe -axis near 0 and that, for = 0, theytake on positive or negative value with equalprobability. Why is it so? Let us re-write one ofthese functions, say g1(), as follows:

    g1() =1

    4

    k{1,2,4,5}

    [yk + yk1][yk+1 + yk]

    =1

    4

    k{1,2,4,5}

    [(yk + 0yk1) + (

    0)yk1]

    [(yk+1 + 0

    yk) + ( 0

    )yk]

    =1

    4

    k{1,2,4,5}

    [wk + ( 0)yk1]

    [wk+1 + ( 0)yk]

    = ( 0)21

    4

    k{1,2,4,5}

    yk1yk

    +( 0)1

    4

    k{1,2,4,5}

    wkyk

    +( 0)1

    4 k{1,2,4,5}yk1wk+1

    +1

    4

    k{1,2,4,5}

    wkwk+1.

    If 14

    k{1,2,4,5} wkwk+1 = 0, the intersection

    with the -axis is obtained for = 0. The verticaldisplacement 1

    4

    k{1,2,4,5} wk1wk is a random

    variable with equal probability of being positive ornegative; moreover, due to averaging, vertical dis-persion caused by noise is de-emphasized (it wouldbe more de-emphasized with more data). So, we

    would like to claim that 0 will be somewherenear where the average functions intersect the -axis. Moreover - following the above reasoning -we recognize that for = 0 it is very unlikelythat almost all the gi(), i = 1, . . . , 7, functionshave the same sign, and we therefore discard therightmost and leftmost regions where at most onefunction is less than zero or greater than zero. Theresulting interval [0.04, 0.48] is the confidenceregion for 0.

    A deeper reasoning (given in Appendix A for notbreaking the continuity of discourse here) reveals

    that a fairly strong claim on the obtained intervalcan be made:

    RESULT: the confidence region constructedthis way has exact probability 122/8 = 0.5to contain the true parameter value 0.

    A few comments are in order:

    (1) The interval is stochastic because it dependson data; the true parameter value 0 is notand it has a fixed location that does notdepend on any random element. Thus, whatthe above RESULT says is that the interval

    is subject to random fluctuation and coversthe true parameter value 0 in 50% of thecases.

    Fig. 4. 10 more trials.

    To better understand the nature of the re-

    sult, we performed 10 more simulation trialsobtaining the results in Figure 4. Note that0 and wt were as follows: 0 = 0.2, wt inde-

    51

  • 7/30/2019 sysid06-semilpenary

    7/19

    pendent with uniform distribution between1 and +1.

    (2) In this example, probability is low (50%)and the interval is rather large. With moredata, we obtain smaller intervals with higherprobability (see also the next section).

    (3) The LSCR algorithm was applied with no

    knowledge on the noise level or distributionand, yet, it returned an interval whose prob-ability was exact, not an upper bound. Whatis the key here is that the above RESULT isa pivotal result as the probability remainsthe same no matter what the noise charac-teristics are.

    (4) The result was established along a totally dif-ferent inference principle from standard Pre-diction Error Minimization (PEM) methods.In particular - differently from the asymp-totic theory of PEM - LSCR does not con-struct the confidence region by quantifying

    the variability in an estimate.(5) We also mention a technical aspect: The piv-

    otal RESULT holds because {Ii, i = 1, . . . , 8}form a group under the symmetric differenceoperation, that is (Ij Ik) (Ij Ik) returnsanother set in {Ii, i = 1, . . . , 8} for any j andk. For instance, (I1 I2) (I1 I2) = I3.

    4. LSCR FOR GENERAL LINEAR SYSTEMS

    4.1 Data generating system

    Consider now the general linear system in Figure5.

    Fig. 5. The system.

    We assume that wt and ut are independent pro-cesses. This does not mean however that we areconfining ourselves to an open-loop configurationsince closed-loop systems can be reduced to theset-up in Figure 5 by regarding wt and ut asexternal signals, see Figure 6.

    G(0) and H(0) are stable rational transfer func-tions. H(0) is monic and has a stable inverse. wtis a zero-mean independent sequence (noise). Noa-priori knowledge of the noise level is assumed.

    The basic assumption we make is that the systemstructure is known and, correspondingly, we takea full-order model class of the form:

    Fig. 6. Closed-loop recast as open-loop.

    yt = G()ut + H()wt,

    such that the true transfer functions G(0) andH(0) are obtained for = 0 and for no other pa-rameter than this. We assume that is restrictedto a set such that H() is monic and G(), H()and H1() are stable for all .

    Our goal is to construct an algorithm that worksin the following way (see Figure 7): A finite input-output data set is given to the algorithm togetherwith a probability p. The algorithm is required toreturn a confidence region that contains the true

    0

    with probability p under assumptions on thenoise as general as possible.

    ALGOregion for

    Fig. 7. The algorithm.

    4.2 Construction of confidence regions

    We start by describing procedures for the determi-nation of confidence sets r based on correlationproperties of (the prediction error) at differenttime instants (this generalizes the preliminary ex-ample in Section 3) and of confidence sets usbased on cross-correlation properties of and u.In general, confidence regions for 0 can beconstructed by taking the intersection of a fewof the r and

    us sets and this is discussed at the

    end of this section.

    Procedure for the construction of r

    (1) Compute the prediction errors

    t() = yt yt() = H1()yt H

    1()G()ut

    52

  • 7/30/2019 sysid06-semilpenary

    8/19

    for a finite number of values of t, say t =1, 2, . . . , K ;

    (2) Select an integer r 1. For t = 1+ r , . . . , N +r = K, compute

    ftr,r() = tr()t();

    (3) Let I = {1, . . . , N } and consider a collection

    G of subsets Ii I, i = 1, . . . , M , forming agroup under the symmetric difference opera-tion (i.e. (IiIj)(IiIj) G, if Ii, Ij G).Compute

    gi,r() =kIi

    fk,r(), i = 1, . . . , M ;

    (4) Select an integer q in the interval [1, (M +1)/2) and find the region r such that atleast qof the gi,r() functions are bigger thanzero and at least q are smaller than zero. 2

    The above procedure is the same as the oneused for construction of the confidence set in thepreliminary example in Section 3. In that examplewe had H1() = 1 + z1, G() = 0, a ndK = 8, N = 7, r = 1, M = 8 a nd q = 2.Normalization 1

    4in the preliminary example was

    introduced for the purpose of interpreting thegi() functions as empirical averages but it couldhave been dropped similarly to point 3 in theabove procedure without affecting the final result.

    In the procedure, the group G can be freely se-lected. Thus, if e.g. I = {1, 2, 3, 4}, a suitablegroup is G = {{1, 2}, {3, 4}, , {1, 2, 3, 4}}; an-other one is G = {{1}, {2, 3, 4}, , {1, 2, 3, 4}};yet another one is G = all subsets of I. Whilethe theory presented holds for any choice andthe region r is guaranteed to be a confidenceregion in any case (see Theorem 1 below), thefeasible choices are limited by computational con-siderations. For example, the set of all subsetscannot be normally chosen as it is a truly largeset. Gordon (1974) discusses how to constructgroups of moderate size where the subsets containapproximately half of the elements in I and sucha procedure is also summarized in Appendix B.

    These sets are particularly well suited for use inpoint 3 of the above procedure.

    The intuitive idea behind the construction in theprocedure is that, for = 0, the functions gi,r()assume positive or negative value at random, sothat it is unlikely that almost all of them arepositive or that almost all of them are negative.Since point 4 in the construction of r discardsregions where all gi,r()s but a small fraction(q should be taken to be small compared to M)are of the same sign, we expect that 0 rwith high probability. This is put on solid math-

    ematical grounds in Theorem 1 below, showingthat the probability that 0 r is actually1 2q/M. Thus, q is a tuning parameter that has

    to be selected such that a desired probability ofthe confidence region is obtained. Moreover, as qincreases, we exclude larger and larger regions of and hence r shrinks and the probability that0 r decreases.

    The procedure for construction of the sets us is inthe same spirit. The only difference being that the

    empirical auto-correlations in point 2 are replacedby empirical cross-correlations between the inputsignal and the prediction error.

    Procedure for the construction of us

    (1) Compute the prediction errors

    t() = yt yt() = H1()yt H

    1()G()ut

    for a finite number of values of t, say t =1, 2, . . . , K ;

    (2) Select an integer s 0. For t = 1 +s , . . . , N +s = K, compute

    futs,s() = utst();

    (3) Let I = {1, . . . , N } and consider a collectionG of subsets Ii I, i = 1, . . . , M , forming agroup under the symmetric difference opera-tion. Compute

    gui,s() =kIi

    fuk,s(), i = 1, . . . , M ;

    (4) Select an integer qin the interval [1, (M + 1)/2)and find the region us such that at least q

    of the gu

    i,s() functions are bigger than zeroand at least q are smaller than zero. 2

    The next theorem gives the exact probability thatthe true parameter 0 belongs to one particularof the above constructed sets. The proof of thistheorem - as well as comments on the technicalassumption on densities - can be found in Campiand Weyer (2005).

    Theorem 1. Assume that the variables wt andwtu admit densities and that wt is symmetricallydistributed around zero. Then, the sets r and

    us

    constructed above are such that:

    P r{0 r} = 1 2q/M, (5)

    P r{0 us} = 1 2q/M. (6)

    The following comments pinpoint some importantaspects of this result:

    (1) The procedures return regions of guaranteedprobability despite that no a-priori knowl-edge on the noise level is assumed: The noiselevel enters the procedures through data only.

    This could be phrased by saying that theprocedures let the data speak, without a-priori assuming what they have to tell us.

    53

  • 7/30/2019 sysid06-semilpenary

    9/19

    (2) As expected, noise level does impact the finalresult as the shape and size of the regiondepend on noise via the data.

    (3) Evaluations (5) and (6) are nonconservativein the sense that 1 2q/M is the exactprobability, not a lower bound of it.

    Each one of the sets r and us is a non-

    asymptotic confidence set for 0. However, eachone of these sets is based on one correlation onlyand will usually be unbounded in some direc-tions of the parameter space, and therefore notparticularly useful. A general practically usefulconfidence set can be obtained by intersectinga number of the sets r and

    us , i.e.

    = nr=1r

    nus=1

    us . (7)

    An obvious question is how to choose n andnu in order to obtained well shaped confidence

    sets that are bounded and concentrated aroundthe true parameter 0. It turns out that theanswer depends on the particular model classunder consideration and this issue will be furtherdiscussed in Section 6.

    We conclude this section with a fact which isimmediate from Theorem 1.

    Theorem 2. Under the assumptions of Theorem1,

    P r{0 } 1 (n + nu)2q/M,

    where is given by (7).

    The inequality in the theorem is due to that theevents {0 / r}, {

    0 / us}, r = 1, . . . , n,s = 1, . . . , nu, may be overlapping.

    Theorem 2 can be used in connection with robustdesign procedures: If a problem solution is robustwith respect to in the sense that a certainproperty is achieved for any , then sucha property is also guaranteed for the true systemwith the selected probability 1 (n + nu)2q/M.

    5. EXAMPLES

    Two examples illustrate the developed methodol-ogy. The first one is simple and permits an easyillustration of the method. The second is morechallenging.

    5.1 First order ARMA system

    Consider the ARMA system

    yt + a0yt1 = wt + c

    0wt1, (8)

    where a0 = 0.5, c0 = 0.2 and wt is an inde-pendent sequence of zero mean Gaussian randomvariables with variance 1. 1025 data points weregenerated according to (8). As a model class weused yt+ayt1 = wt+cwt1, |a| < 1, |c| < 1, withassociated predictor and prediction error given by

    yt(a, c) = cyt1(a, c) + (c a)yt1,t(a, c) = yt yt(a, c) = yt + ayt1 ct1(a, c).

    In order to form a confidence region for 0 =(a0, c0) we calculated

    ft1,1(a, c) = t1(a, c)t(a, c), t = 2, . . . , 1024,

    ft2,2(a, c) = t2(a, c)t(a, c), t = 3, . . . , 1025,

    and then computed

    g

    i,1(a, c) =kIi

    f

    k,1(a, c), i = 1, . . . , 1024,

    gi,2(a, c) =kIi

    fk,2(a, c), i = 1, . . . , 1024,

    using the group in Appendix B. Next we discardedthose values of a and c for which zero was amongthe 12 largest and smallest values of gi,1(a, c) and

    gi,2(a, c). Then, according to Theorem 2, (a0, c0)

    belongs to the constructed region with probabilityat least 1 2 2 12/1024 = 0.9531. The obtainedconfidence region is the blank area in Figure 8.

    The area marked with x is where 0 is among the12 smallest values of gi,1, the area marked with+ is where 0 is among the 12 largest values ofgi,1. Likewise for g

    i,2 with the squares representing

    when 0 belongs to the 12 largest elements and thecircles the 12 smallest. The true value (a0, c0) ismarked with a star. As we can see, each step inthe construction of the confidence region excludesa particular region.

    Using the algorithm for the construction of wehave obtained a bounded confidence set with aguaranteed probability based on a finite number of

    data points. As no asymptotic theory is involvedthis is a rigorous finite sample result. For com-parison, we have in Figure 8 also plotted the 95%confidence ellipsoid obtained using the asymptotictheory (Ljung (1999), Chapter 9). The two confi-dence regions are of similar shape and size, con-firming that the non-asymptotic confidence setsare practically useful, and - unlike the asymptoticconfidence ellipsoids - they do have guaranteedprobability for a finite sample size.

    5.2 A closed-loop system

    The following example was originally introducedin Garatti et al. (2004) to demonstrate that the

    54

  • 7/30/2019 sysid06-semilpenary

    10/19

    Fig. 8. Non-asymptotic confidence region for(a0, c0) (blank region) and asymptotic con-fidence ellipsoid. = true parameter, =estimated parameter using a prediction errormethod.

    asymptotic theory of PEM can at times delivermisleading results even with a large amount ofdata points. It is reconsidered here to show howLSCR works in this challenging situation.

    Consider the system of Figure 9 where

    Fig. 9. The closed-loop system.

    F(0) =b0z1

    1 + a0z1, a0 = 0.7, b0 = 0.3

    H(0) = 1 + h0z1, h0 = 0.5,

    wt is white Gaussian noise with variance 1 andthe reference u

    tis also white Gaussian, with

    variance 106. Note that the variance of thereference signal is very small as compared to thenoise variance, that is there is poor excitation. Itis perhaps interesting to note that the presentsituation - though admittedly artificial - is asimplification of what often happens in practicalidentification, where poor excitation is due tothe closed-loop operation of the system. 2050measurements of u and y were generated to beused in identification.

    We first describe what we obtained using PEM

    identification.A full order model was identified. The amplitudeBode diagrams of the transfer function from u

    to y of the identified model and of the realsystem are plotted in Figure 10. From the plot,a big mismatch between the real plant and theidentified model is apparent, a fact that does notcome too much of a surprise considering that thereference signal is poorly exciting. An analysisconducted in Garatti et al. (2004) shows that,

    when ut = 0, the asymptotic PEM identificationcost has two isolated global minimizers, one is 0

    and a second one is a spurious parameter, say ;when ut = 0 but small as is the case in our actualexperiment, does not minimize the asymptoticcost anymore, but random fluctuations in theidentification cost due to the finiteness of the datapoints may as well result in that the estimate getstrapped near the spurious , generating a totallywrong identified model.

    Fig. 10. The identified u y transfer function.

    But, let us now see what we obtained as a 90%confidence region with the asymptotic theory.Figure 11 displays the confidence region in thefrequency domain: Surprisingly, it concentratesaround the identified model, so that in a realidentification procedure where the true transferfunction is not known we would conclude that theestimated model is reliable, a totally misleadingresult. We will come back to this point later anddiscuss a bit further the theoretical reason for sucha bad behavior.

    Return now to the LSCR approach. LSCR wasused in a totally blind manner, that is withno concern at all for the identification set-upcharacteristics; in particular, we did not pay anyattention to the existence of local minima: Themethod is guaranteed by the theory and it willwork in all possible situations covered by thetheory.

    In the present setting, the prediction error is givenby

    t() =1

    1 + hz1yt

    55

  • 7/30/2019 sysid06-semilpenary

    11/19

    Fig. 11. 90% confidence region for the identifiedu y transfer function obtained with theasymptotic theory.

    bz1

    (1 + az1)(1 + hz1) (ut yt)

    =1 + (a + b)z1

    (1 + az1)(1 + hz1)yt

    bz1

    (1 + az1)(1 + hz1)ut.

    The group was constructed as in the Appendix B(2l = 2048), and we computed

    gi,r() =kIi

    kr()k(), r = 1, 2, 3,

    in the parameter space, making the standard as-sumptions that G() and H() (i.e. the u to yand w to y closed-loop transfer functions) werestable (|a + b| < 1) and that H() has a stableinverse (|a| < 1, |h| < 1). We excluded the regionsin the parameter space where 0 was among the 34smallest or largest values of any of the three corre-lations above to obtain a 13234/2048 = 0.9004confidence set. The confidence set is shown inFigure 12. The set consists of two separate regions,one around the true parameter 0 and one around, the spurious minimizer. This illustrates the

    global features of the approach: LSCR producestwo separate regions as the overall confidence setbecause information in the data is intrinsically in-effective in telling us which one of the two regionscontain the true parameter.

    Figures 13 and 14 show the close-ups of the tworegions. The ellipsoid in Figure 13 is the 90%confidence set with the asymptotic PEM theory:When the PEM estimate gets trapped near , theconfidence ellipsoid all concentrates around thisspurious because the PEM asymptotic theory islocal in nature (it is based on a Taylor expansion)

    and is therefore unable to explore locations farfrom the identified model. This is the reason whyin Figure 11 we obtained a frequency domain

    Fig. 12. 90% confidence set.

    Fig. 13. Asymptotic confidence 90% ellipsoid (-),and the part of the non-asymptotic confi-dence set around (- -).

    Fig. 14. Close-up of the non-asymptotic confi-dence region around 0.

    confidence region unable to capture the real modeluncertainty. The reader is referred to Garatti et al.(2004) for more details.

    6. LSCR PROPERTIES

    As we have seen in Section 4, Theorems 1 and2 quantify the probability that 0 belongs to the

    56

  • 7/30/2019 sysid06-semilpenary

    12/19

    constructed regions. However, this theorem dealsonly with one side of the story. In fact, a goodevaluation method must have two properties:

    the provided region must have guaranteedprobability (and this is what Theorems 1 and2 deliver);

    the region must be bounded, and, in partic-

    ular, it should concentrate around 0 as thenumber of data points increases.

    We next discuss how this second property can beachieved by choosing n and nu in (7). It turns outthat the choice depends on the model class, andwe here consider ARMA and ARMAX models,while general linear model classes are dealt within Campi and Weyer (2005).

    6.1 ARMA models

    Data generating system and model class

    The data generating system is given by

    yt =C(0)

    A(0)wt,

    where

    A(0) = 1 + a01z1 + + a0nz

    n,

    C(0) = 1 + c01z1 + + c0pz

    p,

    and 0 = [a01 a0n c01 c0p]T. In addition tothe assumptions in Section 4.1 and in Theorem 1,we assume that A(0) and C(0) have no commonfactors and that wt is wide-sense stationary withspectral density w() = 2w > 0.

    The model class is

    yt =C()

    A()wt,

    where

    A() = 1 + a1z1

    + + anzn

    ,C() = 1 + c1z

    1 + + cpzp,

    = [a1 an c1 cp]T, and the assumptions

    in Section 4.1 are in place.

    Confidence regions for ARMA models

    We next give a result taken from Campi andWeyer (2005) which shows how a confidence re-gion which concentrates around the true parame-ter as the number of data points increases can beobtained for ARMA systems.

    Theorem 3. Let t() =A()C()

    yt be the prediction

    error associated with the ARMA model class.

    Then, = 0 is the unique solution to the setof equations:

    E[tr()t()] = 0, r = 1, . . . , n +p.

    Theorem 3 shows that if we simultaneously imposen+p correlation conditions, where n and p are the

    orders of the A() and C() polynomials, then theonly solution is the true 0. Guided by this idea,we consider n + p sample correlation conditions,and let n = n +p in (7):

    = n+pr=1r.

    Theorem 2 guarantees that this set contains 0

    with probability 1 (n + p)q/M, and Theorem3 entails that the confidence set concentratesaround 0.

    6.2 ARMAX models

    Data generating system and model class

    Consider now system

    yt =B(0)

    A(0)ut +

    C(0)

    A(0)wt,

    where

    A(0) = 1 + a01z1 + + a0nz

    n,

    B(0) = b01z1 + + b0mzm,C(0) = 1 + c01z

    1 + + c0pzp,

    and 0 = [a01 a0n b

    01 b

    0m c

    01 c

    0p]

    T. Inaddition to the assumptions in Section 4.1 and inTheorem 1, we assume that A(0) and B(0) haveno common factors and - similarly to the ARMAcase - we assume a stationary environment. Pre-cisely, wt is wide-sense stationary with spectraldensity w() = 2w > 0 and ut is wide-sensestationary too and independent of wt.

    The model class is

    yt =B()

    A()ut +

    C()

    A()wt,

    where A(), B(), C() have the same structureas for the true system.

    Confidence regions for ARMAX models

    The next theorem taken from Campi and Weyer(2005) shows that we can choose correlation equa-tions such that the solution is unique and equalto 0, provided that the input signal ut is white.

    Theorem 4. Let t() =A()C()

    yt B()C()

    ut be the

    prediction error associated with the ARMAX

    57

  • 7/30/2019 sysid06-semilpenary

    13/19

    model class. If ut is white with spectral densityu() =

    2u > 0, then =

    0 is the uniquesolution to the set of equations:

    E[utst() ] = 0, s = 1, . . . , n + m,

    E[tr()t() ] = 0, r = 1, . . . , p .

    Guided by this result, we choose n = p andnu = n + m in (7) to arrive at the followingconfidence region for ARMAX models:

    = pr=1r

    n+ms=1

    us .

    Interestingly enough, the conclusion of Theorem4 does not hold true for colored input sequences,see Campi and Garatti (2003). On the other hand,assuming that ut is white is often unrealistic.This impasse can be circumvented by resorting to

    suitable prefiltering actions, as indicated in Campiand Weyer (2005).

    6.3 Properties of LSCR

    To summarize, LSCR has the following properties:

    for suitable selections of the correlations, theregion shrinks around 0;

    for any sample size, 0 belongs to the con-structed region with given probability, de-

    spite that no assumption on the level of noiseis made.

    7. COMPLEMENTS

    In this Part I, the only restrictive assumptionon noise was that it had symmetric distributionaround zero. This assumption can be relaxed asbriefly discussed here.

    (i) Suppose that wt in Section 4 has median0, i.e. P r{wt 0} = P r{wt < 0} = 0.5(note that this is a relaxation of the sym-metric distribution condition). Then, the-ory goes through by considering everywheresign(t()) instead of t(), where sign issignum function: sign(x) = 1 if x > 0,sign(x) = 1 if x < 0 and sign(x) = 0 i f x = 0.

    (ii) When wt is independent and identically butnot symmetrically distributed, we can obtainsymmetrically distributed data by consider-ing the difference between two subsequentdata points, that is (yt yt1) = G()(ut

    ut1)+H()(wtwt1); here, wtwt1, t =2, 4, 6, . . . are independent and symmetricallydistributed around 0 and we can refer to this

    difference system to construct confidenceregions.

    PART II: Presence ofunmodeled dynamicsIn this second part, we discuss the possibility todeal with unmodeled dynamics within the LSCRframework: Despite that the true system is notwithin the model class, we would like to deriveguaranteed results for some parts of the system.

    We start by describing the problem of identifyinga full-order u to y transfer function, withoutderiving a model for the noise. Then, we turn toalso consider unmodeled dynamics in the u to ytransfer function. General ideas are only discussedby means of simple examples.

    8. IDENTIFICATION WITHOUT NOISEDESCRIPTION: AN EXAMPLE

    Consider the system

    yt = 0ut + nt.

    Suppose that the structure of the u to y transferfunction is known. Instead, the noise nt describesall other sources of variation in yt apart from utand we do not want to make any assumption onhow nt is generated. Correspondingly, we wantthat our results regarding the value of 0 are validfor any (unknown) deterministic noise sequence ntwith no constraints whatsoever. When the noiseis stochastic, the result will then hold for anyrealization of the noise, that is surely.

    In this section, we assume that we have accessto the system for experiment: We are allowed togenerate a finite number, say 7, of input dataand - based on the collected outputs - we areasked to construct a confidence interval for 0

    of guaranteed probability.

    The problem looks very challenging indeed: Sincethe noise can be whatever, it seems that theobserved data are unable to give us a hand inconstructing a confidence region. In fact, for anygiven 0 and ut, a suitable choice of the noisesequence can lead to any observed output signal!Let us see how this problem can be circumvented.

    Before proceeding, we feel advisable to make clearwhat is meant here by guaranteed probability.We said that nt is regarded as a deterministic

    sequence, and the result is required to hold truefor any nt, that is uniformly in nt. The stochasticelement is instead the input sequence: We will

    58

  • 7/30/2019 sysid06-semilpenary

    14/19

    select ut according to a random generation mech-anism and we require that 0 with a givenprobability value, where the probability is withrespect to the random choice of ut.

    We first indicate input design and then the pro-cedure for construction of the confidence interval.

    Input design

    Let ut, t = 1, . . . , 7, be independent and identi-cally distributed with distribution

    ut =

    1, with probability 0.51, with probability 0.5.

    Procedure for construction of the confi-dence interval

    Rewrite the system as a model with generic pa-

    rameter :

    yt = ut + nt.

    We construct a prediction by dropping the noiseterm nt whose characteristics are unknown:

    yt() = ut, t() = yt yt() = yt ut.

    Next, we compute the prediction errors t() fromthe observed data for t = 1, . . . , 7 and calculate

    ft() = utt(), t = 1, . . . , 7.

    The rest of the construction is the same as for thepreliminary example of Section 3: We consider thesame group of subsets as given in the bullet tablein that example and construct the gi() functionsas in (4). Then, we extract the interval where atleast two functions are below zero and at leasttwo are above zero. The reader can verify that thetheoretical analysis for the example in Section 3goes through here to conclude that the obtained

    interval has probability 0.5 to contain the true 0.Interestingly, the property ofwt to be independentand symmetrically distributed have been replacedhere by analogous properties of the input signal;the advantage is that - if the experiment can bedesigned - these properties can be easily enforcedand no restrictive conditions on the noise arerequired anymore.

    A simulation example was run where 0 = 1 andthe noise was the sequence shown in Figure 15.This noise sequence was obtained as a realizationof a biased independent Gaussian process with

    mean 0.5 and variance 0.1. The obtained gi()functions and the corresponding confidence regionare given in Figure 16.

    Fig. 15. Noise sequence.

    empirical correlations

    Fig. 16. The gi() functions.

    9. UNMODELED DYNAMICS IN THE U TOY TRANSFER FUNCTION: AN EXAMPLE

    Suppose that a system has structure

    yt = 00ut +

    01ut1 + nt,

    while - for estimation purposes - we use thereduced order model

    yt = ut + nt.

    The noise has been indicated with a generic ntto signify that it can be whatever, and not just a

    white signal. In fact, a perspective similar to theprevious section is taken and we regard nt as ageneric unknown deterministic signal.

    After determining a region for the parameter ,one sensible question to ask is: Does this regioncontain with a given probability the system pa-rameter 00 linking ut to yt?

    Reinterpreting the above question we are ask-ing whether the projection of the true transferfunction 00 +

    01z1 onto the 1-dimensional space

    spanned by constant transfer functions is con-tained in the estimated set with a certain prob-ability.

    59

  • 7/30/2019 sysid06-semilpenary

    15/19

    We generate an input signal ut in the same wayas in the previous section, this time over thetime interval t = 0, . . . , 7, and inject it into thesystem. Then, the predictor and prediction errorare constructed the same way as in the previoussection, while we add a sign function to ft():

    ft() = sign(utt()), t = 1, . . . , 7. (9)

    Corresponding to the true parameter value, i.e. = 00, an easy inspection reveals that sign(utt(

    00))

    = sign(ut(01ut1 + nt)) is an independent andsymmetrically distributed process (it is in facta Bernoullian process taking on values 1 withprobability 0.5 each). Thus, with ft() as in (9),the situation is similar to what we had in the pre-vious section and again the theory goes throughto prove that an interval for constructed as in

    the previous section has probability 0.5 to contain00, despite the presence of unmodeled dynamics.

    A simulation example was run with 00 = 1,01 = 0.5 and where the noise was again therealization of a biased Gaussian process givenin Figure 15. As sign(utt()) only can take onthe values 1, 1 and 0, it is possible that twoor more of the gi() functions will take on thesame value on an interval (in technical terms,the assumption on the existence of densities inTheorem 1 does not hold). This tie can be brokenby introducing a random ordering (e.g. by adding

    a random constant number between 0.1 and 0.1to the gi() functions) and one can see that thetheory remains valid. The obtained gi() functionsand confidence region are in Figure 17.

    Fig. 17. The gi() functions.

    Though presented on simple examples, the ap-

    proaches illustrated in Sections 8 and 9 to dealwith unmodeled dynamics bear a general breathof applicability.

    PART III: Nonlinear sys-temsInterestingly, the identification set-up developedin previous sections in a linear setting generalizesnaturally to nonlinear systems. Such an extensionis presented in this part III with reference to a

    simple example.

    10. IDENTIFICATION OF NONLINEARSYSTEMS: AN EXAMPLE

    Consider the following nonlinear system

    yt = 0(y2t1 1) + wt, (10)

    where wt is an independent and symmetricallydistributed sequence and 0 is an unknown pa-rameter.

    This system can be made explicit with respect towt as follows:

    wt = yt 0(y2t1 1),

    and - by substituting 0 with a generic and re-naming the so-obtained right-hand-side as wt() -we have

    wt() = yt (y2t1 1).

    Note that wt() coincides in this example with the

    prediction error for the model with parameter . Itis not true, however, that the above constructiongenerates the prediction error for any nonlinearsystem. For example, if we make explicit systemyt = 0yt1wt with respect to wt we get wt =yt/0yt1, and further substituting a generic wehave wt() = yt/yt1. This is not the predictionerror since the predictor is in this case yt() = 0,so that the prediction error is here given by yt yt = yt. Note also that for linear systems wt() isalways equal to the prediction error t() so thatthe construction suggested here for the generationof w

    t() generalizes the construction of

    t() for

    linear systems.

    Now, an inspection of the proof in Appendix Areveals that the only property that has a rolein determining the result that the confidenceregion contains 0 with a given probability is thatt(

    0) = wt. Since this same property holds herefor wt(), i.e. wt(0) = wt, we can argue thatproceeding in the same way as for the preliminaryexample of Section 3 where t() is replaced bywt() still generates in the present context aguaranteed confidence region.

    Is it all so simple? It is indeed, as far as theguarantee that 0 is in the region is concerned.The dark side of the medal is that second order

    60

  • 7/30/2019 sysid06-semilpenary

    16/19

    statistics are in general not enough to spot thereal parameter value for nonlinear systems, so thatresults like those in Section 6 do not apply toconclude that the region shrinks around 0.

    In actual effects, if e.g. 0 = 0 and E[w2t ] = 1,some easy computation reveals that E[wtr()wt()]= 0 for any and for any r > 0 so that the second-

    order statistics are useless.

    Now, the good news is that LSCR can be up-graded to higher-order statistics with little effort.A general presentation of the related results canbe found in Dalai et al. (2005). Here, it sufficesto say that we can e.g. consider the third-orderstatistic E[wt()

    2wt+1()] and the theory goesthrough unaltered.

    As an example, we generated 9 samples of yt,t = 0, . . . , 8 for system (10) where wt is zero-meanGaussian with variance 1. Then, we constructed

    gi() =14

    kIi

    wk()2wk+1(), i = 1, . . . , 8,

    where the sets Ii are as in Section 3. Thesefunctions are displayed in Figure 18. The intervalmarked in blue where at least two functions arebelow zero and at least two are above zero hasprobability 0.5 to contain 0 = 0.

    empirical 3rd-order correlations

    Fig. 18. The gi() functions.

    11. CONCLUSIONS

    In this paper we have provided an overview ofthe LSCR method for system identification. Mostof the existing theory for system identification isasymptotic in the number of data points whilein practice one will only have a finite numberof samples available. Although the asymptotictheory often delivers sensible results when appliedheuristically to a finite number of data points, the

    results are not guaranteed. The LSCR methoddelivers guaranteed finite sample results, and itproduces a set of models to which the true system

    belongs with a given probability for any finitenumber of data points.

    As illustrated by the simulation examples, themethod is not only grounded on a solid theoret-ical basis, but it also delivers practically usefulconfidence sets.

    The LSCR method takes a global approach andcan, when the situation requires, produce a con-fidence set which consists of disjoint regions, andhence it has advantages over confidence ellipsoidsbased on the asymptotic theory.

    By allowing the user to choose the input signal,the LSCR method can be extended to deal withunmodeled dynamics, and it can also be extendedto non-linear systems by considering higher-orderstatistics.

    REFERENCESBartlett, P.L (2003). Prediction algorithms: com-

    plexity, concentration and convexity. Proc. ofthe 13th IFAC Symposium on System Iden-

    tification, Rotterdam, The Netherlands, pp.1507-1517.

    Bittanti, S. and M. Lovera (2000). Bootstrap-based estimates of uncertainty in subspaceidentification methods. Automatica, Vol. 36,pp. 1605-1615.

    Bosq, D. (1998). Nonparametric Statistics forStochastic Processes - Estimation and Pre-

    diction. Lecture Notes in Statistics 110.Springer Verlag.

    Campi, M.C. and S. Garatti (2003). Correlationapproach for ARMAX model identification:A counterexample to the uniqueness of theasymptotic solution. Internal report of theuniversity of Brescia.

    Campi, M.C. and P.R. Kumar (1998). LearningDynamical Systems in a Stationary Environ-ment. Systems and Control Letters, Vol. 34,pp. 125-132.

    Campi, M.C. and E. Weyer (2002). Finite sampleproperties of system identification methods.IEEE Trans. on Automatic Control, Vol. 47,pp. 1329-1334.

    Campi, M.C. and E. Weyer (2005). Guaranteednon-asymptotic confidence regions in systemidentification. Automatica, Vol. 41, pp. 1751-1764.

    Campi, M.C., S.K. Ooi and E. Weyer (2004).Non-asymptotic quality assessment of gen-eralised FIR models with periodic inputs.Automatica, Vol. 40, pp. 2029-2041.

    Cherkassky, V. and F. Mulier (1998). Learningfrom data. John Wiley.

    Dahleh, M.A., T.V. Theodosopoulos and J.N.Tsitsiklis (1993). The sample complexity ofworst-case identification of FIR linear sys-

    61

  • 7/30/2019 sysid06-semilpenary

    17/19

    tems. System and Control Letters, Vol. 20,pp. 157-166.

    Dahleh, M.A., E.D. Sontag, D.N.C. Tse and J.N.Tsitsiklis (1995). Worst-case identification ofnonlinear fading memory systems. Automat-ica, Vol. 31, pp. 503-508.

    Dalai, M., E. Weyer and M.C. Campi (2005).

    Parametric identification of nonlinear sys-tems: Guaranteed confidence regions. In Proc.of the 44th IEEE CDC, Seville, Spain, pp.6418-6423.

    Ding, F. and T. Chen (2005). Performancebounds on forgetting factor least-squares al-gorithms for time-varying systems with finitemeasurement data. IEEE Trans on Circuitsand Systems-I, Vol. 52, pp. 555-566.

    Dunstan, W.J. and R.R. Bitmead (2003). Em-pirical estimation of parameter distributionsin system identification, In Proc. of the 13thIFAC Symposium on System Identification,

    Rotterdam, The Netherlands.Efron, B., and R.J. Tibshirani (1993). An intro-

    duction to the bootstrap, Chapman and Hall.Garatti, S., M.C. Campi and S. Bittanti (2004).

    Assessing the quality of identified modelsthrough the asymptotic theory - When is theresult reliable? Automatica, Vol. 40, pp. 1319-1332.

    Garatti, S., M.C. Campi and S. Bittanti (2006).The asymptotic model quality assessment forinstrumental variable identification revisited.System and Control Letters, to appear.

    Garulli, A., L. Giarre and G. Zappa (2002).Identification of approximated Hammersteinmodels in a worst-case setting. IEEE Trans.on Automatic Control, Vol. 47, pp. 2046-2050.

    Garulli, A., A. Vicino and G. Zappa (2000).Conditional central algorithms for worst-caseset membership identification and filtering.IEEE Trans. on Automatic Control, Vol. 45,pp. 14-23.

    Giarre, L., B.Z. Kacewicz and M. Milanese(1997a). Model quality evaluation in setmembership identification. Automatica, Vol.33, pp. 1133-1139.

    Giarre, L., M. Milanese and M. Taragna (1997b).H identification and model quality evalu-ation. IEEE Trans. on Automatic Control,Vol. 4, pp. 88-199.

    Goldenshluger, A. (1998). Nonparametric esti-mation of transfer functions: rates of con-vergence and adaptation. IEEE Trans. onInformation Theory, Vol. 44, pp. 644-658.

    Gordon, L. (1974). Completely separating groupsin subsampling. Annals of Statistics, Vol. 2,pp. 572-578.

    Harrison, K.J., J.A. Ward and D.K. Gable(1996). Sample complexity of worst-case H-

    identification. Systems and Control Letters,Vol. 27, pp. 255-260.

    Hartigan, J.A. (1969). Using subsample values astypical values. Journal of American Statisti-cal Association, Vol. 64, pp. 1303-1317.

    Hartigan, J.A. (1970). Exact confidence inter-vals in regression problems with independent

    symmetric errors. Annals of MathematicalStatistics, Vol. 41, pp. 1992-1998.Heath, W.P. (2001). Bias of indirect non-parametric

    transfer function estimates for plants inclosed loop. Automatica, Vol. 37, pp. 1529-1540.

    Hjalmarsson, H. and B. Ninness (2004). An exactfinite sample variance expression for a classof frequency function estimates. In Proc. ofthe 43rd IEEE CDC, Bahamas.

    Hoeffding, W. (1963). Probability inequalities forsums of bounded random variables. J. Amer.Statist. Assoc., Vol. 58, pp. 13-30.

    Karandikar R. and M. Vidyasagar (2002). Ratesof uniform convergence of empirical meanswith mixing processes. Statistics and Proba-bility Letters, Vol. 58, pp. 297-307.

    Meir, R. (2000). Nonparametric Time Series Pre-diction Through Adaptive Model Selection.Machine Learning, Vol. 39, pp. 5-34.

    Ljung, L. (1999). System Identification - Theoryfor the User. 2nd Ed., Prentice Hall.

    Milanese, M. and M. Taragna (2005). H setmembership identification. A survey. Auto-matica, Vol. 41, pp. 2019-2032.

    Milanese, M. and A. Vicino (1991). Optimalestimation theory for dynamic systems withset membership uncertainty: an overview.Automatica, Vol. 27, pp. 997-1009.

    Modha, D.S. and E. Masry (1996). Minimumcomplexity regression estimation with weaklydependent observations. IEEE Trans. Infor-mation Theory, Vol. 42, pp. 2133-2145.

    Modha, D.S. and E. Masry (1998). Memory-Universal Prediction of Stationary RandomProcesses. IEEE Trans. Information Theory,Vol. 44, pp. 117-133.

    Ninness, B. and H. Hjalmarson (2004). VarianceError quantifications that are exact for fi-nite model order. IEEE Trans. on AutomaticControl, Vol. 49, pp. 1275-1291.

    Politis, D.N. (1998). Computer-intensive meth-ods in statistical analysis. IEEE Signal Pro-cessing Magazine, Vol. 15, pp. 39-55.

    Politis, D.N., J.P. Romano and M. Wolf (1999).Subsampling. Springer.

    Poolla, K. and A. Tikku (1994). On the TimeComplexity of Worst-Case System Identifi-cation. IEEE Trans. on Automatic Control,Vol. 39, pp. 944-950.

    Richmond, S. (1964). Statistical Analysis. 2ndEd., The Ronald Press Company, N.Y.

    62

  • 7/30/2019 sysid06-semilpenary

    18/19

    Shao, J., and D. Tu (1995). The Jackknife andBootstrap. Springer.

    Smith, R. and G.E. Dullerud (1996). Continuous-time control model validation using finite ex-perimental data. IEEE Trans. on AuotmaticControl, Vol. 41, pp. 1094-1105.

    Soderstrom, T. and P. Stoica (1988). System

    Identification. Prentice Hall.Spall, J.C. (1995). Uncertainty Bounds for Pa-rameter Identification with Small SampleSizes. Proc. of the 34th IEEE CDC, NewOrleans, Louisiana, USA, pp. 3504-3515.

    Tjarnstrom, F. and L. Ljung (2002). Using theBootstrap to estimate the variance in thecase of undermodeling. IEEE Trans. on Au-tomatic Control, Vol. 47, pp. 395-398.

    Vapnik, V. (1998). The nature of statistical learn-ing theory. Springer.

    Vapnik, V.N. and A.Ya. Chervonenkis (1968).Uniform convergence of the frequencies of

    occurence of events to their probabilities.Soviet Math. Doklady, Vol. 9, pp. 915-968.

    Vapnik, V.N. and A.Ya. Chervonenkis (1971). Onthe uniform convergence of relative frequen-cies to their probabiliies. Theory of Prob. andits Appl., Vol. 16, pp. 264-280.

    Venkatesh, S.R. and M. A. Dahleh (2001). Onsystem identification of complex systemsfrom finite data. IEEE Trans. on AutomaticControl, Vol. 46, pp. 235-257.

    Vicino, A. and G. Zappa (1996). Sequential ap-proximation of feasible parameter sets for

    identification with set membership uncer-tainty. IEEE Trans. on Automatic Control,Vol. 41, pp. 774-785.

    Vidyasagar, M. (2002). A theory of Learning andGeneralization. 2nd Ed., Springer Verlag.

    Vidyasagar, M. and R. Karandikar (2002). Alearning theory approach to system identi-fication and stochastic adaptive control. InIFAC Symp. on Adaptation and Learning.

    Vidyasagar, M. and R. Karandikar (2006). Alearning theory approach to system identifi-cation and stochastic adaptive control. In G.Calafiore and F. Dabbene (Eds.) Probabilis-tic and randomized methods for design under

    uncertainty. Springer.Welsh, J.S. and G.C. Goodwin (2002). Finite

    sample properties of indirect nonparametricclosed-loop identification. IEEE Trans. onAutomatic Control, Vol. 47, pp. 1277-1292.

    Weyer, E. (2000). Finite sample properties ofsystem identification of ARX models undermixing conditions. Automatica Vol. 36, pp.1291-1299.

    Weyer, E. and M.C. Campi (2002). Non-asymptoticconfidence ellipsoids for the least squares es-

    timate. Automatica, Vol. 38, pp. 1539-1547.Weyer, E., R. C. Williamson and I. M. Y. Mareels

    (1999). Finite sample properties of linear

    model identification. IEEE Trans. on Auto-matic Control, Vol. 44, pp. 1370-1383.

    Yu, B. (1994). Rates of convergence for empiri-cal processes of stationary mixing sequences.Annals of Probability, Vol. 22, pp. 94-116.

    Appendix A. PROOF OF THE RESULT

    In the confidence region construction, we elim-inate the regions in parameter space where allfunctions gi(), i = 1, . . . , 7, are above zero, orat most one of them is below zero and whereall functions are below zero, or at most one ofthem is above zero. Therefore, the true parametervalue 0 falls outside the confidence region when- corresponding to = 0 - all functions gi(),i = 1, . . . , 7, are bigger than g8() = 0, or at mostone of them is smaller than g8() and where allfunctions are smaller than g8(), or at most one

    of them is bigger than g8(). We claim that eachone of these events has probability 1/8 to happen,so that the total probability that 0 falls outsidethe confidence region is 4 1/8 = 0.5, as claimedin the RESULT.

    We next concentrate on one condition only andcompute the probability that all functions gi(0),i = 1, . . . , 7, are bigger than g8(

    0) = 0. Theprobability of the other conditions can be derivedsimilarly.

    The considered condition writes

    w1w2 + w2w3 + w4w5 + w5w6 > 0w1w2 + w3w4 + w4w5 + w6w7 > 0w2w3 + w3w4 + w5w6 + w6w7 > 0w1w2 + w2w3 + w6w7 + w7w8 > 0w1w2 + w3w4 + w5w6 + w7w8 > 0w2w3 + w3w4 + w4w5 + w7w8 > 0w4w5 + w5w6 + w6w7 + w7w8 > 0.

    (A.1)

    To compute the probability that all these 7 in-equalities are simultaneously true let us ask thefollowing question: What would we have writ-ten if instead of comparing gi(0), i = 1, . . . , 7,with g8(0) we would have compared gi(0), i =

    2, . . . , 8, with g1(0)? The conditions would havebeen:

    w1w2 + w3w4 + w4w5 + w6w7> w1w2 + w2w3 + w4w5 + w5w6

    w2w3 + w3w4 + w5w6 + w6w7> w1w2 + w2w3 + w4w5 + w5w6

    w1w2 + w2w3 + w6w7 + w7w8> w1w2 + w2w3 + w4w5 + w5w6

    w1w2 + w3w4 + w5w6 + w7w8> w1w2 + w2w3 + w4w5 + w5w6

    w2w3 + w3w4 + w4w5 + w7w8> w1w2 + w2w3 + w4w5 + w5w6

    w4w5 + w5w6 + w6w7 + w7w8> w1w2 + w2w3 + w4w5 + w5w6

    0 > w1w2 + w2w3 + w4w5 + w5w6,

    63

  • 7/30/2019 sysid06-semilpenary

    19/19

    or, moving everything to the left-hand-side,

    w2w3 + w3w4 w5w6 + w6w7 > 0w1w2 + w3w4 w4w5 + w6w7 > 0w4w5 w5w6 + w6w7 + w7w8 > 0w2w3 + w3w4 w4w5 + w7w8 > 0w1w2 + w3w4 w5w6 + w7w8 > 0w1w2 w2w3 + w6w7 + w7w8 > 0w1w2 w2w3 w4w5 w5w6 > 0.

    If we now let w2 = w2, w5 = w5, the lattercondition re-writes as

    w2w3 + w3w4 + w5w6 + w6w7 > 0w1w2 + w3w4 + w4w5 + w6w7 > 0w4w5 + w5w6 + w6w7 + w7w8 > 0w2w3 + w3w4 + w4w5 + w7w8 > 0w1w2 + w3w4 + w5w6 + w7w8 > 0w1w2 + w2w3 + w6w7 + w7w8 > 0w1w2 + w2w3 + w4w5 + w5w6 > 0.

    (A.2)

    Except for the showing up here and there,this latter set of inequalities is the same as theoriginal set of inequalities (A.1) (the order inwhich the inequalities are listed is changed butthe inequalities altogether are the same - this is aconsequence of the group property of the sets Ii).Moreover, since the wt variables are symmetricallydistributed, the change of sign implied by the isimmaterial as far as the probability of satisfactionof the inequalities in (A.2) is concerned, so thatwe can conclude that (A.1) and (A.2) are satisfied

    with the same probability.Now, instead of comparing the gi(0)s withg1(

    0), we could have compared with g2(0), or

    with g3(0) or ... or with g7(0) arriving all thetime to a similar conclusion that the probabilitydoes not change. Since these 8 events (g1(

    0) is thesmallest, g2(

    0) is the smallest, etc.) are disjointand cover all possibilities and all of them have thesame probability, we finally draw the conclusionthat each and every event has exactly probability1/8 to happen. It remains therefore proven thatthe initial condition (A.1) is satisfied with proba-

    bility 1/8 and this concludes the proof.

    Appendix B. GORDONS CONSTRUCTIONOF THE INCIDENT MATRIX OF A GROUP

    Given I = {1, . . . , N }, the incident matrix fora group {Ii} of subsets of I is a matrix whose(i, j) element is 1 if j Ii and zero otherwise. InGordon (1974), the following construction proce-dure for an incident matrix R is proposed where

    I = {1, . . . , 2

    l

    1} and the group has 2

    l

    elements.Let R(1) = [1], and recursively compute (k =2, 3, . . . , l)

    R(2k 1) =

    R(2k1 1) R(2k1 1) 0R(2k1 1) J R(2k1 1) e

    0T eT 1

    ,

    where J and e are, respectively, a matrix and avector of all ones, and 0 is a vector of all zeros.Then, let

    R =

    R(2l 1)

    0T

    .

    Gordon (1974) also gives construction of groupswhen the number of data points is different from2l 1.